Sign in to view source links and access this dataset
Description
47,772 tables derived from 1,379 parent tables from TabFact and WikiTableQuestions, fragmented at four cumulative noise tiers. The dataset is part of the TRL-Bench suite for evaluating tabular encoders, created by logo-lab and last updated on June 11, 2026.
Use Cases
Benchmarking table retrieval models based on the data lake's fragmented structure.
Evaluating tabular encoder robustness based on the four cumulative noise tiers (clean, schema, cell, hard).
Training models for table union operations based on the union target data.
Training models for table join operations based on the join target data.
Studying representation-level performance for cross-paradigm tasks as described in the associated paper.
Strengths
Large scale with 47,772 derived tables.
Structured noise injection across four defined tiers (clean, schema, cell, hard).
Derived from established source datasets (TabFact and WikiTableQuestions).
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count and file formats are unknown, which may limit suitability assessment.
Provenance
Source
Derived from TabFact and WikiTableQuestions parent tables.
Collection Method
Fragmented at four cumulative noise tiers to create a compositional data lake.
Time Range
null
Freshness
Last updated 2026-06-11 03:52:46; freshness should be verified.
Geography
null
License is unknown; restrictions should be verified before use.