Name: TheJoin: A Collection of 650 Multi-Domain Relational Databases for Pretraining
Creator: stanford-rdl
Published: 2026-06-10T20:35:13
Keywords: Machine Learning Pretraining, Tabular, Multi Domain, Large Scale, Relational Databases, Tabular Foundation Models

Description

A collection of 650 relational databases spanning domains like e-commerce, finance, sports, biomedical, and government, ported to the RelBench manifest format. It was created by stanford-rdl for large-scale pretraining of relational and tabular foundation models, with each database being self-describing and tasks shipping labels as-is.

Use Cases

Pretraining relational foundation models based on the collection's 650 multi-domain databases.
Benchmarking tabular machine learning models across diverse domains like e-commerce and finance mentioned in the description.
Developing self-describing data systems based on the RelBench manifest format used by the collection.

Strengths

Contains 650 distinct relational databases, indicating substantial scale.
Spans many domains including academic, e-commerce, finance, sports, biomedical, and government, suggesting broad coverage.
Databases are self-describing and formatted for pretraining, which likely aids in automated processing.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Freshness should be verified; the last metadata update was 2026-06-12.

Provenance

Source: stanford-rdl
Collection Method: Ported from various sources to the RelBench manifest format.
Freshness: Last updated 2026-06-12 19:54:50; freshness should be verified.

License is unknown; terms of use must be verified before application.

Tabular Machine Learning Pretraining Multi Domain Large Scale Relational Databases Tabular Foundation Models

TheJoin: A Collection of 650 Multi-Domain Relational Databases for Pretraining

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info