601,328 protein domain structures hierarchically classified by class, architecture, topology, and homologous superfamily. The dataset, created by LiteFold, includes a deterministic, S35-cluster-aware split with 541,123 rows for training and 60,205 for testing. This split ensures domains from the same homologous superfamily and S35 cluster are kept together.
Use Cases
- Training protein structure prediction models based on hierarchical domain classification.
- Benchmarking clustering algorithms based on the S35-cluster-aware data splits.
- Studying protein domain evolution and relationships based on homologous superfamily labels.
Strengths
- 601,328 total classified protein domains provides a substantial scale.
- The deterministic, S35-cluster-aware split with 541,123 training and 60,205 test rows helps prevent data leakage.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count for the full dataset is unknown, which may limit suitability assessment.
Provenance
- Source
- CATH hierarchical classification database.
- Freshness
- Last updated 2026-05-27 13:02:55; freshness should be verified.