Sign in to view source links and access this dataset
Description
369,740 protein structures for training, with 46,217 for validation and 46,218 for testing, form this benchmark for protein fold classification. The dataset, named TEDBench, is built from TED annotations projected onto the Foldseek-clustered AlphaFold Database. It was presented in the paper 'Protein Fold Classification at Scale: Benchmarking and Pretraining'.
Use Cases
Benchmarking protein fold classification models based on the 965 CATH topology classes.
Pretraining models for structural bioinformatics using the large-scale, non-redundant set of protein structures.
Analyzing the distribution and characteristics of rare protein topologies mentioned in the description.
Strengths
Large scale with 369,740 training structures, 46,217 validation structures, and 46,218 test structures.
Covers 965 distinct CATH topology (T-level) classes, including rare topologies.
Designed as a non-redundant benchmark, which may reduce data bias.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Last updated 2026-05-20 06:39:14; freshness should be verified.
Provenance
Source
TEDBench, built from Encyclopedia of Domains (TED) annotations and the AlphaFold Database.
Collection Method
Annotations were projected onto the Foldseek-clustered AlphaFold Database.
Freshness
Last updated 2026-05-20 06:39:14.
License is unknown; users should verify terms before use.