Name: ESM Atlas: Protein Features and Structures for 6.8 Billion Sequences
License: CC-BY-SA-4.0
Keywords: Metagenomics, Bioinformatics, Protein Structure, Tabular, Computational Biology, Large Scale, Synthetic, Multimodal

Description

Over 6.8 billion deduplicated protein sequences from all domains of life, including viral proteins and metagenomic dark matter, form the basis of this dataset. It provides sparse autoencoder features for these proteins and predicted 3D structures for approximately 1.1 billion of them, generated by the ESMC and ESMFold2 models. The dataset is organized into 7.7 million clusters based on feature similarity and is published by Biohub under a CC-BY-SA-4.0 license.

Use Cases

Training or benchmarking protein language models based on the sparse autoencoder features.
Analyzing protein structure-function relationships based on the predicted 3D structures.
Exploring protein family evolution and functional grouping based on the 7.7 million similarity clusters.
Investigating metagenomic 'dark matter' and viral protein diversity based on the sequence collection.

Strengths

Extremely large scale, derived from over 6.8 billion publicly available protein sequences.
Includes predicted three-dimensional structures for approximately 1.1 billion proteins.
Proteins are organized into 7.7 million clusters, enabling functional grouping analysis.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Last update date is unknown; freshness unverified.
Data may reflect source bias inherent to the original public sequence repositories.

Provenance

Source: Biohub
Collection Method: Computational outputs generated by ESMC and ESMFold2 models from a deduplicated set of publicly available protein sequences.

Data is accessible via AWS CLI in S3 format; a companion web explorer is available.

Tabular Multimodal Metagenomics Bioinformatics Protein Structure Computational Biology Large Scale Synthetic

ESM Atlas: Protein Features and Structures for 6.8 Billion Sequences

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info