Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
Over 6.8 billion deduplicated protein sequences from all domains of life, including viral proteins and metagenomic dark matter, form the basis of this dataset. It provides sparse autoencoder features for these proteins and predicted 3D structures for approximately 1.1 billion of them, generated by the ESMC and ESMFold2 models. The dataset is organized into 7.7 million clusters based on feature similarity and is published by Biohub under a CC-BY-SA-4.0 license.
Data is accessible via AWS CLI in S3 format; a companion web explorer is available.