ProtAttn-QuadNet: Protein-Protein Interaction Pairs with ProtBERT Embeddings
by Md Shahidul Islam·Updated 1mo ago
5.0 GB2files
Available on 1 platform
Sign in to view source links and access this dataset
Description
573,661 reviewed protein sequence embeddings from UniProtKB and two labeled datasets for protein-protein interaction prediction. The balanced dataset contains 249,814 protein pairs, and an oversampled version contains 1,082,662 pairs. Authored by Md Shahidul Islam and last updated on 2026-05-02.
Use Cases
Train attention-based deep learning models for PPI prediction based on ProtBERT-derived sequence embeddings.
Benchmark PPI prediction algorithms using the provided balanced and oversampled labeled datasets.
Analyze protein interaction networks based on the large collection of reviewed protein entries.
Strengths
Includes 573,661 reviewed protein sequence embeddings from the authoritative UniProtKB database.
Provides a balanced dataset of 249,814 labeled protein pairs and an oversampled dataset of 1,082,662 pairs for model training.
Dataset size is 5.0 GB, indicating substantial data volume.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count for the primary embedding file is unknown, which may limit suitability assessment.
Provenance
Source
UniProtKB database
Collection Method
ProtBERT-derived sequence embeddings for reviewed protein entries.
Freshness
Last updated 2026-05-02 08:05:50; freshness should be verified.
License is CC-BY-4.0. The dataset is packaged in ZIP archives.