A corpus of computer science paper abstracts sourced from the arXiv API. The abstracts are organized into four thematic categories, including Algorithms and NLP/AI. The dataset was collected for an NLP Lab at IIT Jammu during the 2025-2026 academic year.
Use Cases
- Multi-class text classification based on thematic labels
- Analyzing topic distribution in computer science literature based on arXiv categories
- Training models for automated paper categorization based on abstract content
- Benchmarking NLP models on scientific text classification tasks
Strengths
- Abstracts are organized into four thematic labels, providing a structured classification
- Data is sourced from the arXiv API, a reputable repository for scientific preprints
- Last updated on 2026-03-12, indicating recent maintenance
Limitations
- Column-level documentation is absent; field semantics must be inferred after download
- Row count is unknown, which may limit suitability assessment
Provenance
- Source
- arXiv API
- Collection Method
- Collected via API for an NLP Lab project
- Freshness
- Last updated 2026-03-12 08:48:37