32,784 computer science research papers are aggregated from multiple academic sources, including arXiv, conferences, and journals. The dataset is maintained by ResearchScope and updated automatically via GitHub Actions. It includes splits for per-source analysis, instruction-tuning, and per-section fine-tuning.
Use Cases
- Training language models for academic text generation based on the paper abstracts and full texts.
- Building paper recommendation systems based on metadata and content from arXiv and conference sources.
- Fine-tuning models for specific sections of research papers using the per-section splits.
- Analyzing trends in computer science research across different publication venues.
- Instruction-tuning models for tasks like summarization or question-answering on scientific literature.
Strengths
- Contains 32,784 papers, providing a substantial corpus for analysis.
- Includes papers from specific sources: 7,784 from arXiv, 20,000 from conferences, and 5,000 from journals.
- Offers structured splits for per-source, instruction-tuning, and per-section fine-tuning tasks.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count for specific splits is unknown, which may limit suitability assessment.
- Data may reflect source bias inherent to the selected arXiv, conference, and journal publications.
Provenance
- Source
- ResearchScope, aggregating from arXiv, conferences, and journals.
- Collection Method
- Updated automatically via GitHub Actions.
- Freshness
- Last updated 2026-06-13 08:10:37; freshness should be verified.