Sign in to view source links and access this dataset
Description
Approximately 1 million academic papers from sources like arXiv and bioRxiv have been processed into a unified, multi-layered knowledge graph. The dataset, created by InternScience, decomposes each paper into five modules covering metadata, entities, abstracted knowledge, citation context, and fine-grained relations. It was last updated on June 12, 2026.
Use Cases
Building a semantic search engine for academic papers based on the decomposed knowledge modules.
Training or evaluating information extraction models on the structured entities and relations.
Analyzing citation networks and research trends using the provided citation context.
Developing question-answering agents for scientific domains based on the abstracted knowledge representations.
Strengths
Contains about 1 million papers, providing a substantial scale for analysis.
Papers are decomposed into a structured, five-module representation (A–E) designed for querying.
Sources include established preprint repositories like arXiv and bioRxiv.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
The description references a full page for details, indicating core metadata may be incomplete here.
Provenance
Source
Processed from arXiv, bioRxiv, and other unspecified sources by Agents-K1.
Collection Method
Papers were decomposed into a unified, queryable representation organized into five modules.
Freshness
Last updated 2026-06-12 03:03:40; freshness should be verified.
License is unknown and must be verified before use.