SGI-Bench is a scientist-aligned benchmark for evaluating Scientific General Intelligence in large language models across the full inquiry cycle. The benchmark spans 10 disciplines and contains more than 1,000 expert-curated samples inspired by Science's 125 Big Questions. It was created by InternScience and last updated on HuggingFace in June 2026.
Use Cases
- Benchmarking LLM performance on scientific deliberation tasks based on the described inquiry cycle.
- Evaluating model conception abilities on multidisciplinary problems across 10 disciplines.
- Testing agentic action and perception workflows in a scientific context as described in the framework.
Strengths
- More than 1,000 expert-curated samples provide a substantial evaluation corpus.
- Benchmark spans 10 distinct scientific disciplines for broad coverage.
- Evaluation framework is structured around the scientist-aligned Deliberation, Conception, Action, and Perception cycle.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- InternScience on HuggingFace
- Collection Method
- Expert-curated samples inspired by Science's 125 Big Questions.
- Freshness
- Last updated 2026-06-02 12:05:10; freshness should be verified.