Sign in to view source links and access this dataset
Description
Nemotron-SFT-Science-v2 is a science reasoning dataset created by NVIDIA and last updated on June 4, 2026. It contains problems and solutions across three domains: Physics, Biology, and Chemistry. The dataset includes synthetic and vendor-sourced problems in multiple-choice and open-question formats, paired with LLM-generated solutions using chain-of-thought, Python, and search tool reasoning.
Use Cases
Fine-tuning LLMs for science question answering based on the multiple-choice and open-question formats.
Training models for chain-of-thought reasoning based on the CoT solution generation setup.
Developing models that use external tools for problem-solving based on the Python and Tavily API tool usage setups.
Benchmarking model performance across scientific domains based on the inclusion of Physics, Biology, and Chemistry problems.
Strengths
Covers three distinct scientific domains: Physics, Biology, and Chemistry.
Includes two question formats: multiple-choice questions (MCQ) and open questions (OpenQ).
Provides solutions generated via three distinct reasoning setups: chain-of-thought, Python tool usage, and search tool usage.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
The dataset includes synthetic data, which may not fully reflect the complexity of real-world science problems.
Provenance
Source
NVIDIA
Collection Method
Combines synthetic generation (MCQ, RQA) and non-synthetic vendor problems with LLM-generated solutions.
Time Range
null
Freshness
Last updated 2026-06-04 04:50:29; freshness should be verified.
Geography
null
License is unknown; terms of use must be verified before application.