Sign in to view source links and access this dataset
Description
ChemCoTBench-V2 is a public 5,620-sample active benchmark for evaluating chemical reasoning in large language models. The dataset, created by fresnellll, evaluates both final-answer correctness and process-level reasoning, pairing model-facing inputs with verified formal reasoning traces. It was last updated on June 3, 2026.
Use Cases
Benchmarking the final-answer accuracy of LLMs on chemical problems based on the verified answers.
Evaluating the step-by-step reasoning process of LLMs based on the provided formal reasoning traces.
Training or fine-tuning models for improved scientific reasoning based on the model-facing inputs and verified traces.
Strengths
Contains 5,620 benchmark samples for evaluation.
Each item includes a verified formal reasoning trace for process-level assessment.
Specifically designed for evaluating both answer correctness and reasoning process in chemical domains.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
huggingface user fresnellll
Collection Method
Likely curated for the purpose of model evaluation, as described in the associated research.
Freshness
Last updated 2026-06-03 06:23:18; freshness should be verified.
License is unknown; terms of use must be verified before application.