Sign in to view source links and access this dataset
Description
550,000 reasoning traces were distilled from the KIMI-K2.5 language model on high-reasoning tasks. The collection includes 2 billion tokens and is distributed across coding (60%), science (15%), math (10%), computer science (5%), logical questions (5%), and creative writing (5%). It was created by ansulev and last updated on Hugging Face in April 2026.
Use Cases
Training or fine-tuning language models for code generation based on the 60% coding subset.
Benchmarking model reasoning capabilities on science problems using the Physics, Chemistry, and Biology traces.
Studying step-by-step logical reasoning processes for math and logical questions.
Analyzing the structure of model-generated reasoning traces across different domains like creative writing.
Strengths
Large scale with 550,000 distinct reasoning traces.
Broad domain coverage across six distinct categories, with coding being the largest at 60%.
Substantial token volume of 2 billion tokens for training or analysis.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
Distilled from the KIMI-K2.5 language model.
Collection Method
Collected using a modified Datagen tool, as referenced in the description.
Freshness
Last updated 2026-04-03 09:49:47; freshness should be verified.
License is unknown; users must verify terms of use before download.