Sign in to view source links and access this dataset
Description
A curated collection of open-source datasets for distilling knowledge from Anthropic's Claude models. The repository contains at least two unified subsets, including 'claude-sonnet-4.6-120000x' with 119,446 samples and 'claude-opus-4.6-10000x' with 9,633 samples. The data was aggregated and formatted by ansulev, with credit to original creators, and was last updated on 2026-06-16.
Use Cases
Train student models via knowledge distillation based on the described general, code, math, and psychology data.
Benchmark distillation techniques using the unified and deduplicated datasets mentioned in the description.
Fine-tune models for specific tasks like code generation or mathematical reasoning based on the described data subsets.
Strengths
The repository provides at least 129,079 total samples across its listed subsets.
The data has undergone a described process of unification and deduplication.
The source of each subset is documented, with credit attributed to original creators.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count for the full collection is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
Aggregated from multiple open-source Claude distillation datasets.
Collection Method
Curated, unified, and deduplicated from original sources.
Freshness
Last updated 2026-06-16 16:25:36; freshness should be verified.
License is unknown; terms of use for the aggregated data should be verified.