Name: Claude Distills: Unified Datasets for Language Model Distillation
Creator: ansulev
Published: 2026-06-16T16:25:36
Keywords: Claude Distillation, Text, Language Model, Instruction Tuning, Synthetic Data

Description

A curated collection of open-source datasets for distilling knowledge from Anthropic's Claude models. The repository contains at least two unified subsets, including 'claude-sonnet-4.6-120000x' with 119,446 samples and 'claude-opus-4.6-10000x' with 9,633 samples. The data was aggregated and formatted by ansulev, with credit to original creators, and was last updated on 2026-06-16.

Use Cases

Train student models via knowledge distillation based on the described general, code, math, and psychology data.
Benchmark distillation techniques using the unified and deduplicated datasets mentioned in the description.
Fine-tune models for specific tasks like code generation or mathematical reasoning based on the described data subsets.

Strengths

The repository provides at least 129,079 total samples across its listed subsets.
The data has undergone a described process of unification and deduplication.
The source of each subset is documented, with credit attributed to original creators.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count for the full collection is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: Aggregated from multiple open-source Claude distillation datasets.
Collection Method: Curated, unified, and deduplicated from original sources.
Freshness: Last updated 2026-06-16 16:25:36; freshness should be verified.

License is unknown; terms of use for the aggregated data should be verified.

Text Claude Distillation Language Model Instruction Tuning Synthetic Data

Claude Distills: Unified Datasets for Language Model Distillation

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info