Name: Lm 100M En Sft Data: 100M-Parameter English Language Model Instruction Tuning Dataset
Creator: Aeryx-ai
Published: 2026-06-18T17:35:31
Keywords: Text, Language Model, English Text, Instruction Tuning, Sft Data

Description

A 100M English language model instruction tuning dataset used for supervised fine-tuning. The dataset, created by Aeryx-ai, combines the shared ChatML instruct dataset, SmolTalk core, and Dolly-15k. It was used in an experiment comparing two ~100M-parameter models with identical architecture and SFT but different pretraining token budgets.

Use Cases

Supervised fine-tuning of small language models based on the described instruction data.
Studying the impact of pretraining token budget on model performance based on the described A/B experiment.
Replicating or analyzing the instruction-tuning process for the described ~100M-parameter models.
Creating instruction-response pairs for model training based on the ChatML and Dolly-15k sources mentioned.

Strengths

Dataset was used for a controlled experiment comparing two ~100M-parameter models, suggesting a defined purpose.
Combines multiple established instruction sources: ChatML, SmolTalk core, and Dolly-15k.
Last updated on 2026-06-18 17:35:34, indicating recent activity.

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Row count, file formats, and license information are unknown, which may limit suitability assessment.

Provenance

Source: huggingface, author Aeryx-ai
Collection Method: Likely aggregated and processed from the ChatML instruct dataset, SmolTalk core, and Dolly-15k.
Freshness: Last updated 2026-06-18 17:35:34; freshness should be verified.

License is unknown, which may restrict usage. The description references a 32k tokenizer and the dropping of all-masked windows at pack time, which may affect data format.

Text Language Model English Text Instruction Tuning Sft Data

Lm 100M En Sft Data: 100M-Parameter English Language Model Instruction Tuning Dataset

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info