Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
A 100M English language model instruction tuning dataset used for supervised fine-tuning. The dataset, created by Aeryx-ai, combines the shared ChatML instruct dataset, SmolTalk core, and Dolly-15k. It was used in an experiment comparing two ~100M-parameter models with identical architecture and SFT but different pretraining token budgets.
License is unknown, which may restrict usage. The description references a 32k tokenizer and the dropping of all-masked windows at pack time, which may affect data format.