Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
SmolTalk is a synthetic dataset containing 1 million samples created for supervised finetuning of large language models. It was developed by HuggingFaceTB to address performance gaps with public SFT datasets and was used to build the SmolLM2-Instruct model family. The dataset's methodology and details are documented in a research paper.
Full dataset description, column details, sample data, and license information are only available on the Hugging Face dataset page. Users should review the linked research paper for methodological context.