Name: Kurtis EON1 SFT: 200,000 Samples for Instruction-Tuning
Creator: ethicalabs
Published: 2025-12-31T13:30:42
Keywords: Librarypolars, Librarydask, OPTIMIZED-PARQUET, Text Generation, Modalitytext, Size Categories100 Kn1 M, Librarymlcroissant, Librarydatasets, Text, Parquet, Ai Training, Licenseapache 20, Synthetic Data, Regioneu

Description

200,000 text samples aggregated from nine public datasets, including HuggingFaceTB/cosmopedia-v2 (32.15%) and teknium/OpenHermes-2.5 (29.62%). The dataset was created by ethicalabs and last updated on March 16, 2026. It appears to be a curated collection for supervised fine-tuning of language models.

Use Cases

Supervised fine-tuning of language models based on the described instruction-response pairs.
Training conversational AI agents based on the psychology and mental health therapy data subsets.
Benchmarking model performance on diverse instruction types from the multiple source datasets.
Exploring data synthesis and mixture strategies for AI training based on the provided source distribution.

Strengths

200,000 total samples provides a substantial base for model training.
Aggregates data from nine distinct sources, suggesting diversity in content and style.
The dataset is stored in an optimized Parquet format, which suggests efficient storage and access.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Data may reflect source bias inherent to the aggregated Hugging Face datasets.

Provenance

Source: Aggregated from nine public datasets on Hugging Face.
Collection Method: Likely a curated mixture of existing text datasets.
Time Range: null
Freshness: Last updated 2026-03-16 13:22:51; freshness should be verified.
Geography: null

License is unknown; users must verify the license of each source dataset before use.

Text OPTIMIZED-PARQUET Parquet Librarypolars Librarydask Text Generation Modalitytext Size Categories100 Kn1 M Librarymlcroissant Librarydatasets Ai Training Licenseapache 20 Synthetic Data Regioneu

Kurtis EON1 SFT: 200,000 Samples for Instruction-Tuning

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info