Sign in to view source links and access this dataset
Description
A blend of publicly available datasets for instruction tuning, including samples from OASST, CodeContests, FLAN, T0, Open_Platypus, and GSM8K. The dataset was created by NVIDIA and last updated on March 9, 2024. It consists of four columns, though specific column names and the total number of rows are not detailed in the provided metadata.
Use Cases
Fine-tuning language models for instruction-following based on the blended OASST and FLAN samples.
Training models on mathematical reasoning tasks based on the included GSM8K data.
Improving code generation and understanding models based on the CodeContests subset.
Creating general-purpose conversational AI using the mixture of instruction-tuning datasets.
Strengths
Sourced from multiple established datasets including OASST, CodeContests, FLAN, T0, Open_Platypus, and GSM8K.
Includes only subsets with permissive licenses for commercial use.
Sampling strategy adjusts for dataset size and ratios, potentially balancing representation.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
NVIDIA, blending samples from OASST, CodeContests, FLAN, T0, Open_Platypus, and GSM8K.
Collection Method
Blended and sampled from multiple publicly available datasets.
Freshness
Last updated 2024-03-09 00:05:34; freshness should be verified.
License information for the final blend is unknown; users should verify the permissive commercial use status of the included subsets.