Sign in to view source links and access this dataset
Description
CohereLabs provides synthetic text completions for the s1K-X training split prompts, generated by five different large language models. The dataset includes outputs from models like GEMMA3-27B-IT, KIMI-K2-INSTRUCT, and QWEN3-235B, sampled at a temperature of 0.3. This collection, last updated in October 2025, is designed for research into model aggregation and training data synthesis.
Use Cases
Training or fine-tuning language models based on aggregated synthetic completions.
Comparing response quality and style across different teacher models like gemma3-27b and qwen3.
Developing algorithms for selecting the best output from multiple model generations.
Studying the effects of temperature sampling on synthetic data diversity.
Strengths
Includes completions from five distinct teacher models, providing comparative breadth.
Outputs are generated with a specified sampling temperature (T=0.3), offering consistency.
Data is sourced from both cloud (TogetherAI) and locally hosted model instances.
Limitations
Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Row count and dataset size are unknown, which may limit suitability assessment.
Provenance
Source
CohereLabs via Hugging Face.
Collection Method
Synthetic text generation from five specified LLMs for a defined prompt split.
Time Range
null
Freshness
Last updated 2025-10-02 05:39:38; freshness should be verified.
Geography
null
License is unknown; terms of use must be verified before application.