Name: Fusion Synth Data S1Kx: Synthetic Completions from 5 LLMs
Creator: CohereLabs
Published: 2025-09-30T11:55:31
Keywords: Text Generation, Model Comparison, Text, Llm Training, Synthetic Data, Synthetic

Description

CohereLabs provides synthetic text completions for the s1K-X training split prompts, generated by five different large language models. The dataset includes outputs from models like GEMMA3-27B-IT, KIMI-K2-INSTRUCT, and QWEN3-235B, sampled at a temperature of 0.3. This collection, last updated in October 2025, is designed for research into model aggregation and training data synthesis.

Use Cases

Training or fine-tuning language models based on aggregated synthetic completions.
Comparing response quality and style across different teacher models like gemma3-27b and qwen3.
Developing algorithms for selecting the best output from multiple model generations.
Studying the effects of temperature sampling on synthetic data diversity.

Strengths

Includes completions from five distinct teacher models, providing comparative breadth.
Outputs are generated with a specified sampling temperature (T=0.3), offering consistency.
Data is sourced from both cloud (TogetherAI) and locally hosted model instances.

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Row count and dataset size are unknown, which may limit suitability assessment.

Provenance

Source: CohereLabs via Hugging Face.
Collection Method: Synthetic text generation from five specified LLMs for a defined prompt split.
Time Range: null
Freshness: Last updated 2025-10-02 05:39:38; freshness should be verified.
Geography: null

License is unknown; terms of use must be verified before application.

Text Text Generation Model Comparison Llm Training Synthetic Data Synthetic

Fusion Synth Data S1Kx: Synthetic Completions from 5 LLMs

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info