Name: Fusion Synth Data UFB: Multilingual Model Completions for 10,000 Prompts
Creator: CohereLabs
Published: 2025-09-30T11:26:37
Keywords: Text Generation, Text, Multilingual, Language Model, Synthetic Data, Synthetic

Description

A 10,000-prompt subset of the UFB dataset, translated into 9 languages, contains completions generated by 5 different teacher models and 2 aggregations. The dataset was created by CohereLabs and last updated on October 2, 2025. Completions were sampled from models including GEMMA3-27B-IT, kimik2, qwen3, deepseek-v3, and command-a.

Use Cases

Training or fine-tuning language models based on synthetic multilingual completions.
Evaluating model performance and response quality across different teacher models.
Researching best-of-N sampling and aggregation methods for text generation.
Analyzing the stylistic or qualitative differences in outputs from various model architectures.

Strengths

Contains completions for a defined subset of 10,000 prompts.
Data includes outputs from 5 distinct teacher models, providing comparative breadth.
Prompts are available in 9 languages, suggesting multilingual utility.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
The description references a full description on an external page, requiring a click-through for complete details.

Provenance

Source: CohereLabs via Hugging Face.
Collection Method: Completions were sampled from teacher models, some via TogetherAI API and others via locally hosted instances.
Time Range: Creation date is not specified; last update was October 2025.
Freshness: Last updated 2025-10-02 05:32:31; freshness should be verified.
Geography: Spatial coverage is not specified.

License is unknown, which may restrict usage.

Text Multilingual Text Generation Language Model Synthetic Data Synthetic

Fusion Synth Data UFB: Multilingual Model Completions for 10,000 Prompts

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info