Sign in to view source links and access this dataset
Description
A 10,000-prompt subset of the UFB dataset, translated into 9 languages, contains completions generated by 5 different teacher models and 2 aggregations. The dataset was created by CohereLabs and last updated on October 2, 2025. Completions were sampled from models including GEMMA3-27B-IT, kimik2, qwen3, deepseek-v3, and command-a.
Use Cases
Training or fine-tuning language models based on synthetic multilingual completions.
Evaluating model performance and response quality across different teacher models.
Researching best-of-N sampling and aggregation methods for text generation.
Analyzing the stylistic or qualitative differences in outputs from various model architectures.
Strengths
Contains completions for a defined subset of 10,000 prompts.
Data includes outputs from 5 distinct teacher models, providing comparative breadth.
Prompts are available in 9 languages, suggesting multilingual utility.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
The description references a full description on an external page, requiring a click-through for complete details.
Provenance
Source
CohereLabs via Hugging Face.
Collection Method
Completions were sampled from teacher models, some via TogetherAI API and others via locally hosted instances.
Time Range
Creation date is not specified; last update was October 2025.
Freshness
Last updated 2025-10-02 05:32:31; freshness should be verified.