Sign in to view source links and access this dataset
Description
An aligned multilingual dataset for supervised fine-tuning, containing multiple-choice questions across three configurations. The dataset was created by cs-552-2026-claude-bots and last updated on May 31, 2026. It comprises 148,497 total base questions, with 133,647 rows for training and 14,850 for evaluation.
Use Cases
Supervised fine-tuning of language models based on the described multiple-choice question format.
Evaluating model reasoning capabilities across different languages based on the 'with_same_language_reasoning' and 'with_engish_language_reasoning' configurations.
Benchmarking multilingual knowledge and question-answering performance using the aggregated source datasets.
Studying the effect of answer presentation format on model performance using the 'only_boxed_answer' configuration.
Strengths
Provides a substantial training set of 133,647 rows for supervised fine-tuning.
Offers three distinct configurations for experimentation, allowing comparison of reasoning and presentation formats.
Aggregates data from multiple established sources, including CohereLabs/Global-MMLU (71,258 rows) and nayeon212/BLEnD (60,203 rows).
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.
The specific languages covered and the nature of the 'alignment' are not detailed in the provided description.
Provenance
Source
Aggregated from CohereLabs/Global-MMLU, nayeon212/BLEnD, QCRI/MultiNativQA, and openai/MMMLU.
Collection Method
Likely involves processing and aligning multiple-choice questions from the source datasets into a consistent format for SFT.
Freshness
Last updated 2026-05-31 09:24:35; freshness should be verified.
License is unknown, which may restrict commercial or research use.