Name: Multilingual Dataset: Aligned Multiple-Choice SFT Data with 133,647 Training Rows
Creator: cs-552-2026-claude-bots
Published: 2026-05-31T09:19:13
Keywords: Text, Multilingual, Multiple Choice, Multilingual Nlp, Knowledge Evaluation

Description

An aligned multilingual dataset for supervised fine-tuning, containing multiple-choice questions across three configurations. The dataset was created by cs-552-2026-claude-bots and last updated on May 31, 2026. It comprises 148,497 total base questions, with 133,647 rows for training and 14,850 for evaluation.

Use Cases

Supervised fine-tuning of language models based on the described multiple-choice question format.
Evaluating model reasoning capabilities across different languages based on the 'with_same_language_reasoning' and 'with_engish_language_reasoning' configurations.
Benchmarking multilingual knowledge and question-answering performance using the aggregated source datasets.
Studying the effect of answer presentation format on model performance using the 'only_boxed_answer' configuration.

Strengths

Provides a substantial training set of 133,647 rows for supervised fine-tuning.
Offers three distinct configurations for experimentation, allowing comparison of reasoning and presentation formats.
Aggregates data from multiple established sources, including CohereLabs/Global-MMLU (71,258 rows) and nayeon212/BLEnD (60,203 rows).

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.
The specific languages covered and the nature of the 'alignment' are not detailed in the provided description.

Provenance

Source: Aggregated from CohereLabs/Global-MMLU, nayeon212/BLEnD, QCRI/MultiNativQA, and openai/MMMLU.
Collection Method: Likely involves processing and aligning multiple-choice questions from the source datasets into a consistent format for SFT.
Freshness: Last updated 2026-05-31 09:24:35; freshness should be verified.

License is unknown, which may restrict commercial or research use.

Text Multilingual Multiple Choice Multilingual Nlp Knowledge Evaluation

Multilingual Dataset: Aligned Multiple-Choice SFT Data with 133,647 Training Rows

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info