MCIF is a human-annotated benchmark for evaluating instruction-following across speech, vision, and text modalities in four languages. The dataset was created by FBK-MT and was last updated in February 2026.
Use Cases
- Benchmark MLLM performance on crosslingual instruction-following tasks using speech, text, and image inputs.
- Evaluate model understanding of long-form scientific content across English, German, Italian, and Chinese languages.
- Assess multimodal reasoning capabilities by requiring models to process and integrate information from audio transcripts, visual data, and textual instructions.
Strengths
- Covers three core modalities: speech, vision, and text.
- Spans four diverse languages: English, German, Italian, and Chinese.
Limitations
- Specific dataset size, row count, and file formats are unknown.
- Limited to content from scientific talks, which may not represent general conversational or instructional data.
Provenance
- Source
- FBK-MT via Hugging Face.
- Collection Method
- Human-annotated, based on scientific talks.
- Time Range
- null
- Freshness
- Last updated in February 2026.
- Geography
- null