74 spoken languages and American Sign Language are covered in this comprehension dataset. It is an extension of the Belebele text dataset, built by aligning Belebele, Flores200, and Fleurs datasets. The dataset was created by Facebook and last updated on December 17, 2024.
Use Cases
- Training speech comprehension models based on the 74 spoken languages mentioned.
- Evaluating American Sign Language (ASL) processing systems based on the inclusion of a sign language.
- Benchmarking multilingual model performance across diverse language families.
- Developing multimodal AI that integrates text, speech, and sign language understanding.
Strengths
- Covers 74 spoken languages, providing broad linguistic diversity.
- Includes American Sign Language (ASL), enabling multimodal research.
- Extends the established Belebele text dataset, suggesting a foundation in reading comprehension tasks.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- The description references an external page for full details, requiring an extra step for complete information.
Provenance
- Source
- Facebook
- Collection Method
- Built from aligning the Belebele, Flores200, and Fleurs datasets.
- Freshness
- Last updated 2024-12-17 13:39:10; freshness should be verified.