DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

2M-Belebele: Highly-Multilingual Speech and ASL Comprehension Dataset | DataSalon

Home Speech & Audio2M-Belebele: Highly-Multilingual Speech and ASL Comprehension Dataset

Speech & Audio

2M-Belebele: Highly-Multilingual Speech and ASL Comprehension Dataset

Name: 2M-Belebele: Highly-Multilingual Speech and ASL Comprehension Dataset
Creator: facebook
Published: 2024-12-16T08:45:30
Keywords: Language Dataset, Comprehension, Multilingual Speech, American Sign Language, Multilingual, Audio, Multimodal

by facebook·Updated 1y ago

Available on 1 platform

Description

74 spoken languages and American Sign Language are covered in this comprehension dataset. It is an extension of the Belebele text dataset, built by aligning Belebele, Flores200, and Fleurs datasets. The dataset was created by Facebook and last updated on December 17, 2024.

Use Cases

Training speech comprehension models based on the 74 spoken languages mentioned.
Evaluating American Sign Language (ASL) processing systems based on the inclusion of a sign language.
Benchmarking multilingual model performance across diverse language families.
Developing multimodal AI that integrates text, speech, and sign language understanding.

Strengths

Covers 74 spoken languages, providing broad linguistic diversity.
Includes American Sign Language (ASL), enabling multimodal research.
Extends the established Belebele text dataset, suggesting a foundation in reading comprehension tasks.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
The description references an external page for full details, requiring an extra step for complete information.

Provenance

Source: Facebook
Collection Method: Built from aligning the Belebele, Flores200, and Fleurs datasets.
Freshness: Last updated 2024-12-17 13:39:10; freshness should be verified.

Audio Multimodal Multilingual Language Dataset Comprehension Multilingual Speech American Sign Language

Related Datasets

Quality Score

D37

Description

Source

Reputation

Quality Score

D37

Description

Source

Reputation

Access

Community

491 downloads

13 likes

0 views

Dataset Info

Author: facebook
Created: Dec 16, 2024
Updated: Dec 17, 2024
Last synced: Jun 8, 2026

Access

Community

491 downloads

13 likes

0 views

Dataset Info

Author: facebook
Created: Dec 16, 2024
Updated: Dec 17, 2024
Last synced: Jun 8, 2026

2M-Belebele: Highly-Multilingual Speech and ASL Comprehension Dataset

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info