DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Amharic BDU-Speech: 32,901 Paired Audio and Transcriptions | DataSalon

Home Speech & AudioAmharic BDU-Speech: 32,901 Paired Audio and Transcriptions

Speech & Audio

Amharic BDU-Speech: 32,901 Paired Audio and Transcriptions

Name: Amharic BDU-Speech: 32,901 Paired Audio and Transcriptions
Creator: chappM
Published: 2026-03-06T15:53:29
Keywords: Size Categories10 Kn100 K, Modalityaudio, Modalitytext, Librarymlcroissant, Arxiv250318485, Librarydatasets, Licensecc By 40, Regionus, Languageam, Task Categoriesautomatic Speech Recognition, Arrow

by chappM·Updated 4mo ago

Available on 1 platform

Description

32,901 paired Amharic speech audio files and transcriptions processed from the BDU-speech dataset by Yohannes A. Ejigu. Updated in March 2026, the collection provides mono audio recordings specifically structured for automatic speech recognition research and model training.

Use Cases

Fine-tuning transformer models like Whisper using the 'audio' and 'sentence' columns
Developing acoustic models for the Amharic language
Benchmarking ASR performance on Ethiopian linguistic data

Strengths

32,901 training records
CC BY 4.0 open license
Processed from the established BDU-speech source

Limitations

Variable sampling rates across files require normalization
Limited metadata regarding speaker demographics or recording environments

Provenance

Source: Yohannes A. Ejigu (BDU-speech dataset)
Freshness: Last updated March 2026
Geography: Ethiopia

Audio files are decoded as mono but sampling rates vary across the dataset; resampling to a consistent rate (e.g., 16kHz) is recommended before training.

Arrow Size Categories10 Kn100 K Modalityaudio Modalitytext Librarymlcroissant Arxiv250318485 Librarydatasets Licensecc By 40 Regionus Languageam Task Categoriesautomatic Speech Recognition

Related Datasets

Quality Score

D37

Description

Source

Reputation

Quality Score

D37

Description

Source

Reputation

Access

Community

79 downloads

2 likes

0 views

Dataset Info

Author: chappM
Created: Mar 6, 2026
Updated: Mar 6, 2026

Access

Community

79 downloads

2 likes

0 views

Dataset Info

Author: chappM
Created: Mar 6, 2026
Updated: Mar 6, 2026

Amharic BDU-Speech: 32,901 Paired Audio and Transcriptions

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info