Description

CORAA v1.1 contains 290.77 hours of Brazilian Portuguese audio with transcriptions, segmented into over 400,000 audio files. The dataset is compiled from five distinct speech projects, including academic recordings and TEDx talks, and is validated for automatic speech recognition research.

Use Cases

Train an automatic speech recognition model on 290.77 hours of Brazilian Portuguese audio with transcriptions.
Fine-tune a speech-to-text model using segmented audio files from diverse sources like TEDx talks and academic projects.
Analyze linguistic patterns or dialectal variations across the five constituent speech projects within the corpus.
Benchmark ASR system performance on validated Brazilian Portuguese audio segments.

Strengths

Substantial volume of 290.77 hours of validated speech data.
Diverse audio sources from five distinct Brazilian Portuguese speech projects.
Over 400,000 segmented audio files for granular analysis.

Limitations

Limited to Brazilian Portuguese, not generalizable to other Portuguese dialects or languages.
Potential for audio quality and recording condition variance across the five source projects.
Dataset composition details and validation specifics require external reference to the full description.

Provenance

Source: Compilation of five projects: ALIP, C-ORAL Brazil, NURC-Recife, SP-2010, and TEDx talks in Portuguese.
Collection Method: Audio collection and transcription from constituent speech projects, with validation.
Freshness: Last updated in December 2022.
Geography: Brazil

Full dataset description, including detailed validation methods and license information, is only available on the external Hugging Face dataset page.

Licenseunknown Regionus Arxiv211015731

Brazilian Portuguese Speech Recognition Corpus with 290 Hours

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info