Sign in to view source links and access this dataset
Description
Kennslurómur is a collection of audio recordings and corresponding text from instructional lectures recorded in courses at the University of Reykjavík and the University of Iceland. The dataset is intended for training speech recognition models, with recordings provided by lecturers, processed by a speech recognizer, and subsequently proofread by students and a professional proofreader.
Use Cases
Train an Icelandic speech recognition model using the audio recordings and corresponding text transcripts.
Develop a forced alignment tool to map text transcripts to specific timestamps in the lecture audio.
Analyze lecture content and vocabulary for linguistic research on academic Icelandic.
Fine-tune a language model on the proofread text corpus for domain-specific natural language processing tasks.
Strengths
Data originates from two major Icelandic universities, providing a source of academic Icelandic.
Text transcripts underwent multiple rounds of correction by students and a professional proofreader.
Limitations
The dataset size, number of rows, and audio duration are unknown, limiting assessment of its scale for model training.
Content is restricted to academic lectures, which may not generalize to other domains or colloquial speech.
Potential for speaker bias as the recordings are from a limited number of lecturers.
Provenance
Source
University of Reykjavík and University of Iceland.
Collection Method
Lectures were recorded, transcribed via speech recognition, and the text was corrected by students and a professional proofreader.
Freshness
Last updated on 2022-08-22.
Geography
Iceland.
The full description and specific details are available only on the linked dataset page. License information is unknown.