2021 collection of Polish language text samples categorized for punctuation restoration tasks within Automatic Speech Recognition (ASR) workflows. The dataset provides unpunctuated transcriptions paired with their punctuated versions to facilitate the training of sequence labeling models.
Use Cases
- Train a sequence labeling model to predict punctuation marks for unpunctuated Polish text.
- Develop ASR post-processing pipelines to improve the readability of raw speech transcripts.
- Fine-tune transformer models to handle Polish-specific syntactic structures for punctuation recovery.
Strengths
- Focuses exclusively on the Polish language and its unique punctuation requirements.
- Optimized for the output of Automatic Speech Recognition (ASR) systems.
- Includes text data from 2021, reflecting contemporary language usage.