Sign in to view source links and access this dataset
Description
47,140 Sinhala text pairs for training spelling correction models, split into 37,712 training and 9,428 test samples. The dataset, created by SPEAK-PP, contains dyslexic/noisy sentences paired with their clean, corrected versions. It was last updated on June 8, 2026.
Use Cases
Train spelling correction models based on the described noisy-clean text pairs.
Benchmark model performance on Sinhala text based on the defined train/test split.
Generate synthetic noisy text for data augmentation based on the described error types like dyslexia-like mistakes.
Study typographical and dyslexic error patterns in Sinhala text based on the described content.
Strengths
47,140 total text pairs provide a substantial corpus for model training.
A defined split of 37,712 training and 9,428 test samples facilitates standard machine learning workflows.
The dataset explicitly includes dyslexia-like mistakes alongside common typos, which suggests a focus on diverse error types.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Last updated 2026-06-08 15:36:18; freshness should be verified.
Provenance
Source
SPEAK-PP
Freshness
Last updated 2026-06-08 15:36:18.
License is unknown; terms of use must be verified before application.