V3 V1 V2 Code Mixed Syntheic Correct Noisy Pairs: Sinhala Spelling Correction Dataset

Name: V3 V1 V2 Code Mixed Syntheic Correct Noisy Pairs: Sinhala Spelling Correction Dataset
Creator: SPEAK-PP
Published: 2026-01-29T10:15:17
Keywords: Text Pairs, Text, Noisy Text, Natural Language Processing, Spelling Correction, Sinhala Language

by SPEAK-PPUpdated 14d ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

47,140 Sinhala text pairs for training spelling correction models, split into 37,712 training and 9,428 test samples. The dataset, created by SPEAK-PP, contains dyslexic/noisy sentences paired with their clean, corrected versions. It was last updated on June 8, 2026.

Use Cases

Train spelling correction models based on the described noisy-clean text pairs.
Benchmark model performance on Sinhala text based on the defined train/test split.
Generate synthetic noisy text for data augmentation based on the described error types like dyslexia-like mistakes.
Study typographical and dyslexic error patterns in Sinhala text based on the described content.

Strengths

47,140 total text pairs provide a substantial corpus for model training.
A defined split of 37,712 training and 9,428 test samples facilitates standard machine learning workflows.
The dataset explicitly includes dyslexia-like mistakes alongside common typos, which suggests a focus on diverse error types.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Last updated 2026-06-08 15:36:18; freshness should be verified.

Provenance

Source: SPEAK-PP
Freshness: Last updated 2026-06-08 15:36:18.

License is unknown; terms of use must be verified before application.

Text Text Pairs Noisy Text Natural Language Processing Spelling Correction Sinhala Language

Related Datasets

Quality Score

D39

Description

42

Source

41

Reputation

39

Access

26

Community

10 downloads

1 likes

0 views

Dataset Info

Author: SPEAK-PP
Created: Jan 29, 2026
Updated: Jun 8, 2026
Last synced: Jun 15, 2026

Access

26

Community

10 downloads

1 likes

0 views

Dataset Info

Author: SPEAK-PP
Created: Jan 29, 2026
Updated: Jun 8, 2026
Last synced: Jun 15, 2026

V3 V1 V2 Code Mixed Syntheic Correct Noisy Pairs: Sinhala Spelling Correction Dataset

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info