Description

Presenting a gold-standard benchmark dataset for sentence alignment between Sinhala, English, and Tamil languages. The data was crawled from news websites including Army, Hiru, ITN, and Newsfirst, with aligned sentences derived from a prior document alignment dataset.

Use Cases

Train multilingual sentence embedding models using aligned Sinhala-English-Tamil sentence pairs.
Benchmark machine translation quality for Sinhala, Tamil, and English language pairs.
Analyze cross-lingual news coverage by comparing aligned sentences from sources like Army.lk and Hiru News.

Strengths

Gold-standard benchmark annotations for sentence alignment.
Data sourced from multiple established news websites including Army.lk and Hiru News.
Covers three distinct languages: Sinhala, Tamil, and English.

Limitations

Unknown dataset size, row count, and column structure.
Potential geographic and topical bias limited to news content from specific Sri Lankan sources.
Relies on annotations from a prior document alignment dataset, which may propagate any errors from that source.

Provenance

Source: News websites: Army.lk, Hirunews.lk, Newsfirst.lk, ITNnews.lk.
Collection Method: Crawled from news websites, with sentence alignments annotated based on a prior document alignment dataset.
Freshness: Last updated on 2024-02-16.
Geography: Likely Sri Lanka, based on the language focus and news sources.

The full description is available on the Hugging Face dataset page; key details like size, format, and license are not provided in this summary.

Task Categoriessentence Similarity Languageen Languagesi Regionus Languageta Task Categoriestranslation

Sinhala Tamil English Sentence Alignments from News Sources

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info