DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Manually Aligned News Documents in Sinhala Tamil and English | DataSalon

Home Multimodal & LLMManually Aligned News Documents in Sinhala Tamil and English

Multimodal & LLM

Manually Aligned News Documents in Sinhala Tamil and English

Name: Manually Aligned News Documents in Sinhala Tamil and English
Creator: NLPC-UOM
Published: 2022-05-23T03:08:04
Keywords: Task Categoriessentence Similarity, Languageen, Modalitytext, Languagesi, Regionus, Languageta

by NLPC-UOM·Updated 2y ago

Available on 1 platform

Description

Gold-standard benchmark for document alignment between Sinhala, Tamil, and English languages. It contains manually annotated document pairs crawled from four Sri Lankan news websites: Army, Hiru, ITN, and Newsfirst.

Use Cases

Train a multilingual document alignment model using manually annotated Sinhala-English-Tamil pairs.
Benchmark cross-lingual retrieval systems on gold-standard Sinhala, Tamil, and English news documents.
Analyze linguistic patterns and translation quality across Sinhala, Tamil, and English news sources.

Strengths

Gold-standard benchmark dataset with manual annotation for alignment quality.
Covers three languages: Sinhala, Tamil, and English.
Data sourced from four distinct Sri Lankan news websites.

Limitations

Unknown dataset size, row count, and file formats.
Geographic and topical bias towards news content from specific Sri Lankan sources.
Potential temporal staleness as the specific crawl dates are not provided.

Provenance

Source: Crawled from Army, Hiru, ITN, and Newsfirst news websites.
Collection Method: Web crawling followed by manual annotation for document alignment.
Freshness: Last updated on 2024-02-16.
Geography: Sri Lanka (based on news source domains).

The full description and data structure are available on the Hugging Face dataset page; specific column names, sample data, and file formats are not provided in this input.

Task Categoriessentence Similarity Languageen Modalitytext Languagesi Regionus Languageta

Related Datasets

Quality Score

D34

Description

Source

Reputation

Quality Score

D34

Description

Source

Reputation

Access

Community

502 downloads

0 views

Dataset Info

Author: NLPC-UOM
Created: May 23, 2022
Updated: Feb 16, 2024
Last synced: Apr 30, 2026

Access

Community

502 downloads

0 views

Dataset Info

Author: NLPC-UOM
Created: May 23, 2022
Updated: Feb 16, 2024
Last synced: Apr 30, 2026

Manually Aligned News Documents in Sinhala Tamil and English

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info