Wiki Serbian-Croatian: Combined Wikipedia Corpus in Cyrillic Script

Name: Wiki Serbian-Croatian: Combined Wikipedia Corpus in Cyrillic Script
Creator: RafaelUI
Published: 2026-06-06T11:14:57
Keywords: Croatian Language, Text, Wikipedia, Natural Language Processing, Serbian Language, Text Corpus

by RafaelUIUpdated 21d ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

A cleaned Wikipedia corpus combines Serbian and Croatian Wikipedia articles. Croatian text has been transliterated to Cyrillic script, and wiki markup, infoboxes, and stub articles have been removed. The corpus was compiled by RafaelUI and is available on Hugging Face.

Use Cases

Train language models for Serbian based on the combined, cleaned text corpus.
Analyze linguistic differences between Serbian and Croatian dialects using the transliterated text.
Benchmark text processing tools on a cleaned Wikipedia dataset with removed markup.
Study the effects of script unification (Cyrillic) on a corpus originally containing Latin script.

Strengths

Croatian Latin text has been transliterated to Serbian Cyrillic script.
Processing removed wiki markup, infoboxes, tables, templates, and calendar stub articles.
Articles with >40% Latin characters were filtered, likely focusing the corpus on Cyrillic text.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Freshness should be verified; the last update date is 2026-06-06.

Provenance

Source: Serbian (sr) and Croatian (hr) Wikipedia articles.
Collection Method: Articles were cleaned of markup, infoboxes, tables, and templates. Croatian Latin text was transliterated to Cyrillic.
Freshness: Last updated 2026-06-06 11:22:35

Source data is under CC BY-SA 4.0; corpus compilation is under CC BY 4.0.

Text Croatian Language Wikipedia Natural Language Processing Serbian Language Text Corpus

Related Datasets

Quality Score

C41

Description

51

Source

36

Reputation

39

Access

26

Community

15 downloads

1 likes

0 views

Dataset Info

Author: RafaelUI
Created: Jun 6, 2026
Updated: Jun 6, 2026
Last synced: Jun 13, 2026

Access

26

Community

15 downloads

1 likes

0 views

Dataset Info

Author: RafaelUI
Created: Jun 6, 2026
Updated: Jun 6, 2026
Last synced: Jun 13, 2026

Wiki Serbian-Croatian: Combined Wikipedia Corpus in Cyrillic Script

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info