Kazakh News Corpus from 2020-2024 with 1.7 Million Tokens

Name: Kazakh News Corpus from 2020-2024 with 1.7 Million Tokens
Creator: Assel Ormanova
Published: 2026-05-28T11:57:07
License: CC-BY-4.0
Keywords: ZIP, News Corpus, Text, Text, Natural Language Processing, Kazakh Language, Text Corpus

by Assel OrmanovaUpdated 11d ago

4.0 MB2files

Available on 1 platform

Sign in to view source links and access this dataset

Description

1700 news publications in Kazakh, collected from major Kazakhstani news platforms like Tengri News and Egemen Kazakhstan between 2020 and 2024. The corpus contains 1,007,037 tokens, 107,501 types, and 109,395 lemmas, with a frequency list provided. It was compiled by Assel Ormanova and is available under a CC-BY-4.0 license.

Use Cases

Train language models for Kazakh based on contemporary news text.
Analyze linguistic patterns and vocabulary usage in modern Kazakh media.
Conduct frequency analysis and build lexicons using the provided frequency list.
Study news topics and discourse in Kazakhstan from 2020 to 2024.

Strengths

Contains 1,007,037 tokens from a defined 5-year period (2020-2024).
Sourced from 12+ named, major Kazakhstani news platforms, suggesting diverse journalistic content.
Includes a frequency list and lemma counts (109,395 lemmas) for linguistic analysis.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
The 4.0 MB file size indicates a relatively small corpus for modern NLP training.
Data may reflect the editorial bias inherent to the selected news sources.

Provenance

Source: Compilation from Kazakhstani news platforms (e.g., Tengri News, Egemen Kazakhstan).
Collection Method: Downloaded from news platforms and processed with #LancsBox software.
Time Range: 2020-2024
Freshness: Last updated 2026-05-28 11:57:08; freshness should be verified.
Geography: Kazakhstan

Files are in ZIP and TXT formats; all text is in the Kazakh language.

Text ZIP News Corpus Natural Language Processing Kazakh Language Text Corpus

Related Datasets

Quality Score

C47

Description

46

Source

43

Reputation

35

Access

79

Community

0 views

Dataset Info

License: CC-BY-4.0
Author: Assel Ormanova
Files: 2
Created: May 28, 2026
Updated: May 28, 2026
DOI
Last synced: May 28, 2026

Access

79

Community

0 views

Dataset Info

License: CC-BY-4.0
Author: Assel Ormanova
Files: 2
Created: May 28, 2026
Updated: May 28, 2026
DOI
Last synced: May 28, 2026

Kazakh News Corpus from 2020-2024 with 1.7 Million Tokens

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info