Kazakh News Corpus from 2020-2024 with 1.7 Million Tokens
by Assel Ormanova·Updated 11d ago
4.0 MB2files
Available on 1 platform
Sign in to view source links and access this dataset
Description
1700 news publications in Kazakh, collected from major Kazakhstani news platforms like Tengri News and Egemen Kazakhstan between 2020 and 2024. The corpus contains 1,007,037 tokens, 107,501 types, and 109,395 lemmas, with a frequency list provided. It was compiled by Assel Ormanova and is available under a CC-BY-4.0 license.
Use Cases
Train language models for Kazakh based on contemporary news text.
Analyze linguistic patterns and vocabulary usage in modern Kazakh media.
Conduct frequency analysis and build lexicons using the provided frequency list.
Study news topics and discourse in Kazakhstan from 2020 to 2024.
Strengths
Contains 1,007,037 tokens from a defined 5-year period (2020-2024).
Sourced from 12+ named, major Kazakhstani news platforms, suggesting diverse journalistic content.
Includes a frequency list and lemma counts (109,395 lemmas) for linguistic analysis.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
The 4.0 MB file size indicates a relatively small corpus for modern NLP training.
Data may reflect the editorial bias inherent to the selected news sources.
Provenance
Source
Compilation from Kazakhstani news platforms (e.g., Tengri News, Egemen Kazakhstan).
Collection Method
Downloaded from news platforms and processed with #LancsBox software.
Time Range
2020-2024
Freshness
Last updated 2026-05-28 11:57:08; freshness should be verified.
Geography
Kazakhstan
Files are in ZIP and TXT formats; all text is in the Kazakh language.