Sign in to view source links and access this dataset
Description
Kazakh Instruction V2 is a dataset of self-instruct data pairs for the Kazakh language. It was created by translating the Stanford Alpaca instruction dataset via Google's API, with manual corrections and additions of Kazakh names, places, history, and culture. The dataset, authored by AmanMussa, was last updated on February 23, 2026.
Use Cases
Fine-tune language models for Kazakh instruction following based on the self-instruct data pairs.
Improve model performance on Kazakh cultural and historical topics based on the added instructions.
Enhance language model fluency and accuracy in Kazakh through translated and manually corrected examples.
Strengths
Manual correction of translation errors suggests improved data quality over raw machine translation.
Addition of Kazakh-specific names, places, history, and culture likely increases the dataset's relevance for the target language.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment for large-scale training.
The dataset's origin is a translation of another dataset, which may introduce artifacts.
Provenance
Source
Translated from the Stanford Alpaca instruction dataset.
Collection Method
Translated via Google Translations API, then manually corrected and augmented.
Freshness
Last updated 2026-02-23 10:30:27.
Geography
Kazakhstan (implied by content additions).
License is unknown; terms of use must be verified before application.