Name: Kazakh Instruction V2: Translated and Augmented Self-Instruct Data
Creator: AmanMussa
Published: 2023-11-16T13:47:44
Keywords: Machine Translation, Text, Cultural Data, Llm Fine Tuning, Instruction Tuning, Kazakh Language

Description

Kazakh Instruction V2 is a dataset of self-instruct data pairs for the Kazakh language. It was created by translating the Stanford Alpaca instruction dataset via Google's API, with manual corrections and additions of Kazakh names, places, history, and culture. The dataset, authored by AmanMussa, was last updated on February 23, 2026.

Use Cases

Fine-tune language models for Kazakh instruction following based on the self-instruct data pairs.
Improve model performance on Kazakh cultural and historical topics based on the added instructions.
Enhance language model fluency and accuracy in Kazakh through translated and manually corrected examples.

Strengths

Manual correction of translation errors suggests improved data quality over raw machine translation.
Addition of Kazakh-specific names, places, history, and culture likely increases the dataset's relevance for the target language.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment for large-scale training.
The dataset's origin is a translation of another dataset, which may introduce artifacts.

Provenance

Source: Translated from the Stanford Alpaca instruction dataset.
Collection Method: Translated via Google Translations API, then manually corrected and augmented.
Freshness: Last updated 2026-02-23 10:30:27.
Geography: Kazakhstan (implied by content additions).

License is unknown; terms of use must be verified before application.

Text Machine Translation Cultural Data Llm Fine Tuning Instruction Tuning Kazakh Language

Kazakh Instruction V2: Translated and Augmented Self-Instruct Data

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info