Name: Nogai-Russian SFT Biblical Corpus v1: Parallel Text for Turkic Language LLMs
Creator: ansarzeinulla
Published: 2026-06-13T14:29:46
Keywords: Low Resource Nlp, Text, Natural Language Processing, Biblical Text, Turkic Languages, Llm Fine Tuning, Parallel Corpus

Description

A specialized parallel dataset engineered for Supervised Fine-Tuning of Large Language Models in zero-resource Turkic languages. The corpus, created by ansarzeinulla and last updated in June 2026, is designed to mitigate catastrophic forgetting during model adaptation to endangered languages. It contains high-fidelity Nogai-Russian translations of biblical text.

Use Cases

Supervised Fine-Tuning of frontier LLMs like Qwen or Llama based on the described parallel biblical text.
Mitigating catastrophic forgetting during Continuous Pre-Training for endangered languages based on the dataset's stated purpose.
Training or evaluating machine translation models for the Nogai-Russian language pair based on the parallel corpus nature.
Researching LLM alignment and instruction-following capabilities in low-resource language contexts based on the SFT focus.

Strengths

Designed specifically for a high-fidelity, specialized task: Supervised Fine-Tuning for zero-resource Turkic languages.
Targets a concrete NLP problem: mitigating catastrophic forgetting during model adaptation to endangered languages.
Last updated metadata indicates a recent update on 2026-06-13.

Limitations

Description metadata is limited; actual data quality, size, and structure require manual inspection after download.
Column-level documentation is absent; field semantics and the nature of the parallel alignment must be inferred after download.
Row count, file formats, and license are unknown, which limits suitability assessment for specific projects.

Provenance

Source: huggingface user ansarzeinulla
Collection Method: Likely curated or constructed from biblical text translations.
Freshness: Last updated 2026-06-13 14:38:36; freshness should be verified.

License is unknown, which may restrict commercial or research use.

Text Low Resource Nlp Natural Language Processing Biblical Text Turkic Languages Llm Fine Tuning Parallel Corpus

Nogai-Russian SFT Biblical Corpus v1: Parallel Text for Turkic Language LLMs

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info