Sign in to view source links and access this dataset
Description
A collection of instruction and toxic alignment datasets for 14 Indic languages, created by ai4bharat and last updated on July 25, 2024. The datasets include subsets like IndicAlign-Instruct, Indic-ShareLlama, and IndicAlign-Toxic, which were translated using IndicTrans2. The full curation process is detailed in an associated arXiv paper.
Use Cases
Fine-tuning language models for instruction-following based on the IndicAlign-Instruct subset.
Training or evaluating toxicity detection models based on the IndicAlign-Toxic subset.
Developing multilingual conversational AI using translated datasets like Wiki-Chat and Wiki-Conv.
Conducting linguistic or cultural analysis across 14 Indic languages using the diverse collection of translated texts.
Strengths
Covers 14 Indic languages, providing linguistic diversity.
Includes distinct subsets for instruction-following and toxic content alignment.
Uses a specific translation model, IndicTrans2, for dataset creation.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count and dataset size are unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
ai4bharat on Hugging Face.
Collection Method
Datasets were translated using IndicTrans2; curation details are in an arXiv paper.
Freshness
Last updated 2024-07-25 03:38:13; freshness should be verified.
Geography
Focus on Indic languages, suggesting coverage of regions in South Asia.
License is unknown; users must verify terms of use before application.