Name: IndicAlign: Instruction and Toxic Alignment Datasets for 14 Indic Languages
Creator: ai4bharat
Published: 2024-03-05T11:16:02
Keywords: Alignment Datasets, Text, Multilingual Nlp

Description

A collection of instruction and toxic alignment datasets for 14 Indic languages, created by ai4bharat and last updated on July 25, 2024. The datasets include subsets like IndicAlign-Instruct, Indic-ShareLlama, and IndicAlign-Toxic, which were translated using IndicTrans2. The full curation process is detailed in an associated arXiv paper.

Use Cases

Fine-tuning language models for instruction-following based on the IndicAlign-Instruct subset.
Training or evaluating toxicity detection models based on the IndicAlign-Toxic subset.
Developing multilingual conversational AI using translated datasets like Wiki-Chat and Wiki-Conv.
Conducting linguistic or cultural analysis across 14 Indic languages using the diverse collection of translated texts.

Strengths

Covers 14 Indic languages, providing linguistic diversity.
Includes distinct subsets for instruction-following and toxic content alignment.
Uses a specific translation model, IndicTrans2, for dataset creation.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count and dataset size are unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: ai4bharat on Hugging Face.
Collection Method: Datasets were translated using IndicTrans2; curation details are in an arXiv paper.
Freshness: Last updated 2024-07-25 03:38:13; freshness should be verified.
Geography: Focus on Indic languages, suggesting coverage of regions in South Asia.

License is unknown; users must verify terms of use before application.

Text Alignment Datasets Multilingual Nlp

IndicAlign: Instruction and Toxic Alignment Datasets for 14 Indic Languages

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info