Malayalam Instruct Dataset-L: A Large-Scale Malayalam Instruction-Tuning Corpus

Name: Malayalam Instruct Dataset-L: A Large-Scale Malayalam Instruction-Tuning Corpus
Creator: siyah1
Published: 2026-06-16T09:27:28
Keywords: Malayalam, Text, Multilingual, Language Model, Large Scale, Natural Language Processing, Multilingual Text, Instruction Tuning

by siyah1Updated 13d ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

Malayalam Instruct Dataset-L is a large-scale instruction-tuning dataset for the Malayalam language. It was programmatically compiled from over 20 multilingual text corpora, translation engines, and RSS feeds, heavily featuring the CulturaX database. The dataset was created by author siyah1 and was last updated on June 17, 2026.

Use Cases

Instruction-tuning large language models based on the described multilingual instruction data.
Training models for Malayalam text generation based on the compiled instruction-response pairs.
Benchmarking model performance on low-resource language tasks based on the dataset's scale.
Developing multilingual NLP applications based on the dataset's integration of multiple sources.

Strengths

Compiled from over 20 distinct multilingual text sources.
Heavily features the large-scale CulturaX database.
Described as one of the largest Malayalam instruction-tuning datasets available.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Last updated 2026-06-17 06:59:47; freshness should be verified.

Provenance

Source: Compiled from sources including the uonlp/CulturaX database, translation engines, and RSS feeds.
Collection Method: Programmatically scraped, deduplicated, and unified.
Freshness: Last updated 2026-06-17 06:59:47.

License information is unknown.

Text Multilingual Malayalam Language Model Large Scale Natural Language Processing Multilingual Text Instruction Tuning

Related Datasets

Quality Score

D37

Description

42

Source

36

Reputation

38

Access

22

Community

8 downloads

1 likes

0 views

Dataset Info

Author: siyah1
Created: Jun 16, 2026
Updated: Jun 17, 2026
Last synced: Jun 23, 2026

Access

22

Community

8 downloads

1 likes

0 views

Dataset Info

Author: siyah1
Created: Jun 16, 2026
Updated: Jun 17, 2026
Last synced: Jun 23, 2026

Malayalam Instruct Dataset-L: A Large-Scale Malayalam Instruction-Tuning Corpus

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info