Name: Tw Instruct 500K: 500,000 Synthetic Task-Oriented Dialogues for Taiwan Mandarin
Creator: lianghsun
Published: 2025-01-07T04:07:35
Keywords: Text, Task Oriented Dialogue, Natural Language Processing, Synthetic Data, Synthetic, Taiwan Mandarin

Description

500,000 synthetic task-oriented dialogues in Traditional Chinese, tailored for Taiwan's social context. The dataset was created by author lianghsun, combining reference-based and reference-free generation methods using LLMs, with a subset derived from the lianghsun/tw-instruct collection. It was last updated on January 10, 2025.

Use Cases

Fine-tuning large language models for task completion based on the described synthetic dialogue structure.
Training conversational AI assistants for Taiwan-specific domains and social contexts mentioned in the description.
Benchmarking model performance on generating or understanding Traditional Chinese task-oriented dialogues.
Studying the characteristics of LLM-generated synthetic data for instruction tuning.

Strengths

Contains 500,000 dialogue instances, providing a substantial volume for model training.
Specifically targets Taiwan's social context and common tasks, offering regional linguistic relevance.
Combines both reference-based and reference-free generation methods, as described, which may enhance diversity.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Data may reflect bias inherent to the source texts and generation methods used.

Provenance

Source: huggingface
Collection Method: Synthetic generation by LLMs, combining reference-based (using texts from training lianghsun/Llama-3.2-Taiwan-3B) and reference-free methods.
Freshness: Last updated 2025-01-10 05:03:52; freshness should be verified.
Geography: Taiwan

License is unknown; terms of use must be verified before application.

Text Task Oriented Dialogue Natural Language Processing Synthetic Data Synthetic Taiwan Mandarin

Tw Instruct 500K: 500,000 Synthetic Task-Oriented Dialogues for Taiwan Mandarin

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info