Sign in to view source links and access this dataset
Description
500,000 synthetic task-oriented dialogues in Traditional Chinese, tailored for Taiwan's social context. The dataset was created by author lianghsun, combining reference-based and reference-free generation methods using LLMs, with a subset derived from the lianghsun/tw-instruct collection. It was last updated on January 10, 2025.
Use Cases
Fine-tuning large language models for task completion based on the described synthetic dialogue structure.
Training conversational AI assistants for Taiwan-specific domains and social contexts mentioned in the description.
Benchmarking model performance on generating or understanding Traditional Chinese task-oriented dialogues.
Studying the characteristics of LLM-generated synthetic data for instruction tuning.
Strengths
Contains 500,000 dialogue instances, providing a substantial volume for model training.
Specifically targets Taiwan's social context and common tasks, offering regional linguistic relevance.
Combines both reference-based and reference-free generation methods, as described, which may enhance diversity.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Data may reflect bias inherent to the source texts and generation methods used.
Provenance
Source
huggingface
Collection Method
Synthetic generation by LLMs, combining reference-based (using texts from training lianghsun/Llama-3.2-Taiwan-3B) and reference-free methods.
Freshness
Last updated 2025-01-10 05:03:52; freshness should be verified.
Geography
Taiwan
License is unknown; terms of use must be verified before application.