Name: Tw Instruct 500K 2511
Creator: lianghsun
Published: 2025-11-13T03:23:17
Keywords: Text, Fine Tuning, Large Language Model, Chinese Language, Synthetic Dialogue, Synthetic

Description

tw-instruct-500k-2511 is a 2025 November version of a synthetic dialogue dataset for training Taiwanese Mandarin conversational models. It combines reference-based and reference-free generation methods to produce instruction-response pairs aligned with Taiwanese context. The dataset was created by lianghsun and updated on HuggingFace in May 2026.

Use Cases

Supervised fine-tuning (SFT) of Taiwanese Mandarin dialogue models based on the dataset's stated purpose.
Training models on culturally contextual responses based on reference-based generation from Taiwanese texts.
Generating reference-free instruction-response pairs based on the Self-Instruct methodology mentioned.
Cleaning and standardizing dialogue data based on the described process of removing overly long, short, or duplicate samples.
Adopting OpenAI messages schema for dialogue formatting based on the dataset's unified structure.

Strengths

Dataset format is unified to the OpenAI messages schema, suggesting a standardized structure.
Data has been cleaned to remove overly long, short, and duplicate samples.
Includes both reference-based and reference-free instruction samples, indicating a multi-source generation approach.

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.

Provenance

Source: lianghsun on HuggingFace.
Collection Method: Synthetic generation via reference-based (using Taiwanese texts) and reference-free (using Self-Instruct) methods.
Freshness: Last updated 2026-05-03 21:19:02; freshness should be verified.
Geography: Taiwan (based on the focus on Taiwanese context and texts).

License is unknown; terms of use must be verified before application.

Text Fine Tuning Large Language Model Chinese Language Synthetic Dialogue Synthetic

Tw Instruct 500K 2511

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info