A dataset of 50,000 Chinese text samples for intent classification, created by author trytax. The data is synthetically generated and includes labels for intent and domain. It was last updated on March 13, 2026.
Use Cases
- Train a Chinese intent classifier based on the labeled 'intent' field.
- Benchmark prompt routing systems based on the labeled 'domain' field.
- Test model performance on synthetic conversational data based on the 'source' field.
- Evaluate classification models using the provided train/validation/test split.
Strengths
- Contains 50,000 total samples.
- Provides a fixed, reproducible split of 45,000 training, 2,500 validation, and 2,500 test samples.
- Includes multiple annotation fields: text, intent, domain, and source.
Limitations
- Data is synthetically generated, which may not reflect real-world user query distributions.
- Column-level documentation is absent; field semantics must be inferred after download.
Provenance
- Source
- trytax on Hugging Face
- Collection Method
- Synthetic/rule-based generation.
- Time Range
- null
- Freshness
- Last updated 2026-03-13 16:43:27; freshness should be verified.
- Geography
- null