sarjai's Arabic Voice Agent End-of-Turn 5M dataset contains 5,000,000 synthetic Arabic voice-agent end-of-turn examples for training and evaluating turn-taking models. Each row is a two-turn text situation with an agent_turn and a simulated user_stt_text surface, labeled for whether the user has finished speaking. The dataset was last updated on Hugging Face in May 2026.
Use Cases
- Train turn-taking prediction models based on labeled end-of-turn examples.
- Evaluate dialogue system performance based on simulated user speech text.
- Benchmark models for Arabic conversational AI based on the two-turn text structure.
- Study synthetic dialogue generation patterns for voice agents.
Strengths
- Contains 5,000,000 total examples, providing a large-scale resource.
- Includes a dedicated validation split of 50,000 rows and a test split of 50,000 rows.
- Focuses on a specific, high-value NLP task for Arabic conversational AI.
Limitations
- Data is synthetic, which may not fully reflect real-world conversational patterns.
- Column-level documentation is absent; field semantics must be inferred after download.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- sarjai on Hugging Face.
- Collection Method
- Synthetically generated.
- Time Range
- null
- Freshness
- Last updated 2026-05-25 13:28:43; freshness should be verified.
- Geography
- null