This benchmark contains evaluation data for long-form Text-to-Speech (TTS) and speech-audio understanding tasks in English and Chinese. It is designed to test the capabilities of omni-modal large language models in generating personalized, long-horizon speech and interpreting complex audio signals.
Use Cases
- Benchmark the accuracy of long-form TTS systems using the provided English and Chinese text inputs
- Evaluate the audio understanding performance of omni-modal LLMs against the speech comprehension tasks
- Test the consistency of personalized voice cloning over long-horizon speech generation
Strengths
- Includes evaluation samples for both English and Chinese language processing
- Focuses on long-form TTS tasks to measure performance in extended speech synthesis
- Provides test cases for speech and audio understanding within the MGM-Omni framework