120,125 audio files totaling 143.88 hours comprise this dataset for Bangla speech analysis. BAS4R contains both authentic and spoofed speech from 200 native speakers across ten Bangladeshi districts. Al Arian Ahmad contributed this dataset to Harvard Dataverse, with a last update recorded on 2026-05-22.
Use Cases
- Train anti-spoofing classifiers based on systematically generated spoofed speech samples.
- Evaluate speaker verification robustness based on recordings under realistic acoustic and channel-degraded conditions.
- Develop gender-aware voice analysis models based on speech from 110 male and 90 female participants.
- Research accent-robust spoofing detection based on regional pronunciation variability from ten districts.
Strengths
- Large scale with 120,125 audio files totaling approximately 143.88 hours of speech.
- Structured organization into five major spoofing categories with defined file counts (e.g., 28,830 files per spoofing category).
- Diverse speaker pool of 200 native Bangla speakers from ten districts, capturing regional linguistic diversity.
- Systematically generated spoofed samples covering multiple conditions like GSM codec, telephone transmission, and pitch shift.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment for certain modeling tasks.
Provenance
- Source
- Harvard Dataverse
- Collection Method
- Speech samples collected from 200 native Bangla speakers under controlled and realistic acoustic conditions; spoofed samples generated via physical replay setups, communication channels, effect-based modifications, and signal-processing transformations.
- Time Range
- null
- Freshness
- Last updated 2026-05-22 13:50:41; freshness should be verified.
- Geography
- Ten districts of Bangladesh: Barishal, Chapainawabganj, Chittagong, Habiganj, Kishoreganj, Kushtia, Naogaon, Narail, Pabna, and Sylhet.