DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

ViMedCSS: 24 Hours of Vietnamese Medical Code-Switching Speech | DataSalon

Home Speech & AudioViMedCSS: 24 Hours of Vietnamese Medical Code-Switching Speech

Speech & Audio

ViMedCSS: 24 Hours of Vietnamese Medical Code-Switching Speech

Name: ViMedCSS: 24 Hours of Vietnamese Medical Code-Switching Speech
Creator: shannonnonshan
Published: 2026-03-07T04:59:25
Keywords: Size Categories10 Kn100 K, Librarypolars, Librarydask, Modalityaudio, Arxiv260212911, OPTIMIZED-PARQUET, Modalitytext, Code Switching, Librarymlcroissant, Librarydatasets, Licensecc By 40, Parquet, Regionus, Task Categoriesautomatic Speech Recognition, Languagevi, Medical

by shannonnonshan·Updated 4mo ago

Available on 1 platform

Description

ViMedCSS provides 24.3 hours of Vietnamese medical speech across 11,832 training utterances, developed for the LREC 2026 conference. Each recording features at least one English medical term embedded within Vietnamese speech to support code-switching automatic speech recognition (ASR).

Use Cases

Training ASR models to recognize English medical terms within Vietnamese speech
Evaluating code-switching detection algorithms using the annotated CS terms
Fine-tuning speech-to-text systems for Vietnamese healthcare environments

Strengths

Contains 12,314 annotated code-switching terms in the training set
24.3 hours of domain-specific audio recordings
Average utterance duration of 7.39 seconds provides consistent segment lengths

Limitations

Small total duration of 24.3 hours compared to general ASR datasets
High domain-specific bias toward medical vocabulary

Provenance

Source: shannonnonshan, LREC 2026
Freshness: Last updated March 2026.
Geography: Vietnam

The dataset is licensed under CC BY 4.0 and is provided in Parquet format, requiring compatible libraries such as Hugging Face Datasets or Polars for efficient access.

OPTIMIZED-PARQUET Parquet Size Categories10 Kn100 K Librarypolars Librarydask Modalityaudio Arxiv260212911 Modalitytext Code Switching Librarymlcroissant Librarydatasets Licensecc By 40 Regionus Task Categoriesautomatic Speech Recognition Languagevi Medical

Related Datasets

Quality Score

C40

Description

Source

Reputation

Quality Score

C40

Description

Source

Reputation

Access

Community

14 downloads

1 likes

0 views

Dataset Info

Author: shannonnonshan
Created: Mar 7, 2026
Updated: Mar 7, 2026
Last synced: Apr 29, 2026

Access

Community

14 downloads

1 likes

0 views

Dataset Info

Author: shannonnonshan
Created: Mar 7, 2026
Updated: Mar 7, 2026
Last synced: Apr 29, 2026

ViMedCSS: 24 Hours of Vietnamese Medical Code-Switching Speech

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info