Name: Neapolitan Spoken Corpus: 141 Audio Sentences for ASR Benchmarking
Creator: anonymous-nsc-author
Published: 2025-07-08T02:42:27
Keywords: Librarypolars, Modalityaudio, Size Categoriesn1 K, Modalitytext, CSV, Librarymlcroissant, Neapolitan, Librarydatasets, Librarypandas, Text, Licensecc By Nc 40, Audio, Regionus, Natural Language Processing, Audio Corpus, Low Resource, Low Resource Language, Speech Recognition

Description

Neapolitan-Spoken-Corpus (NSC) is the first publicly available speech corpus for benchmarking Automatic Speech Recognition systems on the Neapolitan dialect. It includes 141 sentence-level audio recordings with gold-standard orthographic transcriptions. The dataset was created by anonymous-nsc-author to address the lack of computational resources for dialectological research.

Use Cases

Benchmarking ASR model performance on Neapolitan based on the 141 audio recordings.
Training or fine-tuning speech recognition models for a low-resource Romance dialect.
Conducting linguistic research on Neapolitan using the provided orthographic transcriptions.
Developing tools for dialect preservation and documentation.

Strengths

First publicly available speech corpus specifically for the Neapolitan dialect.
Includes 141 sentence-level audio recordings with gold-standard transcriptions.

Limitations

Row count is unknown, which may limit suitability assessment.
Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: anonymous-nsc-author via Hugging Face
Collection Method: Likely recorded and transcribed specifically for this corpus.
Time Range: null
Freshness: Last updated 2025-07-08 02:49:47; freshness should be verified.
Geography: Southern Italy (Neapolitan dialect region)

License is listed as 'cc By Nc 40' on the platform, indicating a Creative Commons Attribution-NonCommercial 4.0 license.

Text Audio CSV Librarypolars Modalityaudio Size Categoriesn1 K Modalitytext Librarymlcroissant Neapolitan Librarydatasets Librarypandas Licensecc By Nc 40 Regionus Natural Language Processing Audio Corpus Low Resource Low Resource Language Speech Recognition

Neapolitan Spoken Corpus: 141 Audio Sentences for ASR Benchmarking

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info