Name: Preprocessed Reddit Text for Automatic Speech Recognition
Creator: DDSC
Published: 2022-03-02T23:29:22
Keywords: Librarypolars, Size Categories1 Mn10 M, Preprocessed Text, Modalitytext, Social Media Text, Librarymlcroissant, Librarydatasets, Librarypandas, Text, Parquet, Regionus, Speech Recognition, Automatic Speech Recognition

Description

Preprocessed text data sourced from Reddit, intended for training or evaluating Automatic Speech Recognition (ASR) systems. The dataset was created by DDSC and last updated on the Hugging Face platform in February 2022. Its size is indicated as between 1 million and 10 million entries.

Use Cases

Training language models for ASR on conversational text patterns found in Reddit posts.
Fine-tuning speech recognition systems using the preprocessed text features to improve performance on informal language.
Benchmarking ASR model accuracy by using the prepared text as target transcriptions for evaluation.
Analyzing the impact of text preprocessing (indicated by 'Preprocessed' in the title) on downstream ASR model performance.

Strengths

Dataset size is categorized between 1 million and 10 million entries, providing substantial data volume.
Data is preprocessed, which may reduce initial cleaning effort for model training.
Stored in the efficient Parquet format as indicated by platform tags.

Limitations

Specific column names, data schema, and exact row count are unknown.
Data is from 2022 or earlier, potentially lacking recent linguistic trends from Reddit.
Geographic and demographic coverage is unclear, likely biased towards US English Reddit users.

Provenance

Source: Text content sourced from the social media platform Reddit.
Collection Method: Data was gathered and preprocessed by author DDSC; specific methods are unknown.
Freshness: Last updated in February 2022; static dataset with no stated update frequency.
Geography: Platform tags suggest a primary focus on the US ('Regionus'), but specific coverage is unknown.

License terms are unknown and must be verified before use. The specific preprocessing steps applied to the Reddit text are not detailed.

Text Parquet Librarypolars Size Categories1 Mn10 M Preprocessed Text Modalitytext Social Media Text Librarymlcroissant Librarydatasets Librarypandas Regionus Speech Recognition Automatic Speech Recognition

Preprocessed Reddit Text for Automatic Speech Recognition

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info