Name: Balanced Toxicity Text Subset from The Pile
Creator: tomekkorbak
Published: 2022-04-17T17:38:31
Keywords: Size Categories10 Kn100 K, Librarypolars, Text Toxicity, Ai Safety, Modalitytext, Librarymlcroissant, Language Model Training, Librarydatasets, Librarypandas, Text, Parquet, Regionus, Content Moderation

Description

A filtered subset of the Pile dataset, focused on text with toxicity labels, curated by researcher tomekkorbak and hosted on Hugging Face. It contains approximately 100,000 text samples, as indicated by its size category, and was last updated in April 2022. The data is intended for training and evaluating language models on toxic content.

Use Cases

Train a binary classifier to predict toxicity labels from text content.
Fine-tune a language model for controlled text generation, conditioning on toxicity features.
Analyze linguistic patterns and features associated with toxic text segments.
Benchmark the performance of detoxification or safety alignment techniques on a balanced dataset.

Strengths

Contains approximately 100,000 text samples based on the '10K<n<100K' size category.
Data is specifically filtered and balanced for toxicity, addressing class imbalance common in such tasks.
Derived from The Pile, a known large-scale, diverse text corpus for language modeling.

Limitations

Specific column names and the exact toxicity labeling methodology are not documented.
The dataset's last update was in 2022, potentially missing newer linguistic trends or toxicity forms.
Geographic and demographic coverage is unclear, likely inheriting biases from its source corpus.

Provenance

Source: Filtered from The Pile, a large-scale text corpus.
Collection Method: Curated and filtered by author tomekkorbak; specific filtering criteria unknown.
Freshness: Last updated 2022-04-17; no stated update frequency.
Geography: Primarily US-region content, as suggested by platform tags.

License terms are unspecified; users must verify permissible use. The dataset is stored in Parquet format, requiring compatible libraries like Polars or Pandas for loading.

Text Parquet Size Categories10 Kn100 K Librarypolars Text Toxicity Ai Safety Modalitytext Librarymlcroissant Language Model Training Librarydatasets Librarypandas Regionus Content Moderation

Balanced Toxicity Text Subset from The Pile

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info