Name: TestiMole: A Multi-Billion Token Italian Text Corpus from Web Scraping
Creator: mrinaldi
Published: 2024-05-14T11:53:07
Keywords: Web Scraping, Conversational Data, Text, Italian Language, Large Scale, Natural Language Processing, Text Corpus

Description

Almost 100 billion tokens of Italian text, counted with the Tiktoken cl100k BPE tokenizer, constitute this large linguistic resource. TestiMole was created through a massive web scraping effort and is one of the largest publicly available datasets for the Italian language as of June 2024. The dataset, authored by mrinaldi, consists mainly of conversational data from sources like Italian Usenet hierarchies and message boards.

Use Cases

Training large language models for Italian based on the multi-billion token scale.
Analyzing conversational patterns in Italian based on the Usenet and message board data mentioned.
Benchmarking text processing and tokenization tools on a large, web-scraped Italian corpus.

Strengths

Almost 100 billion tokens, making it one of the largest public Italian text resources.
Consists mainly of conversational data from specific sources like Italian Usenet and message boards.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.
The dataset page indicates a last updated date of 2026-06-14, which suggests potential metadata inaccuracies regarding freshness.

Provenance

Source: Web scraping of Italian Usenet hierarchies and message boards.
Collection Method: Massive web scraping effort.
Freshness: Last updated 2026-06-14 10:08:13; freshness should be verified.
Geography: Italy (inferred from language)

License is unknown; users must verify terms of use before downloading.

Text Web Scraping Conversational Data Italian Language Large Scale Natural Language Processing Text Corpus

TestiMole: A Multi-Billion Token Italian Text Corpus from Web Scraping

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info