Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
Almost 100 billion tokens of Italian text, counted with the Tiktoken cl100k BPE tokenizer, constitute this large linguistic resource. TestiMole was created through a massive web scraping effort and is one of the largest publicly available datasets for the Italian language as of June 2024. The dataset, authored by mrinaldi, consists mainly of conversational data from sources like Italian Usenet hierarchies and message boards.
License is unknown; users must verify terms of use before downloading.