Name: Fineweb URLs: Source URLs and Domains for LLM Training Data
Creator: nhagar
Published: 2025-04-21T21:06:38
Keywords: Task Categoriestext Generation, Librarypolars, Librarydask, Languageen, Text Generation, Modalitytext, Web Data, Librarymlcroissant, Librarydatasets, Doi1057967hf5441, Tabular, Parquet, Url Extraction, Size Categories10 Bn100 B, Regionus, Llm Training, Licenseodc By

Description

A dataset created by nhagar on May 15, 2025, providing the URLs and top-level domains associated with training records in the HuggingFaceFW/fineweb dataset. It was created by downloading source data, extracting URLs and domains, and retaining only those identifiers to make exploring LLM training datasets more accessible.

Use Cases

Analyze the distribution of source domains in a large-scale text corpus based on the extracted top-level domains.
Study the provenance and web source composition of LLM training data based on the provided URLs.
Filter or subset a larger text dataset based on specific source domains using the URL identifiers.
Investigate potential data quality or bias by examining the types of websites included in the training set.

Strengths

Part of a curated collection explicitly designed to make exploring LLM training datasets more straightforward.
Dataset was last updated on 2025-05-15, indicating recent maintenance.
Platform tags indicate the dataset is categorized for text generation and is a large-scale resource (Size Categories: 10 Bn-100 B).

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Row count, file formats, and license information are unknown, which may limit suitability assessment.

Provenance

Source: HuggingFaceFW/fineweb
Collection Method: Created by downloading source data and extracting URLs and top-level domains.
Time Range: null
Freshness: Last updated 2025-05-15 05:03:39.
Geography: null

null

Tabular Parquet Task Categoriestext Generation Librarypolars Librarydask Languageen Text Generation Modalitytext Web Data Librarymlcroissant Librarydatasets Doi1057967hf5441 Url Extraction Size Categories10 Bn100 B Regionus Llm Training Licenseodc By

Fineweb URLs: Source URLs and Domains for LLM Training Data

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info