A dataset created by nhagar on May 15, 2025, providing the URLs and top-level domains associated with training records in the HuggingFaceFW/fineweb dataset. It was created by downloading source data, extracting URLs and domains, and retaining only those identifiers to make exploring LLM training datasets more accessible.
Use Cases
- Analyze the distribution of source domains in a large-scale text corpus based on the extracted top-level domains.
- Study the provenance and web source composition of LLM training data based on the provided URLs.
- Filter or subset a larger text dataset based on specific source domains using the URL identifiers.
- Investigate potential data quality or bias by examining the types of websites included in the training set.
Strengths
- Part of a curated collection explicitly designed to make exploring LLM training datasets more straightforward.
- Dataset was last updated on 2025-05-15, indicating recent maintenance.
- Platform tags indicate the dataset is categorized for text generation and is a large-scale resource (Size Categories: 10 Bn-100 B).
Limitations
- Description metadata is limited; actual data quality requires manual inspection after download.
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count, file formats, and license information are unknown, which may limit suitability assessment.
Provenance
- Source
- HuggingFaceFW/fineweb
- Collection Method
- Created by downloading source data and extracting URLs and top-level domains.
- Time Range
- null
- Freshness
- Last updated 2025-05-15 05:03:39.
- Geography
- null