Part of uv-scripts, a collection of self-contained scripts for local or Hugging Face Jobs execution. This script removes duplicate or near-duplicate text samples from a Hugging Face dataset using SemHash with Model2Vec embeddings, which is CPU-optimized and requires no GPU. The dataset page was last updated on 2026-06 05.
Use Cases
- Clean training data to prevent train/test leakage based on semantic similarity detection.
- Prepare text corpora for model training by removing redundant samples.
- Run a CPU-optimized deduplication pipeline on Hugging Face Jobs infrastructure.
Strengths
- Script is CPU-optimized, eliminating the need for GPU resources.
- Part of the uv-scripts collection, which are self-contained and designed for one-command execution.
- Last updated on 2026-06-05, indicating recent maintenance.
Limitations
- Description metadata is limited; actual data quality requires manual inspection after download.
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count and dataset size are unknown, which may limit suitability assessment.
Provenance
- Source
- huggingface
- Collection Method
- Script for processing datasets hosted on Hugging Face.
- Time Range
- null
- Freshness
- Last updated 2026-06-05 13:26:03.
- Geography
- null