50 million (query, document) pairs uniformly sampled from the 'lightonai/embeddings-pre-training-curated' corpus. The dataset was created by author 'capemox' and was last updated on the Hugging Face platform on 2026-05-29. Pairs were sampled proportionally from 34 source subsets using a uniform Bernoulli sampling strategy with seed 42.
Use Cases
- Pre-training dense retrieval models based on the large-scale query-document pairs.
- Benchmarking embedding model performance across a full quality range, including hard positives.
- Studying sampling strategies for embedding pre-training based on the described proportional and uniform Bernoulli method.
Strengths
- Contains 50 million query-document pairs, providing substantial scale for model training.
- Sampled from 34 distinct source subsets, suggesting diversity in content origins.
- Uses a reproducible sampling strategy with a fixed seed (42), aiding experiment replication.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- Sampled from the 'lightonai/embeddings-pre-training-curated' corpus on Hugging Face.
- Collection Method
- Uniform Bernoulli sampling (seed 42) applied proportionally across 34 source subsets.
- Freshness
- Last updated 2026-05-29 11:38:30; freshness should be verified.