Name: Denseon Pretrain 50M Balanced: 50 Million Query-Document Pairs
Creator: capemox
Published: 2026-05-30T12:41:20
Keywords: Text, Query Document Pairs, Large Scale, Text Embeddings, Balanced Sampling, Pretraining Data

Description

50 million (query, document) text pairs sampled from 34 source subsets using balanced temperature sampling. The dataset was created by author capemox and last updated on Hugging Face in May 2026. Pairs are allocated proportionally to the square root of each source's size, with surplus redistribution.

Use Cases

Train dense retrieval models based on the large-scale query-document pairs.
Fine-tune text embedding models using the balanced, multi-source pretraining data.
Benchmark sampling strategies for large-scale text pair datasets based on the described temperature weighting method.
Pretrain contrastive learning models for semantic search based on the query-document structure.

Strengths

Contains 50 million text pairs, providing substantial scale for pretraining.
Sampling strategy balances representation across 34 distinct source subsets.
Uses a defined temperature weighting (T=2) and iterative redistribution method for allocation.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count beyond the stated 50 million pairs is unknown, which may limit suitability assessment.
Data may reflect source composition bias inherent to the original curated collection.

Provenance

Source: Sampled from lightonai/embeddings-pre-training-curated.
Collection Method: Pairs sampled uniformly at random with allocation per subset using T=2 temperature weighting.
Freshness: Last updated 2026-05-30 12:48:55; freshness should be verified.

License is unknown; terms of use must be verified before application.

Text Query Document Pairs Large Scale Text Embeddings Balanced Sampling Pretraining Data

Denseon Pretrain 50M Balanced: 50 Million Query-Document Pairs

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info