Sign in to view source links and access this dataset
Description
A dataset comparing coreset and random sampling strategies, created by Issactoto. The comparison is based on metrics including a count of over 1.5 billion items, mean and median similarity scores, and standard deviation. The dataset was last updated on May 12, 2026.
Use Cases
Compare sampling algorithm performance based on reported similarity metrics
Analyze dataset diversity based on the lower mean and median similarity scores from coreset sampling
Evaluate coverage stability across samples based on the reported lower standard deviation of similarity
Benchmark coreset sampling techniques against random baselines
Strengths
Direct comparison of coreset vs random sampling with over 1.5 billion items counted
Reports multiple similarity metrics including mean (0.4392 vs 0.5999), median (0.4416 vs 0.6087), and standard deviation (0.1168 vs 0.1565)
Limitations
Description metadata is limited; actual data quality requires manual inspection after download
Column-level documentation is absent; field semantics must be inferred after download
Row count is unknown, which may limit suitability assessment
Provenance
Source
huggingface
Collection Method
Likely involves sampling from a larger dataset, but the specific gathering method is not detailed.
Freshness
Last updated 2026-05-12 00:25:10; freshness should be verified
License is unknown; usage restrictions must be verified.