Korean Visual Document Retrieval Hard Negatives is a multimodal training set for fine-tuning embedding models. The dataset, created by whybe-choi, was last updated on 2026-04-25. Each row contains a text query, a page image document, one positive match, and seven mined hard negatives.
Use Cases
- Fine-tune visual-document retrieval models based on Korean text queries and page images.
- Improve model ranking performance using the provided hard negative examples.
- Benchmark cross-modal retrieval systems for Korean document pages.
Strengths
- Includes seven hard negative examples per query, which are useful for training robust retrieval models.
- Hard negatives were mined using the Qwen/Qwen3-VL-Embedding-8B model, suggesting a targeted mining approach.
- Positives sharing the same query within the same source dataset were excluded from the negative pool, potentially improving negative quality.
Limitations
- Description metadata is limited; actual data quality requires manual inspection after download.
- Row count is unknown, which may limit suitability assessment.
- Column-level documentation is absent; field semantics must be inferred after download.
Provenance
- Source
- huggingface
- Collection Method
- Hard negatives were mined with Qwen/Qwen3-VL-Embedding-8B within each source dataset.
- Freshness
- Last updated 2026-04-25 13:31:56; freshness should be verified.