Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
127,460 query-image pairs for visual document retrieval comprise this training set released by vidore in 2024. It combines 63% academic data from sources like DocVQA with 37% synthetic PDF pages augmented by Claude-3 Sonnet pseudo-questions.
Optimized for the ColPali retrieval architecture; see Arxiv 2407.01449 for training methodologies.