Vision-language document retrieval training pairs transformed from the vidore/colpali_train_set for Tevatron compatibility. The data is structured to support the training of multi-vector retrieval models like ColPali within the Tevatron ecosystem.
Use Cases
- Train a document retrieval model using the Tevatron framework by mapping the query and positive document pairs
- Fine-tune vision-language models for document search using the Tevatron-compatible data structure
- Benchmark retrieval performance on document-based datasets using the Tevatron training pipeline
- Implement multi-vector retrieval training using the pre-formatted query and document features
Strengths
- Formatted for compatibility with the Tevatron information retrieval toolkit
- Derived from the vidore/colpali_train_set vision-language dataset
- Optimized for training multi-vector retrieval models like ColPali
- Provides a standardized schema for training retrieval models on document images and queries