MMDocIRT2ITRetrieval is an evaluation dataset from the Massive Text Embedding Benchmark (MTEB). It contains 313 long documents averaging 65.1 pages, categorized into ten domains including research reports, academic papers, and government documents. The dataset features a multimodal distribution, with text comprising 60.4% of the content.
Use Cases
- Benchmarking text-to-image retrieval systems based on the multimodal document content.
- Evaluating cross-modal embedding models on long documents averaging 65.1 pages.
- Training or testing document retrieval algorithms across ten distinct domains like research reports and laws.
- Analyzing the distribution and integration of multimodal information (text and images) in professional documents.
Strengths
- Dataset includes 313 long documents, providing a substantial evaluation corpus.
- Documents average 65.1 pages, offering a testbed for long-form content retrieval.
- Documents are categorized into ten distinct domains, enabling domain-specific analysis.
- The multimodal distribution is quantified, with text comprising 60.4% of the content.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- mteb (Massive Text Embedding Benchmark)
- Freshness
- Last updated 2026-03-14 19:26:11; freshness should be verified.