Sign in to view source links and access this dataset
Description
DEJIMA is a large-scale Japanese multimodal dataset containing 3.88 million image-caption pairs and 3.88 million image-question-answer pairs. It was created by MIL-UT using a reproducible pipeline involving web-scale image collection, strict filtering, evidence extraction, and LLM-based annotation under grounding constraints. The dataset was last updated on December 2, 2025.
Use Cases
Training Japanese image captioning models based on the 3.88M image-caption pairs.
Fine-tuning visual question answering models for Japanese based on the 3.88M image-QA pairs.
Benchmarking multimodal models on Japanese-language grounding and evidence extraction tasks.
Developing or evaluating scalable data generation pipelines for non-English multimodal datasets.
Strengths
Contains 3.88 million image-caption pairs (DEJIMA-Cap).
Contains 3.88 million image-question-answer pairs (DEJIMA-VQA).
Annotations are generated under grounding constraints, which may improve quality.
Constructed through a scalable and fully reproducible pipeline.
Limitations
Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Data may reflect bias inherent to web-scale image collection.