Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
1 million pages of fully-parallel synthetic documents rendered in 22 languages for OCR, layout detection, and visual question answering tasks. The dataset was created by Cognitive-Lab and was last updated on the Hugging Face platform in May 2026. It is described as one of the largest open-source multilingual, multi-task document datasets, with the same ~45,700 source pages rendered in every language.
License is unknown; terms of use must be verified before application.