Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
100,056 rasterized page images from arXiv AI/ML papers serve as a benchmark corpus for OCR tasks. The dataset, created by obswork, contains pages rendered at 144 DPI from 4,866 source PDFs and was last updated on 2026-04-19. Images are encoded as WebP and packed into Parquet shards for automatic decoding via Hugging Face datasets.
License is unknown; users should verify permissible uses.