Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
BigDocs-7.5M is a dataset created by ServiceNow for training multimodal models on document and code tasks, as described in the associated arXiv paper. The dataset was last updated on June 20, 2025, and is hosted on Hugging Face. It appears to contain both text and image data, with some parts distributed using an image identifier column that requires a provided script to reconstruct.
Some parts of the dataset are distributed without direct image columns, requiring use of the provided `get_bigdocs_75m.py` script to substitute images back in.