Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
D.Html contains fewer than 1,000 document page images paired with structured HTML and Markdown markup for OCR and reconstruction tasks. Developed by prithivMLmods and updated in March 2026, the collection focuses on preserving document hierarchies like headings and paragraphs.
Requires Parquet-compatible libraries such as Pandas or Polars for data access; licensed under Apache 2.0.