A collection of document images intended for training and evaluating models that detect document orientation and the number of text columns. The dataset likely contains images of scientific papers, legal acts, reports, and tables. It was created by dedoc and includes documents in languages such as Russian, English, French, Spanish, and Portuguese.
Use Cases
- Train a model to classify document orientation based on image content.
- Evaluate model performance on detecting the number of text columns in a document.
- Benchmark document layout analysis algorithms across multiple document types.
- Develop preprocessing pipelines for multilingual document digitization.
Strengths
- Dataset is designed to represent a variety of document types, including scientific papers, legal acts, reports, and tables.
- Includes multilingual content, covering languages such as Russian, English, French, Spanish, and Portuguese.
Limitations
- Description metadata is limited; actual data quality requires manual inspection after download.
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
Provenance
- Source
- huggingface
- Freshness
- Last updated 2024-08-02 11:22:19; freshness should be verified.