Sign in to view source links and access this dataset
Description
4,431 document samples, including 4,000 source-only samples with fact and title labels, form a private dataset for a review application. The dataset was created by mannycooper and was last updated on June 2, 2026. Its primary purpose is to support the development of an Office and PDF title extraction tool.
Use Cases
Training a named entity recognition model to identify document titles based on labeled facts and titles.
Evaluating the performance of automated title extraction algorithms for Office and PDF documents.
Building a review application to manually verify or correct machine-generated document metadata.
Strengths
Contains 4,000 source-only samples with explicit fact and title labels, providing a foundation for supervised learning.
The dataset is specifically curated for a targeted application in document title extraction.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count beyond the provided sample numbers is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
huggingface
Collection Method
Private dataset for a specific application; collection method not specified.
Freshness
Last updated 2026-06-02 04:44:24; freshness should be verified.
The dataset is marked as private with a note not to make it public without clearing source files; users should verify license and usage terms.