PDF-WuKong is a dataset for training and evaluating large multimodal models on long PDF documents. The data accompanies the research paper 'PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling'. Author yh0075 uploaded it to Hugging Face on 2025-01-06.
Use Cases
- Training multimodal language models for PDF comprehension based on the described end-to-end sparse sampling method.
- Benchmarking model performance on long document reading tasks using the provided PDF corpus.
- Extracting structured text and image information from XML-based PDF documents using the included code.
Strengths
- Dataset is directly linked to a named research paper, providing a clear academic context.
- Last updated on 2025-01-06, indicating recent maintenance.
Limitations
- Description metadata is limited; actual data quality requires manual inspection after download.
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
Provenance
- Source
- yh0075 on Hugging Face.
- Collection Method
- Likely extracted from PDF documents for the PDF-WuKong research project.
- Time Range
- null
- Freshness
- Last updated 2025-01-06 02:17:02.
- Geography
- null