44,663 traditional Chinese pharmaceutical label documents from Taiwan's Food and Drug Administration (TFDA). The dataset was created by twinkle-ai and last updated on 2026-05-03. Each record contains rendered WebP images of all PDF pages and structured data extracted into a 17-field JSON schema.
Use Cases
- Fine-tuning language models based on structured pharmaceutical label data.
- Training vision-language models based on rendered document images paired with structured text.
- Building document question-answering systems for drug information retrieval.
- Developing tools for traditional Chinese medical NLP tasks using the provided corpus.
Strengths
- Contains 44,663 records of Taiwanese drug labels.
- Provides multimodal data with both rendered WebP images and structured JSON data for each document.
- Follows a unified 17-field JSON schema for structured extraction.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Last updated 2026-05-03 02:21:40; freshness should be verified.
Provenance
- Source
- Taiwan Food and Drug Administration (TFDA) drug license query system.
- Collection Method
- PDFs were downloaded from URLs listed in a government open data Excel file, rendered to WebP images, and processed with OCR and structured extraction.
- Time Range
- null
- Freshness
- Last updated 2026-05-03 02:21:40.
- Geography
- Taiwan