Traditional Chinese Medicine Oncology Texts with BERT-Based Annotation for NLP Tasks
by Zhang, Guoqiang / Harvard Dataverse·Updated 4d ago
Available on 1 platform
Sign in to view source links and access this dataset
Description
A custom corpus of Traditional Chinese Medicine oncology texts constructed from national terminology databases, Chinese medical paper abstracts, and consultation forum Q&A. The dataset includes three annotated subsets (D1: 2,000 paragraphs; D2: 6,000 paragraphs; D3: 12,000 paragraphs) with sequence labeling and paragraph-level semantic classification. It was authored by Zhang, Guoqiang and last updated on 2026-06-09.
Use Cases
Train tokenization models for Traditional Chinese Medicine texts based on the described Begin-Inside-Outside sequence labeling.
Develop semantic classification models for TCM oncology paragraphs based on the paragraph-level annotations.
Benchmark joint modeling architectures (e.g., BERT-CRF-MLP) for medical text processing based on the described structured corpus.
Build structured information extraction systems for TCM knowledge bases or patient records based on the fine-grained terminology recognition described.
Strengths
Includes three well-defined corpus subsets with specific sizes: D1 (2,000 paragraphs), D2 (6,000 paragraphs), and D3 (12,000 paragraphs).
Model performance metrics are provided, such as a worst-case accuracy of 91.364% and a Strict-F1 score of 93.723 on the D3 subset.
The corpus integrates multiple sources, including a national standard TCM terminology database and anonymized Q&A content.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count for the full dataset is unknown, which may limit suitability assessment.
The data appears to be focused on Chinese-language TCM texts, which may limit generalizability to other medical domains or languages.
Provenance
Source
National standard TCM terminology database, abstracts of Chinese medical research papers, and Q&A content from TCM consultation forums.
Collection Method
Constructed by integrating open-source medical terminology data and anonymized TCM question-answer corpora, with applied sequence labeling and classification annotations.
Freshness
Last updated 2026-06-09 16:16:23; freshness should be verified.
License is unknown; terms of use must be verified before application.