Name: Traditional Chinese Medicine Oncology Texts with BERT-Based Annotation for NLP Tasks
Creator: Zhang, Guoqiang
Published: 2026-06-09T16:16:23
Keywords: Named Entity Recognition, Traditional Chinese Medicine, Oncology Texts, Benchmark, Healthcare, Text Classification, Computer Vision, Text, Natural Language Processing, Medical Nlp

Description

A custom corpus of Traditional Chinese Medicine oncology texts constructed from national terminology databases, Chinese medical paper abstracts, and consultation forum Q&A. The dataset includes three annotated subsets (D1: 2,000 paragraphs; D2: 6,000 paragraphs; D3: 12,000 paragraphs) with sequence labeling and paragraph-level semantic classification. It was authored by Zhang, Guoqiang and last updated on 2026-06-09.

Use Cases

Train tokenization models for Traditional Chinese Medicine texts based on the described Begin-Inside-Outside sequence labeling.
Develop semantic classification models for TCM oncology paragraphs based on the paragraph-level annotations.
Benchmark joint modeling architectures (e.g., BERT-CRF-MLP) for medical text processing based on the described structured corpus.
Build structured information extraction systems for TCM knowledge bases or patient records based on the fine-grained terminology recognition described.

Strengths

Includes three well-defined corpus subsets with specific sizes: D1 (2,000 paragraphs), D2 (6,000 paragraphs), and D3 (12,000 paragraphs).
Model performance metrics are provided, such as a worst-case accuracy of 91.364% and a Strict-F1 score of 93.723 on the D3 subset.
The corpus integrates multiple sources, including a national standard TCM terminology database and anonymized Q&A content.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count for the full dataset is unknown, which may limit suitability assessment.
The data appears to be focused on Chinese-language TCM texts, which may limit generalizability to other medical domains or languages.

Provenance

Source: National standard TCM terminology database, abstracts of Chinese medical research papers, and Q&A content from TCM consultation forums.
Collection Method: Constructed by integrating open-source medical terminology data and anonymized TCM question-answer corpora, with applied sequence labeling and classification annotations.
Freshness: Last updated 2026-06-09 16:16:23; freshness should be verified.

License is unknown; terms of use must be verified before application.

Text Named Entity Recognition Traditional Chinese Medicine Oncology Texts Benchmark Healthcare Text Classification Computer Vision Natural Language Processing Medical Nlp

Traditional Chinese Medicine Oncology Texts with BERT-Based Annotation for NLP Tasks

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info