Over 5 billion tokens of Traditional Chinese Medicine text from websites and books, alongside a large-scale image-text dataset, form the pretraining data for ShizhenGPT. The dataset was created by CarsonnnNN and released on Hugging Face, with a last recorded update in March 2026. It is described as the largest existing open-source TCM corpus and image-text dataset for pretraining.
Use Cases
- Pretraining a domain-specific multimodal LLM based on the described TCM text corpus and image-text pairs.
- Fine-tuning models for TCM knowledge question-answering based on the large-scale text corpus.
- Training vision-language models for TCM applications based on the described image-text dataset.
- Conducting linguistic or semantic analysis of TCM literature based on the web and book-derived text data.
Strengths
- Corpus contains over 5 billion tokens, described as the largest existing open-source TCM text dataset.
- Includes a large-scale TCM image-text pretraining dataset alongside the text corpus.
Limitations
- Description metadata is limited; actual data quality, structure, and file formats require manual inspection after download.
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count and dataset size are unknown, which may limit suitability assessment.
Provenance
- Source
- CarsonnnNN on Hugging Face.
- Collection Method
- Collected from TCM-related websites and books.
- Time Range
- null
- Freshness
- Last updated 2026-03-12 17:18:56; freshness should be verified.
- Geography
- null