DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Traditional Chinese Medicine Multimodal Pre-training Corpus | DataSalon

Home Multimodal & LLMTraditional Chinese Medicine Multimodal Pre-training Corpus

Multimodal & LLM

Traditional Chinese Medicine Multimodal Pre-training Corpus

Name: Traditional Chinese Medicine Multimodal Pre-training Corpus
Creator: FreedomIntelligence
Published: 2025-08-22T02:57:56
Keywords: Image Text Pairs, Traditional Chinese Medicine, Computer Vision, Medical Text, Natural Language Processing, Multimodal

by FreedomIntelligence·Updated 10mo ago

Available on 1 platform

Description

Over 5 billion tokens of Traditional Chinese Medicine text form the largest existing TCM corpus, sourced from websites and books. FreedomIntelligence released this multimodal dataset for pre-training the ShizhenGPT model. It was last updated in September 2025.

Use Cases

Pre-training a TCM-specific LLM using the 5B token text corpus.
Training a multimodal model for TCM image-text alignment tasks.
Fine-tuning models for TCM knowledge retrieval from the curated text data.
Benchmarking model performance on specialized medical terminology and concepts.

Strengths

Corpus contains over 5 billion tokens of TCM text.
Described as the largest existing TCM corpus and image-text dataset.

Limitations

Specific row counts, column names, and sample data are unavailable.
Dataset size and file formats are unknown.
Potential bias towards information available on specific TCM websites and books.

Provenance

Source: FreedomIntelligence.
Collection Method: Collected from TCM-related websites and books.
Freshness: Last updated September 2025.

License information is unknown. Users must refer to the linked paper and GitHub repository for full dataset details and usage terms.

Multimodal Image Text Pairs Traditional Chinese Medicine Computer Vision Medical Text Natural Language Processing

Related Datasets

Quality Score

D39

Description

Source

Reputation

Quality Score

D39

Description

Source

Reputation

Access

Community

555 downloads

8 likes

0 views

Dataset Info

Author: FreedomIntelligence
Created: Aug 22, 2025
Updated: Sep 8, 2025
Last synced: Jul 24, 2026

Access

Community

555 downloads

8 likes

0 views

Dataset Info

Author: FreedomIntelligence
Created: Aug 22, 2025
Updated: Sep 8, 2025
Last synced: Jul 24, 2026

Traditional Chinese Medicine Multimodal Pre-training Corpus

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info