DanQing100M: 100 Million Chinese Image-Text Pairs for Vision-Language Pre-training

Name: DanQing100M: 100 Million Chinese Image-Text Pairs for Vision-Language Pre-training
Creator: DeepGlint-AI
Published: 2026-01-05T15:02:46
Keywords: Image Text Pairs, Web Data, Pre Training, Vision Language, Chinese, Computer Vision, Large Scale, Multimodal

by DeepGlint-AIUpdated 3mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

DanQing100M is a large-scale Chinese vision-language dataset containing 100 million image-text pairs, totaling 12 terabytes. It was created by researchers including Hengyu Shen, Tiancheng Gu, and others from DeepGlint-AI, using web data from 2024 to 2025. The dataset is intended for vision-language pre-training tasks.

Use Cases

Train vision-language models based on the 100 million Chinese image-text pairs.
Fine-tune Chinese multimodal models for tasks like image captioning based on web-sourced data.
Benchmark model performance on large-scale Chinese vision-language understanding tasks.
Conduct research on cross-modal alignment for Chinese language and imagery.

Strengths

Contains 100 million image-text pairs, a large-scale resource.
Dataset size is 12 terabytes, indicating high-resolution or diverse content.
Data is sourced from recent web data (2024-2025), suggesting contemporary relevance.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
License information is unknown, which may restrict usage.

Provenance

Source: DeepGlint-AI
Collection Method: Likely gathered from web sources.
Time Range: 2024-2025
Freshness: Last updated 2026-03-25 03:26:02; freshness should be verified.
Geography: China (implied by Chinese language focus)

License restrictions are unknown and must be verified before use.

Multimodal Chinese Image Text Pairs Web Data Pre Training Vision Language Computer Vision Large Scale

Related Datasets

Quality Score

C43

Description

48

Source

36

Reputation

56

Access

26

Community

2.5K downloads

49 likes

0 views

Dataset Info

Author: DeepGlint-AI
Created: Jan 5, 2026
Updated: Mar 25, 2026
Last synced: May 6, 2026

Access

26

Community

2.5K downloads

49 likes

0 views

Dataset Info

Author: DeepGlint-AI
Created: Jan 5, 2026
Updated: Mar 25, 2026
Last synced: May 6, 2026

DanQing100M: 100 Million Chinese Image-Text Pairs for Vision-Language Pre-training

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info