DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Wukong100M: 100 Million Chinese Image-Text Pairs | DataSalon

Home Multimodal & LLMWukong100M: 100 Million Chinese Image-Text Pairs

Multimodal & LLM

Wukong100M: 100 Million Chinese Image-Text Pairs

Name: Wukong100M: 100 Million Chinese Image-Text Pairs
Creator: wanng
Published: 2022-12-11T04:26:12
Keywords: Image Text Pairs, Computer Vision, Chinese Language, Natural Language Processing, Multimodal

by wanng·Updated 3y ago

Description

100 million Chinese image-text pairs form a subset of the Noah-Wukong multimodal dataset. The dataset was uploaded by author 'wanng' to Hugging Face and last updated on December 11, 2022. The text metadata for these pairs occupies approximately 16GB of space.

Use Cases

Training Chinese vision-language models based on the 100 million image-text pairs.
Fine-tuning image captioning systems for Chinese text generation.
Conducting research on cross-modal retrieval between Chinese text and images.
Pre-training foundation models for downstream Chinese multimodal applications.

Strengths

Contains approximately 100 million data points, providing a large-scale resource.
Focuses on Chinese language content, which may be less common in other large multimodal collections.
Text metadata for the pairs is a known size of around 16GB.

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Last updated 2022-12-11 06:24:05; freshness should be verified.
Column-level documentation is absent; field semantics must be inferred after download.

Provenance

Source: Noah-Wukong multimodal dataset
Collection Method: Subset extraction focusing on the Chinese-language portion.
Time Range: null
Freshness: 2022-12-11 06:24:05
Geography: null

The download success rate for images is noted to be around 80%, and the full dataset of images is described as 'very, very large'.

Multimodal Image Text Pairs Computer Vision Chinese Language Natural Language Processing

Related Datasets

Quality Score

D38

Description

Source

Reputation

Quality Score

D38

Description

Source

Reputation

Access

Community

214 downloads

17 likes

0 views

Dataset Info

Author: wanng
Created: Dec 11, 2022
Updated: Dec 11, 2022
Last synced: Apr 21, 2026

Access

Community

214 downloads

17 likes

0 views

Dataset Info

Author: wanng
Created: Dec 11, 2022
Updated: Dec 11, 2022
Last synced: Apr 21, 2026

Wukong100M: 100 Million Chinese Image-Text Pairs

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info