Name: Open-Qwen2VL Data: Filtered Image-Text Pairs for Multimodal LLM Pre-Training
Creator: weizhiwang
Published: 2024-09-27T06:48:06
Keywords: Vision Language, Multimodal Llm, Pre Training Data, Academic Resources, Multimodal

Description

A collection of filtered image-text pairs from academic resources, used for pre-training the Open-Qwen2VL multimodal large language model. The dataset includes subsets like ccs_ebdataset, derived from CC3M-CC12M-SBU and filtered by CLIP, and datacomp_medium_dfn_webdataset. It was created by weizhiwang and last updated on April 16, 2025.

Use Cases

Pre-training vision-language models based on filtered image-text pairs.
Benchmarking data filtering techniques like CLIP and DFN for multimodal datasets.
Training or fine-tuning models for tasks requiring aligned visual and textual understanding.

Strengths

Data is derived from established academic image-text collections like CC3M-CC12M-SBU.
Subsets are filtered using modern techniques (CLIP, DFN) likely to improve quality.
Associated with a published research project (Open-Qwen2VL) with a project page and code repository.

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Row count, file formats, and license are unknown, which may limit suitability assessment.

Provenance

Source: Combines data from sources like CC3M-CC12M-SBU and DataComp-Medium-128M.
Collection Method: Filtered using CLIP and DFN (Data Filtering Network) techniques.
Freshness: Last updated 2025-04-16 00:39:28; freshness should be verified.

License is unknown; users must verify permissions before use. The full description is hosted on a separate Hugging Face dataset page.

Multimodal Vision Language Multimodal Llm Pre Training Data Academic Resources

Open-Qwen2VL Data: Filtered Image-Text Pairs for Multimodal LLM Pre-Training

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info