DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Cc2Dataset: Multimodal Image, Audio, and Video Pairs from Common Crawl | DataSalon

Home Multimodal & LLMCc2Dataset: Multimodal Image, Audio, and Video Pairs from Common Crawl

Multimodal & LLM

Cc2Dataset: Multimodal Image, Audio, and Video Pairs from Common Crawl

Name: Cc2Dataset: Multimodal Image, Audio, and Video Pairs from Common Crawl
Creator: rom1504
Published: 2022-11-29T22:54:06
License: MIT
Keywords: Big Data, Multimodal

by rom1504·Updated 2y ago

Available on 1 platform

Description

Cc2Dataset enables the extraction of multimodal pairs including image-text, audio-text, and video-text from the Common Crawl web archive. Developed by rom1504 and updated in 2023, it provides a pipeline to convert raw web documents into structured caption-media datasets. The tool is designed for big-data applications where media is paired with its surrounding document context.

Use Cases

Training vision-language models using extracted image-text pairs
Developing speech-to-text systems using audio-text document associations
Building video understanding models using video-text pairs

Strengths

Supports multimodal extraction including audio-text and video-text pairs
Utilizes Common Crawl as a primary source for web-scale data
MIT license allows for open commercial and research use

Limitations

Requires significant distributed computing resources for execution
Output quality is limited by the inherent noise of automated web scraping
No fixed record count as the output size depends on the user-selected crawl

Provenance

Source: Common Crawl
Collection Method: scraped
Freshness: Last updated December 2023; depends on the currency of the Common Crawl source used.
Geography: Global

This is a tool for dataset generation; users must manage the infrastructure for downloading and processing Common Crawl archives.

Multimodal Big Data

Related Datasets

Quality Score

D23

Description

Source

Reputation

Quality Score

D23

Description

Source

Reputation

Access

Community

320 likes

0 views

Dataset Info

License: MIT
Author: rom1504
Created: Nov 29, 2022
Updated: Dec 9, 2023
Language: Python
Last synced: Jul 8, 2026

Access

Community

320 likes

0 views

Dataset Info

License: MIT
Author: rom1504
Created: Nov 29, 2022
Updated: Dec 9, 2023
Language: Python
Last synced: Jul 8, 2026

Cc2Dataset: Multimodal Image, Audio, and Video Pairs from Common Crawl

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info