Sign in to view source links and access this dataset
Description
Cc2Dataset enables the extraction of multimodal pairs including image-text, audio-text, and video-text from the Common Crawl web archive. Developed by rom1504 and updated in 2023, it provides a pipeline to convert raw web documents into structured caption-media datasets. The tool is designed for big-data applications where media is paired with its surrounding document context.
Use Cases
Training vision-language models using extracted image-text pairs
Developing speech-to-text systems using audio-text document associations
Building video understanding models using video-text pairs
Strengths
Supports multimodal extraction including audio-text and video-text pairs
Utilizes Common Crawl as a primary source for web-scale data
MIT license allows for open commercial and research use
Limitations
Requires significant distributed computing resources for execution
Output quality is limited by the inherent noise of automated web scraping
No fixed record count as the output size depends on the user-selected crawl
Provenance
Source
Common Crawl
Collection Method
scraped
Freshness
Last updated December 2023; depends on the currency of the Common Crawl source used.
Geography
Global
This is a tool for dataset generation; users must manage the infrastructure for downloading and processing Common Crawl archives.