DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

OBELICS: 141M Interleaved Image-Text Documents with 115B Tokens | DataSalon

Home Multimodal & LLMOBELICS: 141M Interleaved Image-Text Documents with 115B Tokens

Multimodal & LLM

OBELICS: 141M Interleaved Image-Text Documents with 115B Tokens

Name: OBELICS: 141M Interleaved Image-Text Documents with 115B Tokens
Creator: huggingface
Published: 2023-06-05T18:26:19
License: Apache-2.0
Keywords: Machine Learning, Multimodal

by huggingface / huggingface·Updated 1y ago

Available on 1 platform

Description

141 million interleaved image-text web documents containing 115 billion text tokens and 353 million images comprise the OBELICS collection. Created by Hugging Face and updated in 2024, it serves as a massive open-source resource for multimodal AI development.

Use Cases

Pre-training multimodal models using the 353M images and 115B text tokens
Training vision-language models to understand context within interleaved document structures
Benchmarking large-scale data processing pipelines on 141M web documents

Strengths

141 million documents
353 million images
115 billion text tokens
Apache-2.0 license

Limitations

Significant storage and compute requirements due to the 115B token scale
Inherent noise and potential biases common in web-scraped content

Provenance

Source: Hugging Face
Collection Method: scraped and curated
Freshness: Last updated August 2024
Geography: global

Requires significant storage and compute resources to process 141M documents; licensed under Apache-2.0.

Multimodal Machine Learning

Related Datasets

Quality Score

D31

Description

Source

Reputation

Quality Score

D31

Description

Source

Reputation

Access

Community

211 likes

0 views

Dataset Info

License: Apache-2.0
Author: huggingface
Org: huggingface
Created: Jun 5, 2023
Updated: Aug 28, 2024
Language: Python
Last synced: Jun 23, 2026

Access

Community

211 likes

0 views

Dataset Info

License: Apache-2.0
Author: huggingface
Org: huggingface
Created: Jun 5, 2023
Updated: Aug 28, 2024
Language: Python
Last synced: Jun 23, 2026

OBELICS: 141M Interleaved Image-Text Documents with 115B Tokens

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info