Sign in to view source links and access this dataset
Description
500 six-panel comic strips generated with OpenAI's gpt-image-1, totaling 3,000 images. Each strip is paired with structured metadata including art style, a recurring protagonist, and a caption for every panel. The dataset was created by baulab to study spatial grounding in vision-language models, specifically tracking attention across multi-panel images.
Use Cases
Benchmarking spatial grounding performance in vision-language models based on the structured six-panel layout.
Training models to associate per-panel captions with specific image regions based on the provided metadata.
Analyzing how model attention tracks narrative progression across sequential comic panels.
Studying the relationship between generated art styles and textual descriptions in a controlled, structured format.
Strengths
Contains 500 unique comic strips, resulting in 3,000 individual images.
Each image is paired with structured metadata including art style, protagonist, and per-panel captions.
Designed for a specific research purpose: studying spatial grounding in multi-panel contexts.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Data may reflect bias inherent to the specific generative model (gpt-image-1) used for creation.
Provenance
Source
huggingface
Collection Method
Generated with OpenAI's gpt-image-1 model.
Freshness
Last updated 2026-06-11 06:20:00; freshness should be verified.
License is unknown and should be verified before use.