Name: LLaVAR Visual Instruction Data with Text-Rich Images
Creator: SALT-NLP
Published: 2023-07-06T00:03:43
Keywords: Multimodal Ai, Large Language Model, Natural Language Processing, Text Rich Images, Visual Instruction Tuning, Multimodal

Description

LLaVAR provides a collection of 422,000 pretraining and 16,000 to 20,000 instruction-following data pairs for training multimodal AI models. Created by SALT-NLP, this dataset enhances visual instruction tuning by focusing on images containing text. The dataset was released and last updated in July 2023.

Use Cases

Train a vision-language model to answer questions about text content within images using the instruction-following data pairs.
Fine-tune a model for OCR-based visual reasoning tasks using the 422K pretraining samples derived from LAION with OCR results.
Benchmark model performance on interpreting and reasoning about text-rich visual scenes using the high-quality GPT-4 generated instructions.
Expand a model's capability to follow complex, language-only instructions that refer to visual text elements present in the associated images.

Strengths

Contains 422,000 pretraining data samples for foundational model training.
Includes 16,000 high-quality instruction-following data pairs generated with GPT-4.
Offers an expanded finetuning set of 20,000 samples for greater diversity.

Limitations

Specific image sources, demographics, and geographic coverage are not detailed, which may introduce bias.
The dataset's construction relies on GPT-4 for instruction generation, which may propagate any biases present in that model.
The total number of unique images and their resolution or quality metrics are not provided.

Provenance

Source: Sourced from the LAION dataset and augmented by the SALT-NLP team.
Collection Method: Pretraining data collected based on OCR results from LAION; finetuning data generated by interacting with language-only GPT-4 to create instruction-following pairs.
Freshness: Last updated in July 2023.

The full dataset description and access details are on the Hugging Face dataset page; users must visit the provided link for complete information. License information is not specified in the provided input.

Multimodal Multimodal Ai Large Language Model Natural Language Processing Text Rich Images Visual Instruction Tuning

LLaVAR Visual Instruction Data with Text-Rich Images

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info