Name: Filtered Wikipedia Image Text Dataset
Creator: laion
Published: 2022-03-02T23:29:22
Keywords: Librarypolars, Librarydask, Size Categories1 Mn10 M, Modalitytext, Modalitytabular, Librarymlcroissant, Modalityimage, Librarydatasets, Parquet, Arxiv210300020, Regionus

Description

Filtered WIT is an image-text dataset derived from the Wikipedia Image Text (WIT) dataset, containing 10,000 samples per archived tar file. Each sample includes a .jpg image, a .txt caption, and a .json metadata file. The dataset is provided by LAION and was last updated in January 2022.

Use Cases

Train multimodal models using the .jpg image and corresponding .txt caption for image-to-text or text-to-image tasks.
Analyze the relationship between image content and descriptive text by processing the .txt captions and .jpg files.
Extract and utilize metadata from the .json files for filtering or augmenting image-text pairs based on specific attributes.

Strengths

Dataset contains 10,000 samples per tar file, providing a structured batch size for processing.
Includes three distinct file types (.jpg, .txt, .json) per sample, offering multiple data modalities.
Derived from the established Wikipedia Image Text (WIT) dataset, indicating a known provenance.

Limitations

Unknown total row count and dataset size, making it difficult to assess overall scale.
Data structure and specific column schema are not detailed, requiring inspection of the source script or files.
Last update was in January 2022, which may result in temporal staleness for current applications.

Provenance

Source: Derived from dalle-mini/wit, which is based on the Wikipedia Image Text (WIT) dataset.
Collection Method: Data was filtered and packaged into tars using a script; metadata is stored in parquet files.
Freshness: 2022-01-29

Data is stored in tar archives containing 10,000 samples each; users must handle tar extraction. The full description and specifics are on the Hugging Face dataset page. License information is unknown.

Parquet Librarypolars Librarydask Size Categories1 Mn10 M Modalitytext Modalitytabular Librarymlcroissant Modalityimage Librarydatasets Arxiv210300020 Regionus

Filtered Wikipedia Image Text Dataset

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info