Name: Furry E621 SFW 7M HQ: 6.92 Million SFW Image Captions for AI Training
Creator: CaptionEmporium
Published: 2024-03-12T16:35:49
Keywords: Image Captions, Furry Art, Text Generation, Multimodal Ai, Computer Vision, Multimodal

Description

CaptionEmporium provides 6.92 million captions for safe-for-work images from the e621/e926 platform, extending to January 2023. The dataset includes captions generated by a large language model (mistralai/Mistral-7B-v0.1) and a multimodal model (THUDM/CogVLM), with 8 LLM and 1 CogVLM caption per image. Most captions are described as substantially larger than 77 tokens.

Use Cases

Training image captioning models based on the large volume of descriptive text.
Fine-tuning text-to-image generative models using the described SFW furry art style.
Benchmarking the quality of AI-generated captions against the provided LLM and CogVLM outputs.
Studying the characteristics of long-form captions that exceed typical token limits.

Strengths

Large scale with 6.92 million captions.
Includes multiple AI-generated caption perspectives (8 LLM + 1 CogVLM per image).
Captions are noted to be substantially larger than 77 tokens, providing detailed descriptions.

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
The dataset is derived from a specific niche community (e621/e926), which may reflect source bias.

Provenance

Source: e621/e926 platform (safe-for-work split).
Collection Method: Captions created by LLMs (mistralai/Mistral-7B-v0.1) and a custom multilabel classifier along with CogVLM (THUDM/CogVLM).
Time Range: Extends to January 2023.
Freshness: Last updated 2024-03-21 01:26:16; freshness should be verified.

License is unknown; terms of use must be verified before application.

Multimodal Image Captions Furry Art Text Generation Multimodal Ai Computer Vision

Furry E621 SFW 7M HQ: 6.92 Million SFW Image Captions for AI Training

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info