10,921 high-resolution remote sensing images collected from satellite imagery sources, each paired with 5 descriptive natural language captions. The dataset covers 30 distinct scene categories, including airports, bridges, and residential areas, totaling approximately 54,605 caption-image pairs.
Use Cases
- Train image captioning models to generate descriptive text based on visual features in remote sensing imagery
- Develop cross-modal retrieval systems to identify specific satellite images using natural language queries
- Fine-tune vision-language models for scene classification across the 30 provided land-use categories
- Benchmark automated text-to-image synthesis for geographic and environmental monitoring contexts
Strengths
- 10,921 remote sensing images sourced from Google Earth, Baidu Map, MapABC, and Tianditu
- 54,605 natural language captions providing 5 unique descriptions per image
- 30 distinct scene categories including 'airport', 'playground', 'viaduct', and 'beach'
- Standardized image dimensions of 224x224 pixels for consistent model training