A Remote Sensing Vision-Language Foundation Model for Zero-Shot Tasks
Abstract
Foundation models have revolutionized AI, particularly in visual-language tasks, achieving unparalleled performance across domains. Despite advancements, remote sensing (RS) remains underserved due to the lack of large-scale image-text datasets. This paper addresses the gap by introducing two novel datasets: RS-WebLI and Google Maps, specifically designed for training remote sensing vision-language models (RS-VLMs).
The RS-WebLI dataset leverages web images filtered for RS relevance, enriched with high-quality captions derived from associated alt-text. The Google Maps dataset utilizes Gemini, a multi-modal large language model, to generate accurate and descriptive captions by aligning Google Maps data with high-resolution satellite and aerial imagery. These datasets together encompass a vast and diverse array of remote sensing objects and contexts, forming a robust foundation for RS-specific tasks. The two datasets together incorporate around 20M image and text pairs.
We fine-tuned Mammut, a state-of-the-art (SOTA) vision-language model, using these datasets. The model employs a contrastive learning framework, enabling robust zero-shot capabilities. Moreover, the Mammut architecture incorporates a generative loss component, further enhancing its adaptability. To evaluate the model’s zero-shot performance, we used two main methods. The first, zero-shot classification, tests the ability of the model to classify a remote sensing image into a pre-defined set of classes without training directly on the dataset. For this task we use the following RS image classification datasets: Functional Map of the World (FMOW), RESISC45, UCM Classification and SkyScript classification. For every dataset, we composed a set of sentences of the form ”An aerial image of class name”, and we used a simple nearest neighbor algorithm to find the best matching class for every image. The metric is the top-1 accuracy. The second evaluation method is zero-shot retrieval. For that task, we use the following remote sensing image-captions datasets: NWPU RESISC, UCM Captions, RSITMD and RSICD. Similarly to zero-shot classification, we use nearest neighbors on the model’s output embedding to match every image to a class. Similarly to other works in the field, we present the average of the top-1, top-5 and top-10 recall scores.
The study also evaluates supervised learning regimes, where the VLMs are fine-tuned on task-specific datasets like FMOW and FloodNet. These models outperform traditional masked-image models, showcasing the advantage of leveraging vision-language pre-training for RS applications. To assess generalization, the Google Maps Hold-out dataset was introduced, excluding specific object types during training. Results indicate the model's strong ability to recognize unseen objects, validating its versatility.
This work establishes a comprehensive framework for developing RS-VLMs, addressing dataset limitations and model scalability. It sets a precedent for leveraging foundation models in RS, paving the way for enhanced zero-shot and fine-tuned applications in remote sensing analytics. Future directions include expanding dataset diversity and exploring advanced architectures to further push the boundaries of RS vision-language understanding.
The RS-WebLI dataset leverages web images filtered for RS relevance, enriched with high-quality captions derived from associated alt-text. The Google Maps dataset utilizes Gemini, a multi-modal large language model, to generate accurate and descriptive captions by aligning Google Maps data with high-resolution satellite and aerial imagery. These datasets together encompass a vast and diverse array of remote sensing objects and contexts, forming a robust foundation for RS-specific tasks. The two datasets together incorporate around 20M image and text pairs.
We fine-tuned Mammut, a state-of-the-art (SOTA) vision-language model, using these datasets. The model employs a contrastive learning framework, enabling robust zero-shot capabilities. Moreover, the Mammut architecture incorporates a generative loss component, further enhancing its adaptability. To evaluate the model’s zero-shot performance, we used two main methods. The first, zero-shot classification, tests the ability of the model to classify a remote sensing image into a pre-defined set of classes without training directly on the dataset. For this task we use the following RS image classification datasets: Functional Map of the World (FMOW), RESISC45, UCM Classification and SkyScript classification. For every dataset, we composed a set of sentences of the form ”An aerial image of class name”, and we used a simple nearest neighbor algorithm to find the best matching class for every image. The metric is the top-1 accuracy. The second evaluation method is zero-shot retrieval. For that task, we use the following remote sensing image-captions datasets: NWPU RESISC, UCM Captions, RSITMD and RSICD. Similarly to zero-shot classification, we use nearest neighbors on the model’s output embedding to match every image to a class. Similarly to other works in the field, we present the average of the top-1, top-5 and top-10 recall scores.
The study also evaluates supervised learning regimes, where the VLMs are fine-tuned on task-specific datasets like FMOW and FloodNet. These models outperform traditional masked-image models, showcasing the advantage of leveraging vision-language pre-training for RS applications. To assess generalization, the Google Maps Hold-out dataset was introduced, excluding specific object types during training. Results indicate the model's strong ability to recognize unseen objects, validating its versatility.
This work establishes a comprehensive framework for developing RS-VLMs, addressing dataset limitations and model scalability. It sets a precedent for leveraging foundation models in RS, paving the way for enhanced zero-shot and fine-tuned applications in remote sensing analytics. Future directions include expanding dataset diversity and exploring advanced architectures to further push the boundaries of RS vision-language understanding.