Evaluating the Robustness of Foundation Models for Satellite Imagery

With abundant remote sensing data, satellite imagery foundation models have advanced significantly. Currently, the usefulness of the majority of thesemodels presented in their respective works are demonstrated using one or two downstream tasks, specifically classification and segmentation using limi...

Full description

Saved in:
Bibliographic Details
Main Authors: Gilbert Rotich, Sudeep Sarkar
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11039825/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:With abundant remote sensing data, satellite imagery foundation models have advanced significantly. Currently, the usefulness of the majority of thesemodels presented in their respective works are demonstrated using one or two downstream tasks, specifically classification and segmentation using limited datasets that lack complexity based on the diversity of scenes, resolution, and sizes. Consequently, generalization across varied tasks and imagery remains unsolved and a major challenge in remote sensing. To address this gap, we benchmarked eight state-of-the-art foundation models utilizing five self-supervised variants of MAE (SatMAE, Scale-MAE, Cross-Scale MAE) and the Geospatial Foundation Model alongside a supervised Multi-Task Pretraining (MTP) all built upon Vision Transformer (ViT) backbones. Each model was pretrained on one of four large collections (fMoW, SAMRS, Million-AID, GeoPile) and then used to initialize networks for the four downstream tasks namely oriented object detection, classification, segmentation, and change detection across three diverse datasets (DOTA-v2.0, BigEarthNet, and xDB). In every case, the full network underwent end-to-end fine-tuning. This study aims to provide valuable insights to guide in the development of more generalizable models and fine-tuning protocols, advancing the capabilities of foundation models for diverse remote sensing applications. Results show that models are task dependent: while the foundation models match or exceed supervised baselines in image classification or segmentation, they struggle with other tasks such as change detection and oriented object detection. Cross-dataset performance is further hampered by variations in the types of images provided by different satellites. Importantly, the quality of pretraining data namely spatial resolution and scene diversity, drives downstream success far more than sheer dataset size (number of image samples during training), as evidenced by SAMRS outperforming larger image datasets like GeoPile and fMoW. Finally, scaling from ViT-Base to ViT-Large yields less than a 1% improvement, underscoring the inefficiency of naive model enlargement. Together, these insights highlight the critical need for task-agnostic architectures and richly diverse pretraining strategies to fully realize the promise of self-supervised learning in remote sensing.
ISSN:2169-3536