Multi-granularity representation learning with vision Mamba for infrared small target detection

Heterogeneous environments and low Signal-to-Clutter Ratio (SCR) pose a challenge for Infrared Small Target Detection (IRSTD). Convolutional Neural Network (CNN) is constrained by the global view. Transformer with quadratic computational complexity struggles for local feature refinement. Inspired by...

Full description

Saved in:
Bibliographic Details
Main Authors: Yongji Li, Luping Wang, Shichao Chen
Format: Article
Language:English
Published: Elsevier 2025-08-01
Series:International Journal of Applied Earth Observations and Geoinformation
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S1569843225002924
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1839637920674217984
author Yongji Li
Luping Wang
Shichao Chen
author_facet Yongji Li
Luping Wang
Shichao Chen
author_sort Yongji Li
collection DOAJ
description Heterogeneous environments and low Signal-to-Clutter Ratio (SCR) pose a challenge for Infrared Small Target Detection (IRSTD). Convolutional Neural Network (CNN) is constrained by the global view. Transformer with quadratic computational complexity struggles for local feature refinement. Inspired by the quad-directional scanning State Space Model (SSM) with linear complexity for long-range modeling, this research reconceptualizes the spatial and structural information of small targets in IR images. Multi-granularity features and long-range dependency of small targets are considered simultaneously. Specifically, we tailor a nested structure with cross-fertilization of global and local information. Each layer of the top-level pyramid network embeds a tiny well-configured contextual pyramid block to extract fine-grained features of small targets. The following Mamba module restructures the feature maps to derive coarse-grained features of “visual sentences”. The fusion of contextual information and local feature achieves precise localization of small targets. Furthermore, we propose the Asymmetric Convolution (AConv) for substituting the Depthwise Convolution (DWConv) in the Visual State Space (VSS) module and the regular convolution in each lateral connection of the nested pyramid network to alleviate the parameters and computation. Both qualitative and quantitative experiments demonstrate that our proposed model outperforms 12 recent baseline methods on two public datasets.
format Article
id doaj-art-8c6874fee71b41f49d586c5ff739b80c
institution Matheson Library
issn 1569-8432
language English
publishDate 2025-08-01
publisher Elsevier
record_format Article
series International Journal of Applied Earth Observations and Geoinformation
spelling doaj-art-8c6874fee71b41f49d586c5ff739b80c2025-07-06T04:23:20ZengElsevierInternational Journal of Applied Earth Observations and Geoinformation1569-84322025-08-01142104645Multi-granularity representation learning with vision Mamba for infrared small target detectionYongji Li0Luping Wang1Shichao Chen2School of Electronics and Communication Engineering, Sun Yat-sen University, Shenzhen, ChinaSchool of Electronics and Communication Engineering, Sun Yat-sen University, Shenzhen, China; Corresponding authors.School of Electronics and Information, Northwestern Polytechnical University, Xi’an, China; Corresponding authors.Heterogeneous environments and low Signal-to-Clutter Ratio (SCR) pose a challenge for Infrared Small Target Detection (IRSTD). Convolutional Neural Network (CNN) is constrained by the global view. Transformer with quadratic computational complexity struggles for local feature refinement. Inspired by the quad-directional scanning State Space Model (SSM) with linear complexity for long-range modeling, this research reconceptualizes the spatial and structural information of small targets in IR images. Multi-granularity features and long-range dependency of small targets are considered simultaneously. Specifically, we tailor a nested structure with cross-fertilization of global and local information. Each layer of the top-level pyramid network embeds a tiny well-configured contextual pyramid block to extract fine-grained features of small targets. The following Mamba module restructures the feature maps to derive coarse-grained features of “visual sentences”. The fusion of contextual information and local feature achieves precise localization of small targets. Furthermore, we propose the Asymmetric Convolution (AConv) for substituting the Depthwise Convolution (DWConv) in the Visual State Space (VSS) module and the regular convolution in each lateral connection of the nested pyramid network to alleviate the parameters and computation. Both qualitative and quantitative experiments demonstrate that our proposed model outperforms 12 recent baseline methods on two public datasets.http://www.sciencedirect.com/science/article/pii/S1569843225002924Infrared small target detectionState space modelVision MambaNested contextual pyramidAsymmetric convolution
spellingShingle Yongji Li
Luping Wang
Shichao Chen
Multi-granularity representation learning with vision Mamba for infrared small target detection
International Journal of Applied Earth Observations and Geoinformation
Infrared small target detection
State space model
Vision Mamba
Nested contextual pyramid
Asymmetric convolution
title Multi-granularity representation learning with vision Mamba for infrared small target detection
title_full Multi-granularity representation learning with vision Mamba for infrared small target detection
title_fullStr Multi-granularity representation learning with vision Mamba for infrared small target detection
title_full_unstemmed Multi-granularity representation learning with vision Mamba for infrared small target detection
title_short Multi-granularity representation learning with vision Mamba for infrared small target detection
title_sort multi granularity representation learning with vision mamba for infrared small target detection
topic Infrared small target detection
State space model
Vision Mamba
Nested contextual pyramid
Asymmetric convolution
url http://www.sciencedirect.com/science/article/pii/S1569843225002924
work_keys_str_mv AT yongjili multigranularityrepresentationlearningwithvisionmambaforinfraredsmalltargetdetection
AT lupingwang multigranularityrepresentationlearningwithvisionmambaforinfraredsmalltargetdetection
AT shichaochen multigranularityrepresentationlearningwithvisionmambaforinfraredsmalltargetdetection