Multi-granularity representation learning with vision Mamba for infrared small target detection
Heterogeneous environments and low Signal-to-Clutter Ratio (SCR) pose a challenge for Infrared Small Target Detection (IRSTD). Convolutional Neural Network (CNN) is constrained by the global view. Transformer with quadratic computational complexity struggles for local feature refinement. Inspired by...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2025-08-01
|
Series: | International Journal of Applied Earth Observations and Geoinformation |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S1569843225002924 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1839637920674217984 |
---|---|
author | Yongji Li Luping Wang Shichao Chen |
author_facet | Yongji Li Luping Wang Shichao Chen |
author_sort | Yongji Li |
collection | DOAJ |
description | Heterogeneous environments and low Signal-to-Clutter Ratio (SCR) pose a challenge for Infrared Small Target Detection (IRSTD). Convolutional Neural Network (CNN) is constrained by the global view. Transformer with quadratic computational complexity struggles for local feature refinement. Inspired by the quad-directional scanning State Space Model (SSM) with linear complexity for long-range modeling, this research reconceptualizes the spatial and structural information of small targets in IR images. Multi-granularity features and long-range dependency of small targets are considered simultaneously. Specifically, we tailor a nested structure with cross-fertilization of global and local information. Each layer of the top-level pyramid network embeds a tiny well-configured contextual pyramid block to extract fine-grained features of small targets. The following Mamba module restructures the feature maps to derive coarse-grained features of “visual sentences”. The fusion of contextual information and local feature achieves precise localization of small targets. Furthermore, we propose the Asymmetric Convolution (AConv) for substituting the Depthwise Convolution (DWConv) in the Visual State Space (VSS) module and the regular convolution in each lateral connection of the nested pyramid network to alleviate the parameters and computation. Both qualitative and quantitative experiments demonstrate that our proposed model outperforms 12 recent baseline methods on two public datasets. |
format | Article |
id | doaj-art-8c6874fee71b41f49d586c5ff739b80c |
institution | Matheson Library |
issn | 1569-8432 |
language | English |
publishDate | 2025-08-01 |
publisher | Elsevier |
record_format | Article |
series | International Journal of Applied Earth Observations and Geoinformation |
spelling | doaj-art-8c6874fee71b41f49d586c5ff739b80c2025-07-06T04:23:20ZengElsevierInternational Journal of Applied Earth Observations and Geoinformation1569-84322025-08-01142104645Multi-granularity representation learning with vision Mamba for infrared small target detectionYongji Li0Luping Wang1Shichao Chen2School of Electronics and Communication Engineering, Sun Yat-sen University, Shenzhen, ChinaSchool of Electronics and Communication Engineering, Sun Yat-sen University, Shenzhen, China; Corresponding authors.School of Electronics and Information, Northwestern Polytechnical University, Xi’an, China; Corresponding authors.Heterogeneous environments and low Signal-to-Clutter Ratio (SCR) pose a challenge for Infrared Small Target Detection (IRSTD). Convolutional Neural Network (CNN) is constrained by the global view. Transformer with quadratic computational complexity struggles for local feature refinement. Inspired by the quad-directional scanning State Space Model (SSM) with linear complexity for long-range modeling, this research reconceptualizes the spatial and structural information of small targets in IR images. Multi-granularity features and long-range dependency of small targets are considered simultaneously. Specifically, we tailor a nested structure with cross-fertilization of global and local information. Each layer of the top-level pyramid network embeds a tiny well-configured contextual pyramid block to extract fine-grained features of small targets. The following Mamba module restructures the feature maps to derive coarse-grained features of “visual sentences”. The fusion of contextual information and local feature achieves precise localization of small targets. Furthermore, we propose the Asymmetric Convolution (AConv) for substituting the Depthwise Convolution (DWConv) in the Visual State Space (VSS) module and the regular convolution in each lateral connection of the nested pyramid network to alleviate the parameters and computation. Both qualitative and quantitative experiments demonstrate that our proposed model outperforms 12 recent baseline methods on two public datasets.http://www.sciencedirect.com/science/article/pii/S1569843225002924Infrared small target detectionState space modelVision MambaNested contextual pyramidAsymmetric convolution |
spellingShingle | Yongji Li Luping Wang Shichao Chen Multi-granularity representation learning with vision Mamba for infrared small target detection International Journal of Applied Earth Observations and Geoinformation Infrared small target detection State space model Vision Mamba Nested contextual pyramid Asymmetric convolution |
title | Multi-granularity representation learning with vision Mamba for infrared small target detection |
title_full | Multi-granularity representation learning with vision Mamba for infrared small target detection |
title_fullStr | Multi-granularity representation learning with vision Mamba for infrared small target detection |
title_full_unstemmed | Multi-granularity representation learning with vision Mamba for infrared small target detection |
title_short | Multi-granularity representation learning with vision Mamba for infrared small target detection |
title_sort | multi granularity representation learning with vision mamba for infrared small target detection |
topic | Infrared small target detection State space model Vision Mamba Nested contextual pyramid Asymmetric convolution |
url | http://www.sciencedirect.com/science/article/pii/S1569843225002924 |
work_keys_str_mv | AT yongjili multigranularityrepresentationlearningwithvisionmambaforinfraredsmalltargetdetection AT lupingwang multigranularityrepresentationlearningwithvisionmambaforinfraredsmalltargetdetection AT shichaochen multigranularityrepresentationlearningwithvisionmambaforinfraredsmalltargetdetection |