Multi-granularity representation learning with vision Mamba for infrared small target detection

Heterogeneous environments and low Signal-to-Clutter Ratio (SCR) pose a challenge for Infrared Small Target Detection (IRSTD). Convolutional Neural Network (CNN) is constrained by the global view. Transformer with quadratic computational complexity struggles for local feature refinement. Inspired by...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yongji Li, Luping Wang, Shichao Chen
Format:	Article
Language:	English
Published:	Elsevier 2025-08-01
Series:	International Journal of Applied Earth Observations and Geoinformation
Subjects:	Infrared small target detection State space model Vision Mamba Nested contextual pyramid Asymmetric convolution
Online Access:	http://www.sciencedirect.com/science/article/pii/S1569843225002924
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1839637920674217984
author	Yongji Li Luping Wang Shichao Chen
author_facet	Yongji Li Luping Wang Shichao Chen
author_sort	Yongji Li
collection	DOAJ
description	Heterogeneous environments and low Signal-to-Clutter Ratio (SCR) pose a challenge for Infrared Small Target Detection (IRSTD). Convolutional Neural Network (CNN) is constrained by the global view. Transformer with quadratic computational complexity struggles for local feature refinement. Inspired by the quad-directional scanning State Space Model (SSM) with linear complexity for long-range modeling, this research reconceptualizes the spatial and structural information of small targets in IR images. Multi-granularity features and long-range dependency of small targets are considered simultaneously. Specifically, we tailor a nested structure with cross-fertilization of global and local information. Each layer of the top-level pyramid network embeds a tiny well-configured contextual pyramid block to extract fine-grained features of small targets. The following Mamba module restructures the feature maps to derive coarse-grained features of “visual sentences”. The fusion of contextual information and local feature achieves precise localization of small targets. Furthermore, we propose the Asymmetric Convolution (AConv) for substituting the Depthwise Convolution (DWConv) in the Visual State Space (VSS) module and the regular convolution in each lateral connection of the nested pyramid network to alleviate the parameters and computation. Both qualitative and quantitative experiments demonstrate that our proposed model outperforms 12 recent baseline methods on two public datasets.
format	Article
id	doaj-art-8c6874fee71b41f49d586c5ff739b80c
institution	Matheson Library
issn	1569-8432
language	English
publishDate	2025-08-01
publisher	Elsevier
record_format	Article
series	International Journal of Applied Earth Observations and Geoinformation
spelling	doaj-art-8c6874fee71b41f49d586c5ff739b80c2025-07-06T04:23:20ZengElsevierInternational Journal of Applied Earth Observations and Geoinformation1569-84322025-08-01142104645Multi-granularity representation learning with vision Mamba for infrared small target detectionYongji Li0Luping Wang1Shichao Chen2School of Electronics and Communication Engineering, Sun Yat-sen University, Shenzhen, ChinaSchool of Electronics and Communication Engineering, Sun Yat-sen University, Shenzhen, China; Corresponding authors.School of Electronics and Information, Northwestern Polytechnical University, Xi’an, China; Corresponding authors.Heterogeneous environments and low Signal-to-Clutter Ratio (SCR) pose a challenge for Infrared Small Target Detection (IRSTD). Convolutional Neural Network (CNN) is constrained by the global view. Transformer with quadratic computational complexity struggles for local feature refinement. Inspired by the quad-directional scanning State Space Model (SSM) with linear complexity for long-range modeling, this research reconceptualizes the spatial and structural information of small targets in IR images. Multi-granularity features and long-range dependency of small targets are considered simultaneously. Specifically, we tailor a nested structure with cross-fertilization of global and local information. Each layer of the top-level pyramid network embeds a tiny well-configured contextual pyramid block to extract fine-grained features of small targets. The following Mamba module restructures the feature maps to derive coarse-grained features of “visual sentences”. The fusion of contextual information and local feature achieves precise localization of small targets. Furthermore, we propose the Asymmetric Convolution (AConv) for substituting the Depthwise Convolution (DWConv) in the Visual State Space (VSS) module and the regular convolution in each lateral connection of the nested pyramid network to alleviate the parameters and computation. Both qualitative and quantitative experiments demonstrate that our proposed model outperforms 12 recent baseline methods on two public datasets.http://www.sciencedirect.com/science/article/pii/S1569843225002924Infrared small target detectionState space modelVision MambaNested contextual pyramidAsymmetric convolution
spellingShingle	Yongji Li Luping Wang Shichao Chen Multi-granularity representation learning with vision Mamba for infrared small target detection International Journal of Applied Earth Observations and Geoinformation Infrared small target detection State space model Vision Mamba Nested contextual pyramid Asymmetric convolution
title	Multi-granularity representation learning with vision Mamba for infrared small target detection
title_full	Multi-granularity representation learning with vision Mamba for infrared small target detection
title_fullStr	Multi-granularity representation learning with vision Mamba for infrared small target detection
title_full_unstemmed	Multi-granularity representation learning with vision Mamba for infrared small target detection
title_short	Multi-granularity representation learning with vision Mamba for infrared small target detection
title_sort	multi granularity representation learning with vision mamba for infrared small target detection
topic	Infrared small target detection State space model Vision Mamba Nested contextual pyramid Asymmetric convolution
url	http://www.sciencedirect.com/science/article/pii/S1569843225002924
work_keys_str_mv	AT yongjili multigranularityrepresentationlearningwithvisionmambaforinfraredsmalltargetdetection AT lupingwang multigranularityrepresentationlearningwithvisionmambaforinfraredsmalltargetdetection AT shichaochen multigranularityrepresentationlearningwithvisionmambaforinfraredsmalltargetdetection

Multi-granularity representation learning with vision Mamba for infrared small target detection

Similar Items