Scene Text Detection Based on Multi-scale Feature Extraction and Bidirectional Feature Fusion

Natural scene text detection is a fundamental research work in the field of image processing and has a wide range of applications. Currently, natural scene text detection usually adopts single-scale convolution and multi-scale feature fusion to capture the semantic features of scene text. Howeve...

Full description

Saved in:
Bibliographic Details
Main Authors: LIAN Zhe, YIN Yanjun, ZHI Min, XU Qiaozhi
Format: Article
Language:Chinese
Published: Harbin University of Science and Technology Publications 2024-08-01
Series:Journal of Harbin University of Science and Technology
Subjects:
Online Access:https://hlgxb.hrbust.edu.cn/#/digest?ArticleID=2346
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Natural scene text detection is a fundamental research work in the field of image processing and has a wide range of applications. Currently, natural scene text detection usually adopts single-scale convolution and multi-scale feature fusion to capture the semantic features of scene text. However, single-scale convolution methods are usually difficult to take into account the feature representation of text targets with different shapes and scales. Meanwhile, simple multi-scale feature fusion methods based on upsampling only focus on the consistency of scale size, while ignoring the importance of features at different scales. To address the above problems, a scene text detection algorithm based on multi-scale feature extraction and bidirectional feature fusion is proposed. The proposed algorithm constructs a multi-scale feature extraction module based on convolutional kernels of different sizes to take into account the feature extraction of text targets of different scales and shapes, while capturing contextual information dependencies at different distances. In the feature fusion process, a bi-directional feature fusion module is constructed by adding bottom-up fusion paths to achieve different scales of information interaction. Coordinate attention is introduced after feature fusion to achieve high-level detail information enhancement and compensate for the deficiency of feature fusion detail information loss. Extensive experiments are conducted on the ICDAR2015 , MSRA-TD500 , and CTW1500 datasets, and the experimental F1 values reach 87. 8% , 87. 1% , and 83. 2% , respectively, with detection speeds of 17. 2 frames/s, 31. 1 frames/s, and 22. 3 frames/s, respectively, showing good robustness compared with other advanced detection methods.
ISSN:1007-2683