A lip reading method based on adaptive pooling attention Transformer
Lip reading technology establishes the mapping relationship between lip movements and specific language characters by processing a series of consecutive lip images, thereby enabling semantic information recognition. Existing methods mainly use recurrent networks for spatiotemporal modeling of sequen...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | Chinese |
Published: |
POSTS&TELECOM PRESS Co., LTD
2025-06-01
|
Series: | 智能科学与技术学报 |
Subjects: | |
Online Access: | http://www.cjist.com.cn/zh/article/doi/10.11959/j.issn.2096-6652.202515/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Lip reading technology establishes the mapping relationship between lip movements and specific language characters by processing a series of consecutive lip images, thereby enabling semantic information recognition. Existing methods mainly use recurrent networks for spatiotemporal modeling of sequential video frames. However, they suffer from significant information loss, especially when the video information is incomplete or contains noise. In such cases, the model often struggles to distinguish between lip movements at different time points, leading to a significant decline in recognition performance. To address this issue, a lip reading method based on adaptive pooling attention transformer (APAT-LR) was proposed. This method introduced an adaptive pooling module before the multi-head self-attention (MHSA) mechanism in the standard Transformer, using a concatenation strategy of max pooling and average pooling. This module effectively suppressed irrelevant information and enhances the representation of key features. Experiments on the CMLR and GRID datasets showed that the proposed APAT-LR method could reduce the recognition error rate, thus verifying the effectiveness of the proposed method. |
---|---|
ISSN: | 2096-6652 |