A lip reading method based on adaptive pooling attention Transformer

Lip reading technology establishes the mapping relationship between lip movements and specific language characters by processing a series of consecutive lip images, thereby enabling semantic information recognition. Existing methods mainly use recurrent networks for spatiotemporal modeling of sequen...

Full description

Saved in:

Bibliographic Details
Main Authors:	YAO Yun, HU Zhenxiao, DENG Tao, WANG Xiao
Format:	Article
Language:	Chinese
Published:	POSTS&TELECOM PRESS Co., LTD 2025-06-01
Series:	智能科学与技术学报
Subjects:	attention mechanism Transformer Convolutional Pooling Adaptive
Online Access:	http://www.cjist.com.cn/zh/article/doi/10.11959/j.issn.2096-6652.202515/
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Lip reading technology establishes the mapping relationship between lip movements and specific language characters by processing a series of consecutive lip images, thereby enabling semantic information recognition. Existing methods mainly use recurrent networks for spatiotemporal modeling of sequential video frames. However, they suffer from significant information loss, especially when the video information is incomplete or contains noise. In such cases, the model often struggles to distinguish between lip movements at different time points, leading to a significant decline in recognition performance. To address this issue, a lip reading method based on adaptive pooling attention transformer (APAT-LR) was proposed. This method introduced an adaptive pooling module before the multi-head self-attention (MHSA) mechanism in the standard Transformer, using a concatenation strategy of max pooling and average pooling. This module effectively suppressed irrelevant information and enhances the representation of key features. Experiments on the CMLR and GRID datasets showed that the proposed APAT-LR method could reduce the recognition error rate, thus verifying the effectiveness of the proposed method.
ISSN:	2096-6652

A lip reading method based on adaptive pooling attention Transformer

Similar Items