Data stream-pairwise bottleneck transformer for engagement estimation from video conversation

This study aims to assess participant engagement in multiparty conversations using video and audio data. For this task, the interaction among numerous data streams, such as video and audio from multiple participants, should be modeled effectively, considering the redundancy of video and audio across...

Full description

Saved in:

Bibliographic Details
Main Authors:	Keita Suzuki, Nobukatsu Hojo, Kazutoshi Shinoda, Saki Mizuno, Ryo Masumura
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2025-06-01
Series:	Frontiers in Artificial Intelligence
Subjects:	transformer engagement multiparty conversation multimodal classification global token
Online Access:	https://www.frontiersin.org/articles/10.3389/frai.2025.1516295/full
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This study aims to assess participant engagement in multiparty conversations using video and audio data. For this task, the interaction among numerous data streams, such as video and audio from multiple participants, should be modeled effectively, considering the redundancy of video and audio across frames. To efficiently model participant interactions while accounting for such redundancy, a previous study proposed inputting participant feature sequences into global token-based transformers, which constrain attention across feature sequences to pass through only a small set of internal units, allowing the model to focus on key information. However, this approach still faces the challenge of redundancy in participant-feature estimation based on standard cross-attention transformers, which can connect all frames across different modalities. To address this, we propose a joint model for interactions among all data streams using global token-based transformers, without distinguishing between cross-modal and cross-participant interactions. Experiments on the RoomReader corpus confirm that the proposed model outperforms previous models, achieving accuracy ranging from 0.720 to 0.763, weighted F1 scores from 0.733 to 0.771, and macro F1 scores from 0.236 to 0.277.
ISSN:	2624-8212

Data stream-pairwise bottleneck transformer for engagement estimation from video conversation

Similar Items