CAG-MoE: Multimodal Emotion Recognition with Cross-Attention Gated Mixture of Experts

Multimodal emotion recognition faces substantial challenges due to the inherent heterogeneity of data sources, each with its own temporal resolution, noise characteristics, and potential for incompleteness. For example, physiological signals, audio features, and textual data capture complementary ye...

Full description

Saved in:

Bibliographic Details
Main Authors:	Axel Gedeon Mengara Mengara, Yeon-kug Moon
Format:	Article
Language:	English
Published:	MDPI AG 2025-06-01
Series:	Mathematics
Subjects:	multimodal emotion recognition deep learning multimodal fusion transformers mixture of experts
Online Access:	https://www.mdpi.com/2227-7390/13/12/1907
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1839653421288783872
author	Axel Gedeon Mengara Mengara Yeon-kug Moon
author_facet	Axel Gedeon Mengara Mengara Yeon-kug Moon
author_sort	Axel Gedeon Mengara Mengara
collection	DOAJ
description	Multimodal emotion recognition faces substantial challenges due to the inherent heterogeneity of data sources, each with its own temporal resolution, noise characteristics, and potential for incompleteness. For example, physiological signals, audio features, and textual data capture complementary yet distinct aspects of emotion, requiring specialized processing to extract meaningful cues. These challenges include aligning disparate modalities, handling varying levels of noise and missing data, and effectively fusing features without diluting critical contextual information. In this work, we propose a novel Mixture of Experts (MoE) framework that addresses these challenges by integrating specialized transformer-based sub-expert networks, a dynamic gating mechanism with sparse Top-<i>k</i> activation, and a cross-modal attention module. Each modality is processed by multiple dedicated sub-experts designed to capture intricate temporal and contextual patterns, while the dynamic gating network selectively weights the contributions of the most relevant experts. Our cross-modal attention module further enhances the integration by facilitating precise exchange of information among modalities, thereby reinforcing robustness in the presence of noisy or incomplete data. Additionally, an auxiliary diversity loss encourages expert specialization, ensuring the fused representation remains highly discriminative. Extensive theoretical analysis and rigorous experiments on benchmark datasets—the Korean Emotion Multimodal Database (KEMDy20) and the ASCERTAIN dataset—demonstrate that our approach significantly outperforms state-of-the-art methods in emotion recognition, setting new performance baselines in affective computing.
format	Article
id	doaj-art-2b4cd7f36fa549a8b0c8da95338ff8e7
institution	Matheson Library
issn	2227-7390
language	English
publishDate	2025-06-01
publisher	MDPI AG
record_format	Article
series	Mathematics
spelling	doaj-art-2b4cd7f36fa549a8b0c8da95338ff8e72025-06-25T14:08:35ZengMDPI AGMathematics2227-73902025-06-011312190710.3390/math13121907CAG-MoE: Multimodal Emotion Recognition with Cross-Attention Gated Mixture of ExpertsAxel Gedeon Mengara Mengara0Yeon-kug Moon1Department of Artificial Intelligence Data Science, Sejong University, 209 Neungdong-ro, Gwangjin District, Seoul 05006, Republic of KoreaDepartment of Artificial Intelligence Data Science, Sejong University, 209 Neungdong-ro, Gwangjin District, Seoul 05006, Republic of KoreaMultimodal emotion recognition faces substantial challenges due to the inherent heterogeneity of data sources, each with its own temporal resolution, noise characteristics, and potential for incompleteness. For example, physiological signals, audio features, and textual data capture complementary yet distinct aspects of emotion, requiring specialized processing to extract meaningful cues. These challenges include aligning disparate modalities, handling varying levels of noise and missing data, and effectively fusing features without diluting critical contextual information. In this work, we propose a novel Mixture of Experts (MoE) framework that addresses these challenges by integrating specialized transformer-based sub-expert networks, a dynamic gating mechanism with sparse Top-<i>k</i> activation, and a cross-modal attention module. Each modality is processed by multiple dedicated sub-experts designed to capture intricate temporal and contextual patterns, while the dynamic gating network selectively weights the contributions of the most relevant experts. Our cross-modal attention module further enhances the integration by facilitating precise exchange of information among modalities, thereby reinforcing robustness in the presence of noisy or incomplete data. Additionally, an auxiliary diversity loss encourages expert specialization, ensuring the fused representation remains highly discriminative. Extensive theoretical analysis and rigorous experiments on benchmark datasets—the Korean Emotion Multimodal Database (KEMDy20) and the ASCERTAIN dataset—demonstrate that our approach significantly outperforms state-of-the-art methods in emotion recognition, setting new performance baselines in affective computing.https://www.mdpi.com/2227-7390/13/12/1907multimodal emotion recognitiondeep learningmultimodal fusiontransformersmixture of experts
spellingShingle	Axel Gedeon Mengara Mengara Yeon-kug Moon CAG-MoE: Multimodal Emotion Recognition with Cross-Attention Gated Mixture of Experts Mathematics multimodal emotion recognition deep learning multimodal fusion transformers mixture of experts
title	CAG-MoE: Multimodal Emotion Recognition with Cross-Attention Gated Mixture of Experts
title_full	CAG-MoE: Multimodal Emotion Recognition with Cross-Attention Gated Mixture of Experts
title_fullStr	CAG-MoE: Multimodal Emotion Recognition with Cross-Attention Gated Mixture of Experts
title_full_unstemmed	CAG-MoE: Multimodal Emotion Recognition with Cross-Attention Gated Mixture of Experts
title_short	CAG-MoE: Multimodal Emotion Recognition with Cross-Attention Gated Mixture of Experts
title_sort	cag moe multimodal emotion recognition with cross attention gated mixture of experts
topic	multimodal emotion recognition deep learning multimodal fusion transformers mixture of experts
url	https://www.mdpi.com/2227-7390/13/12/1907
work_keys_str_mv	AT axelgedeonmengaramengara cagmoemultimodalemotionrecognitionwithcrossattentiongatedmixtureofexperts AT yeonkugmoon cagmoemultimodalemotionrecognitionwithcrossattentiongatedmixtureofexperts

CAG-MoE: Multimodal Emotion Recognition with Cross-Attention Gated Mixture of Experts

Similar Items