Multi-Class Decoding of Attended Speaker Direction Using Electroencephalogram and Audio Spatial Spectrum

Prior research on directional focus decoding, a.k.a. selective Auditory Attention Decoding (sAAD), has primarily focused on binary “left-right” tasks. However, decoding of the attended speaker’s precise direction is desired. Existing approaches often underutilize spa...

Full description

Saved in:
Bibliographic Details
Main Authors: Yuanming Zhang, Jing Lu, Fei Chen, Haoliang Du, Xia Gao, Zhibin Lin
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Transactions on Neural Systems and Rehabilitation Engineering
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11091336/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Prior research on directional focus decoding, a.k.a. selective Auditory Attention Decoding (sAAD), has primarily focused on binary “left-right” tasks. However, decoding of the attended speaker’s precise direction is desired. Existing approaches often underutilize spatial audio information, resulting in suboptimal performance. In this paper, we address this limitation by leveraging a recent dataset containing two concurrent speakers at two of 14 possible directions. We demonstrate that models relying solely on EEG yield limited decoding accuracy in leave-one-out settings. To enhance performance, we propose to integrate spatial spectra as an additional input. We evaluate three model architectures, namely CNN, LSM-CNN, and Deformer, under two strategies for utilizing spatial information: all-in-one (end-to-end) and pairwise (two-stage) decoding. While all-in-one decoders directly take dual-modal inputs and output the attended direction, pairwise decoders first leverage spatial spectra to decode the competing pairs, and then a specific model is used to decode the attended direction. Our proposed all-in-one Sp-EEG-Deformer model achieves 14-class decoding accuracies of 55.35% and 57.19% in leave-one-subject-out and leave-one-trial-out scenarios, respectively, using 1-second decision windows (chance level: 50%, indicating random guessing). Meanwhile, the pairwise Sp-EEG-Deformer decoder achieves a 14-class decoding accuracy of 63.62% (10 s). Our experiments reveal that spatial spectra are particularly effective at reducing the 14-class problem into a binary one. On the other hand, EEG features are more discriminative and play a crucial role in precisely identifying the final attended direction within this reduced 2-class set. These results highlight the effectiveness of our proposed dual-modal directional decoding strategies.
ISSN:1534-4320
1558-0210