GSR-Fusion: A Deep Multimodal Fusion Architecture for Robust Sign Language Recognition Using RGB, Skeleton, and Graph-Based Modalities

Sign Language Recognition (SLR) plays a critical role in bridging communication gaps between the deaf and hearing communities. This research introduces GSR-Fusion, a deep multimodal fusion architecture that combines RGB-based, skeleton-based, and graph-based modalities to enhance gesture recognition...

詳細記述

保存先:
書誌詳細
主要な著者: Wuttichai Vijitkunsawat, Teeradaj Racharak
フォーマット: 論文
言語:英語
出版事項: IEEE 2025-01-01
シリーズ:IEEE Access
主題:
オンライン・アクセス:https://ieeexplore.ieee.org/document/11045351/
タグ: タグ追加
タグなし, このレコードへの初めてのタグを付けませんか!
その他の書誌記述
要約:Sign Language Recognition (SLR) plays a critical role in bridging communication gaps between the deaf and hearing communities. This research introduces GSR-Fusion, a deep multimodal fusion architecture that combines RGB-based, skeleton-based, and graph-based modalities to enhance gesture recognition. Unlike traditional unimodal models, GSR-Fusion utilizes gesture initiation and termination detection, along with a cross-modality fusion approach using a merge (network) technique, enabling it to capture both spatial-temporal and relational features from multiple data sources. The model incorporates ViViT for RGB feature extraction, Transformers for sequential pose modeling, and A3T-GCN for joint graph representation, which together form a comprehensive understanding of gestures. The study investigates five key experimental setups, covering single-hand static and dynamic gestures (one, two, and three strokes) as well as two-hand static and dynamic gestures from the Thai Finger Spelling dataset. Additionally, we compare our architecture with existing models on global datasets, including WLASL and MS-ASL, to evaluate its performance. The results show that GSR-Fusion outperforms state-of-the-art models on multiple datasets. On WLASL, it achieves 83.45% accuracy for 100 classes and 75.23% for 300 classes, surpassing models like SignBERT and Fusion3. Similarly, on MS-ASL, it attains 84.31% for 100 classes and 80.57% for 200 classes, outperforming both RGB-based and skeleton-based models. These results highlight the effectiveness of GSR-Fusion in recognizing complex gestures, demonstrating its ability to generalize across different sign languages and datasets. The research emphasizes the importance of multimodal fusion in advancing sign language recognition for real-world applications.
ISSN:2169-3536