Are Artificial Intelligence Models Listening Like Cardiologists? Bridging the Gap Between Artificial Intelligence and Clinical Reasoning in Heart-Sound Classification Using Explainable Artificial Intelligence

In recent years, deep learning has shown promise in automating heart-sound classification. Although this approach is fast, non-invasive, and cost-effective, its diagnostic accuracy still mainly depends on the clinician’s expertise, making it particularly challenging to detect rare or complex conditi...

Full description

Saved in:
Bibliographic Details
Main Authors: Sami Alrabie, Ahmed Barnawi
Format: Article
Language:English
Published: MDPI AG 2025-05-01
Series:Bioengineering
Subjects:
Online Access:https://www.mdpi.com/2306-5354/12/6/558
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In recent years, deep learning has shown promise in automating heart-sound classification. Although this approach is fast, non-invasive, and cost-effective, its diagnostic accuracy still mainly depends on the clinician’s expertise, making it particularly challenging to detect rare or complex conditions. This study is motivated by two key concerns in the field of heart-sound classification. First, we observed that automatic heart-sound segmentation algorithms—commonly used for data augmentation—produce varying outcomes, raising concerns about the accuracy of both the segmentation process and the resulting classification performance. Second, we noticed inconsistent accuracy scores across different pretrained models, prompting the need for interpretable explanations to validate these results. We argue that without interpretability to support reported metrics, accuracy scores can be misleading because of ambiguity in how training data interact with pretrained models. Specifically, it remains unclear whether these models classify spectrogram images—generated from heart-sound signals—in a way that aligns with clinical reasoning, where experts focus on specific components of the heart cycle, such as S1, systole, S2, and diastole. To address this, we applied explainable AI (XAI) techniques with two primary objectives: (1) to assess whether the model truly focuses on clinically relevant features, thereby allowing classification results to be verified and trusted, and (2) to investigate whether incorporating attention mechanisms can improve both the performance and the model’s focus on meaningful segments of the signal. To the best of our knowledge, this is the first study conducted on a manually segmented dataset, which objectively evaluates the model’s behavior using XAI and explores performance enhancement by combining attention mechanisms with pretrained models. We employ the Grad-CAM method to visualize the model’s attention and gain insights into the decision-making process. The experimental results show that integrating multi-head attention significantly improves both the classification accuracy and interpretability. Notably, ResNet50 with multi-head attention achieved an accuracy of 97.3%, outperforming those of both the baseline and SE-enhanced models. Moreover, the mean intersection over union (mIoU) for interpretability increased from 75.7% to 82.0%, indicating the model’s improved focus on diagnostically relevant regions.
ISSN:2306-5354