Relative Applicability of Diverse Automatic Speech Recognition Platforms for Transcription of Psychiatric Treatment Sessions

Service delivery in mental healthcare involves documentation of sensitive patient-clinician conversations that require serious caution. Conventionally, clinicians take handwritten notes, which causes low readability and lack of database which hinders research. Having these conversations digitized vi...

Full description

Saved in:
Bibliographic Details
Main Authors: Rana Zeeshan, John Bogue, Mamoona Naveed Asghar
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11063333/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Service delivery in mental healthcare involves documentation of sensitive patient-clinician conversations that require serious caution. Conventionally, clinicians take handwritten notes, which causes low readability and lack of database which hinders research. Having these conversations digitized via Automatic Speech Recognition (ASR) based Speech-to-Text (STT) transcription enables progressive analysis of mental health cases. The ASR applications usually require audio recording prior to the transcription, for labeling speakers or diarization. Although such models are good enough for most use cases, storing audio recordings in psychiatry complicates the data handling and adoption of ASR platforms in mental healthcare. This study involved a two-stage methodology, where at first, a list of 32 well-reputed STT transcription tools were evaluated in terms of applicability in psychiatry; followed by experimental testing using nine audio clips derived from three psychiatric session recordings of varying durations (1, 3, and 10 minutes) and speakers’ gender. Metrics such as inference time, Word Error Rate (WER), and Diarization Error Rate (DER) were analyzed. The results indicated that while WER was positively low (0-7%), DER varied significantly (2-32%), influenced by the audio length and speaker characteristics. DER was notably lower for clips with speakers of differing genders or ages, but negatively increased for speakers of similar demographics. The study also compared synchronous and asynchronous diarization approaches, highlighting challenges in accuracy, privacy, and processing efficiency in psychiatry. These findings provide actionable insights for selecting ASR tools in mental healthcare and underscore the need for targeted improvements in ASR technology to address the unique demands of this field.
ISSN:2169-3536