Developing Effective Techniques for the Recognition of Shanghai Dialect Text

Recognizing Shanghai dialect text is crucial for preserving local dialects, yet research on its automatic distinction from Standard Mandarin remains limited. We construct a carefully balanced dataset specifically for the task of Shanghai dialect recognition and propose a two-stage approach for autom...

Full description

Saved in:
Bibliographic Details
Main Authors: Yida Bao, Zheng Zhang, Mohammad Arifuzzaman, Tran Duc Le, Qi Li, Masuzyo Mwanza, Jiaqing Lin, Philippe Gaillard, Jiafeng Ye
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11053757/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Recognizing Shanghai dialect text is crucial for preserving local dialects, yet research on its automatic distinction from Standard Mandarin remains limited. We construct a carefully balanced dataset specifically for the task of Shanghai dialect recognition and propose a two-stage approach for automatic language classification. In the first stage, we employ Jieba tokenization to retain dialect-specific lexical nuances, ensuring essential semantic and syntactic distinctions are captured. Next, we independently train both a BERT-Chinese-Based classifier and a traditional Support Vector Machine classifier for dialect recognition. The BERT model leverages powerful contextual representations to capture subtle differences between Shanghai dialect and Standard Mandarin, while the Support Vector Machine serves as a conventional baseline. Extensive experiments comparing the two approaches revealed that, although the Support Vector Machine can adequately perform the classification task, the BERT-Based classifier achieves significantly higher accuracy and is more sensitive to the nuanced linguistic features of the dialect. Further analysis through attention visualization reveals how the model specifically attends to unique dialectal features, highlighting distinctive lexical and structural differences between Shanghai dialect and Mandarin text. To the best of our knowledge, this study is the first to apply NLP techniques for language classification between Shanghai dialect and Standard Mandarin, emphasizing the potential for automated dialect recognition as an effective method for dialect documentation and preservation.
ISSN:2169-3536