TDQE:a quality evaluation method for text data in deep learning

Text data quality is an important factor affecting the performance of language models. and its evaluation methodology is considered decisive for model training effectiveness. To address the issues of high computational costs and incomplete evaluation metrics in existing text data quality assessment...

Full description

Saved in:
Bibliographic Details
Main Authors: LUO Chunxu, XIONG Haixu, YE Yazhen, DING Yan, ZONG Shize, XIONG Yun, ZHU Yangyong
Format: Article
Language:Chinese
Published: China InfoCom Media Group 2025-01-01
Series:大数据
Subjects:
Online Access:http://www.j-bigdataresearch.com.cn/zh/article/111999072/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Text data quality is an important factor affecting the performance of language models. and its evaluation methodology is considered decisive for model training effectiveness. To address the issues of high computational costs and incomplete evaluation metrics in existing text data quality assessment methods, a deep learning-oriented text data quality evaluation (TDQE) method was proposed. Specifically, (1) the Dropout module of a text summarization model was utilized to generate multiple stochastic sub-networks, producing embedded representations of data samples to capture semantic consistency, thereby evaluating sample robustness; (2) a text similarity matching model was employed to compute the alignment between data samples and their summaries, assessing sample accuracy; (3) weighted robustness and accuracy metrics were designed to quantify overall text data quality. Comparative experiments were conducted on public datasets between TDQE and state-of-the-art methods, and the results demonstrated that TDQE outperformed existing mainstream algorithms.
ISSN:2096-0271