AI for Data Quality Auditing: Detecting Mislabeled Work Zone Crashes Using Large Language Models
Ensuring high data quality in traffic crash datasets is critical for effective safety analysis and policymaking. This study presents an AI-assisted framework for auditing crash data integrity by detecting potentially mislabeled records related to construction zone (czone) involvement. A GPT-3.5 mode...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2025-05-01
|
Series: | Algorithms |
Subjects: | |
Online Access: | https://www.mdpi.com/1999-4893/18/6/317 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Ensuring high data quality in traffic crash datasets is critical for effective safety analysis and policymaking. This study presents an AI-assisted framework for auditing crash data integrity by detecting potentially mislabeled records related to construction zone (czone) involvement. A GPT-3.5 model was fine-tuned using a fusion of structured crash attributes and unstructured narrative text (i.e., multimodal input) to predict work zone involvement. The model was applied to 6400 crash reports to flag discrepancies between predicted and recorded labels. Among 80 flagged mismatches, expert review confirmed four records as genuine misclassifications, demonstrating the framework’s capacity to surface high-confidence labeling errors. The model achieved strong overall accuracy (98.75%) and precision (86.67%) for the minority class, but showed low recall (14.29%), reflecting its conservative design that minimizes false positives in an imbalanced dataset. This precision-focused approach supports its use as a semi-automated auditing tool, capable of narrowing the scope for expert review and improving the reliability of large-scale traffic safety datasets. The framework is also adaptable to other misclassified crash attributes or domains where structured and unstructured data can be fused for data quality assurance. |
---|---|
ISSN: | 1999-4893 |