ICRSSD: Identification and Classification for Railway Structured Sensitive Data
The rapid growth of the railway industry has resulted in the accumulation of large structured data that makes data security a critical component of reliable railway system operations. However, existing methods for identifying and classifying often suffer from limitations such as overly coarse identi...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2025-06-01
|
Series: | Future Internet |
Subjects: | |
Online Access: | https://www.mdpi.com/1999-5903/17/7/294 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The rapid growth of the railway industry has resulted in the accumulation of large structured data that makes data security a critical component of reliable railway system operations. However, existing methods for identifying and classifying often suffer from limitations such as overly coarse identification granularity and insufficient flexibility in classification. To address these issues, we propose ICRSSD, a two-stage method for identification and classification in terms of the railway domain. The identification stage focuses on obtaining the sensitivity of all attributes. We first divide structured data into canonical data and semi-canonical data at a finer granularity to improve the identification accuracy. For canonical data, we use information entropy to calculate the initial sensitivity. Subsequently, we update the attribute sensitivities through cluster analysis and association rule mining. For semi-canonical data, we calculate attribute sensitivity by using a combination of regular expressions and keyword lists. In the classification stage, to further enhance accuracy, we adopt a dynamic and multi-granularity classified strategy. It considers the relative sensitivity of attributes across different scenarios and classifies them into three levels based on the sensitivity values obtained during the identification stage. Additionally, we design a rule base specifically for the identification and classification of sensitive data in the railway domain. This rule base enables effective data identification and classification, while also supporting the expiry management of sensitive attribute labels. To improve the efficiency of regular expression generation, we developed an auxiliary tool with the help of large language models and a well-designed prompt framework. We conducted experiments on a real-world dataset from the railway domain. The results demonstrate that ICRSSD significantly improves the accuracy and adaptability of sensitive data identification and classification in the railway domain. |
---|---|
ISSN: | 1999-5903 |