DEVELOPMENT OF DATA MESH DATA PLATFORM WITH ML DOMAIN OF DATA ANALYSIS
A data mesh model with three input domains A, B, and C has been proposed. All domains have their own operational data source. Each domain team builds a data product for their domain, which includes only cleaned, processed, and selected data. The domain data products are then combined into a comprehe...
Saved in:
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Ivan Franko National University of Lviv
2024-09-01
|
Series: | Електроніка та інформаційні технології |
Subjects: | |
Online Access: | http://publications.lnu.edu.ua/collections/index.php/electronics/article/view/4467 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | A data mesh model with three input domains A, B, and C has been proposed. All domains have their own operational data source. Each domain team builds a data product for their domain, which includes only cleaned, processed, and selected data. The domain data products are then combined into a comprehensive aggregate containing all-encompassing data about the entities in the system. Next, consumer-aligned data products are created: a marketing data product, a company performance data product, and an ML data product. Thus, data mesh provides a decentralized and distributed data architecture for a project to deliver financial services to clients.
This study provides a detailed analysis of the creation of data products within Domain B and the ML domain, as well as their interactions. The data product for Domain B is constructed using data from the Open Banking API, which provides real-time data on clients' daily transactions, consistent with the information displayed on their bank statements. The data were categorized, aggregated, and anonymized, resulting in fifteen data columns across three sections: categorized expenditures, risky expenditures, and categorized revenues. Additionally, two new columns were derived to represent the net difference between income and expenditures.
The layer of data analysis includes the ML model domain. In this domain, data classification is implemented using various classifiers. It has been established that the highest classification accuracy of 0.98 and the highest classification metric ROC AUC of 0.98 are achieved when using XGBoost (XGB) and Random Forest (RF) classifiers on data obtained from Domain B after balancing and augmentation with a Generative Adversarial Network. The classification results and the Principal Component Analysis (PCA) method confirm that the data product constructed from Domain B ensures high classification accuracy. A thorough analysis of the classification results was conducted. Clients were segmented into groups based on their probability of obtaining a loan. It is proposed to incorporate the results of ML data analysis to enhance client classification accuracy, analyze financial credit risks, and determine the optimal interest rate. |
---|---|
ISSN: | 2224-087X 2224-0888 |