DEVELOPMENT OF DATA MESH DATA PLATFORM WITH ML DOMAIN OF DATA ANALYSIS

A data mesh model with three input domains A, B, and C has been proposed. All domains have their own operational data source. Each domain team builds a data product for their domain, which includes only cleaned, processed, and selected data. The domain data products are then combined into a comprehe...

Full description

Saved in:
Bibliographic Details
Main Authors: M. Fostyak, L. Demkiv
Format: Article
Language:English
Published: Ivan Franko National University of Lviv 2024-09-01
Series:Електроніка та інформаційні технології
Subjects:
Online Access:http://publications.lnu.edu.ua/collections/index.php/electronics/article/view/4467
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:A data mesh model with three input domains A, B, and C has been proposed. All domains have their own operational data source. Each domain team builds a data product for their domain, which includes only cleaned, processed, and selected data. The domain data products are then combined into a comprehensive aggregate containing all-encompassing data about the entities in the system. Next, consumer-aligned data products are created: a marketing data product, a company performance data product, and an ML data product. Thus, data mesh provides a decentralized and distributed data architecture for a project to deliver financial services to clients. This study provides a detailed analysis of the creation of data products within Domain B and the ML domain, as well as their interactions. The data product for Domain B is constructed using data from the Open Banking API, which provides real-time data on clients' daily transactions, consistent with the information displayed on their bank statements. The data were categorized, aggregated, and anonymized, resulting in fifteen data columns across three sections: categorized expenditures, risky expenditures, and categorized revenues. Additionally, two new columns were derived to represent the net difference between income and expenditures. The layer of data analysis includes the ML model domain. In this domain, data classification is implemented using various classifiers. It has been established that the highest classification accuracy of 0.98 and the highest classification metric ROC AUC of 0.98 are achieved when using XGBoost (XGB) and Random Forest (RF) classifiers on data obtained from Domain B after balancing and augmentation with a Generative Adversarial Network. The classification results and the Principal Component Analysis (PCA) method confirm that the data product constructed from Domain B ensures high classification accuracy. A thorough analysis of the classification results was conducted. Clients were segmented into groups based on their probability of obtaining a loan. It is proposed to incorporate the results of ML data analysis to enhance client classification accuracy, analyze financial credit risks, and determine the optimal interest rate.
ISSN:2224-087X
2224-0888