Development and Analysis of a Methodology for Selecting Infrastructure Metrics for Predictive Incident Monitoring
The growth of telemetry volume in distributed IT systems leads to "information noise" and increases the computational costs of AIOps platforms. This paper proposes a formalized two-stage metric selection procedure designed to improve the accuracy and efficiency of predictive monitoring: (1...
Saved in:
Main Author: | |
---|---|
Format: | Article |
Language: | Russian |
Published: |
The Fund for Promotion of Internet media, IT education, human development «League Internet Media»
2025-04-01
|
Series: | Современные информационные технологии и IT-образование |
Subjects: | |
Online Access: | https://sitito.cs.msu.ru/index.php/SITITO/article/view/1193 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The growth of telemetry volume in distributed IT systems leads to "information noise" and increases the computational costs of AIOps platforms. This paper proposes a formalized two-stage metric selection procedure designed to improve the accuracy and efficiency of predictive monitoring: (1) a multicriteria correlation filter using Pearson coefficients (|r| > 0.60), Kendall’s τ (> 0.50), and Maximal Information Coefficient (MICe > 0.35) to eliminate redundant and non-linearly related features; (2) verification of causal relationships using the Granger test (lag = 5, p < 0.01), the PCMCI algorithm (FDR = 10%), and the Directed Information metric (DI > 0.1 bits/step) to identify true drivers of the target metric. Experimental validation was conducted on a 14-day fragment of Prometheus metrics from the industrial cluster of the "Sber Antifraud" system (≈7 billion data points, 1379 initial metrics). The results showed a 43% reduction in the Mean Absolute Error (MAE) of 30-minute CPU utilization forecasts, a 14-fold decrease in input time series, and an 89% reduction in model inference time. The methodology is integrated into an industrial data processing pipeline (Prometheus → Kafka → Spark 3.5 → MLflow 2.11) and aligns with the data minimization principle outlined in GOST R 57580.1-2017 and FSTEC guidelines for information protection. |
---|---|
ISSN: | 2411-1473 |