TY - JOUR
T1 - Context-aware incremental clustering of alerts in monitoring systems
AU - Turgeman, Lior
AU - Avrashi, Yaniv
AU - Vagner, Gabriella
AU - Azaizah, Nadeem
AU - Katkar, Someshwar
N1 - Publisher Copyright:
© 2022 Elsevier Ltd
PY - 2022/12/30
Y1 - 2022/12/30
N2 - The highly complex nature of today's modern hybrid IT applications continues to present an increasing challenge for operation teams relying on traditional monitoring approaches. In monitoring systems, incidents occur frequently due to a variety of causes, from updates to software and hardware, to changes in operation environment. These incidents could significantly degrade the system's availability and customers’ satisfaction. In many cases, investigating an incident in such an environment could feel like looking for a needle in a haystack - and you may not even know how the needle looks like until you see it. In that regard, one of the main challenges is how to efficiently analyze multiple sets of alert messages stemming from disparate monitoring tools and collectors across the application stack, in real-time. Such an analysis can provide trustworthy detection of system states at various critical points, thus helping teams to detect, frame, analyze and resolve incidents or failures in a relatively short time, especially if an accurate system's topological dependencies are absent. In this work, we suggest a new approach to determining relations among alerts – forming “events”. The suggested approach directly models the event's likelihood, by first embedding alerts’ corresponding metrics into a common latent space where the distance among metrics can be naturally defined, using a word2vec model, and then cluster alerts by employing a tailored incremental clustering algorithm. The suggested approach allows controlling the trade-off between the model's sensitivity to clusters’ noise-robustness, thus spanning a wide range of clustering mechanisms, as well as adapting clusters’ outcomes to the level and properties of the noise expected in input data.
AB - The highly complex nature of today's modern hybrid IT applications continues to present an increasing challenge for operation teams relying on traditional monitoring approaches. In monitoring systems, incidents occur frequently due to a variety of causes, from updates to software and hardware, to changes in operation environment. These incidents could significantly degrade the system's availability and customers’ satisfaction. In many cases, investigating an incident in such an environment could feel like looking for a needle in a haystack - and you may not even know how the needle looks like until you see it. In that regard, one of the main challenges is how to efficiently analyze multiple sets of alert messages stemming from disparate monitoring tools and collectors across the application stack, in real-time. Such an analysis can provide trustworthy detection of system states at various critical points, thus helping teams to detect, frame, analyze and resolve incidents or failures in a relatively short time, especially if an accurate system's topological dependencies are absent. In this work, we suggest a new approach to determining relations among alerts – forming “events”. The suggested approach directly models the event's likelihood, by first embedding alerts’ corresponding metrics into a common latent space where the distance among metrics can be naturally defined, using a word2vec model, and then cluster alerts by employing a tailored incremental clustering algorithm. The suggested approach allows controlling the trade-off between the model's sensitivity to clusters’ noise-robustness, thus spanning a wide range of clustering mechanisms, as well as adapting clusters’ outcomes to the level and properties of the noise expected in input data.
KW - Alerts
KW - Clustering
KW - Embedding
KW - Metric ID
KW - Monitoring
KW - Negative sampling
KW - Pair-wise similarity
KW - Skip-gram
UR - http://www.scopus.com/inward/record.url?scp=85136465236&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2022.118489
DO - 10.1016/j.eswa.2022.118489
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85136465236
SN - 0957-4174
VL - 210
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 118489
ER -