Main page Research activities Publications Talks MSc thesis projects Courses Mentoring Hobby and spare time Write me This site uses
Google Analytics
Last updated on
18 March 2024

Publication details

G. Lanciano, R. Andreoli, T. Cucinotta, D. Bacciu, A. Passarella. "A 2-phase Strategy For Intelligent Cloud Operations," IEEE Access, September 2023

Abstract

When operating large cloud computing infrastructures, ensuring healthiness of physical resources and software components is of paramount importance to meet the demanding service levels expected by customers. This is only possible using automations that can detect anomalies and alert the on-call personnel, or trigger healing procedures. In production-grade deployments, such automations are generally based on static thresholds or predefined pattern-matching rules, checked against relevant metrics and logs. Defining and maintaining them is cumbersome and, as the infrastructure grows, they need continuous adjustments. To tackle this problem, we propose an intelligent automation system for cloud operations that learns, from what operators have done in the past, what actions should be applied in response to the observed anomalies. Such system is designed to operate elastic groups of cloud instances realizing typical (replicated) cloud services. The mechanism is based on a 2-phase machine learning pipeline, composed of: a first, lighter, model that automatically detects anomalous patterns, based on past observations of the normal behavior, causing activation of the second, more involved, model; this is a model that recommends specific corrective actions, based on historical operational data reporting the actions applied to heal the faulty components. The approach was validated on an OpenStack deployment, where we deployed both a synthetic application and a multi-node Cassandra NoSQL data-store, and injected different types of anomalies while these systems were exercised using synthetic workloads. For both applications, we obtained a remarkable accuracy (mostly beyond 90%, and also going beyond 95% in some cases), for the anomaly detection and corrective action recommendation tasks, by applying the models on the respective test sets. This allows us to conclude that the presented mechanism constitutes an efficient and effective technique to help operating cloud services in presence of a number of faults, albeit the types and heterogeneity of faulty conditions might be expanded in future evolutions of the framework. The implementation and the material needed to reproduce our results are available under an open-source license.

Open Access under a Creative Commons License (CC BY-NC-ND 4.0).

See paper on publisher website

Download paper

DOI: 10.1109/ACCESS.2023.3312218

BibTeX entry:

@article{Lanciano2023,
	doi = {10.1109/access.2023.3312218},
	url = {https://doi.org/10.1109%2Faccess.2023.3312218},
	year = 2023,
	publisher = {Institute of Electrical and Electronics Engineers ({IEEE})},
	pages = {1--1},
	author = {Giacomo Lanciano and Remo Andreoli and Tommaso Cucinotta and Davide Bacciu and Andrea Passarella},
	title = {A 2-phase Strategy For Intelligent Cloud Operations},
	journal = {{IEEE} Access}
}

Main page Research activities Publications Talks MSc thesis projects Courses Mentoring Hobby and spare time Write me Last updated on
18 March 2024