Skip to Main Content
One of the major problems in managing large-scale distributed systems is the prediction of the application performance. The complexity of the systems and the availability of monitored data have motivated the applicability of machine learning and other statistical techniques to induce performance models and forecast performance degradation problems. However, there is a stringent need for additional experimental and comparative studies, since there is no optimal method for all cases. In addition to a deeper comparison of different statistical techniques, studies lack on two important dimensions: resilience to transient failures of the statistical techniques, and diagnostic abilities. In this work, we address these issues, presenting three main contributions: first, we establish the capability of different statistical learning techniques for forecasting the resource needs of component-based distributed systems, second, we investigate an analysis engine that is more robust to false alarms, introducing a novel algorithm that augments the predictive power of statistical learning methods by combining them with a statistical test to identify trends in resources usage, third, we investigate the applicability of statistical tests for identifying the nature and cause of performance problems in component-based distributed systems.