Knowledge-based Intelligent System for IT Incident DevOps | IEEE Conference Publication | IEEE Xplore

Knowledge-based Intelligent System for IT Incident DevOps


Abstract:

The automation of IT incident management (i.e., handling of any unusual events that hamper the quality of IT services) is a main focus in Artificial Intelligence for IT O...Show More

Abstract:

The automation of IT incident management (i.e., handling of any unusual events that hamper the quality of IT services) is a main focus in Artificial Intelligence for IT Operations (AIOPS). The success and reputation of large-scale firms depend on their customer service and helpdesk system. These systems tend to handle client requests and track customer service agent interactions. In this research, we present a complete knowledge-based system that automates two core components of IT incident service management (ITSM): (1) Ticket Assignment Group(TAG) and (2) Incident Resolution (IR). Our proposed system bypasses the 4 core steps of the traditional ITSM process, including data investigation, event correlation, situation room collaboration, and probable root cause. It provides immediate solutions that can save companies key performance indicator(KPIs) resources and reduce the mean time to resolution (MTTR). The experiment used an industrial, real-time ITSM dataset from a prominent IT organization comprising 500,000 real-time incident descriptions with encoded labels. Furthermore, our systems are then evaluated with an open-source dataset. Compared to the existing benchmark methodologies, there is a 5 % improvement in terms of Accuracy score. The study demonstrates AI automation capabilities in incident handling (TAG and IR) for large real- world IT systems.
Date of Conference: 15-15 May 2023
Date Added to IEEE Xplore: 25 July 2023
ISBN Information:
Conference Location: Melbourne, Australia

Funding Agency:


I. Introduction

Artificial Intelligence for IT Operations (AIOPS) aims to au-tomate IT operations using the advances of Machine Learning (ML) and, to a certain extent Deep Learning (DL). Information Technology Service Management (ITSM) is a subset of Auto-matic Information Operations focusing on planning, managing, and enhancing client IT services. Due to the massive number of incident reports (IR), most IT service management busi-nesses struggle to optimize and use their resources to prioritize and address the most important incident, leading to excessive system downtime [1]. Typically, IT workers deal directly with customers to fix difficulties with particular elements of the system and their related procedures. IT incident management is the most critical aspect of IT service management [2], [3]. The goal of the IT help desk is to register user inquiries and provide instant feedback to address those inquiries. The most common way to find these answers is to search through the solution database. IT teams must react quickly to customer and employee inquiries by notifying the appropriate departments of the escalation of the problem. High serviceability would be the main goal, and this is possible only by the rapid solution of the problem or a complete system restoration [4], [5]. Incident management starts once an incident ticket is raised. These tickets generally come from the organization (e.g., issues commonly associated with system accessibility) or the system components (i.e., where specific segments of the system issue an alert) [6], [7]. Tickets are then rated as major or minor before they are escalated to the subject matter expert. Tickets will close once the issue is fixed [8]. A subject matter expert will manually assess the problem's severity and determines whether additional investigation is required. Businesses frequently depend on manual IT ticket assign-ments, which regrettably leaves room for human mistakes (i.e., inaccurate level assignments) [4], [5]. Furthermore, many large organizations experience higher resource consumption from longer working hours to handle disruptions due to these human errors. Ultimately, these result in negative customer/employee feedback, directly impacting the organization's reputation [6]. For incident resolution (IR), IT service management effi-ciently identifies the correct solution for an incident/outage. Finding the IR against an outage is a tedious, error-prone, and painstakingly time-consuming process [8]. The manual identification of solutions for IT outages extends the Mean time to Resolution (MTTR). We plan to resolve this issue by automating the IT resource management process [8]. To do that, we have implemented the state-of-the-art DL algorithm (e.g., the Bidirectional Encoder Representations from Trans- formers(BERT) transformer model) to predict the solutions associated with each outage more precisely. Additionally, the assignment of IT outage tickets to an irrelevant group can cause deadlock, leading to a Major Incident Record (MIR) [9]. Similarly, this can be addressed by automating the assignment of complicated occurrences using the forecasting model of BERT to predict the Assignment group associated with each outage autonomously.

Contact IEEE to Subscribe

References

References is not available for this document.