IT Ticket Classification: The Simpler, the Better

Recently, automatic classification of IT tickets has gained notable attention due to the increasing complexity of IT services deployed in enterprises. There are multiple discussions and no general opinion in the research and practitioners’ community on the design of IT ticket classification tasks, specifically the choice of ticket text representation techniques and classification algorithms. Our study aims to investigate the core design elements of a typical IT ticket text classification pipeline. In particular, we compare the performance of TF-IDF and linguistic features-based text representations designed for ticket complexity prediction. We apply various classifiers, including kNN, its enhanced versions, decision trees, naïve Bayes, logistic regression, support vector machines, as well as semi-supervised techniques to predict the ticket class label of low, medium, or high complexity. Finally, we discuss the evaluation results and their practical implications. As our study shows, linguistic representation not only proves to be highly explainable but also demonstrates a substantial prediction quality increase over TF-IDF. Furthermore, our experiments evidence the importance of feature selection. We indicate that even simple algorithms can deliver high-quality prediction when using appropriate linguistic features.


I. INTRODUCTION
With today's increased digitization, any enterprise maintains a broad application portfolio, which is often grown historically. Such a portfolio must be supported by large scale complex IT service environments [1]. These developments reveal a fundamental role of IT support systems in any organization's support operations. Two essential steps of any IT ticket processing, which are their correct prioritization and assignment, keep getting the attention of practitioners and the scientific community, especially in the context of the everincreasing amount of tickets, errors, long lead times, and lack of human resources [2]- [7]. While small companies still tend to perform these steps manually, large organizations dedicate large budgets to the implementation of various commercial text classification solutions. Usually, these approaches are sophisticated monolithic software focused on accuracy at the cost of explainability and understandability. In our study, The associate editor coordinating the review of this manuscript and approving it for publication was Najah Abuali .
we are guided by this important issue in the choice of text representation and classification techniques.
Being one of the fastest-growing sectors of the service economy, recently, enterprise IT services have gained importance both in research and practice. One popular stream of this research is IT service management (ITSM) [8], [9], which focuses on the servitization of IT function, organization and management of IT service provision [10]. For a successful establishment of organizational infrastructure for IT services, IT practitioners implement IT service delivery with a reference process -the IT Infrastructure Library (ITIL) [11], [12]. Separate areas of ITIL, such as Incident, Problem or Change Management, deal with a large amount of unstructured text data, i.e., tickets issued in relation to the incidents, problems, or changes in the IT infrastructure products and services.
In the context of text classification, remarkable research has been done on the advantages and disadvantages of different text classification algorithms [13] and text representation techniques. In a specific context of IT tickets, a considerable amount of studies have been performed on the analysis of the ticket text data addressing such problems as correct assignment and prioritization, identification of duplicates, and poorly described tickets [14], [15]. Prevailingly, TF-IDF (Term Frequency-Inverse Document Frequency) ticket text representation technique is used [16], [17]. Various ticket text classification [18], [19] and clustering approaches [20], [21] are studied.
In our previous research [22]- [24], we performed an indepth linguistic analysis of the ITIL Change Management (CHM) ticket texts originating from the IT CHM ticket processing department of a big enterprise with more than 200,000 employees worldwide. Using the elaborated linguistic representations of the ticket texts, we implemented a rule-based approach to predict the process complexity [25]. The precision of this approach reached approximately 62%. Apart from such a precision quality, the process of rule set establishment demanded a lot of analysis and testing effort on our and process experts' side. Another substantial disadvantage of such an approach, particularly in complex scenarios, is difficulty in maintenance, inability to learn and scale, as adding new rules requires revising the existing ones.
In this study, our objective is to address the challenges above (choice of text representation technique, choice of a classifier, a growing amount of IT tickets, and the need for explainable text classification solutions to support the IT helpdesk workers). Furthermore, we aim to develop an understanding which factors are relevant for the prediction quality with focus on text representation techniques, feature selection, and text classification algorithms.
In the context of representation of ticket text data, we compare the performance of a commonly accepted and widely used TF-IDF [26] with the linguistic approach described in our previous work [22]- [24]. Despite massive research efforts on IT ticket classification, the representation based on the linguistic features considered in this study has not been systematically compared with TF-IDF representation in the ticket classification context yet.
In the classification methods, we focus on various classifiers widely used for text [13] and ticket [14] classification, including kNN, its enhanced versions, so-called hubnessaware classifiers [27], [28], decision trees [29], naïve Bayes [30, pp. 253-286], logistic regression [31], support vector machines [32] as well as semi-supervised techniques. The latter includes kNN with self-training [33], Semi-sUpervised ClassifiCation of TimE SerieS algorithm (SUCCESS) [34], and its improved version QuickSUCCESS [35]. Although state-of-the-art technology allows us to capture a considerable amount of instances in many applications (e.g., millions of images may be observed), in our case, it is costly and difficult to obtain class labels for a large number of tickets. On the one hand, to label the tickets, expert knowledge is required. On the other hand, as various aspects have to be taken into account, even experts need much time for this task. Therefore, due to the limited amount of labeled tickets, we also experimented with semi-supervised approaches for ticket classification.
The rest of the paper is organized as follows: we give an overview of related works in Section II. Section III presents our methods, followed by the experiments in Section IV. Subsequently, we discuss the implications of the findings and conclude with limitations and future research directions.

II. RELATED WORK
In this paper, as we focus on the problem of CHM ticket classification, the related work is structured as follows: (i) text representation, (ii) text and ticket classification.

A. TEXT REPRESENTATION
Text representation is one of the essential building blocks of approaches for text mining and information retrieval. It aims to transform a text document into a numeric feature vector [36], [37], to allow a computational analysis of textual data. We studied different techniques of text representation and feature extraction and structured them into three main categories: weighted word techniques, word embedding [13], and linguistic features.

1) WEIGHTED WORDS
Weighted words approaches are based on counting the words in the document and computing the document similarity directly from the word-count space [13]. Due to their simplicity, Bag-of-Words (BoW) [38] and TF-IDF [39] can be considered as the most common weighted words approaches. The employed techniques usually rely on the aforementioned text representations rather than in-depth linguistic analysis or parsing [40]. As these models represent every word in the corpus as a one-hot-encoded vector, they are incapable of capturing the word semantic similarity and can become very large and technically challenging [13]. To address these limitations, we use linguistic features in the proposed approach. As described in Section III, the number of our linguistic features is independent of the size of the corpus, and they are capable of capturing relevant aspects of semantics.

2) WORD EMBEDDING
With the progress of research, new methods, such as word embedding, have come up to deal with the limitations of weighted words approaches. Word embedding techniques learn from sequences of words by considering their occurrence and co-occurrence information. Each word or phrase from the corpus is mapped to a high dimensional vector of real numbers, and the mapping from words to vectors is learned based on the surrounding words in the corpus. Various methods have been proposed, such as Word2Vec [41], Doc2Vec [42], GloVe [43], FastText [44], [45], contextualized word representations [46], [47]. These methods consider the semantic similarity of the words. However, they need a large corpus of text datasets for training. To solve this issue, pre-trained word embedding models have been published [48]. Unfortunately, the models do not work for the words outside of the corpus. Another limitation of the word embedding models is related to the fact that they are trained based on the words appearing in a pre-defined window. This is an inherent limitation for considering different meanings of the same word occurring in different contexts. While addressing this limitation, the contextualized word representations techniques based on the context of the word in a document [46], [47] were developed. Nonetheless, in realworld applications, new words may appear (in the description of a new ticket, the customer may use phrases and specific words that have not been used before). These new words are not included in the corpus at training time. Therefore, these word embedding techniques will fail to produce a correct mapping for these new words. While this problem may be alleviated by retraining the model, this requires a lot of data and computational resources (CPU, RAM). In contrast, as it is discussed next, it is straightforward to use our linguistic features text representation in case of new words as it is not based on word embedding.

3) LINGUISTIC FEATURES
Text representations based on linguistic features have been introduced to address the mentioned limitations of weighted words and word embeddings. Below, we list some examples.
Lexicon-based sentiment analysis is a well-established text classification technique [49]. Synonymy and hypernymy are known approaches to increase prediction quality [36]. Extensive research on the linguistic analysis of ticket texts has been performed in [50], [51]. The authors use parts-of-speech (PoS) count and specific terms extractions to define the reported problem's severity. Coussement and Van den Poel extract the following linguistic features: word count, question marks, unique words, the ratio of words longer than six letters, pronouns, negations, assents, articles, prepositions, numbers, time indication [52]. The study results indicated a profoundly beneficial impact of combining traditional features, like TF-IDF and singular value decomposition, with linguistic style into one text classification model. However, the authors declare a demand for more experiments and research. This demand is addressed in our work.
There is no unified opinion in the research community, whether the linguistic approaches are good enough to substitute or enhance the traditional text representations. They are more complex to implement and do not compensate for this complexity with the expected performance increase. Both proponents [53]- [58] and opponents [59] of the linguistic approach provide convincing experimental results. In our work, we amend these research discussions with case studybased findings while comparing the performance of linguistic features with TF-IDF. Word embedding techniques are not applicable in our study due to the following reasons: (i) pre-trained models would not perform well considering the domain-specific vocabulary which contains many new words compared to the training corpus, (ii) at the same time, a limited text corpus would not be enough for training a new model (or retraining an existing one), (iii) industrial applications prevailingly demand explainable models to be able to understand and correct classification mistakes.  Table 1 summarizes the strengths and weaknesses of the discussed techniques.

B. TEXT AND TICKET CLASSIFICATION
Text classification, also referred to as text categorization or text tagging, is the task of assigning a text to a set of pre-defined categories [60]. Traditionally, this task has been done by human experts. Expert text classification, for example, remains widely used in qualitative research in the form of coding or indexing [61], [62]. Nevertheless, with the growing amount of text data, the attention of the research community and practitioners shifted to automatic methods. One can differentiate three main groups of automatic text classification approaches: rule-based, approaches based on machine learning (ML), and a combination of both in a hybrid system. Ticket classification in software development and maintenance has been studied to tackle challenges, such as correct ticket assignment and prioritization, predicting time and number of potential tickets, and avoiding duplicate tickets.

1) RULE-BASED APPROACHES
As the name suggests, this kind of text classification system is based on a set of rules determining classification into predefined categories. In general, rule-based classifiers are a popular data mining method applicable to diverse data types. It became popular due to its transparency, explainability, and relative simplicity [63]. Various types of rule development can be distinguished: ML-based such as decision trees [64] or association rules [65], [66], handcrafted rules created by the experts, or hybrid approaches [67], [68]. It is essential to mention that rule-based systems work well with small rule sets. However, they become difficult to build, manage, and change with the growing number of rules.
In our previous work, we conceptualized a recommender system for IT ticket complexity prediction using rule-based classification, i.e., handcrafted rules and rules based on decision trees [25]. Nonetheless, due to the complexity of the domain, the process of rule development consumed much time and effort. The system was difficult to manage and change.

2) MACHINE LEARNING APPROACHES
In the context of ML, text classification can be defined using the following formalization approach. If there is a set of classification labels S and a set of training instances F, each labeled using class labels in S, the model must use F to learn to predict the class labels of unseen instances of the same type [69]. In text classification, F is a set of labeled documents from a corpus. The labels can be extracted topics, themes, writing styles, judgments of the documents' relevance [36]. Regarding the prediction itself, various techniques such as kNN, its enhanced versions (hubness-aware classifiers), decision trees, naïve Bayes, logistic regression, support vector machines, neural networks have been introduced [13]. ML approaches have been shown to be more accurate and easier to maintain compared to rule-based systems [70]. At the same time, it is challenging to select the best ML technique for a particular application [13]. Table 2 summarizes the strengths and weaknesses of the main approaches for text classification.
Most ML techniques, including all the approaches above, require a large amount of training data. However, as described previously, in our application, it is challenging to obtain labeled training data. In contrast, semi-supervised learning (SSL) allows inducing a model from a large amount of unlabeled data combined with a small set of labeled data. For an overview of major SSL methods, their advantages and disadvantages, we refer to [71]. For the above reasons, we experimented with semi-supervised ML approaches. In particular, we used kNN with self-training and SUCCESS [34]. Although SUCCESS showed promising results, it does not scale well to large datasets. We addressed this limitation by suggesting a scaling technique for SUCCESS, and we called the resulting approach QuickSUCCESS [35].

III. METHODS
In the choice of the methods, we put a special emphasis on the explainability and understandability of the classification results for the process worker.

A. FEATURE EXTRACTION
To allow for automated analysis, texts are usually transformed into a vector space. Unnecessary characters and words (stop words) are removed. Afterward, diverse feature extraction techniques can be applied. In our study, we compare two types of features: TF-IDF and linguistic features.
Next, we describe linguistic features in detail. These features are specifically designed for the task of IT ticket complexity prediction. Three levels of text understanding are commonly distinguished: (1) objective (answering the VOLUME 8, 2020 questions who, what, where, when) measured by semantic technologies, (2) subjective (an emotional component of a text) -by sentiment analysis, and (3) meta-knowledge (information about the author outside of the text) measured by stylometry or stylistic analysis [73]. Accordingly, we develop a set of features which are aggregated by the respective measures indicating the IT ticket complexity. We proposed these features in our initial works [22]- [24]. 1 A detailed explanation of linguistic features is provided below using an anonymized IT ticket example (see also Table 3).

1) OBJECTIVE KNOWLEDGE ASPECT
Core research in Natural Language Processing (NLP) addresses the extraction of objective knowledge from text, i.e., which concepts, attributes, and relations between concepts can be extracted from text, including specific relations such as causal, spatial, and temporal ones [73]. Among diverse approaches, specifically, taxonomies and ontologies, are widely used in the business context [74], [75]. Thus, we suggest a specific approach of objective knowledge extraction using the Decision-Making Logic (DML) taxonomy [24] illustratively presented in Appendix I. Herewith, it is aimed to discover the decision-making nature of activities, called DML level. We use the following DML levels: routine, semi-cognitive, and cognitive (corresponding to the columns of the table in Appendix I). Using a Latent 1 https://github.com/IT-Tickets-Text-Analytics Dirichlet Allocation Algorithm (LDA) [76], we identify the most important keywords, see [24] for details. Each of the keywords is associated with a DML level. For example, the keywords user, test, request, etc. are associated with routine, whereas the keyword management belongs to cognitive. We detect these keywords in IT ticket texts. Based on the total number of detected keywords, we calculate the relative occurrence of the words of each category in the ticket text. In the example shown in Table 3, the total number of detected words equals five, out of which four words belong to routine (server, attach, detail, see), one to semi-cognitive (service), and no word to cognitive. Thus, the corresponding features are calculated as follows: routine 4/5 = 0.8 and semicognitive 1/5 = 0.2.

2) SUBJECTIVE KNOWLEDGE ASPECT
To extract a subjective knowledge aspect, we perform a sentiment analysis [77]. In [23], we suggest a specific business sentiment approach as an instrument for measuring the emotional component of an IT ticket. This latent information is extracted from the unstructured IT ticket texts with the help of a lexicon-based approach. As standard lexicons do not work well in our context of IT ticket classification, using the stateof-the-art VADER [78] and LDA, we developed a domainspecific lexicon, see [23] for details. Our lexicon can also be found in Appendix II.
Each of the words is associated with positive, negative, or neutral sentiment. Words with valence scores greater than 0 are considered positive, whereas words with a valence score less than 0 are considered negative. All other words are considered to have a neutral sentiment. We determine the proportion of words with negative, neutral, and positive sentiment for each IT ticket text and use these values as features. In our example, there are no words with positive or negative sentiment. Therefore, the ticket is assigned to be entirely neutral.

3) META-KNOWLEDGE ASPECT
In our case, meta-knowledge is the knowledge about the author of an IT ticket. The quality of the written text will likely depend on such factors as the author's professionalism and expertise, level of stress, and different psychological and sociological properties [73]. To extract the meta knowledge aspect, we use the following features [22]: (1) IT ticket text length, (2) PoS features, (3) wording style [22] calculated with the Zipf's law of word frequencies [79].
By length, we mean the number of words in the IT ticket text. This feature is motivated by the following observation: in most cases, IT tickets texts containing a few words, such as update firewalls, refer to simple daily tasks. Therefore, short ticket texts may be an indication of the low complexity of the ticket. In the example shown in Table 3, the length of the ticket is 12 words.
As for PoS features, we consider the following PoS tags: nouns, verbs, adjectives, and adverbs. For each of them, we calculate their absolute occurrence, i.e., the total number of words having that PoS tag (for example, the total number of nouns). Subsequently, we calculate their relative occurrence, called occurrence for simplicity, i.e., the ratio of nouns, verbs, adjectives, and adverbs relative to the length of the text. We use these occurrences as features. In the example shown in Table 3, the occurrence of nouns (registry, XYZ-ZZ, YYY, server, attachment, details) in all words is 6/12 = 0.5.
We also calculate the number of unique words having the considered PoS tags (for example, number of unique nouns). Then, we calculate the occurrence of unique nouns, verbs, adjectives, and adverbs relative to the number of all nouns, verbs, adjectives, and adverbs, respectively. We use these occurrences as features as well. In Table 3, the uniqueness of nouns is 6/6 = 1 (no repeating words).
According to Zipf's word frequency law, the distribution of word occurrences is not uniform, i.e., some words occur very frequently. In contrast, others appear with a low frequency, such as only once. Our wording style feature describes how extreme this phenomenon is in the IT ticket text, i.e., whether the distribution of occurrences is close to being uniform or not. For details, we refer to [22].

B. SUCCESS
After the features have been extracted and the texts are represented in the form of numerical vectors, they can be fed into ML classifiers. To make sure that our paper is self-contained, below, we shortly review SUCCESS and its scaled version.
We define the semi-supervised classification problem as follows: given a set of labeled instances L = {(x i , y i )} l i=1 and a set of unlabeled instances U = {x i } n i=l+1 , the task is to train a classifier using both L and U . We use the phrase set of labeled training instances to refer to L, x i is the i-th instance, y i is its label, whereas we say that U is the set of unlabeled training instances. The labeled instances (elements of L) are called seeds. We wish to construct a classifier that can accurately classify any instance, i.e., not only elements of U . For this problem, we proposed the SUCCESS approach that has the following phases: 1. The labeled and unlabeled training instances (i.e., instances of U and L) are clustered with constrained single-linkage hierarchical agglomerative clustering [34]. While doing so, we include cannot-link constraints for each pair of labeled seeds, even if both seeds have the same class labels. 2. The resulting top-level clusters are labeled by their corresponding seeds. 3. The final classifier is 1-nearest neighbor trained on the labeled data resulting at the end of the 2nd phase. This classifier can be applied to unseen test data.
As we noticed the relatively slow training speed of semisupervised classifiers, we implemented resampling [80], a technique similar to bagging, to accelerate the classification and possibly improve classification results [81].
In particular, we select a random subset of the data and train the model on the selected instances. This process is repeated , class labels predicted based on the majority vote several times. When making predictions for new instances, the predictions of all the models above are aggregated by majority voting. As the sample size is much smaller than the size of the original dataset and the training has a superlinear complexity, the training of several models on small samples is computationally cheaper than the training of the model on the entire dataset. While kNN with resampling has been established in the research community since a long time ago [82], to the best of our knowledge, this is the first attempt to speed up the SUCCESS algorithm using resampling. We call the resulting approach QuickSUCCESS [35]. Next, we explain QuickSUCCESS in detail.
The core component of SUCCESS is constrained hierarchical agglomerative clustering. Although various implementations are possible, we may assume that it is necessary to calculate the distance between each pair of unlabeled instances as well as between each unlabeled instance and each labeled instance. This results in a theoretical complexity of O(l(n − l) + (n − l) 2 ) = O(n 2 − l 2 − nl) at least, 2 where l denotes the number of labeled instances (l = |L|), while n indicates the number of all instances (labeled and unlabeled) that are available at training time (n = |L| + |U |). Under the assumption that l is a small constant, the complexity of distance computations is O(n 2 ).
Considering only the aforementioned distance computations required for SUCCESS, the computational costs in the case of a dataset containing n/c instances is c 2 -times lower than in the case of a dataset containing n instances. Therefore, even if the computations have to be performed on several ''small'' datasets, the overall computational costs may be an order of magnitude lower. In particular, computing SUC-CESS m-times on a dataset containing r instances has a total computational cost of O(m × r 2 ). Under the assumption that r = n/c and m ≈ O(c), the resulting complexity is O(n 2 /c). Based on the analysis, we propose to speed up SUCCESS by repeated sampling. In particular, we sample the set of unlabeled instances m-times. From now on, we will use U (j) to denote the j-th sample, 1 ≤ j ≤ m.
Each U (j) is a random subset of U containing r instances (|U (j) | = r). For simplicity, each instance has the same probability of being selected. To train the j-th classifier, we use all the labeled instances in L together with U (j) . When sampling the data, the size r of the sample should be chosen carefully so that the sampled data is representative in the sense that the structure of the classes can be learned. This is illustrated in Figure 1. As the sampling is repeated m-times, we induce m classifiers denoted as C (1) , . . . , C (m) . Each classifier C (j) predicts the label for the unlabeled instances in the corresponding sample of the data, i.e., for the instances of U (j) . More importantly, each of these classifiers can be used to predict the label of new instances, i.e., instances that are neither in L nor in U . Labels for the instances in U are predicted as follows. For each x i ∈ U , we consider all those sets U (j) for which x i ∈ U j and the classifiers that were trained using these datasets. The majority vote of these classifiers is the predicted label of x i . Our approach is summarized in Algorithm 1.
The label of a new instance x / ∈ L ∪ U is predicted as the majority vote of all the classifiers C (1) , . . . , C (m) . We note that the computations related to the classifiers mentioned above C (1) , . . . , C (m) can be executed in parallel, which may result in additional speed-up in case of systems where multiple CPUs are available, such as high-performance computing (HPC) systems.
The same resampling technique can be applied with other classifiers as well. In particular, in our experiments, we used it with a semi-supervised variant of kNN, called kNN with self-training and resampling.
Self-learning is a semi-supervised technique known to improve the learning process in case of a large number of unlabeled and a small number of labeled instances [83]. Therefore, we also use kNN with self-training. Hereby, kNN is iteratively retrained with its own most confident predictions [33].

IV. EXPERIMENTAL EVALUATION
The goal of our experiments is to analyze the contribution of the proposed text representation, study the effect of feature selection and the performance of various classifiers.
Our experimental design is summarized in Figure 2 and includes the following steps: (1) collect the case study data; (2) pre-process the data; (3) label part of the data; (4) split the data into training and test sets; (5) extract two sets of features: TF-IDF and linguistic features; (6) apply the classifier; (7) evaluate the results with standard metrics. All the experiments were conducted using Python 3.6. In the case of naïve Bayes, logistic regression, decision trees, and support vector machines, we used the publicly available implementation from scikit-learn, 3 whereas in the case of hubness-aware classifiers and SUCCESS, we used PyHubs. 4 We implemented kNN with and without self-training as well as the proposed speed-up technique based on resampling on our own. We performed all the experiments on an Intel R Core TM i7 16 GB RAM machine.

A. DATASETS
The datasets (see Table 4) in the form of IT ticket texts originate from an ITIL CHM department of a big enterprise. The datasets were received (step 1) according to their availability. They covered the whole period obtainable at the moment of this study. The first dataset (Data1) comprised 28,243 tickets created from 2015 to 2018. The data was pre-processed (step 2) by removing stop words, punctuation, turning to lowercase, stemming and converted into a CSV-formatted corpus of ticket texts. The tickets texts contained prevailingly English texts (more than 80% of English words, a small portion of German words was present). The second dataset (Data2) comprised 4,684 entries in prevailingly English language from the period January -May 2019. The length of IT ticket texts varies from 5-10 words to 120-150.
As labeling of the data is a time-consuming and tedious process and our data comes from an operational unit fully overbooked in the daily work, which is often the case in the industry, experts labeled 30 and 60 tickets of Data1 and Data2 (step 3). To provide a correct label, the case study workers needed to analyze all the factors influencing the IT ticket complexity, such as the number of tasks; the number of Configuration Items (CIs), specifically applications; if the ticket had to be executed online or offline (planning of downtime and considering affected CIs); involvement of Change Advisory Board (CAB), etc. Therefore, labeling a single ticket could take up to 30 minutes. For complexity class labels, a qualitative scale of low, medium, and high has been selected to simplify the classification task and as a wellknown scale of priority ratings [90]. Although two datasets are coming from the same case, a distinct period of time and a different number of labels allowed us to test our approaches in two independent settings.

B. EXPERIMENTAL SETTINGS
To evaluate classifiers, we use leave-one-out cross-validation [91]. We select one of the labeled instances as a test instance and use all the remaining instances as training data (step 4). The process of selecting a test instance and training the classifier using all the other instances is repeated as many times as the number of labeled instances. Hereby, in each round, a different instance is selected as the test instance. For example, in the case of Data2, we select one labeled instance out of the available 60 labeled instances as a test instance and use the remaining 59 as labeled training data L. The SSL classifiers also used the 4,684-60 = 4,624 unlabeled instances as unlabeled training data U . We repeat the process 60 times.
As evaluation metrics, we use accuracy, average precision, average recall, and F-score (step 7). To compare the differences in the classifiers' performance, we use the binomial test by Salzberg [92] at the significance threshold of 0.05.
When applying semi-supervised classifiers, we consider all the unlabeled instances available at the training time. Whenever the opposite is not stated, we use kNN with k=1 and Euclidian distance, which is also justified by theoretical and empirical studies [93], [94]. In the case of kNN with selftraining, we set the number of self-training iterations to 100.
In respect to support vector machines (SVMs), we used internal leave-one-out cross-validation on the training data to identify appropriate settings of hyper-parameters. In particular, we considered polynomial and RBF kernels, and we used a grid search to determine the best values of the complexity constant C and the kernel coefficient gamma in the range 10 −5 , 10 −4 , . . . , 10 4 , 10 5 . As for the degree d of the polynomial kernel, we considered all integer values between 1 and 10.
Regarding linguistic features, we include feature selection tests to identify which features play an important role in prediction quality. Similarly to [95], [96], we use logistic regression to determine the most predictive features based on their weights. We also compare classifiers with a statistical method, i.e., an expert-defined decision rule based on the presence of cognitive words.

C. COMPARISON OF SUCCESS AND QUICKSUCCESS
As an initial experiment, we measured the execution time of SUCCESS and QuickSUCCESS. We used the linguistic features representation. In the case of QuickSUCCESS approach, we selected r = 100 instances and trained m = 100 models. Table 5 shows our results: the execution time of one round of cross-validation (in seconds) and the accuracy. As one can see, the proposed technique leads to several orders of magnitude speed-up, both in the case of Data1 and Data2, with negligible loss of prediction quality (the difference corresponds to the misclassification of one additional instance). Hence, in further experiments, we used the QuickSUCCESS algorithm.  Table 6 provides the obtained values of accuracy, an average of precision and recall calculated over the three classes of low, medium, and high complexity. F-score is calculated based on the average precision and recall.

D. RESULTS
We compare the classifiers' performance using linguistic features with that of classifiers using TF-IDF on Data1 and Data2. We point out the systematic improvement in the prediction quality of algorithms when using linguistic features. VOLUME 8, 2020 Namely, we observed that the classifiers' performance with linguistic features was almost always higher than that one with TF-IDF under comparable conditions (the difference between the two experiments is only text representation technique). In the case of TF-IDF features, most classifiers predicted the dominant class for the vast majority of the instances, resulting in relatively low values for precision and recall, i.e., all classifiers had difficulty predicting the correct class labels.
Due to the higher number of labeled tickets, Data2 revealed an expected systematic classification quality increase compared to Data1, independently of applied algorithm or text representation technique. When enhancing kNN with a self-training, we expected a noticeable increase in performance quality. Nonetheless, the evaluation results evidenced no improvement.
As stated in the Experimental settings subsection, using logistic regression, we identified the most predictive features for Data1 and Data2. According to the weights of logistic regression, in the case of Data 1, the five most important features are 1) relative occurrence of cognitive words; 2) occurrence of unique adjectives in all adjectives, 3) wording style, 4) occurrence of unique verbs in all verbs; 5) relative occurrence of words with positive sentiment. In the case of Data2, the five most important features are 1) relative occurrence of cognitive words; 2) occurrence of unique adjectives in all adjectives, 3) occurrence of unique verbs in all verbs; 4) relative occurrence of words with negative and 5) positive sentiments. As can be concluded, the most essential feature appeared to be a relative occurrence of cognitive words, further referred to as the ''cognitive words'' feature. After that, we trained the classifiers with the selected linguistic features. The results are summarized in Table 7.
As can be seen in Tables 6 and 7, the usage of our linguistic features delivers excellent performance with simple algorithms, such as decision trees and naïve Bayes. Both of them are statistically significantly better than other classifiers. Additionally, we have shown that selecting the best set of features improves the performance of classifiers consistently. Hence, using a smaller set of features also simplifies the process of extraction and reduces computational costs.
As mentioned above and according to the weights of logistic regression, the most important feature in the ticket text appears to be the ''cognitive words'' feature. To confirm this observation, we tested the algorithms' performance without this feature (see Table 8 ). In the case of exclusion of the most predictive feature, we see a statistically significant decrease in terms of prediction quality both for decision trees and naïve Bayes. In the case of classifiers that treat all features equally important (e.g., kNN and its variants), the accuracy was generally low, and we did not observe differences when excluding the ''cognitive words'' feature.
Furthermore, we experimented with a statistical method, i.e., the expert decision rule based on the occurrence of the most predictive ''cognitive words feature'' in a ticket text denoted as COG. Our prediction (ŷ) is based on our already mentioned previous research [25], and it is determined as This expert rule is competitive with some of the examined classifiers. However, its performance is significantly worse than that of decision trees and naïve Bayes (see Table 9). In the case of kNN, we also tested other k values, in particular, all odd numbers between 1 and 10 and different distance functions, such as Euclidian, Manhattan, and cosine. As a result, we did not observe any substantial differences (see Figure 3).

V. DISCUSSION
In this study, our focus was to gain a better understanding of the factors influencing the quality of prediction in text classification tasks. At the same time, we addressed (i) the need for further experiments in industrial settings, especially   with recent classifiers that have not been used for ticket classification before, and (ii) limitations of our rule-based approach. VOLUME 8, 2020 As stated at the beginning of the study, since Artificial Intelligence (AI) becomes an increasing part of our business and daily lives, we recognize the paramount importance of the explainability of AI-based solutions. The need to understand the decisions made by an AI system is crucial in the context of trust and primarily to efficiently address the errors made by such a system. In our work, we closely study the ITIL Change Management based IT ticketing process of a big telecommunication company. Implementing the changes in the IT infrastructure means intervening in the organization's IT environment, its functions, and experience every time the change is performed. Below, we review the study findings in light of the experimental results.
Our study findings show that feature selection is an important component of ticket classification pipelines, not only to reduce the dimensionality of the data and computational costs but also to increase the performance of classifiers. We demonstrate that five linguistic features we identified based on the weights of logistic regression are enough to deliver accurate predictions. While comparing the algorithms' performance with and without the most important feature relative occurrence of cognitive words, we evidence that this feature undoubtedly influences the prediction results. In the case of kNN classifier, when experimenting with different values of k and distance functions, we also show that their choice does not significantly impact the prediction, especially when comparing with the choice of features (TF-IDF vs. linguistic features).
As there is a general discussion on the advantages and disadvantages of various text representation techniques and classification algorithms, in our paper, we offer a new analysis of both design decisions (text representation and classifier) and their interaction. First, we systematically compared the efficiency of linguistic features and TF-IDF representations in the context of IT ticket complexity prediction. To the best of our knowledge, this is the first attempt of such a research setting. Second, using different datasets and text representation techniques, we consistently tested twelve ML classifiers and showed that simple algorithms, such as decision trees and naïve Bayes, achieved accurate prediction with the linguistic features representation. Hereby, the five best performing features delivered results of the same quality. Hence, building explainable text classification pipelines, i.e., using linguistic features with simple algorithms, can have great potential when applied in real-world tasks.
Our study offers valuable insights for managers and ML experts by enabling them to understand the interdependencies or their absence between selected text representation techniques, classification algorithms, and prediction quality.
The most remarkable practical implication of this study is improving our rule-based approach. We used the described linguistic features approach and both handcrafted and decision trees-based rules to predict the IT ticket complexity value of low, medium, or high [25]. Using the ML classification pipeline discussed in the paper, we managed to improve prediction quality significantly without the need to define and constantly update the rules with the experts, which is a big hurdle in the management, maintenance, and scalability of such systems.
In real-life scenarios, IT ticket processing teams continuously undergo changes and fluctuations, causing the resolver groups to be split, merged, or reorganized. Due to the fast technological developments, the problems they have to solve also change. Hence, the trained machine learning model can become outdated very fast and will demand retraining [97]. This again justifies our message -the simpler, the better, i.e., we recommend using those classifiers which do not demand complex parameter search and are simple and cheap to implement if their prediction quality is acceptable.
To sum up, systematic testing of the discussed representation techniques and classification algorithms and their application for the IT ticket classification task of complexity prediction can be of interest for text data scientists and managers who seek to understand the role of factors influencing the prediction.

VI. CONCLUSION
Our work aimed to provide a comparative analysis of text representation techniques and classifiers while developing an IT ticket classification pipeline. Our observations can be useful for the design of decision support systems in enterprise applications, specifically in the IT ticket area.
The contributions of our work can be summarized as follows: (i) our comprehensive comparative analysis of linguistic features with TF-IDF and various ML algorithms confirms the positive influence of linguistic style predictors [52] on the prediction quality; (ii) our observation that simple algorithms work well if using appropriate linguistic features contributes to the general discussion on the advantages and disadvantages of various text classification algorithms [13], [14]; (iii) we showed that ML-based IT ticket classification outperforms our rule-based approach.
As a part of future work, one can: (i) consider further information regarding IT ticket complexity prediction, such as the number of tasks and configuration items per ticket; (ii) test other application cases in the IT ticket area and beyond, i.e., further explore the potential of linguistic features; (iii) as we showed that selecting an appropriate subset of linguistic features can considerably improve the performance of classifiers, one may conduct further experiments with more advanced feature selection techniques [98].

List of attached Appendices:
Appendix I. Taxonomy of Decision-Making Logic Levels Appendix II. Business Sentiment Lexicon with assigned valences APPENDIX I. Taxonomy of decision-making logic levels. following [24], we consider diverse semantic concepts: Resources, Techniques, Capacities, and Choices, elements of RTCC framework. We designed contextual variables [99], based on which experts categorized words into one of the three DML levels and one of the four semantic concepts. VOLUME 8, 2020 KRISZTIAN BUZA received the diploma degree in computer science from the Budapest University of Technology and Economics, in 2007, and the Ph.D. degree from the University of Hildesheim, in 2011. He is the coauthor of more than 50 publications, including the best paper of the IEEE Conference on Computational Science and Engineering in 2010. He regularly serves as a Reviewer for renowned journals, such as Neurocomputing, and Knowledge-Based Systems or Knowledge and Information Systems. He is also a member of the program committee of international conferences, such as the Pacific-Asia Conference on Knowledge Discovery and Data Mining and the International Conference on Computational Collective Intelligence. His research interests include applied machine learning and data mining.
VERA G. MEISTER received the Ph.D. degree in mathematics from St. Petersburg State University, in 1988. She worked as a Researcher, a Lecturer, and a Project Manager in vocational training for more than 20 years. She joined the Brandenburg University of Applied Sciences in 2012 as a Full Professor for business information systems in the Department of Economics. Her research interests include knowledge engineering and information retrieval. VOLUME 8, 2020