System Design for a Data-Driven and Explainable Customer Sentiment Monitor Using IoT and Enterprise Data

The most important goal of customer service is to keep the customer satisfied. However, service resources are always limited and must prioritize specific customers. Therefore, it is essential to identify customers who potentially become unsatisfied and might lead to escalations. Data science on IoT data (especially log data) for machine health monitoring and analytics on enterprise data for customer relationship management (CRM) have mainly been researched and applied independently. This paper presents a data-driven decision support system framework that combines IoT and enterprise data to model customer sentiment and predicts escalations. The proposed framework includes a fully automated and interpretable machine learning pipeline using state-of-the-art methods. The framework is applied in a real-world case study with a major medical device manufacturer providing data from a fleet of thousands of high-end medical devices. An anonymized version of this industrial benchmark is released for the research community based on the presented case study, which has interesting and challenging properties. In our extensive experiments, we achieve a Recall@50 of 50.0 % for the task of predicting customer escalations. In addition, we show that combining IoT and enterprise data can improve prediction results and ease troubleshooting. Additionally, we propose a practical workflow for end-users when applying the proposed framework.

and might lead to escalations.Today this prioritization of customers is often done manually.Data science on IoT data (esp.log data) for machine health monitoring, as well as analytics on enterprise data for customer relationship management (CRM) have mainly been researched and applied independently.
In this paper, we present a framework for a data-driven decision support system which combines IoT and enterprise data to model customer sentiment.Such decision support systems can help to prioritize customers and service resources to effectively troubleshoot problems or even avoid them.The framework is applied in a real-world case study with a major medical device manufacturer.This includes a fully automated and interpretable machine learning pipeline designed to meet the requirements defined with domain experts and end users.The overall framework is currently deployed, learns and evaluates predictive models from terabytes of IoT and enterprise data to actively monitor the customer sentiment

Introduction
Companies are interested in monitoring the performance of their installed systems.The success of a system depends on the health status of a machine (e.g.derived from IoT data like event logs) and customer perception (e.g.derived from ticket data).However, these two perspectives are mostly separated in the literature.The machine health perspective is often considered in disciplines like predictive maintenance or more generally prognostic health management [1].
Sipos et al. [2], for example, used a data-driven approach based on multipleinstance learning from event log data for predictive maintenance for high-end medical devices.Additionally, event log data are analyzed for intrusion detection [3,4] or failure detection in data and computing centers [5,6].On the other side, the customer perspective is emphasized in the framework of customer relationship management (CRM), which is a broad discipline including strategies and processes for organizations to handle customer interactions and to keep track of all customer-related information [7].Customer escalations are mostly predicted based on ticket data only [8,9].In manufacturing companies (e.g. for medical devices), available data typically falls in two distinct groups.First is the IoT data/machine logs generated on the system.The second group contains complementary enterprise systems.This includes ticketing systems for service activities, spare part consumption, and reported system malfunctions.To keep customers satisfied with the operation of their systems is crucial for the success of medical device manufacturer.Therefore, it is important to identify unsatisfied customers who might lead to escalations.Hence, a framework making use of both data sources in order to combine these two perspectives would be desirable.Such a system should combine existing IoT (log) data and enterprise data.
It could serve as a decision support system for the end user to encourage datadriven and therefore more objective decision-making.In this paper, we present a fully automated end-to-end machine learning framework which combines both data sources to model customer sentiment.We show that customer sentiment can be better estimated when looking at the system performance based on both the machine log data (e.g. to detect system malfunction affecting the customer) and enterprise data (e.g.ticket data from customer interactions).We use historical data of escalations as labels for our predictive models to continuously learn a probability for an escalation as an estimate for the customer sentiment.This resulting decision support system helps to better prioritize customers and trouble shoot problems.The concrete problem formulation and proposed solution which combines log and enterprise data to increase predictive power and interpretability for the real-world case study serve as our main contributions.
The remainder of the paper is structured as follows: Section 2 describes the problem to be solved as well as the data sources.Section 3 describes the overall methodology.Section 4 presents the experimental results.Then, Section 5 discusses the results and presents the proposed workflow.Section 6 concludes our paper and discusses future research directions.

Business Problem
Customer satisfaction and hence service resource prioritization is a key priority in many organizations.Here, we analyze data from a large and worldwide installed fleet of high-end medical devices.Therefore, customers, as well as local service entities, naturally differ in the way they communicate and document problems.This inevitably leads to situations where customers facing similar problems address the service provider in vastly different ways.Hence, objectively prioritizing customers and service resources is a hard problem.Combining relevant information from machine log and enterprise data could potentially help to better understand problems in the field and how they affect the cus-tomer sentiment.Therefore, we design a data-driven decision support system to help prioritizing customers based on an estimated sentiment.This can help to minimize unexpected escalations as a product of a more proactive customer support.The case study at hand was conducted with a major medical device company for a fleet of thousands of high-end systems used by customers world wide.Major challenges are the amount, heterogeneity, and complexity of the different data sources.

Data Description
In order to solve the business problem at hand we make use of two major data sources which we describe here in more detail.

Log Data
Log data is a time-based protocol of events recorded by different components of a medical system.An event consists of a timestamp (indicating when the event occurred), an event source (specifying which system component generated the event), an event id (representing a category of similar event types by the given event source), an event severity (typically: information, warning and error), and a message text (describing the event and giving more details like sensor values).Events are defined and implemented by the developers of each particular system component.The severity and amount of sensor data logged is decided by each individual developer.
Depending on which combination of event-source, event-id and messagetext we define as unique, we get approximately 10 5 different events.There can theoretically be an unlimited number of distinct message texts depending on the usage.One system typically generates from 10 4 to 10 6 of these events per day.
A typical system family having several thousand installed systems worldwide would then generate up to 100 GB of log data per day.
These log files are typically used by customer support centers to diagnose problems as well as by the original system developers to track whether their developed systems work as intended.
Major challenges for analyzing log data are the volume and complexity.We describe later how we automatically extract relevant information from incoming log files.

Enterprise Data Sources
Enterprise data sources are mostly collected by and stored in enterprise resource planning (ERP) systems.Types of enterprise data are: • service activity data / ticket data -documenting all customer interactions and problems which occur • spare part data -typically related to service activities, includes which spare parts have been used for maintenance / repair of a system • customer base / contract data -listing all customers and the corresponding relationships, especially what kind of service level has been signed Major challenges regarding the enterprise data are: • getting a consistent picture for all customers and service activities worldwide, which is made more difficult by different local ERP systems • manual data inputs contain errors due to typos and incorrect usage • worldwide standards differ a lot, especially since there is no precise definition what a "well running medical system" is and, therefore, interpretation of service data can differ from country to country • regional ticket data is often written in the local language Globally operating companies can have several levels of customer service centers ranging from regional to global and all of them are generating ticket data.
In our case, we consider three different ticket levels from regional to global.Furthermore, we analyze tickets generated by an information system tracking escalations from customer service to the R&D department.

Requirements
There are special requirements to be met for a successful deployment of a decision support system in a real-world scenario as in the presented case study.
We describe these requirements in this section and adapt all design decisions accordingly.During the whole development life cycle from proof-of-concept up to deployment, we worked closely together with domain experts and stakeholders from all relevant departments including potential end users for the implemented decision support system.Thus, we can assure that we meet all requirements and build a framework which has a real impact for the end users.
• Dynamics and Efficiency: Currently, decisions about escalations are made on a weekly basis.Hence, our framework must process data and provide predictions on a weekly basis as well.The overall framework should be capable to efficiently load new data, extract features, train a model, and perform predictions on a weekly basis.This should be done in the time frame of a few hours, e.g. on Monday mornings.
• Model Performance and Output: The escalation flags (highest escalation level) used as a label in this case study are very sparse and noisy.This causes special challenges for the prediction task.A binary output is not desired, but rather a probability of escalation which models the customer sentiment.Customers with the largest probabilities will then be analyzed in more detail by the end users.Therefore, it is not the main goal to design a machine learning model, which perfectly predicts escalations, but rather a system that helps to identify, based on the designed features, which customers might need special attention.
• Interpretability: This is especially important for real-world applications as the present case study since the end user wants to understand the reasoning behind the output of the prediction model not only to take the appropriate actions but also to build trust.We extract specific features from the log and enterprise data.These features were designed together with domain experts and end users to incorporate prior knowledge and interpretable features into the decision support system.
• Usability: An interactive application was developed based on a commercial Business Intelligence tool which is currently in use by the medical device provider.Usability also includes considerations of what data sources need to be provided based on the end user's request.The ability to interact with the provided decision support system enables continuous feedback for validation.Based on the explanations of the model and features, the end user can decide if the predictions are valid and with that provide more and cleaner labels for future prediction cycles.We will later describe an envisioned workflow for the designed decision support system.

Methodology
In this section, we describe the designed framework to solve the business problem at hand.A high-level overview of the implemented framework is depicted in Fig. 1.Hundreds of gigabytes of incoming log and enterprise data from all over the world are automatically analyzed via a log evaluation framework to calculate relevant features designed together with domain experts.We design an automated and interpretable machine learning pipeline to calculate a probability of escalation, which models the customer sentiment.The provided decision support system includes an explanation for the calculated probability, as well as historical feature data based on extracted log and enterprise data for the end user.There are major benefits when combining both data sources from the user perspective.Depending on which features explain a prediction, it is possible to identify problems as either being more related to R&D (log features) or customer service (enterprise features).Features are extracted based on domain knowledge in order to train a predictive model.The resulting data and predictions are integrated into a decision support system for the end user.

Build Dataset
Algorithm 1 describes how we built the dataset for the experimental setup.
In the following, we will describe the process in more detail.

Labeling
Let I be the set of all customers.Fig. 2 depicts the labeling approach for one example customer i ∈ I using a sliding window approach.The step size is set to 1 week.We set the window size to 10 steps, which was proposed by domain experts.Different values were also evaluated but did not yield an improvement.
From this window a feature vector x i,t pred is extracted, where t pred is defined as the last week in a window.Let T i,esc be the set of all escalations flags for customer i and t i,esc a specific time point of an escalation flag for customer i.The predictive interval is set to 2 steps.If there has been an escalation t i,esc ∈ T i,esc in the predictive interval, the label (y i,t pred ) for the sample will be set to 1 and 0 otherwise (line 9-12).After an escalation t i,esc , the following 4 steps are defined as an infected interval.All samples where the sliding window contains weeks from the infected interval are excluded (line [14][15].This was defined together with domain experts.We assume that there is already a special focus on customers for which a recent escalation occurred.As described in line 2-3, we repeat this procedure for all customers for a fixed time frame of 104 steps, which in our case is equivalent to 2 years.The dataset D t pred = (x t pred , y t pred ) contains samples for all customers with complete data (line 4).This results in the distributions depicted in Fig. 3.Note that the number of customers |y t pred | is increasing while the number of escalations y t pred 1 remains almost constant over time.This is due to limited service resources which yields to an approximate constant number of customers put into focus each week.We provide an anonymized version of D for the research community as an industrial benchmark [10]1 .

Feature Extraction
Weekends and public holidays can introduce noise into calculated features.
Therefore, we decided together with domain experts to aggregate features on a weekly basis.For all customers i ∈ I and prediction weeks t pred , we calculate features for a window of 10 weeks (line 2-8).

Log Data
Due to the high volume and complexity of existing data sources, feature extraction is required.The machine log data as described in 2.2.1 is not feasible to analyze in their raw format.We use a log evaluation framework to detect the occurrences of specific sequences of events, determined by domain experts.
The extracted features have clear meanings and are related to specific system malfunctions, which can affect customers in their daily work routines.Such features include, for example: abort of operation, system delay, user interface (UI) freeze, and UI pop ups.We also extract whether there was a software (SW) update performed for a system.

Enterprise Data
The enterprise data available can be split in two connected groups -sales data and customer service tickets -as described in 2. and positive samples ( yt pred 1 ) for t pred = t 0 + {0, . . ., 104}.
on a global level.

Modeling
In order to meet the requirements described in Section 2.3, we selected a specific approach to model the customer sentiment.Major challenges are the large class imbalance (Fig. 3) and significant amount of label noise as a result of manual decision for an escalation.The output of a machine learning model can be helpful in several ways.First, we can identify which specific problems depicted in the designed features lead to escalation.Additionally, we can identify customers which have similar problems and might need special attention.We compare different machine learning methods: ensembles of decision trees and deep neural networks (DNNs).For both methods, we calculate post-hoc explanations for each prediction using either a tree explainer [11] or a modification of DeepLift [12].Both algorithms are implemented in the SHAP libary [12] 2 and we refer to the explanatory outputs as SHAP values.

Ensemble of Decision Trees
Ensemble of decision tree methods have the following benefits: • The computed feature importance [13,14] helps end users understand which of the designed features are "correlating" with escalations/customer sentiment.
• Ensemble methods provide a probability as a model output which can be interpreted as the customer sentiment (probability for escalation).
• Since each combination of time point (week) in a window and designed feature is modeled as a single input variable, we can provide the relevance of each input variable for all predictions to the end user for better troubleshooting.
The decision tree ensemble methods we select are Random Forest (RF) [15] the class which is most probable for the majority of the C i .In the original paper [15] and also in our case, we used decision trees as weak classifiers.The decision trees for RF are generated independently and in parallel via a bagging (bootstrap aggregation) approach.This means that each decision tree is generated in two steps: 1. Bootstrapping: Independently sampling the input data set D train = (x j , y j ) with j ∈ {1, . . ., m} for each C i on data points and features.
This means the data points from D are sampled iid (independent and identically distributed) into a subset D traini = (x i j , y i j ) j∈Ji where J i ⊂ {1, . . ., m}.
Also, the feature space is sampled iid, such that if x j contains the features F = {f k : k ∈ {1, . . ., n}}, then x i j contains the features from F i ⊂ F. 2. Aggregating: Averaging or in our case deciding by a majority vote which class should be chosen.
Gradient Boosting [17] also combines many weak classifiers into a strong classifier, but the idea how to combine those weak learners differs.In contrast to bagging, the decision trees are not built in parallel but sequentially, while results are combined along the way.In our case, we chose XGBoost [16] which is an improved variant of the Gradient Boosting algorithm using a more regularized formalization of the model leading to a reduction of over-fitting.In both cases the output is the predicted class (here: 0 or 1) as well as the probability the model assigns to each prediction.We use the probability the model assigns to class 1 as the predicted customer sentiment ŷt pred for each input x t pred .We address the imbalanced class problem by applying either random oversampling of the minority class, SMOTE [18] or random undersampling of the majority class [19] using the imblearn libary [20] 5 .We treat the sampling strategy as a hyperparameter in our model selection approach, which we will describe later.We tested 8 different model configurations as summarized in Table 1.We applied two different data fusion approaches.For "early" fusion (M1, M2) we simply stack enterprise ( x ent ) and log ( x log ) features to train a single classifier (RF or XGB).In "late" fusion (M3, M4) we train one base classifier based on x ent and one based on x log .The output of each base classifier is then fed into a subsequent logistic regression layer for the final prediction.Both base classifier are either RF or XGB.We additionally tried to train a classifier only based on x ent (M5, M6) or only on x log (M7, M8).
{W f,i,g,o , b f,i,g,o } are trainable parameters, σ is the sigmoid activation function, • denotes the Hadamard product (element-wise product), h (t) and c (t) are the hidden state and cell memory of a LSTM cell.Additionally, a LSTM cell uses four gates to manage its states over time to avoid the problem of exploding/vanishing gradients in the case of longer sequences [22].
determines how much of the previous memory is kept, i g (input gate) controls the amount new information (c (t) ) stored into memory, and o (t) g (output gate) determines how much information is read out of the memory.The hidden state h (t) is commonly forwarded to a successive layer.In our experiments, we set the number of LSTM layers to ∈ {1, 2} as a hyperparameter.Additionally, we set as a hyperparameter if the LSTM is bidirectional [23] or not.The final output vector from the last LSTM cell h 10 is then forwarded to a fully connected layer using dropout [24] for regularization and softmax activation for prediction.Fig. 4 depicts the implemented DNN architecture.We address the imbalanced class problem by applying either random oversampling of the minority class or random undersampling of the majority class [19].We used the PyTorch framework 6 to implement our DNN architecture.

Training and Validation
Our experiments are designed to simulate the real-world performance of our decision support system.Algorithm 2 line 1-9 and Fig. 5 depict the experimental setup.

Evaluation Metrics
For a given set of estimated customer sentiment values ŷt pred and ground truth escalation labels y t pred , we calculate the Recall@N for a whole year t pred = t 0 + {53, . . ., 104} as an evaluation metric.y t pred 1 is the number of positive samples and |y t pred | the total number of samples at t pred , respectively.We define Recall@N = 100 • t pred True(N (ŷ t pred )) , where N () denotes the N largest elements and True() denotes the number of samples which have a positive ground truth value.Furthermore, we define avg(Recall@N) = 100 N =1 Recall@N 100 , as the average Recall@N, in order to compare different models for a relevant range of values N .

Model Selection and Evaluation
We perform a weekly analysis for one year (t pred = t 0 + {53, . . ., 104}) to evaluate our approach.For each week, we use the past year as training data ) weeks.We apply the tree-structured Parzen estimator (TPE) [25] approach for hyperparameter tuning.TPE is a Bayesian optimization approach for hyperparameter tuning and can yield better results compared to grid and random search [25].We use the TPE implementation we would look at the N largest estimated customer sentiments at each week.
In deployment (Algorithm 3), we provide information regarding the customer sentiment in the current week (t pred + 1) based on ŷt pred since we only have the full data available up to and until t pred .The source code for our experiments is available on GitHub8 .

…
Figure 5: Illustration of the proposed training and evaluation setup (Algorithm 2).We evaluate the decision support system on a weekly basis for one year (t pred ∈ {t 0 + 53, . . ., t 0 + 104}).For each week we use the data from the previous year (52 weeks) to train a model and to get a probability output for each customer (ŷt pred ).Finally, we calculate Recall@N to evaluate the performance over the whole year.
Early and late fusion: Comparing M1 vs. M3 and M2 vs. M4 shows that there is a slight overall benefit of late fusion compared to early fusion in terms of avg(Recall@N) (36.89 vs. 39.07 and 42.46 vs. 44.13).
Feature configuration: Using log features only (M7 and M8) yielded the worse results with 13.79 and 16.36 in terms of avg(Recall@N) respectively.Using enterprise features only (M6 and M7) resulted in 35.99 and 34.11 in terms of avg(Recall@N) respectively.Fusing both features yielded consistently better results for all configurations M1-M4 in terms of avg(Recall@N) (36.89 − 44.13).
Algorithm 1 Build Dataset for all customers i ∈ I do 3: for t pred = t 0 : t 0 + 104 do for every week for 2 years create samples starting from an arbitrary time point t 0

4:
if customer exists since t pred − 10 then 5: for t = t pred − 10 + 1 : t pred do for every week in look-behind window of size 10 for all escalation flags t i,esc ∈ T i,esc do D ← D \ D i,ti,esc:ti,esc+10+4 discard infected intervals after escalation flags from dataset Fig. 8 shows the feature importance of all resulting models for the model configuration M2.One can see that enterprise features are generally more important than log features.Note the small scale and that some log features do have a relatively large feature importance for some weeks.Furthermore, the applied machine learning models can potentially exploit non-linear relationships between the different features.Table 2: Numerical results in terms of Recall@N and avg(Recall@N) for all model configurations listed in Table 1.

RF and XGB
Model Recall@10 Recall@20 Recall@30 Recall@40 Recall@50 Recall@100 avg(Recall@N) bles might be more prone to overfitting on the heavily imbalanced dataset with noisy labels.We assume a similar problem with overfitting when using deep learning models for this kind of data.This might explain the inferior performance of M9 despite the potential to better model the temporal structure in the data.Furthermore, XGB and LSTM based models are more than 10 times slower to train compared to RF and are harder to tune.We also noticed that the computation of SHAP values [12] for LSTM based models is significantly slower compared to XGB and RF.As a conclusion, we recommend using model   1.
configuration M2.In practice, we estimate that one customer support employee can scan around 10 customers in depth with our tool per hour.Hence, if for example a team of 5 customer support employees would invest one hour each week using the proposed decision support system with model M2, they could potentially prevent around 48.61% of the escalation in a year.Furthermore, the decision support system can help to learn which specific problems, depicted in the designed features, lead to escalation, as well as to identify customers which have similar problems and might need special attention.In the following we explain in detail how the designed decision support system could be used in practice.

Proposed Workflow
The following section briefly outlines how the data-driven decision support system is integrated into a productive environment and how it could be used by customer support employees.
The envisioned workflow, which is how this system is mainly used, can be grouped into five distinct steps, which are outlined in Fig. 9.
• Step 1: Producing new predictions This step is fully automated and all relevant processes are triggered at the beginning of each week.The first process within this step is to load the most recent raw data for all monitored customers from the respective data sources and to conduct the necessary preprocessing steps.Afterwards, all available samples up to this point for which labels can be defined are used to train a new model.
Once a new model has been trained, it is used to predict the customer sentiment of all monitored customers.Additionally, the SHAP values for each prediction and each feature are calculated.Predictions and SHAP values are then copied to a database and automatically loaded into an interactive dashboard which serves as a user interface.
• Step 2: Identifying high risk customers One element of the user interface is an interactive table showing the most recent predictions for all monitored customers, along with some additional information about each customer, e.g.location and operated system type.Within this step, the user identifies a system within his or her area of responsibility with a particularly high probability for an escalation within the following two weeks.Once a customer has been identified, it can be selected, which reduces the information shown on the user interface to just the relevant information about the customer in question.
• Step 3: Single out the most relevant features with SHAP values Knowing only which customers are at high risk of causing escalations without knowing why is only of limited use.In order to explain why a specific customer has a high probability for escalation (customer sentiment), SHAP values for each prediction and each feature are displayed in the user interface.With such a visualization, the user can easily single out one or a few features which, according to their respective SHAP values, have a large effect on the customers sentiment.By selecting these specific features, the information shown on the interface is further reduced, now only displaying information connected to the selected customer and the selected feature or features.
• Step 4: Analyzing specific features Once a set of few features has been selected, the user is provided with the actual values of these features and how these values have been changed over the past weeks.With this information, the user can easily identify open, yet unresolved, tickets and see immediately for how long a specific ticket has been unresolved.Another example could be the accumulation of specific malfunctions reflected in log features.
• Step 5: In depth analysis of certain problem At this point, the experienced user probably has a good idea of where a potential problem with the customer in question might be found (e.g.unresolved tickets, spare parts, software issues).For an in depth analysis of ticket data or consumed spare parts, other tools are already available which are tailored for such tasks.Therefore, once the user has identified the potential root cause for a bad customer sentiment, he or she is provided with a direct link to these external tools in order to continue the analysis as efficient as possible with the goal to act before an escalation occurs.
The main idea of a productive use of a data-driven decision support system is to help customer support employees decide which customers to focus on and where to look.

Conclusion and Future Work
In this paper, we propose a general framework and an interactive workflow with a decision support system.Additionally, we provide a publicly available industrial benchmark dataset, including all code necessary to reproduce or to improve our results.Our designed and implemented decision support system is currently deployed to monitor the customer sentiment of thousands of customers of high-end medical devices worldwide.The explainability of the system helps a variety of end users to identify problems in the field.We demonstrate that using both log and enterprise data-based features enables more effective troubleshooting compared to using either of these data sources alone.Furthermore, the gained insights can help to achieve better and more proactive customer relations, as well as improve product management by focusing on problems which affect the customer the most.There are some open challenges which could be addressed in future research using the provided benchmark dataset and evaluation framework.For example, more efficient methods for merging log and enterprise data information which preserve explainability can be investigated.
Another challenge is to design models that, within the implemented framework, allow to increase its predictive power without trading interpretability.Finally, alternative learning problem formulations, like anomaly detection, could be explored for the task.This could help with the heavy class imbalance present in the benchmark dataset.

Figure 1 :
Figure 1: High-level schema of the overall processing pipeline.Systems from all over the world are sending log data.Additionally, enterprise data (sales and ticket data) are collected.

Figure 2 :
Figure2: Description of the labeling approach for one example customer i ∈ I.We assume that t pred (last week in window) starts at some point t 0 .(a) Sample with negative label (y i,t pred ← 0 ) since there is no escalation within predictive interval of 2 weeks.(b), (c) Samples are labeled positive (y i,t pred ← 1) since there is an escalation within the predictive interval.(d) We exclude infected samples from the dataset.(e) First valid sample after the infected interval.

3
and XGBoost (XGB)[16] 4 .Random Forest and XGBoost are ensemble learning techniques which can be used for both classification and regression.In our case, we are interested in classification.In general, a RF is a collection of weak classifiers {C i } where each classifier gets the same input x and outputs the most probable class C i (x) = y i ∈ S with S being the set of all possible classes.The output of the Random Forest is then defined as RF(x) = arg max|{y i :

Figure 8 :
Figure 8: Feature importance for all trained models for configuration M2.

Figure 9 :
Figure 9: Schematic overview of the envisioned workflow.
val .Finally, the LSTM model with minimum avg(Recall@N) on D val over all sampled hyperparameter combinations is chosen for final evaluation.For RF and XGB we use the best set of hyperparameters to train a model on the complete training data D train .The resulting model is used to calculate predictions on the current test data (D test ← D t pred ) in order to obtain the estimated customer sentiment ŷt pred .The hyperparameters for the different classifiers are listed in Appendix Appendix A. For evaluation, we calculate Recall@N based on ŷt pred and y t pred over the whole year (t pred ∈ {t 0 + 53, . . ., t 0 + 104}).This measures the percentage of escalations we would have predicted in one year if 7in the Optuna [26] library for our pipeline.We train models with different hyperparameter combinations suggested by TPE on D train * and calculate the avg(Recall@N) for D val .The LSTM model (M9) is trained for 150 epochs on D train * and stops training if the validation loss did not decrease for 15 epochs.A model checkpoint with the minimum validation loss is chosen to calculate the avg(Recall@N) on D