Predicting Lymphoma Development by Exploiting Genetic Variants and Clinical Findings in a Machine Learning-Based Methodology With Ensemble Classifiers in a Cohort of Sjögren's Syndrome Patients

Lymphoma development constitutes one of the most serious clinico-pathological manifestations of patients with Sjögren's Syndrome (SS). Over the last decades the risk for lymphomagenesis in SS patients has been studied aiming to identify novel biomarkers and risk factors predicting lymphoma development in this patient population. Objective: The current study aims to explore whether genetic susceptibility profiles of SS patients along with known clinical, serological and histological risk factors enhance the accuracy of predicting lymphoma development in this patient population. Methods: The potential predicting role of both genetic variants, clinical and laboratory risk factors were investigated through a Machine Learning-based (ML) framework which encapsulates ensemble classifiers. Results: Ensemble methods empower the classification accuracy with approaches which are sensitive to minor perturbations in the training phase. The evaluation of the proposed methodology based on a 10-fold stratified cross validation procedure yielded considerable results in terms of balanced accuracy (GB: 0.7780 ± 0.1514, RF Gini: 0.7626 ± 0.1787, RF Entropy: 0.7590 ± 0.1837). Conclusions: The initial clinical, serological, histological and genetic findings at an early diagnosis have been exploited in an attempt to establish predictive tools in clinical practice and further enhance our understanding towards lymphoma development in SS.


B. Data preprocessing and curation
The data curation framework enhanced our assessment related to the types of variables included in the raw dataset and their quality in terms of missing and duplicate measurements and outlier's detection. In the case of missing values handling, we excluded subsequently the clinical records that exhibited a percentage of missing values higher than 90%. The variables within the dataset that also exhibited a percentage of missing values higher than 80% were not selected for further analysis. In terms of outliers and duplicate values detection, the respective features were not considered in the cleaned dataset. To complete any missing value detected within the dataset after the data preprocessing and curation step, an imputation transformer was also developed in Python by utilizing the Scikit learn library [30]. The "SimpleImputer" transformer was adopted with strategy "mean" for replacing missing values using the mean along the continuous variables, and "most_frequent" for each categorical variable. After applying the preprocessing procedure, we concluded with 207 patient records consisting of 23 clinical, laboratory, demographic and genetic variables. Table I and Table II present the four different categories of the preprocessed dataset exploited in the current study. The dataset of this study was curated by removing the duplicate features, the patient records with high percentage of missing values and the features detected with outliers. The samples distribution of the categorical features within each class (class 0 = no lymphoma development; class 1 = lymphoma development) was described by the corresponding percentages, whereas for the continuous variables was described by their arithmetic mean and standard deviation values (SD) as well as their min and max values.

C. Cost-sensitive Random Forest Feature Selection and Ranking
The RF classifier was applied aiming at evaluating the importance of features with reference to the classification problem. The "balanced mode" of the RF estimator was selected in the current study to automatically adjust weights associated with the class frequencies in the training set. The identification of the most important predictor variables which contribute to accurate and unbiased predictions of the response variable was achieved. The maximum number of features selected after keeping the threshold disabled (i.e. threshold = −∞) was also reported with reference to the feature ranking results.
According to the binary classification and regression trees and RFs background [31], the best split = * at each node maximizes the decrease: of some impurity measure (e.g. the Gini index, the Shannon entropy or the variance of the response variable ). and denote the left and right child nodes of node following split , respectively. The RF algorithm instead of searching for the best split at each node, selects a random subset of variables and determines subsequently the best split over these features only. To evaluate the importance of a variable when considering aggregation of randomized trees for predicting response , the weighted impurity decrease is used over all trees in the forest [31]:

D. Model training and parameter tuning with ensemble classifiers 1) Gradient Boosting (GB) classification model
Boosting is a known example of ensemble methods that manipulates the training samples for improving classification accuracy. The overall classification accuracy is obtained by aggregating the predictions of multiple base learners [32]. Boosting methods assign a weight to each training sample and at the end of each boosting round they may adaptively change the weight [33]. GB for classification approximates a simple parameterized function and fits regression tree(s) sequentially on the negative gradient of the specified loss function. Furthermore, GB incorporates randomness into the function estimation procedures for improving the performance. Based on this knowledge, GB constitutes an appealing choice for solving classification problems and building predictive models from an input dataset.
According to (2), the parameters { } 0 and the expansion coefficients { } 0 are jointly fit to the training data in a stagewise manner. Based on an initial guess 0 ( ), for each = 1,2, … , we have: According to [26], GB solves equation (4) for differentiable loss function ( , ( )) with a two-step procedure. In the first step, the function ℎ( ; ) is fit to the current "pseudo"residuals by least squares: where is the current "pseudo" -residuals. Then, given ℎ( ; ), the optimal value of the coefficient is determined based on: . (6) Gradient tree boosting follows this approach considering that the base learner ℎ( ; ) is an terminal node regression tree. In the current study, the ensemble model selection procedure reveals the efficiency of the gradient tree boosting classifier, which may well reduce model's variance, noise and bias; thus, minimizing model's training error. The proposed ML-based methodology was implemented in Python by utilizing the Scikit learn library and the imbalanced-learn toolbox for classification purposes [30,34]. Gradient tree boosting algorithm was nested into an EasyEnsemble [34] random under-sampling scheme to handle the class imbalance problem in our dataset. The EasyEnsemble approach exploits the samples from the majority class that have been ignored by under-sampling. We herein selected the combination of EasyEnsemble with the ensemble strategy of gradient boosting for imbalanced data classification. Given a set of minority class samples , a set of majority class samples , where | | < | |, the number of subsets to sample from and the number of iterations to train the gradient boosting ensemble , EasyEnsemble method randomly samples a subset , | | = | | for learning an ensemble of balanced gradient tree boosting classifiers trained on different balanced bootstrap samples [35]. The output of EasyEnsemble is a single ensemble with gradient boosting classifiers ("ensemble of ensembles") which reduces the bias of model performance while improving generalization of the decision boundary.

2) Random Forest (RF) classification models
RF is a class of ensemble methods which encompasses multiple decision trees using random vectors from the original training data [32]. Predictions made by multiple decision trees are combined via majority voting aiming at improving the classification accuracy. RFs constitute a combination of tree learners (predictors), where each tree is generated according to the values of an independent train set of random vectors. Internal estimates of the generalization error of the combined ensemble trees allow the control and monitoring of error, strength and correlation. Towards this direction, outof-bag methods have been used aiming at improving internal estimates in relation to the classification accuracy [27]. Internal out-of-bag estimates have also applications to the understanding and measurement of variable importance.
For the ℎ tree, a random vector is generated which is independent of the past random vectors 1 , . . . , −1 , but with the same distribution. The training set and are used to construct the trees resulting into a classifier ℎ( , ), with being the input vector. According to the definition in [27] a random forest is a classifier consisting of a number of treebased classifiers {ℎ( , ), = 1, … }, where { } are independent random vectors with the same distribution and each tree in the forest obtains a vote for the most popular class given input . According to [32], it has been proven that the upper bound for the generalization error of RFs converges to the following expression, when the number of trees is sufficient large: where ̅ is the average correlation between the trees and is a quantity that expresses the "strength" of the tree classifiers. The strength of a classifier implies its average performance, which is computed probabilistically in terms of the classifier's margin [32]: where ̂ is the predicted class of based on the random vector that builds the classifier. The higher the margin, the more likely the classifier predicts a new unseen record.
In the present study, the ensemble model selection procedure reveals that RF is also an appealing choice for ensemble-based classification purposes in order to learn imbalanced datasets and achieve better decision boundaries for the predictive model [36]. The RF algorithm was nested into a random undersampling scheme included in the imbalanced-learn toolbox for handling imbalanced dataset [30,34]. Specifically, the balanced RF classifier was implemented which randomly under-samples each bootstrap sample to balance it. Each tree of the forest is provided a balanced bootstrap sample resulting in an ensemble of samples including inner balancing samplers.

E. Performance evaluation and validation
To evaluate the classification performance of our proposed methodology six measures including balanced accuracy, which deals with balanced datasets, sensitivity, specificity, positive predictive value, negative predictive value and AUC were used for both GB and RF models. An external stratified 10-fold cross validation was applied with reference to the feature ranking and the boosting classification scheme in each iteration, allowing for the reduction of the models' variance. The stratified 10-fold cross validator [30] returns stratified training folds by preserving the percentage of samples for each class within the dataset. We should note that an inner 5-fold cross validation was also applied in the proposed ML-based methodology for assessing the exhaustive grid search over specified parameter values for each classifier. Using grid search with a nested cross validation for parameter estimation ensures the optimization of model's parameterization. Table III presents the evaluation performance of the GB and RF ensemble classifiers. For the RF classifier both gini and entropy criterion were applied in order to determine the best way to split the samples. The reported classification results of the proposed methodology are high with balanced accuracies of 0.7626, 0.7590 and 0.7780 for RF Gini, RF Entropy and GB estimators, respectively. The respective mean AUC obtained by the classifiers are considerably high implying the accurate model prediction in terms of sensitivity and specificity as presented in Table III (RF Gini classifier: 0.7988, RF Entropy classifier: 0.7995 and GB classifier: 0.8054) and Fig. 1.

II. RESULTS
We evaluated the models' performances in terms of certain metrics and hyper-parameter optimization criterion (i.e. balanced accuracy). For each prediction model we specified the hyper-parameters to optimize the models' performance. Specifically, for the GB classifier the "learning_rate", the "n_iter_no_change", the "validation_fraction" and the "n_estimators" were set to 0.1, 10, 0.20 and 200, respectively. For the RF models the hyper-parameters selected were the "min_samples_leaf" and the "n_estimators" which were set to 2 and 200, respectively. Default values were used for the rest hyper-paremeters. Tuning the hyper-parameters of the estimators was done with the grid search optimizer available with the sklearn Python library. Table IIΙ RF Gini, RF Entropy and GB classification results. Outer stratified 10-fold cross validation for evaluating the models' performances and inner 5fold cross validated grid search for hyper-parameter optimization have been performed for obtaining the classification results. Certain evaluation metrics have been computed regarding the models' performance. The results are presented as mean±2SD. For each input case, balanced accuracy was selected as criterion within the grid search procedure. We highlighted in bold the best results obtained by either input case 1 (clinical and genetic data), input case 2 (clinical data) and input case 3 (genetic data), for each estimator in order to pinpoint the significant findings of the current study.