Prediction of Cocaine Inpatient Treatment Success Using Machine Learning on High-Dimensional Heterogeneous Data

The high prevalence of drug addiction is a major health challenge that pressures healthcare systems to respond with cost-effective treatments. To improve the treatment success of drug-dependent patients, it is necessary to identify the main associated risk factors for dropping out of treatment. Previous research shows disparate results due to the wide variety of approaches employed, the different and/or poorly defined metrics used, and the different target populations under study. This article presents the design and selection of a predictive model to estimate success of inpatient cocaine treatment based on a high-dimensional heterogeneous set of characteristics, with the aim of learning new associations between independent characteristics. We evaluated different feature selection techniques and machine learning algorithms to design the best predictive model in terms of accuracy, area under the receiver operating characteristic curve, recall, specificity, F1-score, and Matthews correlation coefficient. Random Forest was the top-performing model with a characteristic set consisting of 11 features selected with a wrapper evaluator and the Best First algorithm, achieving 82% accuracy, 0.81 of area under the receiver operating characteristic curve, 0.96 of recall, 0.47 of specificity, 0.89 of F1-measure and 0.53 of Matthews correlation coefficient. The predictive model’s performance was enhanced by combining multiple dimensions with variables referring to previous treatments, mental exploration, cognitive functioning, personality, consumption habits, and pharmacological treatment. We have refined the use of machine learning techniques to predict drug addiction treatment success, which could represent a new step in treatment management especially when included in clinical decision support systems.


CS WG
Characteristic set obtained by the wrapper evaluator with the Genetic Search algorithm. CS IWSS Characteristic set obtained by the IWSS evaluator. CS CFBF Characteristic set obtained by the Correlationbased Feature Selection with the Best First algorithm. CS CFG Characteristic set obtained by the Correlationbased Feature Selection with the Genetic Search algorithm.

I. INTRODUCTION
The high prevalence of cocaine addiction is a major health challenge that leads to pressure on healthcare systems to respond with cost-effective treatments. The latest European Drug Report by the European Monitoring Centre for Drugs and Drug Addiction [1] shows that, after cannabis, cocaine is the second most widely used illegal drug in Europe, being consumed by 3.9 million adults (15-64 years old) and 2.6 million young adults (15-34 years old) in 2018. Spain is the sixth largest cocaine consuming country in the European Union, with a prevalence of 2.8% among young adults [2]. It is estimated that 18 million Europeans have used cocaine in their lifetime. Cocaine was cited as the primary drug by about 73,000 patients entering specialized drug treatment in 2017 and by more than 33,000 patients in specialized drug treatment for the first time. Estimates of public spending on drugs in Europe range from 0.01-0.5% of gross domestic product. US data from 2018 [3] shows an estimated 5.5 million people aged 12 or older have used cocaine in the past. Among them, about 977,000 had a cocaine use disorder (0.4% of the population) and 19% of those people received treatment for illicit drug use at a specialized facility.
International standards for the treatment of drug use disorders indicate that essential treatment services should be available within different healthcare systems [4] which include outreach services, brief psychosocial interventions, diagnostic assessment, outpatient psychosocial treatment, pharmacological treatment, services for the management of drug-induced acute clinical conditions, inpatient services for the management of severe withdrawal, and long-term residential services. In Europe, most treatments for drug addiction are provided on an outpatient basis whereas only a small portion are delivered as an inpatient service, mainly in hospital-based residential centers (e.g. psychiatric hospitals) [1]. Treatment pathways followed by patients are often characterized by the use of a variety of different services, by relapses that lead them to begin treatment again, and by different lengths of stay.
A high dropout rate is one of the main limitations of treatment success. Some studies report that among outpatient treatments the dropout rate ranges from 23-50% [5], [6], whereas in residential treatments the dropout rate ranges from 17-57% [7], [8]. Treatment completion is associated with higher levels of abstinence and fewer relapses [9], as well as lower crime rates [10] and better employment levels [11]- [13]. On the other hand, leaving treatment is associated with greater legal and financial difficulties [14] and entails a high cost to society [15] as well as to loved ones.
According to a systematic review published in 2013, the most consistent dropout risk factors identified were cognitive deficits, low adherence to treatment, personality disorders, and younger age [9]. Cognitive deficits are extensive among patients in addiction treatment and, in addition to an increased risk of dropout [16]- [18], have been associated with some personality disorders [19].
The generation of successful predictive models would lead to a better understanding of the behavior of patients with drug addiction problems and could help to identify the most important risk factors that impede treatment. In medical and psychological research, it is generally found that one size does not fit all [20], and in many situations interactions between independent heterogeneous variables can offer potential outcomes that help to better understand the factors that explain treatment dropout.
Machine learning (ML) methods are useful to learn the relationships that exist amongst data points, and this can be applied to clinical datasets for the purpose of developing robust risk models [21]. The development of decision support systems based on predictive machine learning models has become a subject of enormous scientific interest, and this can be observed in various fields of medicine, finding systems that allow the identification of prescriptions with high risk of medication error [22], the prediction of metastasis in gastric cancer [23], the detection of pediatric autism [24], the prediction of coronary artery disease [25], the treatment of large kidney stones [26] or the prediction of heart failure [27] among others.
Recent work has shown the success of ML methods in clinical psychology and psychiatry that have explicitly focused on learning statistical functions from multidimensional datasets to make generalizable predictions about individuals [28]. At the same time, applications of ML in computational psychiatry [29] and computational neurosciences [30] are emerging.
ML methods, particularly supervised learning, are increasingly used in addiction psychiatry for informing medical decisions and further investigations of the potential applications of ML in precision psychiatry and neuroscience are warranted [31]. In fact, there is already evidence in some fields of addiction, such as the case of alcohol-dependent patients, in which machine learning models seem to be more accurate than psychologists in predicting treatment outcomes in abstinence programs [32].
The aim of this research was to develop a predictive model using ML with high-dimensional heterogeneous data to estimate success in inpatient treatment for those with cocaine addiction, as a first step towards developing a decision support system to assist clinicians in charge of referring and guiding patients in their treatment pathway. The main contributions of the research work proposed in this article can be summarized as follows: • To our knowledge, this is the first research study using ML techniques on a European population sample with cocaine use disorder.
• The patient sample size is the largest that has been used in this context focused on the specific case of patients with cocaine use disorder (253 subjects).
• To our knowledge, this is the first research study that has considered the characteristics of the previous treatments together with mental, cognitive, personality, and pharmacological variables in cocaine addiction. The large number of high-dimensional heterogeneous variables considered in this article allows for a more accurate representation of the context of the patients, as well as the possible synergies between these dimensions for prediction.
• In this work, in addition to evaluating the performance of ML algorithms, different feature selection techniques are also used and compared. The best results are contrasted with those shown in the literature.
• The predictive model developed in this research work is discussed in terms of possible benefits obtained from their use in clinical practice when included in a decision support system, improving the possibilities of reducing the treatment dropout rate, the number of future readmissions, and the consequent waiting lists. This would lead to a better use of health care resources and would have a positive impact in economic and social aspects.
The paper is structured as follows. Section II reviews the background and work related to the application of ML techniques in drug addictions. Section III describes the materials and methods used, begins with a description of the dataset and the characteristics, continues with the processing and filtering criteria applied to the data, the feature selection methods and the selected ML algorithms, and ends with the proposed validation methodology to evaluate the results. Section IV presents the results obtained with the ML algorithms and the different sets of characteristics. Section V includes the discussion based on the results obtained. Section VI presents the conclusions of the paper.

II. BACKGROUND AND RELATED WORK
The research on ML techniques applied to the field of addictions is still quite limited. According to a recent systematic review about the applications of ML in addiction studies, previous research works could be divided according to the type of addiction, differentiating between smoking cigarettes, alcohol, cocaine, opioids, multiple substance use, internet addiction, and game addiction. Focusing on the papers on drug use, this review includes 14 papers published between 2012 and 2018, most of them developed in North America and to a lesser extent in Europe and Asia, and with sample sizes ranging from 22 to 228,405 subjects [31].
The subtypes of ML methods used in these research works are classification, regression, ensemble, multiple comparison of algorithms, clustering, and direct reinforcement learning, with most being supervised learning methods. Among the algorithms used are Support Vector Machine (SVM), decision trees, Random Forest (RF), artificial neural networks, classification and regression trees, Naïve Bayes, discriminant analysis, logistic regression (LR), penalized regression, nearest neighbors, elastic net, K-means clustering, K-medoids clustering and Q-learning. Among the algorithms and methods of feature selection used are correlations, regression coefficients, analysis of variance, chi-square test, grid search, filters, wrappers, information gain and Pearson's chi-squared test. Model evaluation methods included k-fold cross-validation, the receiver operating characteristic (ROC) curve, chi-squared test, leave-one-out cross-validation, variance analysis and multiple comparisons with Bonferroni correction.
A wide range of characteristics, combined in different ways, are used in these studies, including substance problems, route of administration, frequency, age at first use, demographics, psychopathology, personality, risk, cognitive, peer pressure, motives, attitudes, impulsivity, psychiatric problems, clinical, executive function, functional magnetic resonance imaging, arousal level, tense level, restlessness level, brain regions, family history, life events and genetic heritability.
In the field of cocaine addiction, Deane et al. [7] used a binary logistic regression for predicting dropouts in the first 3 months of residential drug and alcohol treatment in Australia. The final sample consisted of 618 participants with different primary drugs of abuse. Predictor variables used were age, gender, primary drug of concern, criminal involvement, psychological distress, drug cravings, self-efficacy to abstain, spirituality, forgiveness of self and others, and life purpose. The overall model accuracy was 61.6%, with 76.5% accuracy for being able to predict dropouts and 42.3% for nondropouts.
Anh et al. [33] used the least absolute shrinkage and selection operator regression to classify US individuals with cocaine dependence by impulsivity variables, both selfreported and from neurocognitive tasks. Despite the small sample size (31 cocaine-dependent and 23 healthy subjects), the area under the receiver operating characteristic curve (AUC) obtained was 0.912 in the test set, showing the subjects with cocaine dependence higher scores on motor and non-planning trait impulsivity, and poor response inhibition, discriminability and decision making.
Acion et al. [34] used a single dataset of 99,013 patients to compare different ML models for prediction of successful outpatient treatment focused on Hispanic Americans with different substance addictions. They considered 28 predictor variables (10 patient characteristics, 3 treatment characteristics, source of referral, summary of type of problematic substance, and mental health problems) and compared 5 ML algorithms (LR, penalized regression, RF, deep learning 218938 VOLUME 8, 2020 neural networks, and Super Learning). All the algorithms were evaluated using the AUC. The best results were provided by the Super Learning and RF algorithms, obtaining AUCs of 0.820 and 0.816, respectively.
The research on cocaine use by Mete et al. [35] is focused on medical imaging. It used a SVM algorithm to classify brain images of 93 cocaine-dependent participants and 69 healthy controls in the US population, obtaining F-measures of 0.88 and 0.89, sensitivities of 0.83 and 0.90, and specificities of 0.83 and 0.89 in 10-fold cross-validation and leave-one-out approaches, respectively. Sakoglu et al. [36] also used an SVM algorithm for both feature selection/reduction and classification, based on fMRI data obtained from 58 cocaine-dependent participants and 25 healthy subjects from the US population while performing a stop signal task. The aim was to determine whether dynamic functional connectivity features were more successful than static functional connectivity features in classification of cocaine-dependent patients and healthy controls. Based on dynamic functional connectivity, participants were successfully classified with 95% accuracy, whereas static functional connectivity yielded only 81%. Visual, sensorimotor, default mode, executive control networks, amygdala, and insula played the most significant role in classification.
Rish et al. [37] used fMRI data to investigate the effects of methylphenidate on the brain activity of US individuals with cocaine use disorders. The sample used included 18 subjects with cocaine use disorder and 16 control subjects. They used different classifiers, including naïve Bayes, nearest neighbour, linear discriminant analysis, LR, linear SVM, decision trees, and RF. They found that the classification error was 10-20% lower classifying subjects with cocaine use disorders under placebo than under methylphenidate, obtaining the best result with the linear discriminant analysis algorithm.
Yip et al. [38] used fMRI data to identify a brain-based predictor of cocaine abstinence by using connectome-based predictive modeling, a machine learning approach optimized for neuroimaging data, in order to identify networks that underlie specific behaviors. The sample used included 98 US subjects with cocaine use disorder. The algorithms were applied pre-treatment and post-treatment and leave-one-out cross validation was conducted. Abstinence was predicted during treatment with a significant correspondence between predicted and actual abstinence values (r=0.49, df=52). Connectivity strength did not change with treatment, and posttreatment assessment also significantly predicted abstinence during follow-up (r=0.34, df=39). Network strength in the independent sample predicted treatment response with 64% accuracy by itself and 71% accuracy when combined with baseline cocaine use.
Panlilio et al. [39] used unsupervised machine learning techniques on drug test results of patients with opioid and cocaine addiction problems, including hierarchical clustering of categorical results and K-means longitudinal clustering of quantitative results. The sample used included 426 US subjects. They identified four clusters of use, categorized into opioid use, cocaine use, dual use (opioid and cocaine), and partial/complete abstinence. Contingency management increased membership in clusters with lower levels of drug use and fewer symptoms of substance use disorder.
SVM algorithms are widely used in psychiatry [21] due to their origin in early multivariate pattern recognition [43] that aimed to automatically discover regularities in multivariate data to fulfil a goal [44]. The main SVM applications in psychiatry focus on neuroimaging analysis, either for brain disorders [40], schizophrenia [41], or imaging biomarkers for neurological and psychiatric disease [42].
Literature comparisons are difficult due to the diverse nature of the studies undertaken, mainly caused by the inclusion of different combinations of drug abuse with mental problems and other diseases, the amount of substances consumed by the patients, the predictor variables used in the analysis, the heterogeneity of the samples and study designs, and the different metrics to show the performance of the predictive models. This heterogeneity may explain the contradictions and non-significant results that constitute the main important gaps in the research [9], and the need to conduct further research to represent the potential applications of machine learning methods in the field of addiction psychiatry [31].

A. DATA EXTRACTION
The usual therapeutic process for drug addiction patients in Spain begins with outpatient treatment. If abstinence from substance use becomes difficult, the patient is referred to inpatient treatment where he or she will remain in the hospital for a period of time [2]. It is necessary that referring physicians complete a patient referral report to recommend enrolment in the inpatient center, which includes information on health, psychological, pharmacological, family environment, social, and occupational aspects.
In the region of Madrid, the referral reports are formatted as tables and free text fields. For the extraction and structuring of the information, a collaborative effort with the psychiatrists was necessary in which the variables of interest and the metrics used for each of them were defined. A total of 120 characteristics were identified as potential predictors of treatment success. The initial dataset was created manually by a group of 3 researchers who were responsible for reading the referral reports and converting them into a structured database.
The target variable selected for prediction was success in inpatient treatment, which was defined as the completion of inpatient treatment. Cases in which the patient dropped out of treatment prematurely, including voluntary dropout against professional advice or expulsion from the treatment center for failure to comply with rules of conduct, were considered unsuccessful treatment. VOLUME 8, 2020

B. COHORT DESCRIPTION
The structured dataset included referral reports from 253 patients admitted to the cocaine addiction unit at the Clínica Nuestra Señora de la Paz (CNSP).
The distribution of men-women in the population sample was 80%-20%, respectively (see Table 1), which was lower for men than the average reported for European countries (85%). The proportion of patients using the inhaled administration route was similar to European figures (25%) while the nasal sniffed administration route (74.7%) was higher than the European figures (68%) [1]. Most patients had a previous history of polydrug use: 15.41%, 76.28% and 82.82% had consumed heroin, cannabis, or alcohol, respectively, at some point in their lives.

C. DATA DESCRIPTION
Data extracted from the referral reports was grouped into 8 dimensions:  [46]); and outpatient treatment center from which the patient was referred.
Data on treatment success was extracted from the electronic medical record.

D. DATA PROCESSING AND FILTERING CRITERIA
A reduction in the dimensionality of the characteristics was performed, eliminating those highly affected by missing values or not providing information due to their low variability. Exclusion criteria were those variables with: >25% missing values; >90% of records in a single category; >95% of categories as a percentage of the number of records; and a minimum variation coefficient >0.1. An additional step was to eliminate highly correlated variables using the absolute value of Pearson's correlation coefficient and taking ≥0.6 as a cut-off point for a high correlation. As a result, the dimensionality of the set of characteristics was reduced from 120 to 60.
The selected characteristics and the treatment success rate are shown in Table 1. The characteristics marked with * * showed a statistically significant difference between both groups of patients (treatment success and dropouts) with p<0.01, and those marked with * showed a statistically significant difference with p<0.05.
For subsequent steps, outlier values in continuous fields were replaced by a cut-off value at a distance of 3 standard deviations from the mean and continuous fields were then normalized to a common scale with a mean value of 0.0 and a standard deviation value of 1.0. Also, the nominal fields were transformed into numerical values.

E. MACHINE LEARNING METHOD
The large number of available characteristics suggested applying feature selection techniques to obtain the optimal set. We evaluated 4 ML algorithms (RF, LR, multilayer perceptron (MLP) neuronal network, and SVM) that are widely used in predicting therapeutic outcomes in drug addiction and psychiatry, in combination with 3 different methods of feature selection (wrappers, filters, and hybrid methods). We used the Weka data mining tool [47], version 3.8.3, which provided all the algorithm implementations utilized here.
RF is a recursive partitioning method that consists of a large number of individual decision trees that operate as an ensemble, being able to evaluate a number of predictors even in the presence of complex interactions [48]. RF is more   protective against overfitting in comparison with other tree algorithms since the low correlation between the trees of the forest protect each other from their individual errors. LR is a regression method similar to the linear regression model but it is suited to models where the dependent variable is dichotomous. This model uses the maximum-likelihood ratio to determine the statistical significance of the variables [49]. LR imposes less stringent requirements than linear regression, in that it does not assume linearity of the relationship between the explanatory variables and the response variable and does not require Gaussian distributed independent variables [50].
MLP neural network is a combination of perceptrons stacked in several layers, to solve complex problems generating non-linearity classification rules [51]. The relationships between the perceptrons are defined by weights calculated using a given rule. Each layer can have a large number of perceptrons, and there can be multiple layers, so the MLP can quickly become a very complex system, which is the case of the Shallow Neural Network or Deep Neural Network.
SVM is a supervised ML algorithm which can be used for classification or regression problems [52]. It basically finds the hyper-plane that best differentiates the classes within the variable dimensional space. It is effective in highdimensional spaces, even if the number of dimensions is greater than the number of samples, and it is also memory efficient.
For RF, LR, and MLP we used the basic implementations available in Weka and in the case of SVM we used the sequential minimal optimization algorithm for training a support vector classifier [53].
Feature selection processes combined a searching algorithm and an evaluator that scored each characteristic or set of characteristics. We tested the 3 main approaches of evaluators: filters, wrappers, and hybrid methods.
1. Filter methods score each subset of features using heuristics based on general characteristics of the data rather than using a learning algorithm, as shown in Fig. 1. They are independent from algorithms and are computationally very fast, but only consider individual characteristics of features to identify their relative importance. Among the filter models we chose Correlation-based Feature Selection (CFS) which evaluates the worth of a feature subset according to a correlation based heuristic evaluation function. It works on the assumption that a characteristic is useful if it is class correlated or class predictive. The bias of the evaluation function is toward subsets that contain features that are highly correlated with the class and uncorrelated with each other. Redundant features should be screened out as they will be highly correlated with one or more of the remaining features. The calculation of the merit of a feature subset is based on a ratio, in which the numerator represents how predictive of the class a set of features is, and the denominator represents how much redundancy there is between the features. CFS's feature subset evaluation function is as follows: where M S is the heuristic ''merit'' of a feature subset S containing k features, r cf is the mean feature-class correlation (f ∈ S), and r ff is the average feature-feature intercorrelation [54]. 2. In the wrapper approach, the feature subset selection algorithm is integrated as a wrapper around the learning algorithm, as shown in Fig. 2. The feature subset selection algorithm performs a search for the optimal subset using the learning algorithm itself as part of the feature subset evaluation. In this way the learning algorithm is considered a black box. The learning algorithm is run on different sets of features from the original dataset, usually partitioned into internal training and holdout sets. The feature subset with the highest evaluation is chosen as the final set on which to run the learning algorithm. The wrapper method uses a 5-fold cross-validation as an evaluation function, so that the score assigned to a subset of features is the accuracy obtained in the cross-validation test set [55]. Due to the large number of iterations necessary to evaluate the performance of each possible set of characteristics, the wrapper method is computationally intensive but at the same time provides the best possible set of characteristics. 3. Hybrid methods try to combine the best properties of filters and wrappers, where a filter method reduces the dimensionality of feature space and a wrapper method finds the optimal subset afterwards. We chose the Incremental Wrapper Subset Selection (IWSS) hybrid method, by first creating a ranking of features based on the correlation-based metric, and later running IWSS over the whole ranking, giving each attribute in the data set the chance of being selected, and selecting those attributes that improve performance for a given minimum number of folds out of the wrapper cross-validation folds. The main advantage of this approach is that it retains a great part of wrapper advantages, while reducing the computational cost of pure wrapper approaches. The main disadvantage of the IWSS algorithm is its greedy behavior, since the algorithm always tries the best ranked features first and once a feature is included in the selected set, it is maintained therein until the end of the search [56]. Table 2 shows the pseudo-code of IWSS method. Weka default parameters were maintained in all the evaluators.
Since not all searching algorithms perform equally with different datasets, we decided to explore algorithms belonging to different categories. There are three main categories available for searching algorithms: ''Exponential search'', ''Sequential search'' and ''Random search'' [57]. The main drawback of ''Exponential search'' is that it requires 2 N combinations for N variables, which is too computationally intensive and time consuming, and for this reason we decided to focus on the other two categories.
In ''Sequential Search'', features are sequentially added to an initially empty set (forward strategy) or features are sequentially removed from an initially complete set (backward strategy) until the addition or removal of a characteristic does not result in a higher evaluation. Its main problem is the nesting effect because the deleted features cannot be considered in later iterations. From this category we tested the Best First (BF) algorithm with a forward searching strategy [47], which searches the space of features subsets by greedy hill climbing augmented with a backtracking facility, being simple and fast. To prevent the BF search from exploring the entire feature subset search space, a stopping criterion is imposed, so that the search will end if a certain consecutive number of subsets show no improvement over the current best subset. Table 3 shows the pseudo-code of BF algorithm. ''Random search'' methods could be categorized as global search because they try to generate an approximate and efficient solution instead of a more precise solution obtained with sequential search, but less costly in time and computational resources. From this category we tested Genetic Search (GS). These algorithms are based on the biological process of evolution through natural selection. Each individual in the sample is characterized by a set of binary variables called genes, and these are linked together in strings to form chromosomes, which could be understood as solutions. It uses a fitness function through which a fitness score is assigned to each individual, and this represents the probability that an individual is selected for reproduction. Based on the fitness score, two pairs of individuals are selected for reproduction, which would represent the role of parents. From the parents, offspring are generated by crossover and mutation, creating new individuals with combinations of the parents' genes. The algorithm ends when the generated population converges, that is, when no significantly different offspring are produced from the previous generation. At this point is when the genetic algorithm has provided a set of solutions to the initial problem [58]. Table 4 shows the pseudo-code of GS algorithm. In the case of the BF searching algorithm, we increased the allowed number of consecutive non-improving nodes before terminating the search from 5 to 15. For the GS algorithm we increased the number of generations to evaluate from 20 to 40.
The feature selection was made by means of a 10-fold cross-validation in all the techniques used. In the case of the wrapper method, a 5-fold cross-validation for feature selection was performed within each of the 10 folds [55]. For each feature selection method, we included in the final characteristic set those variables that were selected in at least 8 of the folds and we discarded all those not selected in any of the folds. Then we followed a sequential forward strategy adding the rest of the features one by one, selecting the most popular each time and stopping when accuracy decreased.
By combining the ML algorithms and the different feature selection techniques (evaluator and searching algorithm) we obtained 14 different sets of variables.

F. EVALUATION
A 10 times repeated 10-fold stratified cross-validation was used to evaluate the performance of each ML algorithm with the 14 final characteristic sets obtained during the feature selection process. These results were also compared with a Zero Rule classifier, which is the simplest classifier since it predicts the majority class in the existing data.
For evaluation measures we used accuracy (2), AUC, recall (3), specificity (5), F1-score (6), and Matthews correlation coefficient (MCC) (7), calculating for each the 95% confidence interval for the mean. In these mathematical equations we will refer to the number of true positives as TP, true negatives as TN, false positives as FP and false negatives as FN.
The AUC provides a better measure than accuracy, especially when measuring and comparing classification systems, as it compares the classifiers' performance across the entire range of class distributions and error costs [59].
F1-score is the harmonic mean of precision and recall, where precision measures the ability of a classifier to predict positive samples as positive and recall measures how many actual positive observations are predicted correctly. This parameter tries to overcome the disadvantage of accuracy against unbalanced sets, where a large majority class can introduce bias. F1-score ranges in [0,1], where the minimum is reached when all the positive samples are misclassified, that is, TP = 0, and the maximum for FN = FP = 0, that is for perfect classification. MCC is an alternative measure unaffected by the unbalanced datasets issue, being a contingency matrix method of calculating the Pearson product-moment correlation coefficient between actual and predicted values [60]. It takes values in the range [−1,+1], where −1 and +1 are the cases of perfect misclassification and perfect classification respectively, and 0 when it has no classification capacity.
An essential point in the management of drug rehabilitation treatments is the need to improve their performance, which means an increase in the number of patients who successfully complete the treatment and achieve rehabilitation. It is essential to accurately identify those patients for whom the treatment would work better. For this reason, among the requirements was to achieve the best possible accuracy while reducing the rate of false negatives so as not to classify a patient as a dropout when he or she could benefit from treatment. The reduction of false negatives leads to an increase in the rate of true positives or recall. Therefore, where applicable and where it would be beneficial, we manipulated the cost matrix in order to reduce the rate of false negatives at the cost of also reducing accuracy. Table 5 shows to which extent a feature was selected for any combination of ML algorithm and feature selection strategy. Each cell represents the number of folds (from 0 to 10) in which the feature was selected during the 10-fold crossvalidation. The first column is the feature name and the additional ones show the possible combinations of the type of evaluator (wrapper, IWSS, and CFS), the searching algorithm (BF or GS), and the learning algorithm (RF, LR, MLP, or SVM). Selected features in each combination are shown in bold and underlined.

A. FEATURE SELECTION
The number of characteristics selected by each method ranges from 3 to 23. There are 10 characteristics selected in at least half of the methods, which are: judgment capacity, personality assessment, tolerance of frustration, cocaine administration route at entry, current cocaine administration route, antidepressants, antipsychotics, topiramate, previous treatments in day centers, and maximum length of stay in inpatient treatments. Of these characteristics, 'Personality assessment' was selected in all the final characteristic sets, and 'Current cocaine administration route' was selected in 13 out of the 14 final characteristic sets. Table 6 shows the results achieved by each algorithm tested with 10 times repeated stratified 10-fold cross-validation. Results are compared with the Zero Rule classifier, which obtains the same values for all the data sets by considering only the target variable as a classification criterion. CS WBF , CS WG , and CS IWSS correspond to characteristic sets obtained by the wrapper evaluator with the BF (WBF), the GS (WG) search algorithms and the IWSS evaluator, respectively. CS CFBF and CS CFG correspond to characteristic sets obtained by the CFS evaluator with the BF (CFBF) and the GS (CFG) search algorithms, respectively.

B. LEARNING ALGORITHMS PERFORMANCE
The feature selection process and the consequent elimination of redundant features improved the results obtained with the ML algorithms: practically all of them achieved a statistically significant improvement of the AUC and specificity with any of the selected characteristic sets compared to the Zero Rule classifier, except for the AUC of the LR with the CS CFBF and CS CFG . In the case of accuracy, only the characteristic sets obtained with the wrapper-BF combination achieved a statistically significant improvement in all the methods (RF, LR, MLP, and SVM) together with the IWSS and wrapper-G combination when used with SVM.
In terms of accuracy, AUC, F1-score and MCC the best result was achieved by the RF algorithm with the CS WBF . In terms of recall the best combination was achieved by the SVM algorithm equally with the CS WG and CS IWSS . In terms of specificity the best result was achieved by the MLP algorithm with the CS WG .
Our top-performing model was the RF algorithm with the CS WBF . It was only 2% lower than the combination with VOLUME 8, 2020 the best recall measure but higher in AUC (20%), specificity (23%), F1-score (3%), MCC (17%), and accuracy (5.3%). Similarly, it was 5% lower than the combination with the best specificity measure but higher in AUC (3%), recall (10%), F1-score (5%), MCC (12%), and accuracy (5.6%). The F1-score achieved by RF-CS WBF combination obtained a statistically significant improvement over the Zero Rule. Similarly, in the case of the MCC, this combination is in the range of a moderate and higher-level correlation between the predicted and real values of the target variable. Fig. 3 shows the best AUC graphs for the two classes of the target variable (treatment success or dropout).
For the best performing algorithm (RF-CS WBF ), we manipulated the cost matrix in order to reduce the rate of false negatives at the cost of also decreasing the accuracy and the rest of the parameters. Table 7 shows the results obtained for different cost balances between false positives and false negatives. The cost balance 1.9/1.0 obtained a recall of 0.98, which would allow the RF algorithm to reduce the number of false negatives by 2% with an overall accuracy loss of 5.43%. But on the other hand, the specificity, the MCC and the F1-score were also reduced, losing this last parameter its statistical significance.  Fig. 4 shows the relative importance of the eleven CS WBF characteristics in the RF predictive model. This importance is calculated based on average impurity decrease and number of nodes using the feature, and ranges from 0.25 to 0.37.
The eleven characteristics from CS WBF are from 5 different dimensions: 3 from mental exploration and cognitive functioning (thought, judgment capacity, and orientation); 1 from personality assessment; 2 from substances of consumption (cocaine administration route at entry and current cocaine administration route); 1 from pharmacological treatment (disulfiram); and 4 from previous treatment pathway (previous outpatient treatments, previous treatments in day centers, previous inpatient treatments in CNSP, and maximum length of stay in inpatient treatments).

V. DISCUSSION
In our testing, the best predictive model combined RF and the CS WBF , which presented the highest accuracy (82.12%), F1-score (0.89), MCC (0.53), and AUC (0.81), as well as a high recall measure (0.96). In general, an AUC from 0.7 to 0.8 is considered acceptable, from 0.8 to 0.9 is considered excellent, and more than 0.9 is considered outstanding [61]. From this point of view, the RF algorithm with the CS WBF can be considered as excellent. The F1-score parameter presents a more optimistic view of the results than the MCC parameter, and this is because MCC gives equal importance to all the parameters of the confusion matrix, or in other words, it gives the same importance to both classes of the target variable. In the case of our top-performing model, probably due to the  unbalanced classes, it has better performance in recognizing positive cases (patients who successfully complete treatment) than negative cases (patients who abandon treatment), just as it has a much lower rate of false negatives than false positives. This fits with the initial criterion, which was to give priority to reducing false negatives so that patients who might benefit from treatment would not be classified as dropouts.
The strategy of sacrificing accuracy in favor of recall allowed us to achieve the design criteria of minimizing false negatives, in other words, minimizing the number of patients erroneously classified as dropouts. Depending on the selected point of the RF ROC curve, the algorithm achieved an accuracy in the range of 76.69-82.12%, a recall measure in the range of 0.96-0.98, a specificity in the range of 0.24-0.47, an F1-score in the range of 0.86-0.89 and an MCC in the range of 0.36-0.53.
The treatment of patients addicted to drugs, including cocaine addicts, presents an opportunity for improvement as can be seen in the high level of dropout. These statistics result from the inherent complexity of this type of patient, most of whom suffer from associated mental problems and unfavorable environments. In a similar population, our model suggests that using the developed predictor to decide the referral of patients to inpatient treatment would result in a significant reduction of the dropout rate by 36.82% (from 28.46% to 17.98% of the total admitted patients), with a false negative rate of 4%. However, if the false negative rate is reduced by 2% instead, a dropout rate reduction of 17.11% (from 28.46% to 23.59%) is still obtained. Dropout reduction has a direct impact on the waiting time for treatment access and also on the number of future treatment re-admissions, as treatment completion is associated with higher levels of abstinence and fewer relapses [9]. From a clinical or management point of view, it would mean a clear improvement in the management of resources, with a greater proportion of patients making profitable use of the treatment and all the subsequent synergies in their reintegration into society.
The fact that patients who complete treatment are less likely to relapse, or that relapses are less severe, has important and enormous economic and social implications that are difficult to quantify. On the one hand, there are the costs generated by drug users who impact on non-drug users, both the pain and emotional distress generated in loved ones, and that generated in potential crime victims. On the other hand, there are the economic costs associated with medical complications generated by drug use, such as the spread of HIV, sexually transmitted diseases, and hepatitis B and C. The medical consequences are broad and pervasive: infectious diseases, cardiovascular effects, respiratory effects, gastrointestinal effects, musculoskeletal effects, kidney damage, liver damage, neurological effects, mental health effects, hormonal effects, cancer, prenatal effects, other health effects, and mortality. The impact of lost productivity of drug users should also be considered, as the nation loses the productive capacity of that person over what could have been a long career. In the United States, drug-related health care spending was estimated at $12.862 billion in 1998, 69% of which was caused by this loss of productivity. Other related impacts would be the costs of criminal justice, infrastructure, private security, and the social welfare administration [62]. As can be seen, the implications and total costs of drug use are immense and complex to measure, especially those related to emotional well-being and mental health, but an improvement in the opportunities for social reintegration of patients with drug addiction problems would bring relief in all the aspects mentioned above.
The feature selection methodology made it possible to reduce the number of variables from 60 to only 11 in the best characteristic set. This set includes variables from 5 of the 8 possible dimensions. The relevance of high heterogeneous dimensionality can be seen not only in this set but also in the others that have been generated, as all of them include variables related to multiple dimensions and most include variables from all the categories, which confirms the importance of evaluating them together rather than separately. This fact highlights the interactions generated between the heterogeneous variables considered, showing the influence of mental health, cognitive functioning, and the patient's personality on possible treatment success, as well as the importance of previous behaviors in similar treatments, beyond parameters related to substance use. The possibility of employing other methodologies for feature selection or reduction of dimensionality, such as those already mentioned in section II, remains open as possible future work.
The heterogeneous composition of all characteristic sets supports our initial hypothesis about the need to include information related to different areas of the patient in order to optimize the treatment pathway of drug-addicted patients. The 'Personality assessment' characteristic was found to be the most relevant because it was included in all the created sets and, additionally, it was the most important variable in 8 of them. Personality disorder has been one of the main dropout risk factors identified in previous studies [9].
In this work, no relationship was found between treatment success and younger age, as defined in previous research about dropout risk factors [9]. On the other hand, it does confirm the relationship between treatment success and certain disorders of cognitive function [9], in this case orientation, thought, and judgment capacity, and the effect of disulfiram in the treatment of cocaine-addicted patients [63].
Compared to the work of Deane et al. [7] our results are superior in accuracy (82.12% vs 61.6%), both with the logistic regression and with the other models. The most significant predictor variables in that study were the primary drug of concern and forgiveness of self. The first variable cannot be compared with our study since our subjects are all cocaine users, and the second was not among the data available in the referral reports. Even considering the difference in population size employed, which are populations from different regions, and Deane's multi-substance approach, the performance achieved by our machine learning techniques and the set of variables employed is an improvement in the predictive capacity of treatment success.
In the case of Anh et al. [33], they attempted to classify US subjects with cocaine dependence and healthy subjects using data on self-reported impulsivity and from neurocognitive tasks. On the one hand, our work tries to classify only individuals with cocaine dependence, so the comparison between the AUC of both works would not be logical or fair. On the other hand, our study does not include data on direct impulsivity, but related variables such as 'Judgment capacity' or 'Thought' are found in the CS WBF used with the RF algorithm.
The work of Acion et al. [34] is the most similar to the one presented in this article since it is also based on the comparison of different ML algorithms to predict success in outpatient treatment. Acion et al. focused on Hispanic Americans with addictions to different substances, used a larger database than ours (99,013), and a smaller number of variables in the initial dataset (28). Despite this, the RF algorithm also showed a great performance, positioning itself as the second option in terms of AUC (0.816), and presenting practically identical values to the best model obtained with Super Learning (0.82). This could imply that the RF algorithm fits better than other algorithms to heterogeneous data with complex interactions such as those that characterize drug dependence within addiction psychiatry. In the study by Acion et al., the length of stay in treatment was also established as one of the most important predictors.
Few studies have considered the variables of the treatments themselves [9] and to our knowledge none have considered the characteristics of the previous treatments together with mental, cognitive, personality, and pharmacological variables in cocaine addiction. We found that among the most important characteristics within the RF predictive model were those related to the patient's previous treatment pathway. This indicates that their previous behavior in similar treatments related to cocaine use can be of great help when selecting the next step in their treatment. Some variables of mental exploration, cognitive functioning, and personality have contributed in a similar magnitude to the predictive result achieved, as well as consumption habits and drugs to a lesser extent. Individually these variables do not have the capacity to provide the same results as when considered together.
The power of the tested classifier is highly dependent upon the population characteristic set used for its creation. To our knowledge, the size of our subject sample is the largest that has been used in this context in the specific case of European subjects with cocaine use disorders. Although it would be very beneficial for the generalization of the data obtained to be able to use larger sample sizes than those presented in this article, due to the inherent high complexity of addiction psychiatry, data of these dimensions are not easily accessible. In fact, this can also be seen in the related works mentioned above, where in many cases they use samples much smaller than ours. This is also related to the growing need for development in the field of psychiatry and specifically in addiction psychiatry, as these are fields of medicine that are not yet as developed as other specialties and still have a great deal of room for improvement. In any case, whenever possible it would be very useful to conduct further research with a different and larger population than that used in this article to determine whether the current results can be generalized.
Longer length of stay in residential drug treatment has been previously associated with more favorable outcomes, but optimal durations are likely to be a function of patient and problem characteristics [64]. Some patients drop out in advanced stages of treatment and represent a great loss due to the health resources consumed. According to the evidence, after 37.37 days of treatment there is a reliable change in psychological recovery and well-being [64] and, in the case of late dropouts, shorter residential treatment with longer aftercare may be beneficial [65], both in terms of cost and clinical outcomes [66]. On the other hand, late dropouts could also be subjected to a pre-treatment phase to improve the stage of change, which has been shown to be a good predictor of treatment completion [67]. For these reasons it would also be of great interest for future research to explore whether predictive models would be able to differentiate between early and late dropouts.
A further step in our research will be to develop a decision support system to assist clinical professionals in charge of referring patients to cocaine addiction inpatient treatment in the next phase of their treatment pathway. The integration of the implemented predictor with a technological platform would facilitate its use by clinical professionals and would allow the final performance of the classifier to be more easily assessed.

VI. CONCLUSION
The use of ML algorithms together with feature selection techniques on a set of heterogeneous variables of different dimensions has shown promising results for the prediction of therapeutic success in cocaine addicted patients. The best performance was achieved by the Random Forest algorithm with an accuracy of 82.12%, an excellent AUC of 0.81, a recall measure of 0.96, an F1-measure of 0.89 and an MCC of 0.53 on average. As seen in previous research works, this algorithm has again demonstrated a great capacity for adaptation to datasets with complex relationships, an inherent condition in the field of psychiatry and drug addiction.
Interactions between variables of different dimensions have proved to enhance the predictor's performance, establishing synergies between characteristics of the previous treatments together with mental, cognitive, personality, and pharmacological variables. This article has confirmed the relationship between treatment success and certain variables also established in the literature, such as cognitive function disorders, personality disorders, length of stay in previous treatments and the effect of disulfiram in the treatment of patients addicted to cocaine.
Reducing treatment drop-out rates for people with drug addiction problems would mean better management and use of medical and care resources, improving opportunities for social reintegration and involving major economic and social benefits. With the increasing use of ML methods in addiction psychiatry for informing medical decisions, the use of these techniques to predict treatment success could represent a new step in treatment management especially when included in clinical decision support systems.  M. ELENA HERNANDO (Senior Member, IEEE) received the Ph.D. degree in Spain. She is currently a Telecommunication Engineer in Spain. She is also a Professor of biomedical engineering topics. Her main research interests include telemedicine services and the definition of decision support tools for patients and healthcare professionals. VOLUME 8, 2020