Data Driven Classification of Opioid Patients Using Machine Learning–An Investigation

The opioid crisis has led to an increased number of drug overdoses in recent years. Several approaches have been established to predict opioid prescription by health practitioners. However, due to the complex nature of the problem, the accuracy of such methods is not yet satisfactory. Dependable and reliable classification of opioid dependent patients from well-grounded data sources is essential. Majority of the previous studies do not focus on the users’ mental health association for opioid intake classification. These studies do not also employ the latest deep learning based techniques such as attention and knowledge distillation mechanism to find better insights. This paper investigates the opioid classification problem by using machine learning and deep learning based techniques. We used structured and unstructured data from the MIMIC-III database to identify intentional and unintentional intake of opioid drugs. We selected 455 patient instances and used traditional machine learning and deep learning to predict intentional and accidental users. We obtained 95% and 64% test accuracy to predict the intentional and accidental users from the structured and unstructured datasets, respectively. We also achieve a distilled knowledge based test accuracy of 76.44% from the integrated above two models. Our research includes an ablation analysis and new insights related to opioid patients are extracted.


I. INTRODUCTION
Opioid analgesics are generally used to alleviate severe and chronic pain in patients. Doctors and other health care practitioners prescribe opioids in large numbers, especially in the United States of America (USA). According to the Centers for Disease Control and Prevention (CDC), the approximate cost of opioid abuse in the United States is $78.5 billion per year [1]. The number of opioid prescriptions in the United States is very high; research found that around 153 million opioid drugs were prescribed in 2019 [2]. Opioids are a The associate editor coordinating the review of this manuscript and approving it for publication was Juan Wang . class of drugs prescribed as painkillers, but they are heavily overused due to their addictive nature. Several studies [3], [4] have described that patients get these medications not to control pain; but because they are dependent on them. This can also result in an overdose. In our study, we use machine learning techniques to predict users' opioid misuse patterns from both structured data (i.e., demographic information, gender, ethnicity, etc. ) and unstructured data (i.e., chronological medical history and eventnotes). Barkley and Shin [5] found that intentional overdoses correlated with a depression. Other studies [6], [7] found that the rate of intentional drug use among adolescents is worrying. Prince [8] found that there is a direct connection between taking drugs and mental illness. Jones and McCance-Katz [9] also found that opioid use disorder (OUD) is associated with mental disorders. There appears to be a direct relationship [10], [11] between mental illness and drug abuse which needs further investigation. In the studies mentioned above, most authors conduct research on a specific aspect of the opioid problem, such as particular age groups or demographics [12], [13], [14]. The database we utilize is a good source of data which includes demographic, ethnicity, medical condition and age variables to study the problem. Previous studies did not use contextual analysis based on natural language processing (NLP) techniques of the patients' event notes, and medical history.
Deep learning and Machine Learning have gained popularity in the healthcare applications [15], [16], [17], [18]. However, the current opioid risk assessment tools [19] are insufficient in terms of predictability and automatic contextual analysis based on patients' historical data 1 . Furthermore, clinicians should be offered tools that allow determination of patients' risk of misuse before administering opioids. Considering that opioid misuse is a medical problem impacting people's health and economy, investigating the problem based on a Machine learning approach can be useful. The database that we work with has data which could be utilized to identify opioid patients. In the light of the above discussion, previous studies find an association between mental health and opioid intake. In some other studies, researchers consider demographics (e.g., age, ethnicity, etc.) for finding opioid associations. Therefore it is important to utilize the above features as the predictors of opioid intaking early warning systems. In addition to this, users' historical data provides a contextual cue for users' future behavior. Previous studies rarely employ the latest deep learning based NLP techniques such as attention and knowledge distillation mechanism from the contextual signals which can unveil better insight for the researchers.
In this paper, we use data from the MIMIC-III database [20], from which we have identified the opioid cases based on keyword identification. We identify relevant tables (i.e., schemas) from the database and select 41 features which are relevant to our study. Based on the keywords and patients' history, we identify which patients take opioids intentionally. In this way, we label our dataset as opioid intake 'YES'/'NO'. Later, we build a structured (i.e., tabular) dataset. To strengthen the model, we also incorporate an unstructured dataset. As training an unstructured dataset is complex and challenging, we apply deep learning based NLP techniques. For each patient, we analyze their historical data (i.e., event notes/unstructured data), and we convert the data using word embedding and attention based LSTM techniques. Since our patients data is already labelled, we train the unstructured data with the deep learning based technique mentioned above. In this study, we obtain a higher performance model by using the structured dataset while the model using unstructured dataset shows weaker results.
To build a combined model, we apply knowledge distillation technique where structured dataset shows the higher capacity network and then, we transfer the knowledge to the weaker unstructured dataset.
Our study further investigates whether a pattern of opioid use has any connection with users' mental health statuses and other socio-economical determinants. Classification of opioid patients and their mental health is important, considering the number of overdose deaths per year and the financial consequences of opioid addiction [21]. Our study may benefit society in a number of ways, such as early detection of intentional and unintentional opioid misuse, reducing the effect of aggressive marketing by pharmaceutical companies which profit from pain medication use, and better surveillance of opioid misuse by authorities and stakeholders.
The main contributions of this study are: 1) We build a dataset by using the MIMIC-III database for predicting opioid misuse. 2) We investigate the relationship between mental health and opioid misuse by patients from their structured and unstructured data (patients' clinical event notes). 3) We develop traditional and deep learning based supervised models to predict intentional and unintentional Opioid users, using an attention based mechanism.
The organization of this paper is as follows. In section II, we briefly present existing studies related to our work. Section III discusses the methodology of our study which describes the dataset, data preprocessing, ground truth procedures, feature engineering, correlations, and model architecture, respectively. Section V describes the ablation study and section VI presents the discussion of our study. Section VII concludes our study.

II. RELATED WORKS
Prior research has shown that using opioids and benzodiazepines increases the risk of an overdose fatality compared to opioids use alone [22]. The authors described trends in intentional abuse of opioid analgesics, benzodiazepines, or both, from 2001 to 2014. They then calculated the increased risk of mortality associated with the abuse or misuse of the combination of opioid analgesics and benzodiazepines relative to opioid analgesic abuse or misuse alone. Barkley and Shin [5] investigated the characteristics of the individuals who died from intentional drug overdoses compared to unintentional overdoses. They found that intentional overdoses are associated with depression. Another study [6] investigated the demographic characteristics related to intentional opioid usage. A study based on information provided by three poison control centers in 2002-2014 in Ohio [7] showed that the rate of intentional drug use among adolescents is alarming, and that there is a need for more research into the misuse and abuse of drugs, especially suicide drug poisoning. Legislative actions may help in controlling drug VOLUME 11, 2023 use among adolescents and young adults and preventing them from getting access to specific drugs.
Mensah et al. [23] investigated factors leading to addiction and abuse of opioids. Prince [8] found that there is a direct connection between drug use and mental illness. People who have a record of past hospitalization or using prescribed painkillers for mental illnesses (e.g. schizophrenia, bipolar disorder, major depressive disorder) are more likely to seek drugs afterwards. Suicide attempts among people with severe mental illness (SMI ), who used prescribed painkillers, were 2.40 times higher than for people with other substance use disorders. People with SMI are more likely to have an opioid use disorder (OUD) and OUD was present in 5% of people with SMI. However, they were unable to identify the neurobiological risk factors in their Syndrome Model. A related study [9] presents that people, who have a record of OUD, are also likely to have mental disorders and to use other substances (nicotine, alcohol, tranquilizers, etc.). We may conclude that there is a connection between mental illness and drug use. However, there is a correlation, causality is less clear. Van et al. [10] identified the connection between mental disorders and opioid overdoses. In their review study, authors tried to explain the association between mood/depression disorder and opioid overdose. There also is an association between PTSD (post-traumatic stress disorder) and opioid overdoses.
Bohnert and Ilgen [11] found OUD has a strong connection with suicide attempts and overdose deaths. About 40% of suicide and overdose deaths were related to opioid use disorders in 2017 in America. Some factors cause people to use opioids in order to cope with society but this may increase depression, stress, anxiety, pain and eventually lead to suicide or overdose death. Easy availability of opioids and the use of other substances combined with opioids further increases the risk of unintentional overdose deaths.
Several studies [14], [24] have attempted to predict opioid abuse using traditional statistical approaches. Alzeer et al. [25] wrote a review paper attempting to find crucial indicators from literature. Their study identified 75 factors that are connected with opioid abuse. They found age and gender to be the most important indicators. Vunikili et al. [13] also presented a set of statistical models to classify patients at risk of opioid misuse, death, and drug-drug interactions. Machine learning is an emerging field for predictive analysis and is used in multiple sectors. Han et al. [12] developed a prediction model to demonstrate the efficacy of machine learning in predicting opioid patients. The MIMIC-III dataset was used to classify opioid-dependent patients.
In this study, we consider the social determinants and the patient characteristics, particularly using the structured set of data from the MIMIC-III database. In addition, making use of the unstructured data could improve the decision-making process. To the best of our knowledge, no study has investigated the interaction of behavioral health and opioid drugs using domain-specific word embedding. The contribution of our research is to classify opioid patients from structured as well as unstructured data.
It is challenging to classify opioid patients with unstructured data using machine learning algorithms. However, it may be feasible with the aid of knowledge distillation. Ahn et al. [26] introduced the concept of the knowledge distillation which was formulated by Hinton et al. [27]. According to Ahn et al. [26] knowledge from a higher capacity model could be compressed to a lower capacity model by training the weaker model with the logits generated by the stronger model. However, it has been established in the research by Gao et al. [28] that it is feasible to distill dark knowledge from a totally different presentation of data from a strong network to a weaker network.

III. METHODOLOGY
In this section, we present different parts of our methodology. Figure 1 depicts the different steps of the study. Using data from the MIMIC-III dataset, we create a structured and unstructured dataset and which we use to predict opioid misuse. We also analyse the performance of our model using different ablation studies. In the following subsections, we describe all the steps in detail: The MIMIC III database [29] has 26 schemas but only ten schemas are relevant to our study. To create our dataset, 41 relevant features were taken. Within the unstructured, data, we identified a total of 37,127 distinct cases which we filtered systematically. We extracted our cohorts from the MIMIC-III database. MIMIC-III is a massive, publicly available database which contains health-related information of over 40,000 patients from the Beth Israel Deaconess Medical Centre's intensive care units during the period from 2001 to 2012. Demographics, vital sign assessments taken at the bedside, diagnostic test findings, treatments, prescriptions, caregiver observations, imaging documents, and death details are also stored (both in and out of the infirmary). The database has information on 53,423 different admitted patients (aged 16 years and older) in critical care units. The tables in MIMIC-III [20] are connected by identifiers that ordinarily end in ''ID''. For instance, the SUBJECT ID is a single sufferer, while the HADM ID signifies a hospital admittance, and the ICUSTAY ID signifies an admittance to an ICU. A descriptions of MIMIC-III is summarized in Table 1.
In our study, we also use unstructured data to understand users' opioid behavior. Note that the events [30] column from the Lab events schema of the MIMICIII database contains a huge corpus of text data. To find opioid patients, we have performed an initial query with 120 opioid keywords to every prescription in the database. We identified 4,08,130  cases. Table 2 presents the statistics of keywords, features, structured and unstructured data instances which we filtered manually.

B. DATA PREPROCESSING
We selected 121 opioid-related keywords (i.e., Hydrocodone, Methadone, Fentanyl, etc.) from the literature [31], [32], [33]. We verified whether the opioid-related keywords referred to prescription opioids or to illegal addictions in this identification process. This qualitative search helps us to remove irrelevant keywords. As a result, we identified 32,152 opioiddependent patients from the prescription tables of MIMIC-III database with selected keywords. The prescriptions schema is about the medications ordered for a given patient. The opioid keywords we chose performed detailed queries by finding one or more opioid drugs served to the patients and returned distinct identifiers as subject id. The purpose of our initial data preprocessing was essentially to find the opioid related patients from the MIMIC-III database, and the group of patients who were given treatment for an overdose. As MIMICIII is a huge dataset of patients with various diseases, and treatments, so we first segregate the opioid patient cohort, and to include the maximum numbers of patients.
We applied every possible opioid related keyword where we could find out of the 26 tables and 202 features, we found 41 features that are relevant to our research. Therefore, the shape of our initial dataset was (32,152*41).
For the two types of datasets, several data preprocessing techniques have been used. For the structured (i.e., tabular) dataset, the label encoding technique has been used initially since many of the attributes had qualitative data. The tabular dataset had a few missing values as well. There are several missing value handling techniques which are available to deal with this, for example, discarding the instance, replacing with mean/mode/median, replacing with next/previous value, imputing the most frequent value, KNNImputer, etc. Among those techniques, KNNImputer [34] was selected for handling the missing values. KNNImputer is widely used for missing value handling [35], [36] and provides the best outcomes for the tabular dataset that we are working with.

C. GROUND TRUTH COLLECTION
We manually selected data from patient's prescriptions for the ground truth collection. We asked the following questions: is the patient taken any narcotics, did the patient take any opioids, did the patient use any controlled substances? After a careful review, we selected 500 samples of opioid overdose and mental health conditions like depression, hypertension, bipolar disorder. It is important to mention that MIMIC-III is a publicly available data source for research purposes, and we only rely on this source. We have not used or amalgamated any other external data. Furthermore, no personally identifiable information (PII) was disclosed, and HIPAA policy was maintained during ground truth collection.
We got the text column from the 'Noteevents' table, where the prescription of the patient is stored. The Noteevents is the only table in the MIMIC-III dataset which contains all the notes of the patients, and is a comprehensive source of the unstructured data. Each note is linked to the specific subject id of the patient containing information about admission type, VOLUME 11, 2023 past medical history, socio-economic status, and a detailed description of the patient. For each patient, we searched for any opioid related information in the prescription. As mentioned earlier, there is specific information for each patient in all the prescriptions such as past medical history, social history, and family history along with the medication which is provided to the patients. We searched manually in the prescriptions to find any evidence of opioid misuse from the past medical history and we also search for information on depression, anxiety, living alone, or broken family that can potentially cause someone to feel unhappy, in the 'Social History' and 'Family History' categories. For example: we identified a patient (patient id: 7445) as opioid dependent or potentially a future abuser.
According to the objectives of this study, we develop datasets containing the socio-economic characteristic of the patients, in addition to the related factors like lab events and vital signs, to identify opioid patients. We performed a qualitative study by doing a brief literature search select only schemas and their relevant columns which may be helpful for decision making. After several data prepossessing steps, high level categorization, and feature engineering, we found 41 factors useful for this study. Table 2 shows the selected tables and their columns to form a structured dataset for our study. The dataset that we created for the analysis has a shape of (454*41) which we feed into the model to classify intentional and unintentional opioid patients. As mentioned in our contribution section, this dataset is made publicly available for further research work.

D. FEATURE ENGINEERING
Feature engineering is an integral part for this study. It helps to uncover the hidden patterns in the data and boost the predictive power of a machine learning model. The tabular data consisted of 21 attributes. Among those 21 attributes, 11 categorical attributes were taken manually for classification. Some of these attributes had string data and each of the attributes had some missing values and some noise as well. There are several data preprocessing techniques which have been used to preprocess the tabular data. Initially, label encoding has been used for labeling the dataset where string data have been replaced with numeric values. Secondly, missing value imputation has been used to impute the missing values where the closest instance was applied to fill up the missing values. Thirdly, feature selection has been used to find the correlations of the attributes, and finally the dataset is randomly divided into training and testing sets. There are 363 (80%) observations in the training set and 91 (20%) in the testing set.
In accordance with literature, we divided the age section into three categories: young (15-39 years), middle (40-60 years), and senior (more than 60 years). Moreover, we discretize the 'los (length of stay)' attribute into three categories: short stay (1-50 days), middle stay (50-100 days), and long stay (more than 100 days). These categorical data are later converted into numerical vectors [37].
In the dataset, a feature named icd9_code was provided. However, this attribute had 41 categories and was not sufficient to describe the disease or past medical history of a patient. Hence, we performed feature engineering by creating more sections, named ''icd9_code_desc'', and ''High level category''. Here, the feature named 'icd9_code' represents the disease code, and the added section, 'icd9_code_desc' is the description corresponding to the icd-9 codes. The feature titled 'High level category' was added to categorize the icd-9 codes description. The purpose is to find the correlation between dependent and independent variables. The 41 categories in 'icd9_code_desc' were reduced to eight categories according to the higher level categorization: • Diseases of blood and circulatory system • Diseases of nervous system and mental disorder • Diseases of digestive system • Diseases of genitourinary system • Diseases of respiratory system • Skin, subcutaneous tissue and musculoskeletal diseases • Endocrine, metabolic, immunity disorder and sepsis • Poisoning and injury

1) AGREEMENT ANALYSIS
For our data analysis purposes, we convert the ICD9 codes into higher level disease categories. We perform this task to produce a lower number of class labels. According to the original dataset, if we use each ICD9 code as a single class label there is a high chance of overlap in the class prediction for the feature. For example, respiratory system disorder and shortness of breathing. We did not consider these diseases in two different classes, rather we lower down the class label as respiratory issues. In this way, we convert the 51 independent class labels to only eight class labels.
To this end, we recruited three physicians from Directorate General of Health Services (DGHS), Bangladesh (01), and Dhaka Medical College Hospital, Bangladesh (02). We hired three physicians as annotators of the ICD9 codes, because converting the result according to a majority voting scheme. First, we produce the same spreadsheet in three independent copies. We confirmed that these annotators performed the task in an independent fashion, so that one annotator does not get influenced by the other annotators.
Since we have three annotators, we use Fleiss' kappa [38] to compute the final class label. To what extent raters assign the similar score is called agreement in their observation. To obtain a fair percentage of agreement, statisticians create a matrix where column and row represent raters and objects that are going to be rated, respectively. If two raters have complete agreement between their ratings, then we consider a zero (i.e., no disagreement), else we assign a one which represents a disagreement. Then, we find the percentage of zeros which represents the agreement score, i.e., Kappa [38]. Kappa can range from 1 to +1. After calculating the agreement score by using Fleiss' kappa, we obtain kappa scores of 0.9 among the physicians which indicates strong agreement.

E. FINDING CORRELATIONS
We apply data engineering to find the Date of Birth (DOB) of the patients from a masking condition due to HIPAA policy [39]. We had to involve domain experts to categorize certain features, and we did binning of the Length of stay (Los) to ensure the overall dataset is suitable for the model building stage. Initially we have aggregated 41 features from the 13 schemas of the MIMICIII dataset. However, after high level categorization of certain features, we reduce this to 10 attributes. The dataset with those 10 attributes serves as our structured dataset. We identified ten important features which are likely correlated to users' opioid behavior. Some of these attributes are continuous and some attributes are categorical in nature. We discretized the continuous values into categorical attributes. We then computed chi-square [40] correlation between our independent variables and dependent variable (i.e., opioid intake intentional YES/NO). In statistics Chi-square is used for testing the independence of two events. We identify the features which have a p-value (<0.05) and determine whether these features are significant in terms of users' opioid intake behavior.  Table 3 shows the correlation with different features and users' opioid intake. Mental status, ethnicity and high level category (diagnosis) are significant features in predicting users' opioid intake behavior.

F. MODEL BUILDING
We build models to predict opioid patients using two different approaches: i) tabular data from correlated attributes and ii) unstructured data from patients' history notes, i.e., eventnotes. In this subsection, we first explain the method of building opioid intake behavior prediction by using the tabular data. Later, we also describe the process of model building by using our unstructured data.

1) STRUCTURED DATA CLASSIFICATION
We build our structured dataset, D t , with the correlated attributes: mental status, ethnicity and ICD high level category (diagnosis). We then split our dataset (instances 454) into training and testing datasets of 80% and 20%, respectively. We train our dataset with a cross validation with 10-iterations by using the following classifiers [41]: AdaBoost Classifier, Logistic Regression, Support Vector Classifier, XGB Classifier, and Random Forest Classifier. Table 4 shows the performance of our models which are developed from the tabular dataset.

2) UNSTRUCTURED DATA CLASSIFICATION
For unstructured data classification, where the patients' histories are the input of the models, several classification approaches can be used. Since the unstructured data do not have any particular format, our study performs the classification by NLP techniques. The unstructured data consists of two attributes where the first attribute refers to the patients' history and the second attribute is our target attribute opioid intake intentional YES/NO. We applied several preprocessing techniques to our unstructured dataset. The data cleaning technique is used to remove rows of incomplete data from the dataset. We discard a few irrelevant syntaxes such as brackets and punctuations from patient's history data field. We also replace some abbreviations (e.g., dr./Dr./md./m.d.) with their full form so that the words do not appear as individual sentences at the time of vectorization. Word vectorization [42] is an NLP method of mapping a word or phrase from a vocabulary to a corresponding vector. The unstructured data, D us , has been classified using three different well known methods: 1D CNN-based model, a basic LSTM-based model, and an LSTM and attention-based model.

a: LSTM AND ATTENTION TECHNIQUES
Long Short Term Memory Networks, most commonly referred to as ''LSTM'' are a unique class of RNN that can recognize long-term dependencies. When there is a lot of information to summarize, the model performs poorly and produces inaccurate results. It is known as the RNN or LSTM long-range dependency problem. The attention mechanism with the LSTM attempts to address the problem.
The cell state is the foundation to LSTMs. The LSTM can modify the cell state by removing or adding information, VOLUME 11, 2023  which is carefully controlled via gates. Information can pass through gates voluntarily. These three gates serve to preserve and regulate the cell state in an LSTM architecture. Choosing whatever information from the cell state to discard is the first stage in our LSTM model. The forget gate layer, which is on sigmoid function, decides what to discard. Equation 1 shows how the forget gate layer scans x t and h t−1 : The next step is to choose the new information that will be kept in the cell state. Two different components perform this task. The input gate layer i t , a sigmoid layer, first determines which values will be updated. The state is then updated with a vector of potential new values,c, created by a tanh layer. These two will be combined in the subsequent phase to produce an update to the state as follows according to In the very next stage, we update the old state. We multiply the previous condition by f t while omitting the earlier items on our list of information to forget. Then i tct is added and it makes a new candidate value which is presented in Equation 4.
Based on the output of candidate state, we finalized the output of LSTM cell in the final output gate as follows (according to Equation 5): In a conventional LSTM unit, the sequences are only encoded in one direction (past information). In order to preserve both past and future information, we employ a BiLSTM model [43] to encode the sequences in both directions.
Natural language processing, machine translation, and image processing tasks are successfully using the attention mechanism. In the supplied input sentence, it extracts pertinent context information about a word. The bidirectional LSTMs forward and backward output features are concatenated into the vectors h t it before applying attention. The formulation of the attention mechanism is described in Equations 7, 8 and 9: In a global attention model, all of the encoder's hidden states are taken into account when determining the context vector c t . In this model type, the current target hidden state h t and all source hidden states h s are compared to derive a variable-length alignment vector at, whose size is equal to the number of steps (according to Equation 10). The weighted average of all the source states, according to a t s, are then calculated to create a global context vector, or c t . Therefore, the computation path of this attention mechanism is, calculation h t , then the attention weights a t s, then the context vector c t and finally the attention vector a t .
To learn the text's sequence, the LSTM-based model contains one LSTM layer with 400 units. This model has 3.1 million parameters, making it more complicated than the other models. Additionally, a hybrid (LSTM + Attention) model was employed to classify the sequence. The attention layer, which outputs a 128-dimensional vector, comes after a bi-directional LSTM layer with 128 units for each layer. For each of the models, there is a dense layer that is fully connected and has two units. The output layer includes two units since the model requires a logit to pass through the softmax function. Adam serves as the optimizer for all models, with the same learning rate (0.001), batch size (32) and Data split (80/20).

b: 1D CNN
1D CNN is the modified version of the 2D CNN. It is widely used in sequence learning which has a shallow layer. Therefore, it uses minimum resources to train and test. Compact 1D CNNs have shown improved performance in recent research for applications with low labeled data and high signal variations obtained from various sources [44].
We use a non causal system of CNN since the output y is dependent on a future sequence of inputs x. Let the input to convolution layer of length n be represented by x, and let the kernel of length k be represented by h. Let the kernel window be shifted s positions (number of strides) after each convolution operation. Then a non-causal convolution between x and h for stride s can be defined as follows: In our 1-D CNN model. we use a 128-dimensional embedding layer and makes use of 32 kernels in total. In terms of parameters, the model is less complex than the others. Starting with a layer of embedding of 128 dimensions, all models are constructed.

G. COMBINING MODELS BY KNOWLEDGE DISTILLATION
Given an eventnote x from a dataset D us , consisting of patients history in texts with corresponding labels where D us is an unstructured dataset. The note consists of t 1 , t 2 . . . .t n tokens which are extracted from the texts. Our aim is to train a model M S on the dataset D us such that it achieves an accuracy a by using distill knowledge from a model M T . The model M t is trained on structured dataset D t with limited features t 1 , t2, . . . t k , where n k. We propose a novel method to classify misused opioid patient's from unstructured data, D us by using the knowledge of a model that is trained on structured data. Figure 2 presents the details pipleine how we combine both structured and unstructured datasets by using knowledge distillation from the structured dataset. In our knowledge distillation approach, a model is trained on unstructured data, D us , learns from the insights extracted from a structured data, through a shallow ANN (Artificial Neural Network) model. In this case, the teacher model, M t , is trained on dataset, D t , which is a tabular structured dataset with limited features. The dataset D t is constructed by the domain experts (see Section III-D). Therefore, the dataset D t is trustworthy and the model is learning the known limited features while training. The main purpose of the teacher model (M t ) is to provide insights on the set of unstructured training set (D us ) of the student model (M S ).
Initially, We look at the label of dataset D us , and we extract features from dataset D t for the corresponding label, and created a new dataset. Therefore, we get a new dataset D sn whose label's is identical to the label of dataset D us . The student model outputs two-dimensional score vectors for each input. Additionally, the teacher model gives us scores that resemble the output of the student model. Now, these can be utilized to calculate soft probabilities. To soften probabilities, We make use of the hyper-parameter temperature τ . When τ = 1, softmax produces its typical output. However, when we raise, the softmax output softens and reveals which classes our teacher model discovered to be more similar to the predicted class. Hinton et al. [27] called it dark knowledge. The teacher model itself implant the dark knowledge during training. However, during the distillation process, this dark knowledge is transmitted to the student model which is built from unstructured dataset, D us . According to the experiments of the authors [27], the value of τ could be from 1 to 20. Authors find that the same value of τ to the student and teacher models likely return the maximum results.
Formally, Let (x,Y) in D us where x is an eventnote and Y is the corresponding label. Now our student model M S , given an input x will output logits L S which can be shown as L S = M S (x). These logit values are softened by using the temperature τ and used in the softmax function σ to get the soft probabilities denoted byŶ Sτ = σ (L S /τ ). On the other hand, Y S denotes the hard probabilities in Y S = σ (L S ) to be used by the CE (cross entropy) loss.
The teacher model M T , outputs the score for each inputs from dataset D sn . Assuming, (I,Y) in D us where i is the set of structured features, gathered from the dataset D s and Y is the corresponding label that is identical to the label of dataset D us . Therefore, The score provided by the model, can be denoted as L T i = M T (I i ) and the hard probability distribution for each input can be shown as Y T i == σ (L T i /τ ), the soften probability would be then,inŶ T τ = σ (L T /τ ). The final loss function now can be derived as Equation 12:

IV. RESULTS
Several classification algorithms and techniques have been used to classify the opioid patients from both structured and unstructured datasets.

A. FROM STRUCTURED DATASET
In the comparison of application of ANN (Artificial Neural Network) and traditional machine learning algorithms for classification of tabular dataset different values of accuracy have been achieved. Table 4 presents the results of the models that have been used to classify the tabular dataset. The traditional machine learning classification algorithms such as AdaBoost, Logistic regression, Support vector, XGB and Random Forest provide a training accuracy of 95.3% -96.7% VOLUME 11, 2023 and testing accuracy of 92.3% -93.4%. Random Forest classification algorithm gives the best outcome in-terms of both training and testing accuracy among the traditional machine learning algorithms. On the other hand, the ANN algorithm provides a training accuracy of 95.9% and testing accuracy of 95.9%. Comparing ANN with traditional classification algorithm's we found that ANN provides slightly better testing accuracy in terms of classifying the tabular dataset.

B. FROM UNSTRUCTURED DATA
Several machine learning algorithms and techniques have been also used to classify unstructured dataset and the several values of accuracy have been achieved. Table 6 represents the results of different classification algorithms by using different approaches to classify the unstructured opioid patients. Here the results can be divided into two approaches: 1) Classification without using KD.
In terms of classification of unstructured data without using KD (knowledge distillation), 1D-CNN, LSTM and Hybrid (LSTM + Attention) models have been achieved testing accuracy of 64.8%, 54.9% and 61.0%, respectively.
By using the KD technique the outcomes have been improved for the same classification algorithms. The testing accuracy for the unstructured data classification using KD technique are 65.9%, 57.2% and 76.44% for 1D-CNN, LSTM and Hybrid (LSTM + Attention) respectively. 1D-CNN has been performed better than other algorithms in both approaches.

V. ABLATION STUDY
A component of a machine learning architecture may typically be deleted or replaced as part of an experiment called an ablation study [45] to determine how these changes affect the overall performance of the system. The performance of a model may remain stable, improve or get worse when these components are changed. The accuracy can be improved by experimenting with various hyper-parameters like optimizers, learning rates, loss functions and batch sizes. Altering the model's architecture has an effect on overall performance as well. In this study, our suggested model is examined by arbitrarily removing or changing various components and parameters.

A. ABLATION STUDY 1: CHANGING HIDDEN LAYERS
Between the input and output layers is a layer known as the hidden layer, where artificial neurons receive a series of weighted inputs and generate an output using an activation function. The performance of the model is influenced by the hidden layers. Arbitrarily, we chose a single dense layer, which is a dense output layer. We observed a considerable change in the CNN and attention-based model results if we increased the number of hidden layers. The accuracy of the identical model with three hidden layers LSTM based mode, however, remains the same, although it significantly alters the training accuracy. Table 7 presents the performance of different models for different numbers of hidden layers.

B. ABLATION STUDY 2: CHANGING BATCH SIZE
The number of training samples used in one iteration is referred to as the batch size. If fewer samples are used to train the network, it uses less memory in the process overall. Minibatches typically help networks train more quickly. We do so because the weights are updated following each propagation. Experimenting with fewer samples shows that a batch size 32 appears to be optimal for all three models. We observe that changing the batch size can reduce test accuracy. Table 7 presents the performance of different models with different batch sizes.

C. ABLATION STUDY 3: CHANGING OPTIMIZER
We used different optimizers to investigate the performance of our models. We found that the Adam optimizer performed the best among all the optimizers. For all three optimizers, we employ the same learning rate and loss function. For this dataset, SGD [46] did not perform well and RMSprop [47] did not outperform the Adam optimizer. Table 7 represents the performance of different optimizers for the models.

D. ABLATION STUDY 4: CHANGING LEARNING RATE
The learning rate [48] indicates how frequently the weights are updated during training. The learning rate is a hyperparameter that can be customized and is used to train neural networks. Its value is typically small and positive in the range of 0.0 and 1.0. The learning rate significantly impacts our models performance. For majority of the models, 0.01 is the best learning rate. However, 'the accuracy increased for an attention-based model when the learning rate was 0.1 or 0.001. With such modification in learning rate, the performance is improved. Table 10 shows the performance of the models by using different learning rates.

E. ABLATION STUDY 5: CHANGING DROPOUTS
A method to prevent neural networks from overfitting is dropout regularization [49]. Dropout disables neurons and their associated connections at random. This step may change all the neurons to develop their generalization skills and keep the network from relying too much on individual neurons.
We employ dropout in primary layers like LSTM and CNN because our identical models have only one dense layer. When we do not use dropouts in an LSTM-based model, the accuracy improves, but the accuracy declines to the same level when we apply more dropouts. The attention-based model is comparable in terms of test accuracy for close dropouts. For dropout 0.3, it functions a little better. However, if dropout is increased to 0.50 or more, it performs weaker. Table 11 shows the performance of our models using different drop out levels.

F. SUMMARY OF THE ABLATION STUDY
Identical accuracy is the term that defines a result of a model with default hyperparameters. The accuracy only VOLUME 11, 2023  drops or increases when we change the hyper-parameters from the default. Thus, we labeled Accuracy dropped, Accuracy improved in the table.
We get the best accuracy for the 1d CNN based model when the optimizer is Adam, the batch size is 32, 1 hidden layer, the learning rate is 0.01 and the dropout rate is 0.3.
The LSTM-Based model performs well when we don't use any dropout, the learning rate is 0.01, the optimizer is Adam, the batch size is 32 and there is 1 hidden layer.
The Attention based model obtained 60.4% accuracy which is the maximum for this model. when the learning rate is 0.1 with no dropout layers, the optimizer is Adam, the batch size is 32, and there is 1 hidden layer.

G. SELECTING THE BEST MODEL
The best model for approach 1 is the regular 1D CNN-based model since it has obtained the maximum accuracy among all the models applied. On the other hand, the gradient boosting classifier is the best model for approach 2. Approach 1 is a direct classification approach from raw event note data without assistance of a pre-trained model. The performance of the LSTM and the attention based models is comparable and cannot surpass the accuracy of the 1D CNN-based model. we obtained 62.63% accuracy in this case with a 1D CNNbased model. Approach 2 is an approach where we rely on a pre-trained neural model named Stanza. Here, we extract the features (test, problems, treatments) using the stanza model, relying on Stanza to train the model on the same data. In this case, the gradient boosting algorithm obtained the maximum accuracy of 74%.

VI. DISCUSSION
Opioids are a class of drugs used for the relief of pain and many studies show that the use of opioids in USA is increasing day by day [50].Opioid actually stops the pain signals between the brain and the body which may have long term consequences like addiction and even death. In our study, we find that ethnicity has a strong correlation with users' opioid behavior (see Table 3). We also find in a few studies that different ethnic groups have different number of opioid related deaths. In 2018-2019, the distribution was 73% nonHispanic White, 15% non-Hispanic Black, 7% Hispanic, and 6% other ethnicity communities [51]. A significant increase in death rate of around 38% had been observed for non-Hispanic Black individuals from 2018 to 2019, but there was no change overall among the other ethnic groups [51]. Other studies [52] found that there are some relationship between ethnicity and intentional behavior in terms of using opioid intake. Patients who belong to hispanic/latino, white Russian, white Brazilian, native Hawaiian, and other Pacific Islander, are likely 'No' opioid intentional intake behavior. On the other hand, Black/African Americans and few other patients have a larger rate of 'YES' opioid intake behavior.
Several studies [8], [9], [10], [52] show that there is a strong connection between users' mental health status and their opioid intake pattern. Among the 239.4 million U.S. adults, 38.6 million had a mental health disorder. Among the adults with mental health disorders, 18.7% are opioid users compared with only 5.0% among those without mental health disorders [52].The study also shows that approximately 115 million opioid prescriptions are distributed each year in the US, 51.4% (60 million prescriptions) of which are received by adults who have a mental health disorder [52]. Our study is compatible with the previous findings. We also found that patients' mental health and intentionality of opioid using has a strong relationship. Almost every one of the patients who is suffering from depression, hypertension, and bipolar disorder has a tendency to misuse opioids intentionally. On the other hand, patients without mental health issues are not likely to misuse with opioid overdose.
Another interesting aspect is that mental health and social determinants are a controlling factor of drug abuse. Studies [53], [54] have shown that individuals with low levels of education and those who fall into high unemployment and poverty categories are at a greater risk of opioid abuse. Additionally, it suggests that people with higher socioeconomic status are more prone to opioid abuse disorder than those with lower socioeconomic status. We used several models to classify the unstructured data. These models are: LSTM, CNN, and hybrid model to classify opioid intentional intake pattern. Among the three models, the CNN model is able to classify the unstructured data most accurately. The LSTM and Hybrrid models need a larger dataset to train accurately and with only 453 patients' histories available, those two models obtained accuracies in the range of 52% -54%. On the other hand, CNN is a comparatively simple model and it can understand the word sequence better in compare to other LSTM, therefore 1D CNN model was able to reach an accuracy of 64%.  Figures 3 and 4 show the patterns of word cloud in the history of patients who take opioid intentionally YES/NO. If we observe the problems, tests, and treatments for both classes, we see many familiar entities between the two categories. Therefore, we may find some overlap in the features of the two categories. However, we build a word cloud which is not an easy task to classify as they share almost the same features. Figure 3 shows that opioid intentional user No has an indication of mental health issues. On the other hand Figure 4 shows that opioid intentional YES users' have a strong tendency mental issues in their history such as depression, mental status, etc.
Our study has a number of shortcomings. We prepare independent models, but a combined model could improve the accuracy of the model. In our study, we identify that opioid usage has an association with users' mental health issues. However, our models do not find which opioid has association with which mental health issues (i.e., depression, obsessive compulsive disorder, schizophrenia, etc.).

VII. CONCLUSION
Opioid use is a crisis globally among young and older people. In our study, we have built a dataset by using MIMIC-III dataset where we have narrowed down a total of 26 relational tables to only 41 features. We then identified correlated features in terms of opioid intentional YES/NO user. We identified three important features which have a strong correlation. In this way, we built a tabular dataset which demonstrated a good performance in predicting users' opioid intake behavior. We have also built a deep learning based model to predict users' opioid intake behavior from their historical information (event notes). By using our tabular model, we have obtained an accuracy of 93% by using random forest classifiers. Later, by using our deep learning (i.e., 1D CNN, LSTM+ attention), we have obtained an accuracy of 66% data from patients' unstructured historical data. After using the knowledge distillation mechanism of the tabular model over the deep learning based model, we have obtained an overall accuracy of 76.44%. We found some interesting correlations with users' mental health issues. There are a number of avenues to further improve our studies. We may increase the size of our dataset which might require more manual work to discover opioid intake YES users.
SADDAM AL AMIN received the B.Sc. degree in physics from Southeast Missouri State University, Missouri, MO, USA, in 2016, and the M.Sc. degree in computer science from United International University, Dhaka, Bangladesh, in 2022. He is currently pursuing the master's degree in information systems with the University of Maryland. His thesis defense works on the verge of completion at UIU. He is also working as a Federal Contractor with the Health Organization, USA. He is supporting as a Data Analyst with the Data and Evaluation Division. His research interests include data analytics inclined to the health domain epidemiology, health disparities, and public health.  He is the author of more than ten peer-reviewed publications in international journals and conferences. His research interests include machine learning, natural language processing, data science, image processing, and human-computer interaction.
MOHIUDDIN AHMED (Senior Member, IEEE) has been educating the next generation of cyber leaders and researching to disrupt the cybercrime ecosystem. He has edited several books and contributed articles to The Conversation. His research publications in reputed venues attracted more than 2500 citations and have been listed in the world's top 2% scientists for the 2020-2021 citation impact. He secured several external and internal grants worth more than A$1.3 Million and has been collaborating with both academia and industry. He has been regularly invited to speak at international conferences and public organizations and interviewed by the media for expert opinion. His research interests include ensuring national security and defending against ransomware attacks.
SAMI AZAM is currently a leading Researcher and a Senior Lecturer with the College of Engineering and IT, Charles Darwin University, Casuarina, NT, Australia. He has number of publications in peer-reviewed journals and international conference proceedings. His research interests include computer vision, signal processing, artificial intelligence, and biomedical engineering. VOLUME 11, 2023