Depression Classification From Tweets Using Small Deep Transfer Learning Language Models

Depression detection from social media texts such as Tweets or Facebook comments could be very beneficial as early detection of depression may even avoid extreme consequences of long-term depression i.e. suicide. In this study, depression intensity classification is performed using a labeled Twitter dataset. Further, this study makes a detailed performance evaluation of four transformer-based pre-trained small language models, particularly those having less than 15 million tunable parameters i.e. Electra Small Generator (ESG), Electra Small Discriminator (ESD), XtremeDistil-L6 (XDL) and Albert Base V2 (ABV) for classification of depression intensity using Tweets. The models are fine-tuned to get the best performance by applying different hyperparameters. The models are tested by classification of depression intensity of labeled tweets for three label classes i.e. ‘severe’, ‘moderate’, and ‘mild’ by downstream fine-tuning the parameters. Evaluation metrics such as accuracy, F1, precision, recall, and specificity are calculated to evaluate the performance of the models. Comparative analysis of these models is also done with a moderately larger model i.e. DistilBert which has 67 million tunable parameters for the same task with the same experimental settings. Results indicate that ESG outperforms all other models including DistilBert due to its better deep contextualized text representation as it gets the best F1 score of 89% with comparatively less training time. Further optimization of ESG is also proposed to make it suitable for low-powered devices. This study helps to achieve better classification performance of depression detection as well as to choose the best language model in terms of performance and less training time for Twitter-related downstream NLP tasks.

Social media posts have been used for several disease detection tasks recently. The early warning symptoms of cancer can be detected from online activities of the people [5]. Depression and post-traumatic stress disorder from social media texts have been investigated in recent studies. The methods for depression classification can have a huge impact on the health of the general public. People use social media to express feelings of depression and loneliness [6], [7]. Studies show that the young generation is more likely to use social media to express suicidal thoughts than their parents or friends [8].
Keeping in view these repercussions, the primary objective of this study is to find the best language model among the transformer encoder language models with a smaller number of trainable parameters. The goal is to obtain better performance and less training time which can predict the intensity of depression in short text similar to tweets. This kind of study has not been done on the dataset of labeled tweets used in this work, as the dataset is newly created. Moreover, the majority of the work on depression detection focused on binary classification where the people suffering from depression are classified. However, this study takes this process one step further and focuses on the multi-class classification, and splits the victims of depression into different classes regarding the intensity of depression.
From this perspective, this study makes several unique contributions. This study evaluates the performance of four language models by classification of labeled tweets into depression severity class labels which are 'severe', 'moderate', and 'mild'. Four small (having less than 15 million tunable parameters) transformer-based language models are used for depression intensity classification through transfer learning of labeled tweets which include Electra small generator (ESG), Electra small discriminator (ESD), XtremeDistil-L6 (XDL), and Albert base v2 (ABV). Further, performance evaluation of the four language models has been done in terms of the training time as well as F1 and specificity. We also compare the performance of the models with Dis-tilBert which is a larger model as compared to ESG, ESD, XDL, and ABV. It has 67 million tunable parameters and is a much larger model compared to other models. Comparison with a larger model further validates the study to get reliable results. Moreover, the models are analyzed in terms of early over-fitting concerning the F1 score. All four language models have less than 15 million parameters which makes them suitable for transfer learning and tuning with relatively less computational complexity. All models are trained downstream to fine-tune them using different hyperparameters and are further evaluated with different evaluation metrics.
The rest of the paper is organized into five sections. Section II discusses several important research works related to this study. The proposed approach is presented and explained in Section III. The experimental setup is elaborated in Section IV which is followed by the discussion of the results. In the end, the study is concluded.

II. RELATED WORK
In this section, we review the latest research regarding the use of transformer-based language models for natural language processing (NLP) applications such as text classification, named entity recognition (NER), as well as different disease predictions using social media texts which are helpful to find research gaps in the existing literature and contemporary state-of-the-art approaches.
Several investigations have been conducted regarding different NLP tasks with transformer-based language models. Bidirectional encoder representation from Transformers (BERT) is a very famous language model that has been used in several state-of-the-art studies for obtaining contextualized embedding of textual data for different NLP tasks. For example, researchers have been using BERT for deep contextualized embedding to perform sentiment analysis by using the downstream fine-tuning of the parameters and attention heads. In addition, deep transfer learning is also used which takes the pre-trained transformer model and further fine-tunes it for specific tasks [9], [10], [11], [12]. BERT is also used in combination with other deep sequence models such as gated recurrent unit (GRU) and long short-term memory (LSTM), etc. Other approaches also compared BERT-based sentiment analysis with lexicon-based study and BERT showed much better results than other models [13]. BERT is also studied in combination with convolution neural networks (CNN) which improved the performance as compared to the original BERT model [14]. Several studies used BERT for biomedical NER which has been done by combining the conditional random field (CRF) layer with multi-lingual BERT [15]. Similarly, Arabic biomedical NER has been done using a variant of BERT named AraBERT and multi-lingual BERT to obtain an F1 score of 85% [16]. In the same way, Chinese clinical NER is done using BERT with bidirectional LSTM and CRF in [17] and [18].
As BERT is a generic language model which can be used in different important NLP tasks such as question answering, text classification, and NER and sequence tagging, etc., different distilled versions of BERT have been introduced to reduce language models size [19], [20]. Distillation [21] is the process of training a small model as a 'student' which should mimic the maximum possible performance of a larger model as a 'teacher' like BERT. As the original BERT-based model has 110 million tunable parameters which are a bit large to be fine-tuned on a machine having limited computational resources, different distilled models based on BERT have also been proposed. For example, DistilBert [22] is a similar model which is smaller in terms of tunable parameters and faster in terms of training performance. Researchers used different techniques to distill the larger models to compress or reduce the parameters. The goal is to reduce the model by minimal compromise on the performance of the model [23], [24], [25], [26]. BERT distilled variants for multi-lingual text have also been proposed [27]. In [28], the authors proposed a model based on BERT and DenseNet which identifies multi-model tweets containing both image and text data VOLUME 10, 2022 during the disaster. Regarding the combination of fine-tuning BERT with aspect-based sentiment analysis, a model for event detection is proposed using Twitter data in [29].
Several recent studies used social media data such as Twitter, Instagram, Reddit, etc. by applying pre-trained language models like BERT. For example, [30] used social media posts from Twitter and Facebook with BERT and deep learning models for analyzing attitudes towards COVID-19 vaccines. In another study, the authors classified garlic-related misinformation in the context of COVID-19 using different BERTbased pre-trained model variants using a large Twitter corpus [31]. The study [32] proposed a model named DICE which uses deep contextualized embedding of BERT in addition to a bidirectional LSTM network for sentiment detection. With the collective text and image features in Tweet using BERT fine-tuning and DesneNet, a model has been proposed to detect multi-model tweets in disaster [28]. A hybrid solution for event detection on Twitter is proposed in [29] by fine-tuning the BERT with aspect-based sentiment analysis.
Early depression detection using the beck depression inventory is carried out using the Reddit posts by applying BiLSTM and Albert language model in [32]. Using BERT and its three variants, a study was proposed to detect and classify social media toxicity using the Kaggle dataset [33]. Different variants of transformer language models are proposed for biomedical such as BioElectra, BioAlbert, and BioBert, etc., [34]. ELECTRA, BERT, and LSTM are combined for detecting suicide tendencies from social media text [35]. By combining the ELECTRA with LSTM a model was proposed for the emotional classification of Chinese text by fine-tuning the parameters with a softmax classification layer on top of the network [36]. BERT transfer learning in combination with CNN has been applied for the classification of Twitter posts containing both images and text in [37]. Similarly, a study used BERT feature embedding for binary classification of Twitter-based user depression [38]. For detecting depression in Arabic society, [39] proposed CarioDep which uses the BERT Arabic variants such as AraBERT and MARBert, etc.
Numerous studies have been conducted specifically for depression detection-based tasks using social media data by applying different transformer-based architectures. For example, using Twitter application programming interfaces (APIs), the depression-related tweets are collected and filtered by dividing Twitter users into 'diagnosed' vs 'control' in [40]. With the help of the geolocation field in tweet data, authors separated tweets concerning the country of origin, keeping the diagnosed vs control tweets separated for each country. They applied different existing machine learning and deep learning models to compare the results. Results show that BiLSTM-SELFA gives better performance and obtains up to 68% F1 score for the binary classification task. The study also analyzed the relationship between events like Christmas and COVID-19 with depression. A tweet monitoring framework is proposed to detect the user who is suspected to be at risk of depression using the social media data in [41].
The authors proposed a machine learning model to identify sadness among school students in [42]. The study showed that depression is the second stage of sadness, and often results from excessive stress and anxiety due to school workload. To find different health disorders, specifically depression, a qualitative analysis was performed in [43]. For data annotation, coding schemes of 6 resources were developed based on depression symptoms and psycho-social stress provided by different research articles. Studies [44], [45] use Latent Dirichlet allocation (LDA) to find depression among students. A large dataset of tweets is used for experiments using the newly proposed approach called auto-aggressive integrated moving average (ARIMA). Different depression and suicide-related trends and their corresponding deviations are also identified. To identify suicidal thoughts using Twitter data, the suicide artificial intelligence prediction heuristic (SAIPH) is proposed in [46]. The authors constructed different binary classification models for different use cases such as stress, insomnia, anxiety, loneliness, etc.
Analysis of the existing literature on depression detection and classification using social media data shows that predominantly such works focus on binary classification. The problem of multi-classification is rarely studied in the context of depression classification. Often the focus of the studies is to divide the victims into depressed and healthy subjects using tweets or short texts from Reddit and Facebook, etc. Therefore, this study focuses on the depression intensity classification where the intensity is categorized as 'severe', 'moderate', and 'mild'.

III. MATERIALS AND METHODS
This study presents an approach for depression intensity classification using tweets by employing four small transformer encoder-based language models. Performance evaluation of all models is carried out to quantify the best model with a higher F1 score and less training time. Figure 1 shows the workflow diagram of the study.

A. DATASET
For experiments, depression-related tweets are extracted. For this purpose, tweets are gathered using Twitter public APIs by putting different depression-related hashtags as seed words. Previous studies show that users suffering from depression tend to show negative sentiments in tweets [47]. Further tweets are annotated using the Python libraries i.e. valence aware dictionary and sentiment reasoner (VADER) and TextBlob for calculating quantitative sentiment polarity and subjectivity scores of tweets. Tweets having less subjectivity are discarded and filtered out from the dataset to get opinionated tweets only. Table 1 shows a few sample tweets from the collected data along with their assigned labels.
where D shows the dataset containing the collected tweets.
is a function to compute quantitative sentiment score. Then labels are assigned based on Although several social media platforms have been used for depression analysis recently including Twitter, Reddit, Wiebo, etc., this study selects Twitter for two reasons. First, a predominantly large number of studies have used Twitter for data extraction and analysis for depression analysis. Twitter and Reddit have been the most famous platforms for depression-related machine learning and NLP problems [50] as compared to other social media platforms. Second, the number of users on the Twitter platform is high compared to other social media platforms that use the English language. Table 2 shows the number of records for each class after the collected tweets are labeled using VADER. It indicates that the 'mild' and 'moderate' classes have almost a similar number of samples while the 'severe' class has a comparatively lower number of samples.  Figure 1 shows the architecture of the classification of depression through deep transfer learning using downstream fine-tuning of pre-trained language models. Predominantly, contemporary NLP applications and systems are using pre-trained language models with encoder transformers. The main task of these models is to train a large corpus that serves pre-trained startups to further fine-tune specific NLP tasks such as text classification and NER, etc. Fine-tuning of VOLUME 10, 2022 these pre-trained models can be done by adding extra layers according to the nature of the NLP task it is trained for. Transformer-based models with the self-attention technique changed the deep contextual text representation for language models.

B. TRANSFORMER ENCODER-BASED PRE-TRAINED LANGUAGE MODELS
As the text is a sequence of characters or words, the natural fit for text data is sequence-based deep neural network models such as RNN, GRU, LSTM, etc. But training in parallel could benefit more in the case of the transformer as compared to RNN-based architecture as RNN feeds word by word into the network but the transformer-based model feeds the input text as a whole. Transformer models use a self-attention mechanism that covers the context of large sequences e.g. long sentences. Transformer neural network architecture has a major advantage in the parallelization of sequential data. The basic encoder-decoder structure in a transformer is the same as RNN or LSTM but the main difference in the transformer is that data can be processed in parallel which makes it possible to train on a large corpus. The positional encoder provides contextual information training and word vectors which are not available in static word embeddings like word2vec or global vectors (GloVe), etc. In each encoder block, we have a multi-head attention layer and a fully connected layer. To get the context of the word in a text document, the embedded input is used to get the vector shape of the word. Two of the main components of the decoder block are the same as the ones used in the encoder blocks. Each word in the text document has a self-attention block which shows how many words are related to each other. The output in the form of attention is sent to the next feed-forward layer, linear layer, and softmax probabilities.

1) BERT
The BERT model has proved to be very helpful in complex tasks on natural language datasets such as sentiment classification and prediction of masked words. Its architecture is based on a stack of trained transformer encoders. The model can be generated by adding the context of specific words from the sentence or document. The BERT model helps to retain long-term dependency in sentences up to a maximum of 512 words by using the self-attention mechanism. 512 words of contextual ability are sufficient for most of the NLP tasks but sentences with more than 512 words may be truncated to train the model. BERT uses a loss function based on the score of masked word prediction to get the bidirectional context of masked words. Further, BERT also uses next-sentence prediction during training. This makes it capable of identifying two words as identical or not in terms of their respective context. Natural language inference and semantic text similarity are expected to be improved with the help of next-sentence prediction.
During the training phase, BERT gains an understanding of a token's context from both the left and right sides to get a deep contextualized representation of the text document. Translation, question answering, text classification, and text summarization are just some of the practical use cases of BERT and BERT-based models. A contextual language understanding is required for all of the mentioned examples. BERT can be trained in English or multiple languages. Downstream training of the model can make it better for a specific dataset. The training of BERT can be done in two parts; in the first part language context is identified using a self-attention mechanism and in the second part, finetuning of tunable parameters can be done to get a high score prediction. Pre-training in BERT makes it fit to learn the deep context of the word within sentences or paragraphs. Next, we discuss other BERT-based language models used in this study which follow the BERT architecture with further reduction in model size by using different distillation techniques e.g. DistilBert, Electra, Albert, and Xtremedistil, etc.

2) ELECTRA
Electra stands for 'efficiently learning an encoder that classifies token replacement accurately'. There are small and large versions of Electra available, but in this study, we only use Electra small which contains 13.5 million tunable parameters. Electra has been trained jointly with two models i.e. generator and discriminator. The generator is trained using a masked language model (MLM) by replacing random words with a mask to fine-tune the model for predicting the masked words [36], [51]. On the other hand, the discriminator is trained to identify which tokens match the original input from the generator samples.

3) XtremeDistil
There are two types of knowledge distillation, i.e. task specific distillation which models compression technique based on training, and task agnostic distillation. The former has the benefit that it only needs to be distilled once and can be used for any other NLP downstream fine-tuning tasks, but later has the advantage of comparatively high compression of the model. XDL uses the large BERT and Electra model as a teacher and a short version of MiniLM to mimic the student for its distillation tasks and uses the task agnostic distillation method. XDL has multiple variants available on 'huggingface' regarding encoder layers, hidden size, and attention heads. In this study, Xtremedistil-l6-h256-uncased has been used keeping the small model parameter in mind, as it only has 12.7 million tunable parameters [52].

4) ALBERT
Various variants of Albert are available on 'huggingface', i.e. Albert base, large, xlarge, and xxlarge, etc. A larger model tends to increase the number of tunable parameters. Although the largest Albert model xxlarge has lesser parameters as compared to BERT but computationally more expensive than BERT due to its bigger structure. Albert uses two model compression techniques to reduce parameters. First, it reduces the embedding matrix into relatively small matrices which split the hidden layer size from the embeddings matrix thus making it easier to increase the hidden layer size. Second, Albert shares the parameters of other layers [53]. For increasing the  performance of Albert, sentence order prediction (SOP) loss is used which gives better performance as compared to the next sentence prediction loss used in the BERT model.

C. PREPROCESSING OF TWEETS
To minimize the noise in data, pre-processing steps are essential for NLP-based tasks in general. Hashtags and universal resource locator (URL) are removed from the collected data. User identities that start with @ sign are also removed. Non-Ascii words have been replaced with white space.

IV. RESULTS AND DISCUSSIONS A. TRAIN VALIDATION TEST SPLIT OF DATA
In this step, the labeled tweets are split into training, test, and validation sets to ensure an equal ratio of each class in all sets. For splitting, the Python Sklearn library is used and its 'train_test_split' function is used which supports the splitting of data in a stratified fashion. Table 3 shows four small models which have been selected for this study i.e. ESG, ESD, XDL, and ABV. These models are employed in this study and the performance of these models is further compared with the larger model DistilBert for the classification of tweets concerning three class labels 'severe', 'moderate', and 'mild'. Tweets are tokenized by a tokenizer provided for each respective model on 'HuggingFace' which maps tokens to their respective IDs. The maximum token length of a tweet within the dataset is found to be 62 so all tweets are padded to a fixed length of 64 tokens. The dataset is divided into three splits i.e. 70% for the training set having 51348 tweets, 15% for the test set consisting of 11004 tweets, and 15% for the validation set containing 11003 tweets.

B. EXPERIMENTAL SETUP
For depression intensity detection, a classification layer is included at the end of each model as shown in Figure 2. The classification layer consists of a dropout layer with a softmax of size three which represent the intensity of depression. A dropout layer is added to avoid the early over-fitting of models during the training phase. Electra is pre-trained using Wikipedia and Bookcorpus [54]. A corresponding TensorFlow-based sequence classification interface e.g. 'TFElectraSequenceClassification' is used with each model. By feeding the training data to the pre-trained model, the classification layer with all tunable parameters is trained on specific depression intensity classification.
Fine-tuning is done using the hyperparameters as follows. The learning rate of three different values 2e-5, 5e-5, and 8e-5 is used to evaluate the models' performance on low, average, and high learning rates. Each experiment is done using the 'Adam' optimizer. The batch size is set to 64 for all experiments. The deep learning framework Tensorflow and Keras is used for model training. For optimal training, performance one cycle learning policy [55] is used which streamlines the best learning rate during the training process. In the first part of training, the learning rate gradually increases while it gradually decreases in the second part. Nvidia Tesla P100 GPU with 12 GB of RAM on an Ubuntu-based machine is used for all experiments. Table 4 shows the experimental settings for the models selected for depression intensity classification.

C. EVALUATION METRICS
The softmax layer predicts class labels by applying the trained model to the test dataset. A confusion matrix of 3×3 dimension is created for each experiment concerning the true label and predicted label. The confusion matrix provides accuracy for each class while also showing misclassification. The evaluation metrics such as accuracy, precision, recall, F1, and specificity are used to evaluate all models using scores from their corresponding confusion matrices.
These parameters are used with the following equations where TP stands for true positive, FP for false positives, FN for false negative, and TN for true negative. The 'severe' depression intensity class has a lower number of samples as compared to the two other labels, so F1 is an important metric due to the imbalanced nature of the dataset. Micro average scores are used to depict the performance of multi-class classification of depression detection.

D. RESULTS
Each encoder model is trained using a training dataset with validation performed on the validation dataset during the training. Moreover, the best model weights are saved and further used to predict the labels on test data to get samples of each model's performance for each experiment.
Training and testing accuracy and loss graphs are shown in Figures 3 and 4 for learning rates of 2e-5, 5e-5, and 8e-5, respectively. All encoder language model shows very good result in terms of training loss, validation & test accuracies, and speedy convergence in fewer training epochs. But specifically at the learning rate of 8e-5, ESG, ESD, and XDL converge quickly to the highest validation accuracy which is evident that at this learning rate only two epochs are enough to get the highest performance of downstream fine-tuning for classification. On the other hand, ABV is not much smooth at a learning rate of 8e-5 but rather shows a very smooth loss and validation curve at a lower learning rate of 5e-5 and 2e-5.
Regardless of test accuracy and training time, ESG and ABV obtain the best F1 score of 89% which exhibits the better capability of getting deep contextualized representation from short text-like tweets for the multi-class classification task. But if training is also in consideration then ESG is a clear winner as it possesses the 89% F1 in an average epoch training time of 130 seconds which is much lesser as compared to ABV which takes 410 seconds for epoch on the same machine and GPU environment with same hyperparameter settings of learning rate and batch size, etc. XDL, as its name indicates, is the fastest encoder model in the current study which only takes 75 seconds of training time for one epoch and achieves the F1 score of 88%. If a little compromise on classification performance is bearable, XDL is an exceptionally well model which gives appropriate accuracy and F1 with a very small training cost and competes with the advanced models in terms of capturing the sequence features using a contextualized representation of text in depression classification. It is also recommended for low parallel computing resources, as well as CPU-only machines. XDL and ESG yield relatively low F1 scores compared to ESG and ABV. Figure 5 shows the confusion matrices of all the models used in this study to indicate their performance regarding the correct and wrong predictions at different learning rates. It shows that the best results are obtained at the learning rate of 8e-5 for all models regarding the number of correct predictions while the highest number of correct predictions are obtained by the ESG model, i.e. 9753 correct predictions followed by ABV, ESD, and XDL with 9750, 9733 and 9632 correct predictions, respectively. The lowest correct predictions of 9273 are made by the XDL model when trained using a 2e-5 learning rate. Table 6 shows the results regarding the micro average. ABV and ESG outperform Distilbert regarding F1 score in the same experimental settings, even though DistilBert is a much larger model with 68 million parameters. Although ABV performs extremely well in terms of F1 and accuracy, ESG is advantageous and preferred over ABV because of its fast training time and early convergence. The highest F1 score is obtained by the ABV model which is 0.89 when a learning rate of 8e-5 is used and the same is true for its sensitivity. Table 7 summarizes the model training time regarding the experiments. As previously discussed, if a little compromise can be made regarding accuracy, XDL is the best model as it requires a substantially shorter training time as compared to other models. As we deliberately selected small distilled language models, the parameter range of models varies within a narrow range of 12.7 million (XDL) to 13.5 million (ESG). Average training time ranges from 75 seconds(XDL) to 410 seconds (ABV). Although models with a larger number of parameters tend to show high performance, this is not always the case. For example, in our case, ABV is the smallest model used for experiments regarding the number of parameters but still shows a high F1 score of 89%. Similarly, ABV is the slowest model to train due to its complexity, regardless of its small size.

E. OPTIMIZATION OF MODEL
Optimization of the model is required when we need to deploy a deep learning model in a device that has constraints in terms of computational power, memory usage, internet speed, etc., for example, mobile devices, IoT-based devices, and microcontroller devices. Another use-case of optimization is deploying a model in specially designed hardware.    for smartphones and micro-controllers [56], [57]. There are two types of quantization i.e. post-training quantization and training-aware quantization. The former is applied after training the model and the latter is applied during the training phase.

F. POST TRAINING QUANTIZATION OF MODEL
Quantization is the process of low-precision approximation floating point numbers regarding neural network weights in the form of tensors to significantly reduce the model size. So to further optimize the best-performing model in this study which is ESG, post-training quantization is proposed to reduce the model size so that it may be deployed in mobile devices as well or in a client-server architecture with low latency. ESG is already trained and fine-tuned for the classification of depression but is only suitable for deploying it on desktop-based systems. This study also aims to produce a lighter version of the model using the post-training quantization technique which can also be deployed on embedded devices as well as smartphones. Tensorflow (TF) Lite has been used for the quantization process [58]. TF Lite is specially designed for compressing deep learning models with different model optimization techniques so that models can be easily deployed on small memory-embedded devices.
ESG is trained with 32-bit floating point precision and weights of the trained model are stored in the same format and quite large for embedded and lower power devices in terms of deploying them in the main memory of low powered devices [59], [60]. Further during prediction, the model also needs to perform intensive floating point calculation which is also impractical for low-powered devices. The trained ESG for depression classification is 57 MB with the full precision of a 32-bit floating point. Our aim is to quantize the model without compromising the classification performance of ESG. The workflow of quantization of ESG and evaluation are shown in Figure 6.
ESG architecture mainly consists of two layers i.e. multihead attention layer and a classification layer of simple perceptron followed by a softmax layer. At the attention layer, the attention matrix is calculated by the dot product of the queries matrix and key matrix. The most expensive calculation is the matrix product of multi-head attention to the classification layer and this is optimized by quantization of all weights from 32-bit floating point numbers to 16-bit floating point numbers. It helps to reduce the model size as well as the cost of classifying new instances. We chose 16-bit floating quantization because it is well-suited for GPU-based smartphones. The 8-bit integer quantization option is also viable but it does not support all kinds of hardware.
It is evident from the experiments that proposed optimization through quantization saves around 50% memory by reducing the model size to 27MB which is almost half of the original size of the model. The compression ratio of the model is 2. It also reduces the computational cost of prediction and maintains almost the same classification performance in terms of accuracy of 92% and F1 score of 88% (slight reduction) in comparison to the original 32-bit floating precision trained model. Accuracy remains the same but a slight reduction in the F1 score is observed which is insignificant keeping in mind the model achieved compression ratio.

V. CONCLUSION
This study performs depression intensity classification using Twitter data by performing experiments on four small transformer-based language models. A comprehensive evaluation of these models is performed using transfer learning and downstream fine-tuning for multi-class classification of depression intensity. A dedicated corpus of 73355 tweets is created for experiments, comprising three levels of intensity, i.e., 'severe', 'moderate', and 'mild'. ESG proves to be the most effective model and outperforms other models in terms of a high classification score regarding F1 of 89% in a relatively short training time which is 130 seconds per epoch. In addition, it can easily converge with two epochs with a little higher learning rate of 8e-5. ABV is the best-performing model in terms of the highest accuracy. Further, the performance of transformer models with less than 15 million parameters is compared with the advanced model DistilBert with 67 million parameters. The study shows very interesting results that the performance of small language models is very much comparable to DistilBert which is a much larger model in terms of tunable parameters. This study provides the impactful foundation for the choice of small language models for the classification of tweets in general and depression classification in specific. This study also helps researchers and data scientists to choose the best small language models which give sufficiently good performance in less training time. Further quantization of the best performing model i.e. ESG is proposed which successfully reduces the model size to half of the original size with an insignificant reduction in accuracy and F1 score. Quantization of the model enables it to be deployed on constrained devices with low hardware resources. Moreover, accurate depression intensity classification helps early detection of depression to avoid the worst-case scenario of suicide.
In the future, we compare the performance of our proposed model with a weighted ensemble of soft voting of different conventional machine learning algorithms such as Naive Bayes, Logistic Regression, Support Vector Machine, etc. to further seek the best model with the same performance but a shorter training time. The evaluation metric area of a Receiver Operating Characteristics (ROC) curve shall also be used to more precisely observe the performance of models in addition to accuracy, recall, F1, and precision. As we trained the model using tweets that consist of short words but the trained model might not be suitable to predict depression intensity in a longer snippet of text. In the future, the model may be trained using Reddit data to make it more generalized for the prediction of depression in shorter and longer text.
ARIF MEHMOOD received the Ph.D. degree from the Department of Information and Communication Engineering, Yeungnam University, South Korea, in November 2017. He is currently working as an Assistant Professor with the Department of Computer Science and IT, Islamia University of Bahawalpur, Pakistan. His recent research interests include data mining, mainly working on AI and deep learning-based text mining and data science management technologies. BENJAMÍN SAHELICES is currently working as a Professor with the Department of Informatics, University of Valladolid, Spain. His research interests include computer architecture and parallel computing. VOLUME 10, 2022