CBiLSTM: A Hybrid Deep Learning Model for Efficient Reputation Assessment of Cloud Services

The cloud market is characterized by fierce rivalry among cloud service providers. The availability of various services with identical functionalities on the market complicates the selection decision for service requesters. Although objective trust measurements can be used to evaluate the trustworthiness of services, they are not always available and are static in nature. Subjective approaches are not always viable because they often require repeated service invocations to collect client feedback. To overcome these limitations, we propose, in this paper, a reputation-based trust assessment approach that combines the Net Brand Reputation (NBR) measure with a deep learning-based sentiment analysis model using online user reviews. CBiLSTM is the name of the proposed deep learning model that hybridizes the Convolutional Neural Networks (CNN) and the Bidirectional Long Short-Term Memory (BiLSTM) layers. The CNN layers deal with text inputs’ high dimensionality, while the BiLSTM layer explores the context of the extracted features in both forward and backward directions. CBiLSTM was trained on a new dataset named CLOSER-DREAM, containing more than 13,000 reviews relating to several emerging cloud services to classify these reviews and assess the overall reputation of the cloud services providers. The results of the series of experiments that were conducted have shown that CBiLSTM outperforms the classic deep learning models with 98% of precision, 99% of recall, 98% as an F1-score, and 99.7% of accuracy. Also, CBiLSTM offered a reasonable training time of about 519ms with the CLOSER-DREAM dataset. The classification obtained by applying CBiLSTM was proven to be an effective method to calculate the NBR measure used for the reputation assessment of cloud service providers. The proposed technique yielded an NBR score of 98.3% for Google cloud services, which is close to the real/actual NBR of 96.25%.


I. INTRODUCTION
Cloud computing is a robust model that enables the delivery of on-demand computing resources over the internet on a pay-as-you-go basis. This technology has been on a rapid upward trajectory in recent years. According to a recent Gartner 1 report, global public cloud spending would approach 45% of total company IT spending by 2026, up from less than 17% in 2021 [1]. The growing cloud services 1 https://www.gartner.com/en market is also characterized by fierce competition between service providers [2]. Each service provider claims to offer services that best satisfy user requirements regardless of the latency, privacy, security, or trust-related issues the service may have [3][4][5]. Although the intense competition is a sign of a healthy market, it complicates the selection decision for service requesters.
To make informed decisions, potential users need to assess the trust of the candidate cloud service providers. In this context, trust refers to the provider's confidence level that reflects the provider's capabilities, reliability, and honesty [5][6]. To assess the trust, the users may refer to the published formal quality measures specified in the Service-Level Agreement (SLA) established between the service provider, the service user, and the standard audit reports provided by third parties [7]. These objective trust methods evaluate service conformance with promised Quality-of-Service (QoS) attributes like response time, availability, security, robustness, and scalability [8][9][10]. However, the values of these attributes are not always accessible and have a static nature, which may fail to reflect fluctuations in the service performance. Service clients may utilize the subjective approaches that employ user feedback and ratings to gradually assess the reputation-based trust of services [5] [11]. Such methods often rely on specific acquisition mechanisms to collect feedback on QoS attributes. However, the feasibility of the acquisition mechanism might be a concern. In most cases, these subjective approaches depend on users' repeated invocations of services and the users' willingness to provide feedback on the invoked services. Thus, these approaches are challenged by data sparsity and cold start problems. Moreover, these approaches generally neglect qualitative factors that affect subjective trusts, such as aesthetics, affordability, and usability [5] [12]. On the other hand, there is a plethora of user feedback on various cloud services available on the Internet. These reviews represent a useful source of information that may be utilized to solve the issues outlined above. Recently, several research studies have employed sentiment analysis techniques to automatically transform unstructured customers' reviews into structured data, which can be particularly useful for service reputation management. In this context, sentiment analysis is used as a procedure for assessing customers' satisfaction toward invoked cloud services by classifying their related reviews. Existing approaches that employed sentiment analysis for reputation assessment can be categorized into three main classes: 1) statistics-based, 2) fuzzy-logic-based, and 3) traditional data mining-based approaches. Even though these approaches provide efficient methods for estimating services' reputation, they present several limitations, including:  These approaches are not scalable since they don't enable analyzing newly added reviews.  They are time-consuming and require high computing resources to analyze a large number of customers' reviews.  They are domain-specific, and the obtained results are highly tied to the data context and features used in the experimentations. They need reengineering adjustments to ensure the reputation assessment of entities other than those considered in the experiments.
 They do not provide a concrete score that helps to assess the overall reputation.  They are not validated through different performance metrics.  They are often dependent on specific QoS information and constrained by their associated acquisition and analysis processes.  They do not consider subjective, trust-based qualitative factors affecting the overall services' reputation score. This study aims to answer the following question: What techniques may be employed to overcome the staticity, infeasibility, and inefficiency challenges associated with most current trust assessment approaches? To address the limitations of the existing solutions for assessing the reputation of the next generation of IT services, we propose a novel approach that employs deep learningbased sentiment analysis and Net Brand Reputation (NBR) techniques. This approach introduces a novel hybrid deep learning model that hybridizes the Convolutional Neural Networks (CNN) and the Bidirectional Long Short-Term Memory (BiLSTM) layers. It is named Convolutional BiLSTM Deep Learning Model (CBiLSTM), and it allows effective review classification. This work also provides a new dataset of cloud service reviews. The collected dataset serves as input for the CBiLSTM classifier, which then feeds into NBR to compute the overall reputation score. The following points are a summary of the key contributions of this study:  The proposed approach tackles data scarcity issues and cold start by exploiting the knowledge provided by customers' feedbacks available on the Internet. In addition, when processing and classifying this data, our approach considers qualitative criteria that substantially impact subjective trusts, such as aesthetics, affordability, and usability.  A new dataset, named ClOud SErvices Reviews Dataset for REputation AssessMent (CLOSER-DREAM), is collected, cleaned, and labeled. This dataset contains more than 13,000 textual reviews related to various cloud services.  An efficient and novel hybrid deep learning model, CBiLSTM, is proposed for reviews classification.  This work proposes a procedure for applying the NBR formula based on CBiLSTM outputs. It also generates a concrete NBR score to assess the overall services' reputation.  To validate the proposed approach, Google cloud services' reputation is assessed based on CBiLSTM classification and compared to the real reputation score value. The obtained results show that CBiLSTM is effective for assessing the reputation of service providers. This paper is organized as follows. Section II introduces the background on the main theoretical concepts of this study and discusses the related work. Section III presents the proposed approach. Section IV describes the experiments that were carried out and discusses the achieved outcomes. Section V summarizes the main contributions and findings and future research directions that will be investigated to extend this study.

II. BACKGROUND AND RELATED WORK
A. BACKGROUND

1) CLOUD COMPUTING
According to the National Institute of Standards and Technology (NIST) [13] "cloud computing is a pay-per-use model for enabling available, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction".
Cloud computing offers services in three primary forms [14]: Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS). IaaS virtualizes the data centers' processing power, storage, and network access. PaaS provides a development platform with a range of services supporting the design, development, testing, deployment, monitoring, and hosting of applications on the cloud. SaaS presents software to the end-users as on-demand services, usually using a browser. Google Apps 2 , Microsoft Azure 3 , and Amazon Elastic Compute Cloud (EC2) 4 are some examples of SaaS, PaaS, and IaaS, respectively. Cloud computing brings many benefits such as ease of use, cost-efficiency, flexibility, elasticity, on-demand scalability, and economies of scale [4][5]. However, cloud services raise numerous concerns about latency, privacy, security, and trust [3][4][5]. These key aspects pose a challenge for cloud services that must be addressed to build trust among cloud stakeholders, including cloud service users, cloud service providers, and third parties [5].

2) DEEP LEARNING
Deep learning [15] is a representation learning approach that uses artificial neural networks to progressively learn 2 https://workspace.google.com/ 3 https://azure.microsoft.com/en-us/ representations like patterns and relationships from a large amount of raw data. Each neural network is a series of biologically inspired algorithms that transform the model at one level into a higher and more abstract level of representation through dynamic adjustment of weight values [15][16]. The learned representations are eventually used for detection or classification tasks [15]. A neural network can be visualized as a set of connected artificial nodes or neurons. Each neuron receives input values or patterns from other neurons, performs some processing operations, and then produces outputs [16]. Neurons in deep neural networks are generally organized into three types of layers: input layer, hidden layers, and output layer. These layers are connected to allow communication between neurons [16]. Recently, deep learning has received a lot of traction in a variety of sectors and applications [17][18][19]. For sentiment classification, specific deep learning models are often used. These include CNN, Recurrent Neural Network (RNN) extensions like Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), and BiLSTM [20].

1) OVERVIEW OF RELEVANT STUDIES
Several studies in the literature have addressed the trust and reputation assessment of cloud services. They have adopted different classification models, techniques, quality features, and data sources. Fan et al. [21] proposed a multi-dimensional trust-based mechanism for selecting cloud services using Evidential Reasoning techniques. It combines perception-based trust value that an active user acquired from direct service interactions and reputation-based trust value derived from other users′ interactions. Noor et al. [22] described the design and implementation of a management framework for the reputation-based trust called CloudArmor. This framework provides a set of functionalities and features to offer Trust-as-a-Service (TaaS): i) anonymization techniques are utilized to guard users against privacy breaches, ii) several metrics for detecting the feedback collusion and Sybil attacks are suggested, and iii) load balancing techniques are adopted to distribute the workload and maintain a desired level of availability through the deployment of the Trust Management Services (TMS). Ding et al. proposed in [23] a ranking prediction model for a personalized selection of cloud service that considers the expectation and attitude of customers towards the quality of service. To enhance the service ranking prediction's accuracy, the proposed technique employed an enhanced Kendall Rank Correlation Coefficient (KRCC) measure that integrates Jaccard's coefficient to reduce the effect of negative customers' reviews in calculating ranking similarity. It also used a customer satisfaction function named Cloud Service Ranking Prediction (DSRP) to find the preference values on pairs of services. Mao et al. in [16] employed Particle Swarm Optimization (PSO) to find the optimal initial connection weights between layers in networks to reduce the impact of initial parameter settings and ultimately achieve more accurate trust prediction of cloud services. The proposed neural network based on PSO techniques demonstrated its efficiency over both basic classification methods (e.g., bayesian network and decision tree) and traditional Back-Propagation Neural Networks (BPNN). Somu et al. in [24] proposed a multi-level Hypergraph Coarsening-based Robust Heteroscedastic Probabilistic Neural Network (HC-RHRPNN) for cloud service trustworthiness prediction. The proposed model utilizes pruning, a dimensionality reduction technique, to identify good-conditioned samples to be then used for HRPNN training. The proposed model made improvements concerning prediction accuracy and execution time. In [4], Deshpande et al. proposed an Evidence-Based Trust Estimation Model (EBTEM) for cloud services' adaptive and dynamic trust assessment. EBTEM employed evidence factors of various QoS attributes of cloud services derived from the direct interaction between cloud users and the services. The computed cumulative trust value, on which the user's decision to use or not use the service is based, represents the core of the proposed dynamic trust prediction approach. Liu et al. in [25] proposed a method that combines clusteringbased techniques and trust-based Collaborative Filtering (CF) approach to improve the prediction accuracy and recommendation quality. The clustering-based technique incorporated explicit textual information, rating information, and implicit context information to identify similar users and provide personalized services. The trust-aware CF approach merges local and global trust values to address user unreliability issues. Rizvi et al. [5] suggested a Fuzzy Inference System (FIS) that returns a quantitative security index for Cloud Service Users (CSUs). For coherent analysis of security, the system addressed the multiple possibilities or uncertainties that CSUs may have when assessing the reliability of a cloud service provider. The overall security assessment depended on CSUs' evaluation of four main factors: compliance, access controls, auditability, and encryption. Li et al. in [26] proposed a framework, named FASTCloud, that facilitated the selection of trustworthy cloud services by Potential Cloud service Consumers (PCCs) and enhanced the feasibility of the acquisition of QoS information. The model collects information related to QoS attributes of different cloud services (i.e., Service Level Objectives (SLOs) and Actual Monitoring Values (AMVs)) from Cloud Service Providers (CSPs) and Cloud Service Consumers (CSCs), respectively. The trustworthy cloud services selection component of FASTCloud receives the information and evaluates the cloud service trust level. Also, the deviation maximization-based weight assignment method is utilized for an objective determination of QoS attributes' weights. Table 1 summarizes the relevant studies reviewed in the previous subsection. The first column presents the previously mentioned related works. The "Classification" column provides a high-level classification of the reviewed reputation and trust approaches; this classification is inspired by the work of Wahab et al. [11]. The "Techniques" column lists the specific techniques employed by each work. The "Assessed Quality Features" column specifies the attributes that are considered in the assessment process. The "Outcomes" column highlights the added values provided by each work. Finally, the "Limitations" column points to the studies' drawbacks or constraints.

3) DISCUSSION
Based on Table 1, it is noticeable that many studies rely on users' feedback for reputation and trust assessment. However, it is not always practical to request many users to rate services against fine-grained criteria for an overall view of community opinion as users can be reluctant or unmotivated to spend time evaluating services. As a result, the data sparsity problem becomes problematic when relying on such an approach. This explains why some studies assessed using prediction methods. Furthermore, some approaches rely on user history information and presume that QoS status monitoring tools/services are available to service customers. Thus, the feasibility of QoS information acquisition mechanisms might be a concern. Considering the performance and security of service platforms, the monitoring techniques used for QoS information acquisition apply only to individual service users. Monitoring platforms by multiple clients using the same or different services is not supported. Furthermore, the actively gathered QoS information of service providers from open sources can be incomplete and inaccurate due to inconsistent updates made by service providers [5] [26]. As shown by Table 1, most studies investigate the quantitative factors, i.e., QoS attributes of service trust assessment, such as performance, availability, and response time. The proposed approaches have the advantage of linking subjective user feedback to specific QoS attributes or providing objective trust assessment. However, they overlook the qualitative factors that considerably affect subjective trusts, such as aesthetics, affordability, and usability [5] [12]. Furthermore, the environment of the new generation services has a dynamic nature, where new services with unpredictable QoS attributes emerge continuously. This can make the service selection problem more complex. Prediction-based trust assessment approaches can be plausible in solving this issue, especially when the trustworthiness of a newly emerging service needs assessment with minimal knowledge of the QoS characteristics of the service [24]. Because of its self-learning capabilities in modeling complicated and arbitrary relationships, the deep learningbased approaches have outperformed traditional methods in trust prediction [24] [27]. This work utilizes the deep learning-based approach for service reputation-based trust assessment. Instead of soliciting users' comments at every service invocation, the proposed approach takes advantage of the rich information resources accessible online in the form of free-text user reviews to address the issues previously highlighted. It is worth mentioning that having a vast number of reviews assessed can ensure that various use cases are tested and reduce the effect of unauthentic or misleading reviews. This work aims to improve the existing research in this area. Our primary purpose is to develop a comprehensive, trustworthy, and novel approach that employs deep learningbased sentiment analysis and NBR techniques to ensure reputation assessment for cloud services. This work introduces a hybrid deep learning model named CBiLSTM for reliable review classification. It also offers a new dataset of cloud service reviews. The collected dataset serves as input for the classifier, which in turn generates input information for NBR. The NBR score reflects the Quality of Experience (QoE), i.e., the overall user acceptance of service based on subjective perception [9][10] [28]. The proposed approach has the following advantages:  It is feasible as it does not require direct user intervention to get user feedback on services.  It is dynamic as CBiLSTM is capable of classifying any newly added service reviews.  It is time-saving because CBiLSTM uses existing web reviews to accomplish classification in a short period of time.  It is effective for reputation assessment as it generates NBR scores closer to the actual/real scores.  It deals with more authentic and rational feedback as it employs a large number of published reviews.  It delivers a comprehensive reputation assessment that considers all subjective trust-based factors.  It is generalizable since it may be used to measure the reputation of various entities other than cloud services.

III. METHODOLOGY AND PROPOSED APPROACH
This work aims to adopt deep learning-based sentiment analysis for effective and efficient reputation assessment of cloud service providers. Considering that, this paper follows a pipeline of three main phases: 1. Dataset collection and labeling phase. 2. Deep learning classification phase.
3. Deep learning model validation phase. These phases are illustrated in Figure 1 and detailed in the following paragraphs.

A. Data Collection and Labelling Phase
This work introduces a new dataset consisting of English reviews on a range of cloud services shown in Table 2. The dataset is named CLOSER-DREAM, ClOud SErvices Reviews Dataset for REputation AssessMent. It contains 13,178 reviews including 12,567 (95.37%) "Positive" reviews, 260 (1.97%) "Negative" reviews, and 350 (2.66%) "Neutral" reviews. Figure 2 shows the unbalanced distribution of the dataset based on the number of instances per class. In addition, the reviews in CLOSER-DREAM are relatively long, with a maximum review length of 818 words, a minimum review length of 3 words, and a median review length of 64 words. The number of unique words is estimated at 15,218 words. Most of the gathered reviews discuss the benefits and drawbacks of the reviewed services.  The reviews in CLOSER-DREAM are scraped from multiple review websites. They are also cleaned by removing duplicates and noises. Manual labeling of this dataset is impractical due to the large number of collected reviews that are handled. Furthermore, the widely available ratings on review websites are inconsistent, as clients with similar concerns may rate the same service differently. As a result of these limitations, we suggest a two-stage labeling technique. First, a sentiment analysis tool is used to label the dataset automatically. Second, minority classes reviews are manually checked and re-labeled depending on specific features.

B. Deep Learning Classification Phase
This phase employs deep learning to classify texts in terms of sentiment. This section discusses the preparation activities required to convert the textual reviews into a machineunderstandable format. In addition, it introduces a novel hybrid deep learning model for sentiment classification.  1. Data is filtered to uniform text and remove unneeded characters. 2. Sentences are tokenized, i.e., divided into smaller units such as words. 3. A word-to-index dictionary can be created by mapping each vocabulary to a unique integer value based on word frequency. Text sentences are then converted to sequences of integers where each number matches up to the corresponding words in the index by following a typical process called sequencing. 4. As neural networks require inputs with the same length and dimension, padding is applied to consider a threshold number of words for all sequences. If a sequence is shorter than the threshold, extra 0s are added, and sequences are truncated if they are longer than the threshold. Padding is used to increase the computational efficiency and the performance of the neural network model. Most CLOSER-DREAM reviews are long, with a maximum review length of 818 words, a minimum review length of 3 words, and a median review length of 64 words. The padding threshold is usually set to the maximum length of the longest sentence in the training set, which is equal to 818 characters in our experiments. This choice is generally made by the mostly conducted works related to applying deep learning models for sentiment analysis classification [29][30]. We have chosen to use the 'post' padding, which means that our sentence sequence numeric representations corresponding to word index entries will appear at the left-most positions of our resulting sentence vectors. In contrast, the padding characters ('0') will appear after our actual data at the right-most positions. Table 3 illustrates a sample review's tokenization, word indexing, sequencing, and padding.

2) WORD EMBEDDINGS
An index value represents each token in the previous preprocessing steps. However, these indices do not reflect any relationship between the tokens, i.e., indices' numerical order does not have much conventional meaning. Therefore, an extra encoding step, known as embedding, is required to create a dense representation of each preprocessed token to reflect their relationships. The preprocessed tokens serve as input to the word embedding layer, which is the first layer in the proposed model. This layer converts the inputs to vector representations that capture the semantic meanings of words and reflect the relationship among them.

3) HYBRID CONVOLUTIONAL BILSTM DEEP LEARNING MODEL (CBI-LSTM) FOR SENTIMENT CLASSIFICATION
CNN and RNN are typical models for sentiment classification [16]. The difference between the two models is that CNN can extract local features by examining the spatial relationship within the data but cannot learn sequential correlations [31]. On the other hand, RNN looks for the temporal relationship and can extract global features [32]. While RNNs are suitable for sequential relationships, traditional RNNs are susceptible to gradient explosion or vanishing when exposed to long data sequences. LSTM [31] is an extension of RNN that can prevent these problems. Also, it can remember long-term dependencies with chains of memory cells as hidden units. BiLSTM [33] enhances LSTM by combining two LSTM layers to process information sequences in forward and backward directions in parallel. We propose a novel hybrid model that combines CNN with BiLSTM layers in this work. This model is named CBi-LSTM. At the top of CBi-LSTM, three CNN layers are added to extract the most important n-gram features from text vectors and, thus, reduce the dimensionality. The successive BiLSTM layer is fed with the extracted features from the top CNN layers and investigates their contexts to capture phrase-level patterns. In addition, a batch normalization layer is added to standardize the inputs from the BiLSTM layer and stabilize the learning process without changing vector dimensions. The output of the batch normalization layer is passed to a global Max pooling layer. This layer facilitates the transition to the output prediction layer, also named the dense layer, by downsampling each representation vector to a single value. Figure 3 illustrates the architecture of the proposed CBi-LSTM. In a sentence like "azure is a reliable, affordable cloud platform", the words "reliable" and "affordable" express a positive sentiment about the cloud service provider. At the top layers of CBi-LSTM, CNN filters capture the features from sequential groups of words (phrases). Therefore, the positive sentiment in the keywords "reliable" and "affordable" could be predicted correctly by a single CNN. However, in sentences like "Do not miss out on Google platform" and "Do not waste your time with Google platform", the two phrases "do not miss" and "do not waste" convey different opinions. As CNN extracts word-level patterns, it could classify both sentences as negative comments, although the former implies a positive sentiment. Adding a BiLSTM layer enables CBi-LSTM to remember the past and forward contexts to detect the phrase- Sequence [1,2,3,4,5,6,7,8,9,10,11,12,1] Padded Sequence [1,2,3,4,5,6,7,8,9,10,11,12, 1, 0, 0, 0, 0, 0, 0, 0, 0 ,0 ,0 ,0,0 ,0 ,0 ,0 ,0, 0, 0, 0, …, 0] level patterns, which could help predict both sentences' classes correctly. BiLSTM could also effectively predict complex sentences with dependencies between features like: "Although this is a good platform, its functions appear to have some delay in revealing data".

C. Reputation Assessment and Model Validation Phase
The goal of this phase is to apply the NBR formula by using the proposed deep learning model results to assess the reputation of cloud service providers [34]. It also aims to validate the effectiveness of using the proposed deep learning model for reputation assessment. NBR is the net value of a brand reputation estimated from published reviews. It employs sentiment analysis to measure clients' satisfaction levels. The NBR index focuses more on the positive feedback from brand promoters than on the negative ones. The output of NBR can be any value in the range [-100,100]. Higher values mean that more positive reviews are considered. The NBR equation is illustrated by Eq. 1.

= × 100 (1)
To substitute the positive reviews and negative reviews values in Eq. 1, the confusion matrix of the proposed deep learning model is used. The confusion matrix is a performance measure that reports the number of "True Positive" (TP), "True Negative" (TN), "False Positive" (FP), and "False Negative" (FN) values. The TP value substitutes the positive reviews' value in NBR, whereas the TN value substitutes the negative reviews' value [35]. TP represents the truly predicted labels as "Positive", whereas TN denotes the truly predicted labels as "Non-Positive". The latter includes both the "Negative" and "Neutral" labels.
To validate the effectiveness of using the proposed deep learning model for reputation assessment, the resulting NBR score is compared to the real/actual data-based NBR score. The NBR score uses the total numbers of positive and nonpositive labels counted from the original dataset. The total number of positive reviews in the dataset substitutes the positive reviews variable in NBR, while the total number of negative plus neutral reviews substitutes the negative reviews variable.

IV. EXPERIMENTATION AND VALIDATION
This section focuses on the experiments conducted to validate the proposed approach. First, CBiLSTM classification performance is tested and compared to other baseline models. Second, computations are performed to compare the CBiLSTM-based NBR to Google Cloud's actual/ real databased NBR.

A. Validation of CBiLSTM Classification Performance
This subsection presents an experimental study conducted on CLOSER-DREAM to evaluate the classification performance of CBiLSTM.

1) DATASET PREPARATION
The reviews are extracted using the Web Scraper 5 extension offered by Google Chrome. Multiple review websites are scraped, including Capterra 6 , g2 7 , Gartner 8 , TrustRadius 9 , Software Advice 10 , GetApp 11 , Trust Pilot 12 , and Spiceworks 13 . Because the reviews are gathered from several websites, some reviewers can submit the same review on more than one website. This causes a data redundancy problem in the dataset.
To solve this issue, Python code is implemented to remove duplicates from CLOSER-DREAM. In addition, some reviews contain noise that must be cleaned. For example, some reviews end or start with sentences like: "This review was collected by g2 website", "show more show less", or "Published on 9/4/2020". Such sentences are useless for categorizing the reviews; thus, they are removed. After cleaning noise, the dataset is automatically labeled using Valence Aware Dictionary for sEntiment Reasoning (VADER) [36], a sentiment analysis Python package.
VADER is an open-source lexicon and rule-based tool for sentiment analysis. It determines reviews polarity and classifies them in multiple sentiment analysis classes. Vader extracts sentimental words and their corresponding intensity from sentences. It returns a polarity score between -4 and 4 of each word. The closer the score to -4, the more intense the word's negativity is, and the closer the score to 4, the more intense its positivity is. The word scores are then normalized to obtain an overall statement sentiment score, known as Compound Score. This score reflects statement polarity and corresponding intensity, and it falls in the range between -1 and 1. The compound score's formula is given in Eq. 2, where x is the sum of polarity scores of constituent words and α is a normalization constant, the default value is 15.
Because VADER's results are not totally accurate, labels of minority classes, neutral and negative, are checked and updated manually. Also, labels with a compound score lower than 0.7 are checked and updated manually. Other reviews with higher compound scores (0.8, 0.9, 1.0) have more intense polarity, thus, are more likely to be classified correctly. Before implementing CBiLSTM using the CLOSER-DREAM dataset, some preparation tasks are required to convert the textual reviews into a machine-understandable format. Several preprocessing steps, illustrated in Figure 4, are followed in this work. First, punctuations, numbers, single characters, and multiple spaces are removed, and all characters are converted into lower case characters. Then, some replacements are made to get more uniformity in text, such as 5 https://chrome.google.com/webstore/detail/web-scraper-free-webscra/jnhgnonknehpejjnehehllkliplmbmhn?hl=en 6 https://www.capterra.com/ 7 https://www.g2.com/ 8 https://www.gartner.com/reviews/home replacing "'ll" with "will" and "I 'm" with "I am". Also, all three labels are vectorized in such a way that "positive" corresponds to "1, 0, 0", "negative" to "0, 1, 0", and "neutral" to "0, 0, 1". Sentences are then tokenized, a word-to-index dictionary and padded sequences are created. In this work, the maximum sequence length is set to 818. For the word embeddings, this work exploits GloVe [37] representation, which is an unsupervised learning algorithm that leverages word co-occurrence frequencies.

2) BASELINE MODELS
This work compares the classification performance of CBiLSTM to three deep learning models. These are as follows:  CNN: relies on convolution and pooling layers and applies convolutional filters to capture local features.  BiLSTM: combines two opposite LSTM layers for context analysis.  GRU: GRU stands for Gated Recurrent Unit. It is an RNN variation with a simpler architecture than LSTM. It has no internal memory and has fewer gates than LSTM. Tables 4, 5, 6, and 7 provide summaries of the structures of the experimented models.     16 . The proposed approach was implemented using a desktop computer that has the following configuration: an 11th Gen Intel ® Core TM i9-11900H@ 2.50 GHz processor and a 32 GB RAM. An NVIDIA GeForce RTX 3080 Ti 16 GB graphics card is used to facilitate the smooth training of the proposed classifier. CLOSER-DREAM is split into 70% for model training and 30% for validation and testing. The shuffle feature is disabled so that Google Cloud-related reviews remain in the validation and testing portions of CLOSER-DREAM. The testing portion includes only reviews related to Google cloud services. This portion is used to assess the reputation of Google as a cloud services provider. Table 8 shows the number of samples in each subset. Moreover, 400,000 pre-trained vectors in the "glove.6B.100d.txt" file [38] are used to prepare the embedding layer. The parameters used for the embedding layer include 20,000 as the maximum vocabulary size, 818 as the maximum sequence length, and 100 as the embedding dimension for all the models. For all models' training, the batch size is set to 128, and the number of epochs is 70. The gradient descent algorithm is employed to set up the optimal hyperparameters (i.e., batch size, epochs, optimizer, momentum, and weight decay) of the different models deployed for the experimentation. This technique is extensively employed to minimize the cost/loss function to develop machine learning and deep learning-based applications. Gradient descent [39] is an iterative first-order optimization algorithm that identifies a local minimum/maximum function. In the gradient descent technique, we start with random model parameters and calculate the error for each learning iteration, then continuously changing the model parameters to get values closer to the values that result in the lowest cost. Because this is the steepest descent, the objective is to take repeated steps in the opposite direction of the function's gradient (or approximation gradient) at the current position. Stepping in the direction of the gradient, on the other hand, will result in a 16

4) RESULTS AND DISCUSSION
Five metrics are used in this work to evaluate models' performance: precision, recall, F1-score, accuracy, and the confusion matrix [40]. Precision is the fraction of results that the model accurately predicts. The recall is the fraction of the model's relevant results correctly predicted. F1-score is a balanced metric that reflects the harmonic mean of both precision and recall. Accuracy evaluates "How good is a model's performance?". It reflects how regularly the model's predicted label is right. Eq. 3, 4, 5, and 6 illustrate the equations of the performance metrics mentioned above.
As this work deals with a multiclass classification on an imbalanced dataset, both the macro-averaged and weighted average scoring metrics of recall, precision, and F1-score are considered.
The macro-averaged measure is the arithmetic mean of all class scores. It treats all classes equally regardless of their proportions in the dataset. For example, the macro-average precision is the mean of the precision scores of classes positive, negative, and neutral. On the other hand, the weighted average measure is the average of weighted class scores. It multiplies each class score by its corresponding class proportion. The remaining part of this section provides relevant illustrations about the performance of all DL models that are compared. Table 9 shows a detailed classification report of each class per model. CBiLSTM achieves the highest precision, recall, and F1 scores for all classes (i.e., positive, negative, and neutral). For positive reviews, CBiLSTM guarantees 100% of recall. For negative reviews, it achieves a precision of 76%. Finally, GRU provides the highest recall of 60% for neutral reviews, whereas our model performs better for classifying this class in terms of precision and F1-score, 76% and 54%, respectively. Tables 10 and 11 show the experimented models' macroaveraged and weighted scores, respectively. CBiLSTM has the highest macro-average precision and F1-score, while GRU achieves the highest recall. CBiLSTM outperforms all the other models for the weighted average by ensuring an overall precision of 98%, a recall of 99%, and an F1-score of 98%. Table 12 shows the training times of each model. It indicates that CNN requires the least training time, followed by CBiLSTM. The training time of our proposed classifier remains reasonable compared to the training time of GRU and BiLSTM models. Figure 5 depicts the models' confusion matrices, normalized by predictions. The diagonal of the CBiLSTM matrix shows the lightest colors and the highest accurate predictions per class. In Figure 5, it is clear that our proposed classifier outperforms the other models considered in these experiments by offering the highest accuracies for classifying the different reviews' classes. All models' training and validation loss and their training and validation accuracy are depicted in Figures 6 and 7. As shown in these figures, the learning curves of CBiLSTM and CNN present good fits. In case of a good fit, the training and validation losses decline to a stable point with a minimum gap between the two curves at the end. In comparison to CNN, our proposed classifier's loss and accuracy learning curves have shown fewer fluctuations, as demonstrated by Figure 8, which presents all models' validation accuracy learning curves in one line chart. This conclusion is also confirmed by Figure 9, which provides a closer look into the differences between the CNN and CBiLSTM validation accuracy learning curves.

5) VALIDATION OF CBILSTM FOR REPUTATION ASSESSMENT
To validate the performance of CBiLSTM in reputation assessment, the NBR score of Google cloud is calculated two times, and the results are compared. First, it is generated based on the confusion matrix of CBiLSTM. Second, it is generated based on the original dataset reviews numbers.
To calculate the NBR equation provided in Eq. 1, the confusion matrix of the CBiLSTM model applied on the testing set is used. The TP value substitutes the positive reviews' value in NBR, whereas the TN value substitutes the negative reviews' value. However, CBiLSTM classifies reviews into three classes, and the results in its confusion matrix are presented in a 3*3 matrix. To derive the TP and TN values from this multiclass confusion matrix, we need to transform the obtained confusion matrix into a binary confusion matrix [47]. The transformation process is illustrated in Figure 10.   The result of transforming the multi-class confusion matrix provided in Figure 11 into a binary confusion matrix is shown in Figure 12. The latter figure is generated through implementation. It classifies the reviews into "Positive" and "not-Positive" classes. Based on the CBiLSTM results, the resulting classification is utilized to determine the NBR score. Based on the obtained binary matrix, the number of positive reviews is 3868, which is the TP value, whereas the number of negative reviews is 33, which is the TN value. According to the result obtained by Eq.7, the NBR score of Google Cloud services is estimated at 98.3 %.
= × 100 = 98.3 % The testing dataset contains 3,880 positive reviews, 31 negative reviews, and 43 neutral reviews. 3,880 substitutes for the positive reviews in NBR, whereas the sum of the negative and neutral reviews, 31 plus 43, substitutes for the negative reviews in NBR. The reputation score of Google Cloud is estimated as 96.25% based on 3954 reviews as calculated by Eq. 8.

=
× 100 = 96.25 % (8) Comparing the CBiLSTM-based NBR score of Google Cloud to the NBR score generated from the original dataset, the two values are close. This indicates that CBiLSTM can be considered a reliable technique for the reputation assessment of service providers.

V. CONCLUSION
In recent years, the number of competing services in the cloud services industry has increased. Although this has numerous advantages, certain difficulties occur when clients must choose amongst a range of services that provide the same functionality. The literature discusses several objective and subjective measurements of service trust. However, the existing approaches are challenged by staticity, acquisition feasibility, data sparsity, and cold start issues. This paper presents a novel approach to dealing with these issues. It develops a new deep learning model to classify cloud servicerelated reviews based on sentiments derived from the examined reviews. The proposed deep learning model, named CBiLSTM, is a hybrid model that combines CNN and BiLSTM layers. The CNN layers handle the high Transformation Multiclass Matrix Binary Matrix dimensionality of text inputs by extracting word-level features, and the BiLSTM layer investigates the context of the formerly extracted features in backward and forward directions simultaneously. The CBiLSTM's classification results are utilized to compute the overall reputation score using the NBR formula. Multiple experiments were carried out to validate the proposed approach. First, the performance of CBiLSTM is compared to that of CNN, BiLSTM, and GRU models, and the findings show that CBiLSTM surpasses these models. Second, experiments using Google Cloud reviews indicated that CBiLSTM is a reliable method for assessing service providers' reputations. The goal of this work is to provide a reputation score for service providers based on user QoE, which is represented by categorizing reviews as "Positive", "Negative", or "Neutral", and provides an overall assessment of user sentiments regarding services providers. Despite the numerous contributions made by this study, it presents a number of shortcomings. Mainly, it needs to perform more in-depth and refined research aiming at investigating the multimodal sentiment analysis techniques for reputation assessment. Indeed, multimodal content has evolved from text content to multimedia material including videos and images as a medium for user expression on the web today. Textual material has given way to multimedia data including films and photographs in multimodal content. For a variety of decision-making applications, these multimodal forms of expression have become the standard information resource. Although these new forms of expression provide more affluent and more expressive information resources, their dispersion in terms of multimodal emotional expressions needs a more complex analysis to extract relevant and valuable data.
As future work, we plan to extend our approach beyond textbased sentiment analysis techniques and make significant contributions by deploying the promising multimodal sentiment analysis techniques to ensure the effective assessment of services' reputation. Also, we intend to turn the multiclass classification problem into a multi-label problem to extract additional and more valuable characteristics from reviews. For example, depending on customers' subjective sentiments, we can categorize reviews to reflect service aesthetics, affordability, usability, security, and QoS attributes. Furthermore, we aim to enhance the proposed classifier to ensure the detection and classification of ironic or sarcastic reviews to increase the overall reputation assessment's accuracy. Finally, to address the unbalanced nature of the CLOSER-DREAM dataset, our future work will include the investigation of the resampling strategies to provide considerable improvements for the overall performance of the suggested CBiLSTM model