Adaptive Trust Management and Data Process Time Optimization for Real-time Spark Big Data Systems

Applications supporting businesses, smart systems, social networks, and advanced video applications such as eXtended Reality (XR) require large amounts of data processing to be provided in real-time. Therefore, the processing speed of big data systems is more important than ever. On the other hand, protecting a big data system is not easy, as various types of nodes and clusters are supported by various wired and wireless networks. Commonly security procedures slow down the response time of big data networks, and therefore, enhanced security and performance speed techniques need to be co-designed into the system. In this paper, a trusted streaming adaptive function (TSAF) is proposed that uses a trust management scheme to identify malicious nodes in Spark big data systems, exclude them from job/task processing, and calculate the number of nodes that can satisfy the process’s object completion time. The TSAF scheme shows an improved processing performance when there are attacks on the big data system compared to other existing real-time big data processing schemes. For the case of no security attack, the results show that the processing time of TSAF is faster by about 1 ~ 2% compared to the existing big data processing schemes when the process completion object time is set to 0.5 s. Even when the ratio of malicious nodes performing security attacks on worker nodes reaches 0.5, the results show that TSAF can satisfy over 75% of the tasks within the object time, which is significantly higher compared to the existing big data processing schemes.


I. INTRODUCTION
C ONSIDERING the significant increase of internet of things (IoT) devices, drones, autonomous driving vehicles, and various interactive video services and systems, such as eXtended Reality (XR), the amount of data that needs to be processed in real-time is rapidly increasing. Because big data systems need to collect massive amounts of structured and unstructured data, big data systems are interconnected with many diverse networks, servers, blockchain databases, and clouds. As there are many interfaces to a big data system network, the possibility of a security attack is very high.
To protect a system in which various types of nodes and wired/wireless networks are combined, various security methods have been proposed. Among them, trust management technology has been very effective. Trust management can help maintain the security level by granting trust to members based on previous activities. Trust management technology has shown good protection results in systems where heterogeneous nodes and networks were used.
Several methods have been proposed to utilize trust management in big data systems. In [1]- [4], the security of big data systems are maintained by calculating the trust for data collected from the IoT sensors, unmanned aerial vehicles (UAVs), autonomous driving vehicles, and IIoT network. In [5]- [7], algorithms to manage user or data trust levels using blockchains and authentication were proposed. However, in these studies, trust of the worker nodes for big data processing is not considered. In [8], when big data is processed on the cloud platform, a trust level is assigned to the data, task, and big data processing resources. The scheme proposed in [8] maintains the security level of the data by allowing tasks to be processed differently according to the assigned trust level. In [9], when multiple clouds are connected to a big data system, trust of the corresponding clouds are determined, FIGURE 1. Real-time big data processing system model of TSAF, which includes one master node that manages several worker nodes and is assisted by a trust level database. Worker nodes can be divided into multiple clusters and each cluster is managed by a cluster head.
and an example of parallel big data processing is suggested. However, in this study, the cloud to which worker nodes are connected is not adapted in reference to the security situation, so the scheme may be vulnerable to varying attacks and have issues in computational complexity.
In this paper, a trusted streaming adaptive function (TSAF) big data scheme is proposed that uses trust management to identify malicious nodes in Spark big data systems, exclude them from job/task processing assignments, and calculate the number of nodes that can satisfy the processing object time.
The contributions of TSAF are as follows.
• TSAF is a novel trust-based security technique for Apache Spark big data processing systems that can satisfy predefined object times even when the ratio of malicious nodes performing security attacks on worker nodes is very high.
• TSAF reduces the recovery time caused by malicious nodes being assigned as worker nodes along with errors that occur during the communication and execution process considering past records.
• TSAF minimizes the computational complexity of determining the trust level of worker nodes through clustering. This approach focuses on the concept that parallel clustering based on divided big data nodes can enable faster big data processing.
In section II, the big data processing system model for TSAF is described. In section III, the TSAF scheme is proposed, where details on the trust management process and data processing time estimation are provided along with the trust management scheme's computation complexity analysis. The operational flowchart of TSAF is described in section IV. Furthermore, the performance of TSAF and other realtime big data processing schemes are compared through simulation in section V and the conclusion of this paper is provided in section VI.

II. SYSTEM MODEL
The big data system and network model investigated in this paper is described in Fig. 1. It is assumed that the worker nodes supporting the big data process are connected to a cloud. To process real-time streaming big data, the big data system has one master node and several worker nodes. Worker nodes are divided into several clusters and managed, and each cluster has one cluster head (CH). The master node communicates directly with the user and distributes the data coming into the system to each cluster. The CH performs the role of the master node in the cluster and also the role of the worker node at the same time. The system processes the streaming data in parallel using multiple clusters to satisfy the object time T object , and the result is reported to the user through the CH and master node [10].
Malicious nodes also exist among the worker nodes of the cloud system. Malicious nodes intentionally degrade the system performance inside the cloud (e.g., data export, no communication response, reduction in data processing speed, etc.).
The master node maintains a separate trust level database to manage the trust evaluation of all worker nodes. The trust level database stores the trust-related information of the existing worker nodes and verifies the trust information of the worker nodes reported by the CH based on synthesizing the trust level data (in the database) and the trust information reported by the CHs. In addition, when it is necessary to adjust the clusters (due to a changes in the number of worker nodes), the trust level database is used to designate nodes with the highest trust level to serve as CHs.

III. TRUSTED STREAMING ADAPTIVE FUNCTION
The TSAF scheme is divided into two parts. In part 1, calculation of the optimal number of clusters is conducted while considering the trust calculation of the worker nodes in the system and the computational complexity. In part 2, the scheme calculates the number of worker nodes required to satisfy the object time based on the cluster and honest worker nodes derived from the trust management process.
The trust management process (using the optimal cluster size) and the real-time big data process (that satisfies the object time) are performed simultaneously while the system is running.

A. TRUST MANAGEMENT PROCESS
GlobalTrust [11], [12] was used to calculate the trust level of the worker nodes, where all worker nodes report a local trust opinion (LTO) to the cluster head (CH) for trust evaluation. The reported behavior is observed based on the big data processing time for a certain period. TABLE 1 summarizes the variables and expressions used in the trust management process. LT O j,k is the LTO measured by worker node j for worker node k, which is calculated as where p is the number of times that data is normally processed as a positive event, n counts negative events, which increases whenever an abnormal behavior (e.g., exceeding the object time required for processing or an error occurs) in the data communication process occurs.
where sim(w, j) is the cosine similarity of the LTO reported by worker nodes w and j. HR j represents the hierarchical rank of node j in which non-CH worker nodes are 1, and CH nodes have a higher value. Based on the created SR, the CH creates a trusted quorum (D). Using the SR information of the nodes belonging to D, the behavior reputation (BR) for each node is calculated using (3).
Using the generated BR, the credibility reputation (CR) can be obtained using (4). CR measures the similarity between the LTO reported by a specific node u and the calculated BR, where the more similar the opinion is, a higher CR value is assigned where N is a group of nodes that LT O u,j = null. The global reputation (GR) can be obtained from the BR and CR using where ρ is a value between [0,1]. The trust information of the clusters is reported to the master node. The master node verifies the trust information of each cluster and performs its role such as excluding potentially malicious nodes from the data processing. Computational complexity (C total ) graph based on number of worker nodes (N ) and clusters (k). For a given N value, C total decreases from k = 1 to k = kopt and increases steadily after k = kopt.

B. TRUST MANAGEMENT SCHEME ANALYSIS
When performing trust management, TSAF calculates the number of clusters that minimizes the time it takes to calculate the trust level. Lemma 1 calculates the computational complexity of the trust management process, and Lemma 2 calculates the number of clusters k opt that minimizes the calculated computational complexity, which will effectively minimize the time consumed in the trust calculation process. Proof: When calculating the trust of worker nodes, the master node divides all N worker nodes into k clusters. At this time, there will be a maximum of N/k worker nodes in at least one cluster (where x refers to an integer greater than or equal to x).
The computational complexity inside the cluster is derived as follows. The trust scheme used in this paper creates a trust quorum to judge the trust levels [11], [12]. When all clusters simultaneously compute trust values, the computational complexity of the cluster with the largest number of worker nodes will be O( N/k 2 ). Since the accuracy of all trust values inside the cluster is determined through the created trusted quorum, the computational complexity inside the cluster will The master node performs the accuracy judgment on the k trust values reported from the CH once more as if it was performed inside the cluster, and the computational complexity at this time is O(k 3 ). Therefore, the overall computational complexity is O( N/k 3 +k 3 ).

Lemma 2:
The number of optimal clusters k opt that minimizes the computational complexity is nint( Proof: According to Lemma 1, the trust computational complexity of TSAF is O( N/k 3 + k 3 ). The value k that minimizes computational complexity can be obtained as VOLUME 4, 2016 follows.
The k values that satisfies (6) is ± √ N . Since N > 0, k must be greater than 0, so k = √ N minimizes the computational complexity.
The second derivative of the computation complexity equation is presented in (7).
Since N > 0 and k > 0, the second derivative of the computational complexity is always greater than 0, so it is a convex function.
Since the number of clusters k is a positive integer, the number of clusters that minimizes the computational complexity is k opt = nint( is the nearest integer to x). Fig. 2 is a graph showing the change of computational complexity according to the change of k and N . At the same k, the computational complexity increases as N increases. At the same N , the computational complexity continues to decrease until k reaches k opt , and has a minimum value at k opt . After passing k opt , the computational complexity continues to increase.

C. DATA PROCESSING TIME
Based on the previously calculated number of optimal clusters k opt , the master node divides the worker nodes into k opt clusters to process the data. The big data processing time (T P ) can be divided into two parts. The first is the time consumed by the master node to transmit data to the CHs (T m ), and the second is the time (T k c ) to process the data inside the k clusters. TABLE 2 summarizes the variable and expressions used in the data processing time analysis.
The master node transmits the real-time data to be processed to each CH. The transmission time (T m ) is where A = θ vs T baseline vs , B = θ comm T baseline comm , T m vs is the variable sharing time, T m comm is the data communication time, and T baseline is the time it takes to transmit unit-sized variables and data. In the variable sharing process, variables necessary for data processing are transmitted to k CHs. In the data communication process, data is transmitted as much as the ratio of honest worker nodes that belong to each cluster among all honest worker nodes. T m add is the time it takes to recover a communication error that occurs when data is transmitted to the cluster and is restored by retransmitting the data.
The CH distributes the received data to honest worker nodes. The time it takes to process data in cluster k (T k c ) is where T k vs is the variable sharing time, T k comm is the data communication time, T k exec is the execution time, and T k add is the recovery time of errors that occurred during data processing. Errors can occur in the communication process and data processing inside the cluster. The data processing time of cluster k (T k c ) with n k worker nodes, that also considers the recovery time caused by an error, can be expressed as follows.
T k c = T k vs + T k comm + T k exec + T k add = n k iA + Bs n k n n k + iM a n k n n k + P ecomm Bs n k n n k +P eexec n k iA + Bs n k n n k n k + iM a n k n n k n k = n k iA + Bs n + iM a n + P ecomm Bs n The total big data processing time (T p ) is as follows.
Proof: The number of worker nodes to process the data within the object time can be obtained by deriving the n that satisfies (13).
Using n k n/k and rearrangeing for n, the following can be obtained where n is the number of worker nodes, so n > 0. In addition, n 3 1 k (1+P eexec )iA +n 2 kiA+ Bs k +P ecomm Bs k −T object +n (Bs + iM a + P ecomm Bs) + kP eexec (Bs + iM a ) = xn 3 + yn 2 + zn + w ≤ 0 (16) where x = 1 k (1 + P eexec )iA, y = kiA + Bs k + P ecomm Bs k − T object , z = Bs + iM a + P ecomm Bs, w = kP eexec (Bs + iM a ). The solutions of the above cubic equation are n 1 , n 2 , and n 3 , which are given below (17) The optimal number of worker nodes N opt is the smallest real number among n 1 , n 2 , and n 3 , therefore, where min(x) is the minimum number of x.

IV. OPERATIONAL FLOWCHART OF TSAF
The operational flowchart of the TSAF scheme, which finds the optimal number of clusters (k opt ) needed to manage the trust level, as well as the number of worker nodes (N ) to satisfy real-time data processing, is presented in Fig. 3. In Fig. 3, the input includes the total number of worker nodes (n), the target time to process the received data (T object ), and the input data segment size (s) that is used as the unit level of data processing in Spark DStream. The parameters n, s, and T object can be changed by the administrator at the beginning or during the DStream data processing.
When TSAF starts, the big data scheme first checks whether the parameters have changed by comparing it with the previous parameters. If the parameters have changed, the value of k opt is recalculated. Next the master node distributes the task and data to the worker nodes of the k opt clusters. The master node selects the worker nodes with the highest trust level to serve as CHs. Then the trust level management process and the big data processing are performed simultaneously. In the big data processing, the total number of required worker nodes N opt is calculated using T m (the time for the master node to distribute data to each cluster), T k c (the time to process data inside the cluster), and T object . In the trust level management process, as described in Section III-1, the CH calculates the trust level of the worker nodes in the cluster based on the events that occurred during recent big data processing. The calculated trust level of worker nodes is reported to the master node. The master node performs the process of verifying the reported trust level. Afterwards, the worker nodes are divided into honest nodes and malicious nodes, and the number of honest nodes in the system (N H ) and the number of malicious nodes (N M ) are identified.
The obtained N opt is compared with N H , which is the estimated number of honest worker nodes that can be assigned to data processing. If N H ≥ N opt , the data is processed by N opt nodes. However, if N H < N opt , all N H nodes will be used to process the data.
The number of N opt can change as the input data size changes. As explained in section II, TSAF considers not only directly connected worker nodes, but also external worker nodes connected through the cloud. Therefore, if the number VOLUME 4, 2016 of honest worker nodes required to satisfy the object time is insufficient, the master node will also use worker nodes in the external cloud. However, when new worker nodes in the cloud are connected, there may be changes in the clusters of the Spark big data processing system due to a change in the total number of worker nodes in the overall system.

V. SIMULATION AND DISCUSSION
In this section, real-time big data processing technologies OptEx [13], SAF [14], SRSAF [15], and TSAF are compared using MATLAB 2020a based simulation. Most of the simulation parameters were performed according to the basic environment of Apache Spark Streaming. The big data targeted object time was set to 0.5 seconds, which is the shortest output sequence time for Apache Spark Streaming systems. The number of worker nodes in the Spark big data system is 75 and the error rate of the communication process and execution process was assumed to be 0.05. The batch size of DStream was chosen to satisfy the object time limit of 0.5 based on an errorless status condition. Fig. 4 shows the data processing time for each scheme according to the number of DStreams when there are no malicious nodes. In this case, no recovery time due to malicious nodes is needed. However, recovery time due to errors occurring in the communication and data processing is needed. In the case of OptEx, it was found that the object time could not be satisfied for any case when errors occur. In the case of SAF and SRSAF, a performance close to the object time is obtained, however, SRSAF and SAF each experience an object time violation when the number of DStreams are 4 and 5, respectively. The results show that the TSAF scheme can satisfy the targeted object time for all cases tested, and the processing time was consistently 1 ∼ 2% lower than that of OptEx, SAF, and SRSAF.
In the case of SAF, SRSAF, and TSAF, there is a section where the processing time suddenly decreases, which is due to a change in the number of worker nodes (which are positive integers) used by the schemes. In the case of TSAF, the change occurs more frequently than in the case of SAF or SRSAF, because more error recovery control variables are considered internally when calculating the data processing time based on changes in the number of DStreams. Fig. 5 shows the probability of processing data within the object time T object when malicious nodes exist. In this simulation, conflicting behavior attacks (CBAs) are applied [11]. In CBA, malicious nodes have a probability of 0.5 to perform wrong behavior on half of the honest nodes and pretend to be correct for the other honest nodes. In addition, the malicious nodes report LTOs as 1 for half of the honest nodes and report LTOs as 0 for the remaining honest nodes.
If there are no malicious nodes, all schemes satisfy the processing object time. However, as the ratio of malicious nodes increases in the OptEx, SAF, and SRSAF big data systems, the ratio that satisfies the object time rapidly decreases. This is because data processing is delayed by the malicious nodes, which takes more time to enter the recovery process and reprocess the DStream.   5 shows that when the ratio of malicious nodes performing security attacks on worker nodes increases, the probability of satisfying the object time decreases as there is a lack of honest nodes (i.e., the N H < N opt case) to complete the data process in time. When the ratio of malicious nodes performing security attacks on worker nodes reaches 0.5, Fig.  5 shows that the TSAF scheme can still satisfy over 75% of the tasks within the target object time, which is significantly higher compared to the other big data processing schemes. TSAF can perform at this level because malicious nodes are identified in advance and only honest nodes are used to process the DStreams.
If the ratio of malicious nodes is larger than 0.5, the processing success probability of TSAF may drop sharply. Glob-alTrust was used to make trust decisions in the simulation experiments [11], [12], and when the ratio of malicious nodes exceeds 0.5, the opinions of the malicious nodes become a majority, and when they collude, the efficiency of the trust management scheme drops sharply, making it impossible to distinguish malicious nodes.

VI. CONCLUSION
Network security procedures commonly slow down the response time of big data systems. To overcome this issue, in this paper a scheme to enhance the security and processing speed in networks with errors and malicious nodes is proposed. The proposed TSAF scheme is effective in identifying malicious nodes among the worker nodes and helps to satisfy the targeted object time of real-time big data processing jobs. The TSAF scheme first computes the number of clusters that results in the lowest trust computational complexity, and then the optimal number of nodes is derived and used to process the data within the targeted object time. When compared to OptEx, SAF, and SRSAF, in the case of no malicious nodes, the TSAF scheme shows a superior performance in satisfying the targeted object time. In addition, when the big data network has malicious nodes conducting CBA attacks, FIGURE 5. Probability of successful processing within T object according to the ratio of malicious nodes among the worker nodes. The proposed TSAF shows a significantly higher probability of satisfying T object by using a trust management scheme that defends against malicious node security attacks.
the TSAF scheme results in a significantly higher probability of satisfying the job processing object time compared to the other schemes.
When there are changes in the input data size and security attacks in the Spark big data processing system, TSAF can respond by immediately adjusting its system configuration and number of required worker nodes according to the monitored system circumstances.
In this paper, the proposed TSAF system can only make optimized adjustments corresponding to errors occurring during the communication and data processing of the Spark big data system. However, errors occurring during the initial setup time where Spark system configuration changes are made, and other related initial communication adjustments are conducted, were not considered. In future work, a more comprehensive optimized error recovery scheme needs to be developed so that accurate processing times can be predicted during the initial setup time as well.