A Complexity Aware Scheduler With Dynamic Slot Allocation for Cloud Video Transcoding

For a multimedia cloud computing platform, it needs to perform video transcoding to provide bandwidth-compatible bit-streams for users. In configuring a MapReduce system, it allocates slots to workers with the assumption of homogeneous worker power and performs task scheduling by assuming equal task time complexity. However, the computing power of a practical cluster of workers and the time complexity of tasks is time-varying in any case. The task-scheduling algorithm has to well manipulate these heterogeneous cloud resources and task complexity. We first find a good partition size for a video transcoding job for efficient processing. We proposed a Complexity-Aware Scheduling (CAS) algorithm that reorders task assignment priorities according to task complexity to maintain load-balancing operations. By utilizing a Neural Network model to refine the task complexity estimation for the CAS(CASNN), the scheduling and transcoding speedup performances can further be improved. Based on the CASNN, we proposed a Dynamically Adjusting Slot number allocation (DAS) method, DASCASNN, to adjust the slots according to resource utilization status to improve the processing performance. Experimental results show the proposed DASCASNN can help reduce 30% of the transcoding time on average as compared to available schedulers and increase the resource utilization rates from 82.7% to 98%.


I. INTRODUCTION
For universal video access, a cloud computing-assisted video processing and streaming platform has to provide variable bitrate video bitstreams for users [1]. It has to transcode the original video format to another for the target user device. For example, an original HEVC format video has to be transcoded to H.264/MPEG4 one. Other feasible scalable dimensions comprise spatial, quality, and temporal scalability. A Dynamic Adaptive Streaming over HTTP (MPEG-DASH) [2] service framework proposed by MPEG can dynamically adapt media streaming over HTTP to provide bandwidth-compatible video consumption. For one MPEG-DASH client, after requesting for connection, it downloads a The associate editor coordinating the review of this manuscript and approving it for publication was Fabrizio Marozzo .
Media Presentation Description (MPD) file from the server which helps the client to select the next video segment to download and playback according to its available bandwidth. The job of video transcoding is computationally expensive and has to be carried out through a cloud computation platform to accommodate the computation loading and support the MPEG-DASH service.
The cloud video streaming server is developed based on distributed resources and grid computation frameworks. One transcoding job of a cloud platform has to be partitioned into smaller segments for concurrent processing through task scheduling and assignment operations. The Hadoop MapReduce framework is widely adopted for developing a cloud platform to process parallelizable problems across different datasets through a computing cluster with N w workers, W = {w j } j=1,2,··· ,N w . One big job, J , is partitioned into smaller tasks, say N t tasks {t i } i=1,2,··· ,N t , which are distributed by a mapper to the Hadoop File System (HDFS) and to activated workers [3]. After all t i s have been processed by respective workers, they are transmitted to a reducer to integrate small tasks back to a finished job [4], [5]. Cloud computation is capable of handling tremendous distributed computing and storage resources. The task scheduling algorithm for a J dominates the efficiency of any distributed systems, such as Grid, Cloud, and P2P networks [6]. However, how to minimize the makespan, T MS , through task scheduling, is NP-Complete [7]. There is no universal best task scheduling algorithm for all kinds of applications. Scheduling algorithms are usually optimized for specific applications.
A cloud video transcoding framework [8] takes bitrates and encoding speed as costs to perform optimization control of a task scheduler. Another [9] sets the highly load-balanced system operation as the control target to design the parallel video encoding system. A divisible load theory is utilized to distribute video frames among workers, in considering load partition granularity and the associated overhead. A random distribution [10] and a roundrobin scheduling [11], [12] approaches were proposed to improve processing efficiency. The round-robin approach is easy to implement but less efficient when processing tasks with different time-complexity under a heterogeneous W. A weighted round-robin method [13] considers the individual worker's processing capability when assigning tasks. An optimized scheduler dataflow [14] was proposed to minimize the job completion time and execution cost, and the best-compromised operation parameters were also determined. To avoid overflow of real-time transcoding applications, a dynamic resource allocation method based on prediction for cloud video processing [15] was proposed to allocate virtual machine resources. Some schemes [16], [17], [19], [20] are designed to dynamically adjust the number of slots to increase system resource utilization rates. Different scheduling algorithms [21], [22], [23] were proposed to deal with task assignments under heterogeneous computer clusters.
When a cloud platform is managing a homogeneous W, it can predict the task time-complexity or processing time to perform task scheduling for load-balanced operations [24]. In addition to scheduling algorithms, the input file partition size, z, also dominates the processing performance. It can be determined by mathematically derived models to yield the highest transcoding performance [10]. Although this z would be kept unchanged after system preconfiguration, performing scheduling by assuming homogeneous task processing time would suffer from degraded performance, as the practical execution time for one video segment would depend on its complexity. A multi-layer input file decomposition method [25] was proposed to partition an input file F in with different sizes to yield large and small segments. The small ones can be assigned to workers after all the larger ones to avoid a convoy effect [27] that would lead to load-unbalance and longer job processing time (i.e., makespan T MS ). To serve multi-jobs, the scheduler maintains an opportunistic environment and uses node availability rates and pairwise available correlations to improve fairness [28]. They proposed to predict task time complexity and monitor the storage location based on the fair scheduling algorithm to decide whether to reserve an available slot or not for other local storage-supported tasks to reduce data access time. A dynamic predictive execution algorithm [29], [30] was proposed, in which it monitors delayed tasks and workers for task re-assignment to avoid possible longer makespan. A Hyb-SMRP algorithm that utilizes dynamic priority and localization ID techniques was proposed to increase data locality rate and reduce completion time [31]. It is 2.19% faster than the First-In-First-Out (FIFO) scheduler and 0.79% faster than the Fair scheduler (FS). Although available scheduling methods can be utilized, how to perform cloud-based video transcoding has not been well addressed, especially when the clients request to access segments of not yet transcoded tasks. In general, maintaining highly load-balanced operations when performing resource allocation and task assignment would yield high processing speed. It has to monitor the worker's processing status and avoid excessive use of a particular worker during processing jobs.
In this research, we study how to utilize the resources of a cloud to process a video transcoding job, J , which requires high time-complexity video processing and larger data segment access. We first investigate how to best partition an input video file, F in , to minimize the T MS of a J . We proposed to formulate the T MS , including packet upload time, concurrent processing time, and video packet processing time, under a specified cloud computation system configuration, by a mathematical model and derive an optimal partition size, z, that can yield the minimum T MS . Secondly, in addition to determining the best z, we also investigate multi-layer file partition methods [25] to improve cloud video processing efficiency. With the best z as one system parameter for configuration, we conduct experiments of cloud video transcoding using different task scheduling algorithms under a heterogeneous W to design the best scheduler. Thirdly, for practical cloud transcoding operations, the online progress may not meet the task scheduling script that records the task assignment order and which worker to assign to. Under this condition, one task scheduler has to compromise between content complexity, i.e., task time-complexity, and system status, i.e., task processing progress and resource utilization status, both of which can be considered as time-varying variables, to best utilize computing resources. Fourthly, we investigate how to best predict the task processing time, which is found to dominate the performance of the online scheduler script, and perform cloud scheduling for load-balanced operations. Finally, combining all the factors, file partition size, task complexity prediction, scheduler script, and status of computing resources, we proposed a Content-Aware Scheduler with Dynamic Allocation for Slot based on refined task complexity prediction through Neural Network models denoted as DASCASNN. In what follows, section II introduces the research backgrounds. Section III discusses cloud video transcoding operations, including determining the file partition size, ordering task assignment, and load-balancing control of cloud computation, and their impacts on video transcoding efficiency. The framework and operations of the proposed DASCASNN system are described in section IV. Section V is the experimental study. Section VI concludes this paper.

A. CLOUD COMPUTING
The system control flow for performing cloud transcoding is described with the aides of Fig. 1. The processing steps after the user submits a job can be carried out in three stages: (1) Initialization; (2) Scheduling; (3) Execution. At the initialization stage, it partitions a F in into segments, creates directories, and uploads files to the Hadoop Disk File System (HDFS) for concurrent processing. This F in can be a live-streaming source or an existing media file. For the latter, the processing target is to shorten the entire media transcoding time with the help of cloud computing. For the former, it has to shorten the single job processing time and inter-job waiting and transmission time to serve for real-time processing. To take one example, one F in is partitioned into N t smaller segments, F = {f i } i=1,2,··· ,N t , in which one video group of pictures (GOP) is the basic decomposition unit, i.e., segments can be independently processed. When one F in is ready, it creates one J that requires N t map tasks for the set of segments, {f i } i=1,2,··· ,N t , which will be assigned to different w j s for concurrent processing. When dealing with live-streaming, the input is a sequence of segments each comprising a few GOPs. Some initialization procedures have to be carried out before the JobClient sends a job to the JobTracker, ψ J , such as fetching the Job ID, creating HDFS directories, and transmitting the required JAR (Java Archive) application packets to the distributed file system. It then submits the Job to the ψ J through the Remote Procedure Call (RPC) protocol.
At the scheduling stage, when the ψ J receives a J submitted by the JobClient, it creates a JobInProgress (JIP) to check user access right of queues and whether the allocated memory space can afford to execute the job. The TaskScheduler initializes the execution process. As jobs being initialized will occupy memory space, to prevent memory from over-occupation due to many initialized jobs queued to wait for execution, the ψ J does not initialize job processing right after receiving a J , but transfers it to the TaskScheduler. The TaskScheduler, in addition to assigning tasks to workers according to their resource utilization status, acts as a load balancer. FIFO, FS, and Capacity Scheduler (CS) are widely used schedulers.
When the ψ J receives a J , it uses three description layers, i.e., JIP, TaskInProgress (TIP) and TaskAttempt (TA), to trace the task execution progress. The ψ J creates a JIP to monitor the overall Job processing progress and creates several TIPs to trace the task execution progress. When one task execution fails, the ψ J will restart it again until the task is finished or too many restarting times are detected. When all TIPs are finished, the corresponding JIP can be marked as a success. When the assignTasks function gets tasks ready for execution, it creates a task list for the J represented as T(J ) = {t i } i=1,2,··· ,N t , for the TaskScheduler to assign tasks through the RPC function.
At the execution stage, the TaskScheduler informs the corresponding TaskTracker, denoted as ψ j T , through the heartbeat, i.e., HeartBeatResponse. In general, the ψ j T = ψ T (w j ) monitors the task execution progress of a w j . When a ψ j T receives the HeartBeatResponse and finds a new task assignment, t i ∈ T, it invokes the task execution of the corresponding file f i ∈ F. For one t i , it creates a local working directory to store the uncompressed data from the JAR file and creates a TaskRunner to invoke a JavaMachine to execute the t i . Each ψ j T reports task execution status to the ψ J such that the ψ J can monitor the progress of all t i s. When there is a slot released from a w k that finishes a task execution, the w k will inform the ψ J through the heartbeat, by which the ψ J can assign a new t i to the w k . The proposed cloud-based streaming platform is developed based on this framework.

B. HADOOP SYSTEM RESOURCE MANAGEMENT
Denote an allocated k-th slot in a w j as s j k that can be considered as a certification for the w j to execute tasks. In a heterogeneous W, powerful w j s can promise more slots to accommodate and execute more tasks [32]. How many slots can be allocated for one w j depends on how many tasks it can execute concurrently. In a Hadoop system, each slot s j k is configured to be with Map or Reduce functions. For cloud computing, how to allocate available computing resources to process jobs efficiently is a task assignment problem. It has to handle the resource utilization of a heterogeneous W, monitor CPU utilization rates of workers, P U (w j )s, and predict task complexity, T c (t i ), etc. to perform task assignments. However, it's NP-hard to perform resource allocation and task scheduling for multi-target optimizations.
For system configuration, the processing power of the least capable worker, denoted as p s , is used as the slot quantization unit. Denote the processing power of a w j as P(w j ). The system calculates the quantized worker power, P(w j )/p s , to allocate a worker-affordable number of slots to each w j , denoted as n s (w j ). This slot allocation as a system configuration helps transform one heterogeneous W to the one with homogeneous slot units, which can make task scheduling easier. After the pre-configuring, the system will not change the n s (w j ). This static slot allocation approach may occasionally make the P U (w j ) too high or too low. A Self Adaptive Hadoop Scheduler (SAHS) [17] is designed to maximize the P U (w j ). We modified the SAHS and designed the control steps as: (1) it executes a resource monitoring procedure to monitor the P U (w j ); (2) it informs the ψ J of P U (w j ) through the heartbeat and determines to increase or decrease the n s (w j ). The resource monitoring procedure is carried out through the RPC function for the ψ J to communicate with ψ j T s. Each ψ j T executes the RPC procedure periodically to send task execution progress and P U (w j ) to the ψ J which can assign tasks to w j s according to the reported status information. The communication and interaction operations act like the human heartbeat that helps the system: (1) make sure which ψ j T is active; (2) enable the ψ J to update status parameters of all w j s; (3) assign tasks to w j s with reference to the P U (w j ) status. Each ψ j T periodically reports worker processing status to the ψ J , which comprises new task assignments or slot allocation operations. The ψ J s can keep through heartbeat the most updated information from ψ T (w j )s instead of tedious polling operations. The JIP has to record static and dynamic job processing status parameters. The former comprises the n s (w j ) and memory space needed for the map tasks. The latter are the numbers of working and finished map tasks. The ψ j T is designed to monitor the task processing progress and calculate the P U , and transfer this data to the ψ J . To achieve high R U and maintain load-balanced system operation, the task scheduler has to dynamically reallocate slots, with reference to T c (t i )s of different job queues. In this research, we proposed to derive and calculate a task partition size, and justify the result through experiments, based on which a task scheduler and a dynamic slot allocation algorithm can be developed to improve the overall cloud transcoding efficiency.

C. TASK ASSIGNMENT
Scheduling algorithms [33] for cloud computing are firstly reviewed for comparisons. The FIFO scheduler processes a job queue sequentially. When utilized in a shared W, the FIFO is inefficient because large jobs will prevent executions of small jobs. Under this condition, adopting the CS or the FS helps yield higher R U s. When adopting the FS, no predefined capacity would be reserved for any job queues, and it can maintain balanced resource utilization among jobs. In comparison, the FS can timely process small jobs and can achieve high R U s. The CS allows several job queues to share a W and configures each queue to utilize a predefined capacity of the cluster.
A queue elasticity [34] function is designed for the CS to utilize resources more than its capacity. In the Hadoop system, one task assignment is executed with three steps: (1) When one ψ j T detects available map slots in w j , i.e., n free s (w j ) > 0, it informs the ψ J through the heartbeat; (2) The TaskScheduler selects one job from a queue and selects one task for the ψ T (w j ) to execute required procedures; (3) Assign the task for processing through the FS or others. The concurrently processed tasks from one job may be subject to unpredictable procedure errors, due to either computation or transmission malfunctions, which would reduce the processing speed and extend the job completion time. A speculative execution optimization technique [38] is adopted to eliminate the artifact of job disorders. It can predict the task processing time and assign the task to the w j with the earliest finish time.

D. VIDEO CODING STRUCTURE
For video transcoding, re-encoding a decoded video is the simplest approach but the most time-consuming. As the motion estimation and coding mode selection of H.264 account for 66% of the overall video encoding time [35], re-using these coding parameters to perform transcoding is widely adopted. Under this condition, the transcoding time would be independent of video content complexity. Video is coded with a Group Of Picture (GOP) structure that comprises frames with type I (Intra-coded), P (predictivecoded), and B (bidirectional-predictive coded). It has to partition one video file along the GOP boundary (i.e., closed GOP) such that partitioned packets can be processed independently and concurrently among different w j s. If not, it is called open GOP and requires accessing data across w j s for video coding/decoding. For the example shown in Fig. 2, the (k-1)-th GOP, GOP(k-1), is a closed GOP inside which every image can be processed independently. The GOP(k) is an open GOP, and it needs to access the frame P 13 in the GOP(k-1) and I 16 in the GOP(k) to transcode/process the B 14 and B 15 frame images. In partitioning video files, the size of segmentation may be determined by video length or packet size. To solve the open-GOP problem mentioned above, the system can be designed to: (1) transfer the P 13 frame to the worker that processes GOP(k); (2) copy the P 13 frame to GOP(k); (3) copy the GOP(k-1) to the packet that comprises GOP(k). Although these methods help enable independent GOP processing, the burden of transmission bandwidth and processing power increases. In general, it uses a constant partition size for efficient packet transmission. We proposed to, before the partition boundary, include the first I frame of the next GOP in the current partition. This partitioning method enables independent processing for cloud computing.
We conducted an experiment to perform transcoding on segments comprising video data with 90∼95 secs. of playback times, and it showed nearly uniform processing time, as shown in Fig. 3(a). The other way around, video complexity signifies no direct correlation with transcoding time in this experiment. We can assume that the transcoding times for packets with the same video playback time are the same. Partitioning one video bitstream into 16 MegaBytes (MB) segments, each one comprises a different number of video frames, and the transcoding time is observed to be linearly proportional to the video length in time, as shown in Fig. 3(b). We can assume that, for videos with the same resolution, the transcoding time would be proportional to the number of frames in a video segment/packet. This experiment on transcoding time vs. video length relation helps justify this assumption.

III. CLOUD VIDEO TRANSCODING
Since the video data amount is much larger than that of other data types and the time-complexity of video data processing is non-homogeneous, it has to manage load balancing operation to speed up the processing. How to best partition a big video file is described in sec. III-A. How to achieve load-balanced system operation to speed up processing is discussed in sec. III-B

A. FILE PARTITION SIZE AND EFFICIENCY
Assuming that a W comprises N w workers and a submitted J with a video file F in is partitioned into N t segments, the cloud computing efficiency under different N w and N t had been investigated [25]. In general, the N w is fixed, and the computation load can be balanced when N t ≥ N w and (N t mod N w ) = 0 under the environment of processing tasks with homogeneous time complexity by a homogeneous W. However, this homogeneity assumption does not hold for practical cloud video processing: (1) When N t = N w , the startup time and packet transmission time would be minimum. However, experiments showed that the entire transcoding time would be much longer, in that same size video segments require different processing times and early finished tasks have to wait for the latest one to finish the entire job; (2) When N t > N w , the partitioned video segment size is smaller and requires a shorter processing time. Under this condition, task scheduling and assignment can be performed with more flexibility to prevent workers from being idle. However, adopting smaller partition sizes will result in more task initialization times and overhead packet transmission time.
Let l p denote video length in time for one packet and the transcoding time for one packet, T p , can be expressed as: where s denotes the efficiency of the worker in performing transcoding. The RAM size and CPU core number also affect transcoding efficiency. In the Hadoop system, the first batch of video segments has to be transmitted to workers before being processed. As shown in Fig. 4, transcoding operations can be conducted only when the first batch of files had been uploaded. The T MS to finish the transcoding job can be expressed as: where L is the video length in time of an input video file F in , N w denotes the number of overall workers in a W, T S denotes the task startup time, Z denotes the input file size, BW denotes the uplink bandwidth and z is the partition size that is a variable to be observed. The first term in eq. 2 is the overall transcoding time (cf. eq. 1) divided by the number of workers, which is independent of z. The second one denotes the required overall startup time under the N w worker setup. The third one is the packet upload transmission time of all tasks. In eq. 2, different z values will lead to different T MS s. To minimize the T MS , we perform partial differential operations on eq. 2 concerning z: Further derivations yield the following one: Eq. 4 can be used to find a proper partition size, z, for different system setups. By setting z < 64 MB, each video segment can be stored within one HDFS block. However, the video lengths measured in time of the same size segments would be different. The system has to maintain highly loadbalanced operations to minimize the total transcoding time.
Eq. 2 demonstrates a simple cloud processing time model to help to explain the operations. By computing z with a parameter set {BW = 20 Mbps, Z = 1024 MB, N w = 4, T S = 5 sec.}, it yields z = 80 MB. Different computer clusters with different processing power, data access rates, and bandwidth will demonstrate different processing efficiency. It needs to conduct experiments on practical cloud video transcoding to evaluate the efficiency under a certain z, which will be discussed in Sec. V-A.

B. TASK ASSIGNMENT FORMULATION
As video segments partitioned by z comprise different numbers of video frames, it would lead to load-unbalance and more processing time. As shown in Fig. 5(a), tasks in w 1 require longer execution time and tasks in w 2 and w 3 have to wait for t 7 to finish the job. Another example is shown in Fig. 5(b). When the numbers of tasks assigned to different workers are different, workers with fewer tasks have to wait for the one with more tasks to finish the job. It needs to improve the task assignment method to solve this inefficient resource utilization problem. When there are N t tasks and N w workers with equal computation power, the estimated execution time of a t i processed by a w j can be denoted as The finish time of a t i processed by a w j is computed as: where t x denotes the task being processed by the w j right before assigning the t i . Let U denote the set of tasks not yet assigned, and the minimum completion time (MCT) of a t i can be represented as: In this MCT algorithm, it sequentially selects one task t i ∈ U and finds a w j that yields the minimum T min (t i ) to process the t i . This MCT algorithm is designed to finish a job as early as possible. However, it cannot well process a video transcoding job with non-uniform task complexity. To improve the MCT efficiency in processing a video transcoding job, a Max-MCT [36] algorithm is proposed in which the next task to be assigned is t k , where In eq. 8, the Max-MCT algorithm is designed to estimate finish times of all tasks t i ∈ U , i.e., T min (t i )s, among which the maximum one is selected for execution first. In most cases, high complexity tasks will be assigned first with this Max-MCT algorithm. The Min-MCT algorithm is carried out similar to eq. 8 except that it will select a t k , where The makespan to finish the job J , denoted as T MS (J ), can be obtained by The resource utilization rate of a w j , R U (J , w j ), can be calculated by and the system resource utilization rate in processing a J is calculated by The task scheduling for cloud computing can be considered as the Knapsack Problem [37], in which the computer cluster resource is fixed and it needs efficient scheduling methods to increase the R U . This sort of Knapsack problem that solves combinational optimization is NP-hard, which can be solved by a dynamic programming algorithm. However, the estimation error of task complexity makes it difficult to be utilized in practical applications.

C. LOAD-BALANCED CLOUD COMPUTING
To eliminate the convoy effect, a Multi-resolution Division Load Balancing Algorithm (MDLBA) [25] was proposed.
It is designed to partition one video file with two or more level sizes. For two-level sizes, the second is much smaller than the first, and the second-level segments require much shorter transcoding times. As discussed in previous sections, adopting a larger z for performing cloud transcoding will result in a severe convoy effect, in addition to its low flexible task assignment capability. The two drawbacks can be eliminated by adopting a smaller z. Although it will increase the number of task initialization times, it will enable short task execution time and highly flexible task scheduling to achieve load-balanced operation. Performing a transcoding task of a larger segment with size Z s is faster than that of Z s small segments with size 1. To speed up processing, it can first assign large segment tasks and then small ones, in which the latter helps to eliminate the convoy effect resulting from the former. As too many small segment tasks will bring on extra trivial initialization procedures, only a few small segments are used to compensate for the execution time disparity, as shown in Fig. 6. The Max-MCT scheduling algorithm is adopted for this MDLBA experiment, abbreviated as MDMCT. High complexity tasks are assigned first, in which the complexity is determined by video length in that segment/packet. The size ratio between large and small segments is around 1/4∼1/5. Practical MCT and MDMCT task assignment examples are demonstrated below. Assume N w = 3 and N t = 12 and individual task execution times are shown in Table 2, in which there are 9 high and 3 low complexity tasks. For the MCT, it sequentially assigns tasks according to eq. 7 and the process is shown in Fig. 7. As shown, after assigning all high complex tasks, w 2 and w 3 have to wait for t 9 in w 1 to finish the job (Fig. 7(e)), and it suffered severe convoy effect even after low complexity tasks, t 10 , t 11 , and t 12 , are assigned to w 1 , w 2 , and w 3 , respectively, as shown in Figs. 7(f)(g)(h). It shows that the low complexity tasks cannot well eliminate the convoy effect due to uneven execution times of high complexity ones. In short, it does not take the entire set of task complexity into consideration such that several high complexity tasks are assigned to the same worker.
The 1st operation of the MDMCT is the same as that of the MCT. Its 2nd operation for t 6 is shown in Fig. 8(b). Fig. 8(f)      shows the assigned tasks with high complexity. As the task complexity has been considered during scheduling, it can avoid assigning high-complexity tasks to the same worker for load balance. Low complexity tasks, t 10 , t 11 and t 12 , were then assigned to w 1 , w 3 , and w 2 , respectively, as shown in Fig. 8(h). The T MS and R U before and after assigning low complexity tasks denoted as method 1 and method 2 , respectively, are shown in Table 3. In comparison, the MDMCT yields better load-balanced operations and a shorter T MS that is 11% smaller than that of the MCT. In terms of R U , the MDMCT demonstrated higher R U before assigning low complexity tasks. The MCT demonstrated higher R U s after assigning shorter segment tasks. The MDMCT yields less convoy effect and better load-balanced operation. In general, adopting a smaller z yields longer T MS in that it needs to initialize more tasks. Also, adopting a smaller z can effectively reduce the convoy effect and yield higher R U before processing the 2 nd partition segments [26]. When the 2nd partition segments are involved, it can increase R U to more than 98% for all zs.

D. LOAD BALANCE PERFORMANCE
We take practical examples of the Min-MCT and the Max-MCT algorithms to demonstrate and analyze their capability in load balancing control. In the Min-MCT, it estimates MCTs of all t i s, t i ∈ U , and record them in a table. After selecting the task whose MCT is minimum by eq. 9, it updates the MCT table for the next task assignment. One Min-MCT example is shown in Fig. 9(a), in which the Min-MCT first selects w 1 to process task t 1 in that T F (t 1 , w 1 ) = 3 is minimum. After updating the MCT table, since T F (t 2 , w 1 ) = 11 is minimum, t 2 is assigned to w 1 . It then selects t 3 for w 1 through the same operation steps. The resultant task execution time on different workers is shown at the right of Fig. 9(a), which demonstrates that all tasks were assigned to the capable w 1 . One Max-MCT example is shown in Fig. 9(b), in which it assigns t 3 , t 2 , and t 1 to w 1 , w 2 , and w 3 , respectively, and demonstrates better load-balanced operations as compared to the Min-MCT example. Their T MS s are 23 secs. and 16 secs, respectively.
The three scheduling methods, MCT, Min-MCT, and Max-MCT, are carried out for comparisons under the same system setup shown in Table 5 but are reconfigured to one master and four workers with one slot each comprising two cores. It adopts two-layer packet sizes, which are 60MB and 15MB, respectively. The T MS and R U performance evaluation is provided in Table 4. As shown, the R U s of MCT 1 and Min-MCT 1 are 85.5% and 92.5%, respectively. The proposed MDMCT 1 can help eliminate the convoy effect and achieves a higher R U = 96.7%. Both MCT 2 and Min-MCT 2 did not shorten the T MS from the two-layer task assignment, but they increase the R U s, i.e., 85.5% to 90.3% and 92.5% to 97.7%, respectively. The MDMCT 2 increases the R U to be as high as 98.6%.
As the MCT considers only the current task complexity and selects a w j that can earliest finish the task, it will lead to load-unbalanced operations, especially for the second half part of the job. The convoy effect cannot be eliminated by adopting the two-layer task assignment procedure. The Min-MCT and the MDMCT select from all unsigned ones, t i ∈ U , the lowest and the highest complexity tasks, respectively, for assignment and can yield load balancing operations and less convoy effect. The Min-MCT selects the lowest complexity task first for assignment, such that higher complexity tasks assigned later may lead to severe convoy effect, which can be well eliminated by the second layer task assignment. The MDMCT selects higher complexity tasks first for assignment, and the later lower complexity task assignment would yield less convoy effect. By utilizing the two-layer task assignment method, in addition to reducing the convoy effect, it can increase the R U and shorten the T MS . The MDMCT 2 is selected as the baseline for performing cloud computing in the following sections.

IV. CONTENT-AWARE SCHEDULING WITH DYNAMIC SLOT ALLOCATION
The control target of a cloud task scheduler is to shorten the entire job processing time. As the P U (w j ) is time-varying, the processing efficiency of this static slot allocation approach can be further improved. By monitoring the instant P U (w j )s and task complexity through the heartbeat, the system can dynamically adjust the n s (w i )s to maintain highly loadbalanced operations.

A. COMPLEXITY AWARE SCHEDULER (CAS)
As the processing time of tasks from the same job would be different, due to procedure error, load-unbalanced operation, and uneven resource allocation, some tasks may occasionally require longer execution times, such that the scheduler frequently invokes speculative execution functions and decreases the processing speed. When the processing progress of one task is delayed 20% more than its counterparts, it is considered as falling behind. Under this condition, the worker, say w k , that is found to yield the earliest finish time, i.e., T F (t i , w k ) through eq. 7, would be selected to process the task. It's better to predict the task complexity for a t i , T c (t i ), before the speculative execution, such that the system can perform the dynamic slot adjustment procedure.
Both Max-MCT and MDMCT algorithms were developed by assuming the computing powers of w j s ∈ W are homogeneous. They are efficient since all task processing time, T E (t i , w j ), can meet the task assignment schedule. To process transcoding jobs in a practical computer cluster, we proposed a Complexity-Aware Scheduler (CAS) which was developed and improved from the MDMCT, the FS, and a LATES [38]. For the LATES, the new speculative execution function is designed to predict the overall job completion time, by which it reorders task assignment priorities for speedup. The CAS is similar to the Max-MCT, and it's a dynamic counterpart of the static Max-MCT one. In other words, the CAS is designed to practically assign tasks according to the Max-MCT schedule. As shown in Fig. 10, the CAS assigns high complexity tasks to different w j s for load-balanced system operation, under which low complexity tasks will automatically be assigned at last to eliminate the convoy effect as the MDMCT does. The Max-MCT can be considered as the benchmark of the CAS, and the MDMCT acts as an improved Max-MCT. However, the CAS can be easily implemented in a practical heterogeneous W. Experiments showed that the MDMCT outperforms CAS a little in T MS and R U performances, which will be presented in section V-C, In section III-D, it shows that the MDMCT 2 procedure can effectively eliminate the convoy effect and shorten the T MS of processing a J . From the viewpoint of processing a specific J or a batch of J s, it is an efficient scheduling method of cloud computation for processing off-line files. For livestreaming services that demand low delay playback, the task set for scheduling should be kept small, i.e., the buffering window size for task assignment. Also, there's no specific file boundary for live-streaming services. Under this condition, the MDMCT 2 procedure becomes trivial, while the CAS can still efficiently process the task scheduling across the boundary of the buffering window. The CAS can be considered as a fine-granular version of the multi-layer task assignment method, MDMCT 2 . In summary, considering video content complexity, both MDMCT and CAS are efficient in speeding up the transcoding process, but the latter is more suitable for practical live-streaming applications.

B. COMPLEXITY-AWARE SELF-ADAPTIVE SCHEDULER
In configuring the Hadoop, each w j is allocated a certain number of certificated slots, n s (w j ), for executing tasks, in which all n s (w j )s will not be changed during execution. As both task complexity and worker processing power are not homogeneous, the scheduling algorithm has to dynamically adjust the n s (w j ) to accommodate this mutual-way variation. A Self Adaptive Hadoop Scheduler (SAHS) [17] was proposed to increase/reduce the n s (w j ) of each worker depending on its P U over the run time. For dynamical slot allocation, it has to set up an initial maximum and minimum number of slots for each w j , n Max s (w j ) and n Min s (w j ), and a threshold value, Th P U , for the P U (w j ). In general, the system sets the initial n s (w j ) based on the number of cores that a w j has.
When setting a large Th P U , one w j is likely to be allocated more slots while P U (w j ) < Th P U which results in a heavy workload. On the contrary, a smaller Th P U will prohibit the system from utilizing available slots and the overall P U will also be lower. Experiments verified that setting Th PU = 80% yields good cloud computing performances [18]. When the ψ J receives the status information from a ψ T (w j ), if P U (w j ) < Th P U , it checks whether: (1) there are available map slots left in w j , i.e., n free s (w j ) = 0; (2) current map slots are fewer than its upper limit, n s (w j ) < n Max s (w j ); (3) there are unassigned map tasks, i.e., |U| = 0. If all these conditions are satisfied, it allocates one map slot and assigns one map task to this ψ T (w j ) to improve the P U (w j ) and shorten the T MS , as demonstrated in the lower right block of Fig. 11. When P U (w j ) > Th P U , it checks whether n free s (w j ) > 0 and n s (w j ) > n Min s (w j ). If both conditions are satisfied, it will delete one map slot from the w j to avoid load unbalance. Fig. 11 shows the flowchart of the modified SAHS control steps.
For both MDMCT and CAS, the system measures the complexity of a task, T c (t i ), by the number of coded frames within the segment and creates a scheduling script for all tasks of the submitted J . According to this scheduling script, it assigns all tasks to the W, which is supposed to help efficiently process the J if all tasks are processed and finished on time. As the P U (w j ) is time-varying, the task processing time, T E (t i , w j ), may not meet the schedule such that the job processing will exceed the scheduled T MS . In addition to analyzing video content complexity, the SAHS monitors the P U (w j )s through heartbeats to adjust the n s (w j ) for better load-balanced operations. We call the method that combines SAHS with CAS as Dynamically Adjusting Slot number with Complexity-Aware task Scheduler, abbreviated as DASCAS. The diagram of the proposed DASCAS is shown in Fig. 12.

C. PREDICTION COMPLEXITY BY ARTIFICIAL NEURAL NETWORK
In the DASCAS experiment, we used the number of video frames in a segment to represent the task complexity T c (t i ). Experiments for multi-job schedules revealed that the task processing time depends not only on the number of video frames. Beyond the playback time of one segment (T p ), other video coding parameters, such as resolution, encoded bitrates, numbers of I, B, and P frames in a GOP, denoted as n I , n B , n P , respectively, may also affect the task processing time. In general, the actual task processing time correlates with each parameter, but video coding configuration parameters also affect each other. To precisely predict the task processing time from different video coding parameters, these correlations have to be considered as a whole, which is quite complicated work. The artificial neural network model (ANN) can implement massively parallel computation for classification and prediction, which can effectively solve this kind of problem. By setting proper numbers of layers, neurons, and learning iteration steps, it can effectively combine different correlations as a whole to perform prediction or classification precisely [39]. We proposed to take these six parameters, P ANN = (resolution, bitrates, n I , n B , n P , T p ), as the input for an ANN to output the predicted task processing time. The parameter set P ANN can be extracted from video codestreams by the FFmpeg video processing tool, and the first P ANN parameter, resolution (H × W), is the number of overall pixels in a frame.

V. EXPERIMENTAL STUDY A. VIDEO FILE PARTITION
An experiment of cloud video transcoding with different zs was conducted to find the best operation parameter. One server computer with 2 CPUs, 96G Ram, and 24 cores, whose model is Intel Xeon E5-2620/2.0GHz, is used as our system platform. We set up 1 master, 3 clients, and 4 slave workers that comprise 2, 3, 4, and 6 CPU cores, respectively, i.e., W = {w 1 (2), w 2 (3), w 3 (4), w 4 (6)}, where w j (s) means the w j affords to provide s slot certifications, or another simple notation |W| = {2, 3, 4, 6}. The VMware is installed on the server to provide the required computer cluster. To perform video transcoding, an FFmpeg video toolbox is adopted. Table 5 shows the experimental setup. The master acts as the central management unit which handles the HDFS NameNode and JobTracker function. Besides, the master monitors the worker's progress and allocates resources. The client issues requests to the cloud system and uploads the required video file to the HDFS. It also handles the final output file which can be DASH compatible for streaming applications.
The first test video is Batman with resolution 1920 × 800, 1598 Kbps, 1h28m11s long, and 1024MB size in H.264 format, and the output format is H.264 480 × 360@172Kbps. The MDMCT scheduler was adopted for this experiment, which was executed six times to yield the average practical transcoding time. The estimated processing time by eq. 2 is used for reference. By setting s = 0.72, T S = 5 secs. and BW = 20 Mbps, and using 4 workers, the practical (solid square) and estimated (dashed line) T MS s are plotted in Fig. 13(a). As shown, adopting a smaller z resulted in more task initialization times and a longer T MS . The T MS decreases when the system sets a larger z. In addition to reducing the task initialization time, setting a larger z requires a longer transmission time such that the probability of transmission interruption will also increase. The two factors cancel out each other and eventually, the T MS reaches stable. For even larger segments, the transmission time increases and the T MS increases slightly, as shown in Fig. 13(a). The second test video is MoneyBall whose experimental result is shown in Fig. 13(b). As shown, the time-size plot demonstrates a similar relational curve to that of the first one. Since the file size, 1610 MB, of the second video is larger than that of the first one, setting a smaller z will increase both task initialization times and the T MS . Under this condition, the best z is around 100MB. Another experimental result, shown in Fig. 13(c), also demonstrated similar results in which the system adopted the FS scheduler and was configured to |W| = {1, 1, 2, 2, 2, 4, 4, 4}. These experiments suggest that the proper segment size for efficient transcoding is around 64∼128 MB.
Experiments revealed that when adopting a small z for one J , the system has to perform more task assignment operations and spend more task initialization times [26], i.e., a smaller z leads to a larger N t . Although it can achieve a higher R U , it will increase the T MS , i.e., T MS (14 MB) > T MS (30 MB) > T MS (60 MB), where T MS (z) denotes the makespan when the file partition size is set to z. By utilizing the two-layer task assignment method, the system not only can yield smaller T MS s but also improve the R U performance when adopting a larger z.

B. PREDICT TASK PROCESSING TIME BY ANN
The target video format of transcoding is set to be the one with resolution 480 × 360 and bitrates 176 Kbps. Nine videos partitioned into 184 segments are used as training samples to extract the P ANN to train the ANN model that comprises six input nodes, one hidden layer with 100 neurons, and one output neuron. Given the P ANN of a partitioned segment, the ANN output neuron will provide the estimated processing time of this segment. Video for training and testing are shown in Table 6. The Bayesian regularized back-propagation model is adopted to train for the ANN parameters of our case. The log-sigmoid function is adopted for input, and the linear transfer function, purelin, is used to enable the output to be arbitrary values. The metric for the ANN convergence is Mean Absolute Percentage Error (MAPE): in which p k denotes the real output, i.e., the task processing time,p k denotes the predicted output and N is the number of total samples. The MAPE converges to 2.6% after ten or more iterative NN training steps. Note that the estimated transcoding time may not accurate enough to represent the original one. However, it is accurate enough to help the system improve the scheduling performance. The task time prediction accuracy by the trained ANN model is demonstrated in Fig. 14. As shown, the solid black line denotes the original task execution time data, which are ordered monotonically decreasing according to their practical transcoding time. In the DASCAS algorithm, the processing time of a video transcoding task, T (t i ), is determined by the video playback time of the corresponding T p (f i ), i.e., T (f i )∝T p (f i ). For comparisons, we utilize the ANN model to predict the T (f i ) in the DASCAS algorithm, denoted as DAS-CASNN. As shown in Fig. 14, the predicted task execution time by the DASCASNN is more coherent with the original data, as compared to that of the DASCAS, whose MAPE values are 2.6% and 4.3%, respectively. This DASCASNN can predict task processing time more accurately and help to improve the transcoding efficiency, which will be discussed in the following section.

C. CLOUD TRANSCODING EFFICIENCY
With the experimental hardware setup shown in Table 5, the numbers of CPU cores for workers are configured to be |W| = {1, 1, 2, 2, 2, 4, 4, 4} to act as a heterogeneous W. The file partition size z is 64MB and the test videos comprise MoneyBall, FireDragon, SkyFighter, and ToyStory. Transcoding time of these methods is shown in Fig. 15 and Table 7. To evaluate the speedup performance, we define the time reduction ratê where j and k denote the i-th and j-th scheduling methods to be compared and evaluated, respectively. Both CS and the FS allocate system resources statically and consider no content complexity, so their average T MS s are larger. The FS outperforms the CS in theˆ T performance in that it manages load balancing control. Let FS(2) denote the configuration that two slot certificates are allocated statically to each worker. Experiments showed that it will lead to load-unbalanced operations. In the FS(p), it allocates n s (w j )s according to the power of w j and the operational R U increases and theˆ T = 14%. The CAS manages to improve task scheduling performance according to video content complexity, in which high complex tasks will be assigned for processing before low complex ones. Its R U andˆ T are 90.3% and 16.5%, respectively. The MDMCT acts as a benchmark of the CAS, which performs task scheduling based on both complexity-dependent task assignment and worker processing efficiency. Besides, it adopted a two-layer segment size policy to eliminate the convoy effect. Itsˆ T = 16.9% is larger than that of the CAS (14.3%). The SAHS is designed to improve the processing performance by managing resource management, which can dynamically adjust the number of certificated slots for a worker to increase the worker R U and theˆ T , which are 94.8% and 20.6%, respectively. The DASCAS, a combination of CAS and SAHS, yields higher T = 23.8%, as it effectively manipulates and utilizes the variation of video content complexity and the P U (w j ) status to perform task scheduling and assignment. Experiments of the CAS, the SAHS, and the DASCAS showed that precisely handling the variation of both content complexity and system P U status helps much in improving system performance. Experiments also revealed that the task complexity estimation error of the CAS can be partly compensated by the SAHS. However, the performance of the DASCAS algorithm still depends on accurate task complexity estimation to shorten the T MS . In the DASCASNN, the content complexity is predicted in a more sophisticated way, and it increases theˆ T up to 30.4%, which outperforms all other methods, as shown in Fig. 15 and Table 7.  Under static configuration, the Hadoop cannot re-allocate slots under load-unbalance conditions, which results in smaller R U s. Although the FS(p) is better than the FS(2), all slots may be busy but the average P U (w j ) is still low. Besides, static configuration allows not to adjust the n s (w j )s to improve the P U performance. Under this condition, the system will demonstrate quite different P U (w j )s, and heavyloaded w j s require longer task processing time. As compared to the FS(p), the proposed DASCAS(DASCASNN) can dynamically adjust the n s (w j ) of a w j according to the P U (w j ) and reduce the T MS to 11.4%(19.1%) smaller, as shown in Table. 7.

D. MULTI-USER TASK PROCESSING
We also conduct experiments on scheduling when handling more jobs with heterogeneous content complexity. As shown in Fig. 16(a), the FS(2) demonstrates the longest T MS in all cases. The FS(p) helps to shorten the T MS about 20%(4 jobs) and 6%(8 jobs). The MDMCT loosely reordered the task assignment priority and yielded 24%(4 jobs) and 9%(8 jobs) ofˆ T . Under this condition, the proposed DAS-CAS/DASCASNN can save about 29%/35%(4 jobs) and 12%/21%(8 jobs) of T MS as compared to those of FS (2). As demonstrated, theˆ T becomes smaller when more jobs are submitted for processing. It's reasonable because the system computing resources become insufficient under this condition. When processing 4 jobs, the total active slots, which are counted to be 20 in this case, have to process the overall 70 tasks. The proposed DASCASNN can better utilize available resources, as demonstrated in Fig. 17(a). For 8 jobs, the processing load of the limited computing resources is doubled, and the operational resource utilization capability of the DASCASNN was limited. Also, the CAS capability in reducing the convoy effect becomes inefficient when more job queues were involved for task assignment. Queuing all job tasks together can help to improve the R U and P U performance. The system design policy for multi-job queueing would depend on different application requirements and is beyond the scope discussed in this research.
Another experiment has been conducted to investigate the P U (w j )s of different algorithms during processing. The four test videos are the same as those listed in Table 6, and the pre-configured CPU cores for workers is |W| = {1, 1, 2, 2, 2, 4, 4, 4}. The resultant P U (w j )s along processing time are shown in Fig. 17(a). As compared to the FS(2), the CAS helps to increase the average P U and reduce the T MS from 1,521 secs. to 1,143 secs. The proposed DASCAS utilized advantages of both CAS and SAHS, and shortened  the T MS to 1075 secs, i.e.,ˆ T = 29.7%. By adopting the NN model to precisely estimate the heterogeneous task complexity from different jobs, re-arranging task assignment order, and dynamically adjusting the number of slots for workers, the DASCASNN increased the average P U to be near 100% and yielded the shortest T MS = 980 secs. The number of total active slots during the process is shown in Fig. 17(b). VOLUME 10, 2022 In comparison, both FS and FS+CAS prohibit the scheduler to use available CPU resources and yield lower P U s. The DASCAS can increase/decrease n s (w j )s of specific w j s to maintain high P U and highly load-balanced operations. The n s (w j )s with different n Max s (w j )s during the DASCASNN process are shown in Fig. 17(c), which shows that the DAS-CASNN can detect and adjust the n s (w j ) of capable workers, i.e., w j (s = 4)s and w j (s = 2)s, to better utilize available CPU resources. The design of dynamic configuration, together with the NN model refined CAS, enable the DASCASNN to better utilize computing resources and effectively eliminate the convoy effect. It yielded the highest P U and R U among all the mentioned schedulers.

VI. CONCLUSION
In this research, we proposed a DASCASNN method that can dynamically adjust the number of slots in a worker according to CPU utilization rates and task time complexity to improve cloud computing efficiency. We integrate the cloud concurrent processing framework, video transcoding techniques, and task scheduling algorithms based on neural network models to provide a MPEG-DASH compatible cloud video transcoding/streaming platform. We investigated the impact of video content complexity and file partition size on cloud video transcoding efficiency, based on which a complexity-aware task scheduler was proposed to cooperate with a dynamic worker slot number adjustment algorithm to enhance resource utilization and increase cloud video processing efficiency. Contributions comprise: (1) A mathematical model that relates cloud video concurrent processing, bandwidth, input file size, and partition size was derived to help the system find a task partition size that yields the shortest processing time. (2) We exploited the diversity of video content complexity and time-varying resource utilization status to develop a complexity-aware scheduler to cooperate with a dynamic worker slot number adjustment method. (3) We proposed to estimate the video transcoding time of tasks with the neural network model to improve the accuracy of the complexity-aware scheduling script, which helped to shorten the overall transcoding time. (4) The proposed DASCASNN improves the load balancing operation efficiency of cloud computing and increases the system resource utilization rate from 84.9% to 98.1%, which can save up to 30.4% of processing time as compared to state-of-the-art Hadoop schedulers. How to utilize deep learning methods to improve cloud computing efficiency is considered as our future research. HAN-YEN YU received the B.S.E.E. degree from the National Yunlin University of Science and Technology, Yunlin, Taiwan, in 2006, and the Ph.D. degree in electrical engineering from the National Taiwan University of Science and Technology, Taipei, Taiwan, in 2015. He worked on the NSC and ITRI research projects to design mart surveillance systems, face recognition systems, RFID location systems, and the Internet streaming systems. In 2015, he joined the Digital Education Institute, Institute for Information Industry (III), Taipei. He conducted a project about Taiwan Small-School Alliance interactive real-time teaching through E-learning to meet the needs of learning in the remote areas of different provinces for the shortage of teachers had been a frequently seen situation in those places. The plan successfully integrated the resources of the learning industry, charities, and enterprises of real-time broadcast systems, starting the cross-school services, and gathering more than ten schools in three provinces to form this alliance. He is currently a Senior Engineer with the Digital Education Institute, Institute for Information Industry (III). His research interests include digital image processing, education technology, and image/video encoding.
CHING-CHENG HWUANG received the B.S.E.E. degree in electrical engineering from Tatung University, Taipei, Taiwan, in 2011, and the M.S.E.E. degree in electrical engineering from the National Taiwan University of Science and Technology, Taipei, in 2013. He is currently a Software Engineer with ASUS Corporation. His research interests include cloud image/video processing and cloud computation.
CHENG-JEI SUNG received the B.S. degree from the Electrical Engineering and Computer Science Department, Chung Yuan University, Taoyuan, Taiwan, in 2014, and the M.S. degree from the Electrical Engineering Department, National Taiwan University of Science and Technology. He is currently a Field Application Engineer with Teradyne, Inc. His research interests include cloud image/video processing and cloud computation.
YOU-SHIN WU received the B.S. degree from the Electrical Engineering Department, Taipei University of Technology, Taipei, Taiwan, in 2021. He is currently pursuing the master's degree with the National Taiwan University of Science and Technology. His research interests include cloud computing, deep image/video compression, and web real-time-communication. VOLUME 10, 2022