Efficient High-resolution Video Compression Scheme using Background and Foreground Layers

Video coding using dynamic background frame achieves better compression compared to the traditional techniques by encoding background and foreground separately. This process reduces coding bits for the overall frame significantly; however, encoding background still requires many bits that can be compressed further for achieving better coding efficiency. The cuboid coding framework has been proven to be one of the most effective methods of image compression which exploits homogeneous pixel correlation within a frame and has better alignment with object boundary compared to traditional block-based coding. In a video sequence, the cuboid-based frame partitioning varies with the changes of the foreground. However, since the background remains static for a group of pictures, the cuboid coding exploits better spatial pixel homogeneity. In this work, the impact of cuboid coding on the background frame for high-resolution videos (Ultra-High-Definition (UHD) and 360-degree videos) is investigated using the multilayer framework of SHVC. After the cuboid partitioning, the method of coarse frame generation has been improved with a novel idea by keeping human-visual sensitive information. Unlike the traditional SHVC scheme, in the proposed method, cuboid coded background and the foreground are encoded in separate layers in an implicit manner. Simulation results show that the proposed video coding method achieves an average BD-Rate reduction of 26.69% and BD-PSNR gain of 1.51 dB against SHVC with significant encoding time reduction for both UHD and 360 videos. It also achieves an average of 13.88% BD-Rate reduction and 0.78 dB BD-PSNR gain compared to the existing relevant method proposed by X. HoangVan [18].


I. INTRODUCTION
T HE emergence of high-quality video including ultrahigh-definition (UHD) video, 360-degree immersive video, initiates new and exciting applications in virtual reality (VR), augmented reality (AR) and mixed reality (MR) in education, training, entertainment, and other markets [1] [2]. The bandwidth-intensive nature of these new generation video contents poses challenges in handling transmission and storage burdens while ensuring low latency delivery [3]. Furthermore, based on different user necessities including heterogeneous network capacities, display, power, and computing capabilities, the need for other video formats (e.g. SD, HD) is still in demand for adaptation to N-screen devices where different terminals are involved in managing video contents [4]. In this scenario, scalable video coding has emerged as a viable solution. The Scalable High-Efficiency Video Coding (SHVC) [5], the scalable extension of High-Efficiency Video Coding (HEVC) is the latest standard to the scalable era of video contents. The SHVC bitstream comprises one base layer (BL) and at least one or more enhancement layers (ELs). By leveraging inter-layer predictions (ILP) among BL and ELs, SHVC achieves high coding performance improvement of 30% over simulcast HEVC at the cost of high computational complexity [6]. However, further compression of SHVC is required for compressing UHD/360-degree videos because due to having high resolution and frame rate, these videos incur additional data overhead.
The purpose of this study is to develop a new strategy for video compression of high-resolution video by exploiting the layered concept of SHVC scheme. In a video sequence, the foreground changes over time and the background remains static over time. Thus, a coding gain can be possible using background frame as the background needs to be encoded only once for a scene. So far the existing background-based video coding techniques [7]- [9] encode background frame as an intra-frame by a coding standard or avoid encoding by modeling from the already decoded frames in both encoder and decoder. If encoded, it takes more bits to get a decent quality, and if modelled, it does not provide highquality image as it uses decoded frames. Thus, they encode the background frame in high quality which requires more bits similar to intra-coded frame. The existing cuboidbased techniques [10] [11] encode original frames where both background and foreground exist; thus, it cannot exploit homogeneity as an expected level due to the dynamic nature of foreground. The background has high spatial homogeneity compared to the original frame where foregrounds exist. It also has high temporal correlation among frames and can be compressed efficiently using cuboid coding as there is more pixel homogeneity than the frame with background and foreground.
In this paper we apply cuboid-based partitioning on background frame where more homogeneity exists due to static nature of the content. Thus, we can partition the frame in a better way for more compression for a given quality. In the proposed scheme, the structure of SHVC scheme has been adopted where foreground and background are encoded in separate layers implicitly. Thus, we are referring BL as a reference layer (RL) which provides EL with Static information (background) from a Sequence of original frames (S I S OF ) of the video sequence. By exploiting the availability of the static information in RL, the objective is to combine it with EL stream to form an improved EL prediction stream and thus, further improving coding efficiency of EL. To improve the overall coding efficiency, the RL is externally encoded using cuboid partitioning and further compressed by adopting 2-D Discrete Cosine Transform (DCT) scheme.
The main contributions of the paper are as follows: • A new video coding strategy has been proposed by adopting the structure of SHVC scheme where externally encoded coarse representation of background is provided as RL in one layer, and the original frame is provided in separate EL layer. The RL is meant for providing the most common information of the frame sequence while the immediate previous frame of EL will provide the motion information. • The effectiveness of cuboid coding on the Static information frame of a Sequence of original frames (S IF S OF ) has been studied to improve the overall coding gain as well as reduce time complexity. • A modified coarse frame generation scheme has been adopted where instead of replacing each cuboid by mean intensity values as adopted in [12], DCT is applied on each cuboid and visually most significant information is exploited by truncating high-frequency components.
• A comprehensive analysis of the Rate-Distortion (R-D) performance based on the reconstructed RL has been provided to understand the applicability of the proposed method in the SHVC video coding scheme.

II. LITERATURE REVIEW
A number of works have been carried out in the field of video coding to improve the performance of video compression.
In the scalable coding sector, for achieving better coding efficiency, most of the works focus on Inter-Layer prediction (ILP) based on decoded BL data. The works in [13]- [15], improved the performance of SHVC by taking an adaptive filtering approach in interlayer reference prediction. Hoang et al. in [16] proposed a joint layer coding mode by linearly combining BL and EL decoded information at pixel level to improve SHVC coding efficiency. [17] is another approach where a joint layer prediction (JLP) method was proposed to improve the performance of SHVC. By applying the decoded information achieved from both BL and EL, the JLP method could create a new prediction picture. The work in [18] proposed an adaptive long-term reference selection algorithm for surveillance cameras using scalable video coding. Based on the content analysis of a video sequence, this approach selected a coded picture as long-term reference picture. Though these mechanisms outperformed SHVC in terms of bit saving, those still required a lot of extra bits to encode the ILP picture.
Apart from improving the coding efficiency of scalable video coding, various efforts have been adopted to reduce the computational complexity of SHVC encoder [19]- [22]. Most of these methods introduced fast CU depth level decisions and applied spatial scalability to reduce overall run time but experienced an increase in the bit rate.
However, these methods are still deficient due to having a trade-off between improved coding efficiency and reduced complexity. In the proposed work, our aim is to improve the coding efficiency of SHVC for UHD and 360-degree videos while preserving the quality as well as controlling the time complexity.

III. PROPOSED FRAMEWORK
The proposed method consists of three steps: (i) Generation of background frame, S IF S OF , (ii) externally encoding S IF S OF to reconstruct RL stream and (iii) encoding EL stream using regenerated RL stream. The schematic diagram of RL stream generation is depicted in Fig. 1. In this method, we explore the effectiveness of externally encoded RL with S IF S OF in SHVC encoding scheme. For this, first, few S IF S OF s are extracted from corresponding chunks of original frames of a video sequence using an existing dynamic background modeling scheme [23]. Then the extracted background frames are segmented using cuboid partitioning [10]. To encode the RL stream externally, those cuboid partitioned frames are compressed with DCT scheme. Thus, externally encoded S IF S OF s are used to regenerate RL stream and finally used to encode EL stream of UHD and 360-degree

A. GENERATION OF BACKGROUND FRAME, SIF SOF
Given a video sequence, ν ∈ R P ×Q×η , of resolution P × Q and frame number η, the aim is to extract the most static information of frames over a number of frames, I p < η. In a video sequence, scenes captured by a static camera have steady background, β, over the frames except the interference of moving objects and changes of illumination at foreground, ξ. Thus, a video sequence can be decomposed as, ν = β + ξ.
As the distribution of background pixels differ from that of the foreground ones, the β layer can be formulated by judging the pixel deviation [24]- [26] . The recent state-ofthe art methods for β layer subtraction use low-rank subspace learning approaches [27]- [31]. The standard approach used for background subtraction is Low-Rank Matrix Factorization (LRMF) [32]. For r-rank matrix factorization, the LRMF factorizes β into two smaller matrices, M ∈ R P Q×r and N ∈ R Ip×r , where r < min(P Q, I p ), such that, Here, the operation 'Fold' folds up each column of a matrix into the corresponding frame matrix of a tensor. Thus, at each intra-period interval, I p , a background frame, β, is generated.

B. CUBOID PARTITIONING
After β extraction, the next step is to encode it externally using cuboid partitioning. The cuboid coding scheme divides β of resolution P × Q into n ∼ = n s cuboids based on a userdefined number of segments n s . n s depends on the resolution of video as well as its quality. In the segmentation process [33], β P ×Q is first separated into two half cuboids -β m1 P ×Q of size m × Q and β m2 P ×Q of size (P − m) × Q, using a vertical line x = m + 0.5 in P − 1 ways with m ∈ 1, 2, · · · P − 1.
Similarly, using a horizontal line y = n + 0.5 in Q − 1 ways, where n ∈ 1, 2, · · · Q − 1, β P ×Q can be split into two halfcuboids -β The pixel intensity contrast distance between a cuboid split pair, denoted as D s1 D s2 , is defined as [34], Then, to maximize the objective function,f (s|β P ×Q ), a greedy optimization heuristic is applied to find the best split of β P ×Q from the possible P + Q − 2 ways as: By recursively partitioning one cuboid into two halves using the optimal split, s * , a hierarchical partitioning algorithm is designed which terminates when all possible ways of splitting get invalid. At the end, the algorithm returns a binary partitioning tree, Γ, of β P ×Q with horizontal or vertical split lines and a cuboid map, λ = {s * } ns−1 j=1 , found from the current frame. It is possible to reproduce the previous image using the indices of λ. Thus, it is necessary to send the indices values (λ) from the encoder to the decoder so that the decoder can reconstruct the cuboid map and the frame. To encode these indices, Exponential-Golomb coding technique is used [35]. The encoded indices are then augmented to EL bitstream to be transmitted at decoder.
The number of cuboids, n s are determined based on the resolution and quality of the video as well as bits, n b , required to run the cuboid partitioning. This is an optimisation problem. A number of different video sequences with different texture and motion information have been analysed to determine the number of cuboids. In table 1, we presented only two results as an example to visualize how a suitable cuboid number has been chosen. From table 1, it is observed that with the increase of n s , both the PSNR and n b increase. The computational time increases as well. So, it is required to decide on a value of n s , for which both the PSNR and n b will be suitable. Analysing a number of video sequences, we found that, for HD videos, after n s > 3000, n b increases but the PSNR increase is insignificant. Same is the case with n s > 10000 for UHD videos. Thus in this study, we used n s = 3000 and 10000 for HD and UHD videos, respectively.

C. COARSE REPRESENTATION OF RL STREAM
At this stage, a set of cuboids, C = {β 1 P ×Q , β 2 P ×Q , · · · , β ns P ×Q } is found. To extract essential information from each cuboid β i P ×Q ∈ C, we use DCT in the spatial domain that transforms information into the frequency domain. In an image, usually, low-frequency data contains most common information, and high-frequency data contains detailed information. According to the information compaction property of the DCT, the most important information is concentrated in few of the output data points of the 2-D DCT coefficient matrix. The first element (DC value) of the output data points carries the most important information of the original image. The remaining coefficients (AC values) carry the detailed information in the decreasing order if arranged in a zigzag manner [36]. Since the low-frequency data has much bigger effect than the high-frequency data, the least significant highfrequency coefficients are masked off and removed from the coefficient matrix by applying data quantization process. After quantization, most of the high-frequency data points (lower right corner of the matrix) are zero. In order to maintain the balance between the perceptual video quality and the bitrate, we truncated a number of DCT coefficients while keeping the most significant values. It is found that after the top four coefficient values, if we increase the number of coefficients further, the number of bits needed to encode the DCT coefficients increases rapidly compared to the PSNR increase. Again, in the cuboid-based approach, it is found that with the increased number of cuboids, most of the cuboid contains null high-frequency values after four top-left coefficient values. Thus, for the reconstruction of RL frame, we have considered top four significant coefficient values and truncated the rest. Since the large DCT coefficients are concentrated in the low-frequency area [37], the top four significant values reside in the low-frequency zone having high energy compactness characteristics. Finally, the RL frame is generated using inverse DCT that uses the truncated coefficients where the high-frequency components are absent.
Algorithm 1 represents the overall process of RL generation. Fig. 2 represents an example scenario of coarse background frame generation. The red marked block of Fig. 2a and Fig. 2b denotes the only moving object among frame 1 to frame 60. Thus, the background extraction model considers the rest as static scene and generated background as Fig.  2c. Fig 2d represents the cuboid partitioning of generated background frame. Due to applying DCT, the overall number of bits required to compress Fig. 2f is larger than that of Fig.  2e. To control the increase of bits, we have kept only the top four significant DCT components from each cuboids. Thus, the overall PSNR of the generated Fig. 2f has gained 0.9 dB improvement compared to mean reconstructed image (Fig.  2e).

Algorithm 1: RL Generation Algorithm
Notations: ν = Video Sequence, I P = Intra-Period, ν Ip = Sequence of frame within an intra-period, β = Extracted background from a sequence of frames, C = List of cuboids, β i P ×Q = A single cuboid of β where P × Q is the dimension of β, Ω i P ×Q = Matrix of DCT data points for cuboid β i P ×Q , Ω i (j, k) = Coefficient of DCT matrix from a specific position (j,k),  Thus, the coarse background frame is generated using the cuboid map and top four DCT coefficients from each cuboid. A new background frame is updated based on the intra-period interval. Finally, all coarse representations of background frames are accumulated to reconstruct RL stream where the same frame is repeated between two consecutive Iframes. The information of λ and truncated DCT coefficient matrix are then fed into exponential golomb processor to get the desired compressed bit stream and transmit those to the decoder.

IV. EXPERIMENTAL RESULTS
In this section, the result for evaluating the performances of the proposed method is presented. To evaluate the coding performance of the SHVC for two-layer SNR scalability, the most recent SHVC reference software, SHM-12.4 [38] is used. The experiments have been carried out as per the common test conditions on an AMD Ryzen 7 processor (PRO 3700Uw) running at 2.30 GHz with 16 GB RAM. To conduct the evaluation, several benchmark HD and UHD/360-degree test sequences of YUV 4:2:0 format with different motion characteristics and content variation [39]- [41] are used. Test sequences are encoded with general coding options for ran-dom access (RA) coding structure for all encodings with a hierarchical GOP of size 16 and an intra-period of 32. For studying the encoding performance, 120 frames have been selected from each of the sequences. The specification of the test sequences are summarized in Table 2. In the proposed method, we have one EL along with the RL for SNR scalability. This method differs from the traditional SHVC as the RL contains only S IF S OF . The encoded bitstream of the cuboid map and DCT coefficients of RL are augmented with EL bitstream to evaluate the performance of the proposed scheme. The performance of the proposed method is assessed in terms of Bjontegaard BD-Rate, BD-PSNR [42] and execution time with respect to the original unmodified SHVC scheme.
The performance comparison between mean and DCT for reconstructing a frame is analyzed in Fig. 3 for a background frame of 'Broadway' sequence. The information loss by gradually suppressing a number of high-frequency data points from the DCT coefficient matrix is also observed. From the figure, it is observed that the R-D curve for DCT with top four coefficient values performs best compared to the R-D curve with five or six DCT coefficients and mean. In the proposed method, with the increase of cuboids, many    The performance comparison among proposed method, traditional SHVC and reference method [18] is presented in Fig. 4 for 'Community', 'Shark Encounter', 'KiteFlite' and 'Broadway' video sequence. The reason for comparing the proposed method against the reference method [18] is, this method uses background as a long-term reference in the SHVC model. However, in our case, the background frame is further encoded to compress it ensuring the quality. From the figure it is observed that the proposed method outperforms the reference method in terms of bit rate saving but the PSNR values are almost same. Furthermore, from the figure it is observed that our proposed method outperforms SHVC significantly in terms of PSNR and bit rate.
In Fig.5, the performance comparison of the proposed method for two different Intra-periods 16 and 32, is studied for 'Broadway' and 'Shark Encounter' sequences. From the results, it is observed that there is an overall improvement in the performance of the proposed method for these two video sequences when the intra-period is doubled.    Van [18]. From the table, it is observed that, for normal video sequence ('Library'), the proposed method performs best compared to HD/UHD/360 videos. The overall performance indicates that, the proposed approach achieves an average of 26.69% BD-Rate saving and 1.51 dB BD-PSNR gain on top of SHVC. Again, in case of the reference method, the proposed method achieves an average of 13.88% BD-Rate reduction and 0.78 dB BD-PSNR gain against the reference method.
In the case of time complexity reduction, the proposed method outperforms SHVC with an average of 9.95% timesaving. The proposed method also outperforms the method proposed by X. Hoang Van [18] with an average of 10.02%. As multiple platforms (Matlab, SHM encoder) are used to encode RL and EL separately, the execution time is approximated. We considered the time required to encode EL, construct background, and partition cuboid to calculate the runtime of the proposed method. The computational time to encode the RL was estimated from the ratio of RL encoding time and EL encoding time [43]. The ratio differs with the varying resolution of the test sequences. Finally, it was considered with the EL encoding time to estimate the overall coding time. An estimation of execution time is presented here in Table 3 where the time for SHVC encoding is kept unchanged. But to calculate time for the proposed method, along with the encoder time, time required for background generation as well as cuboid partitioning has been added. The proposed method needs extra time to generate background and cuboid partitioning; however, it saves significant encoding time as it does not need to explore different modes of intra-coding done by the SHVC coding scheme. Figure 6 represents the visual quality comparison of the proposed method, conventional SHVC and the method by X. HoangVan [18] against the original video frame. The exam-ple frame was taken from the 'Intersection' video sequence. Here, frames from the proposed method, SHVC method and the reference method were encoded with adjusted QP values to meet a specific bitrate. The bitrate needed to encode the original frame (Fig. 6a) was 5.1 Mbps for the proposed method, 5.7 Mbps for SHVC and 5.8 for the reference method. To present the comparison in image quality, let us concentrate on the boundary of the car, wheel, and tree objects marked by the red squares in Figure 6a, 6d, 6g, and 6j which were zoomed to larger sizes in the later images. Indeed, the visual quality of the reconstructed frame obtained from the proposed scheme (Fig. 6g) is better than that of the SHVC frame (Fig. 6d) and the reference frame (Fig. 6j). For example, the boundary of the car object in Fig. 6b (marked by the green rectangle) is better defined in the proposed scheme ( Fig. 6h) than in SHVC (Fig. 6e). The SHVC method created a blur boundary of the car object and also the boundary of the wheel is unclear where the proposed method defined those clearly with better quality. In the case of the reference method, the visual quality of the proposed method (Fig. 6h) is still better compared to that of the reference method. In the reference method (Fig. 6k), the boundary of the car object is defined better than SHVC but the boundary of the wheel is still unclear. Again, the tree objects in Fig. 6c (marked by the green rectangle) are well defined in the proposed scheme ( Fig. 6i) but the SHVC scheme (Fig. 6f) and the reference shceme (Fig. 6l) failed to define them clearly. Both the SHVC method and the reference method could not reconstruct the clear view of the tree objects but the proposed method defined that. Since cuboids are encoded based on the homogeneous pixel intensity of pixels, the proposed method could define objects clearer than that of block-based SHVC.

V. CONCLUSION
In this paper, an efficient video coding strategy for highresolution video has been proposed using the coarse rep-  DAVID TAUBMAN (Fellow, IEEE) received the B.S. and B.E. degrees in electrical engineering from the University of Sydney, in 1986 and 1988, respectively, and the M.S. and Ph.D. degrees from the University of California at Berkeley, in 1992 and 1994, respectively. From 1994 to 1998, he was with Hewlett-Packard's Research Laboratories, Palo Alto, CA, USA. He joined UNSW, in 1998, where he is currently a Professor with the School of Electrical Engineering and Telecommunications. He has authored the book entitled JPEG2000: Image Compression Fundamentals, Standards and Practice, with M. Marcellin. His research interests include highly scalable image and video compression, motion estimation and modeling, inverse problems in imaging, perceptual modeling, and multimedia distribution systems. He received the University Medal from the University of Sydney. He has received two best paper awards from the IEEE Circuits and Systems Society for the 1996 paper entitled "A common framework for rate and distortion-based scaling of highly scalable compressed video," and from the IEEE Signal Processing Society for the 2000 paper entitled "High performance scalable image compression with EBCOT."