SCALABLE VIDEO CODING (SVC) [1] has been finalized as an amendment of the H.264/AVC standard. A SVC coded bitstream contains several different self-decodable subset bitstreams, which are referred to as different layers. In spatial SVC, the layer with the minimum resolution is referred to as the base-layer (BL), while the other layers with higher resolutions are referred to as the enhancement-layers (ELs).
Using SVC, when the network bandwidth is reduced, some frames of the enhancement layer could be dropped. In this case, for a two-layer spatial SVC system, the video quality will be reduced to the base layer quality. The traditional way for EL error concealment (EC) is to up-sample the corresponding BL frame. However, the quality of the up-sampled frame is not good. Other error concealment methods than BL up-sampling have been proposed in [3], but even for the method with best performance, error-concealed frames often look blurry.
Suppose the current frame loses its EL information. Other EL frames in the video sequence are of high quality. However, motions among those frames and the current frame makes the high quality EL frames difficult to be used for compensating the lost current EL frame. To solve this problem, we propose to use the hallucination technique to exploit both BL and EL information. It does not try to use the high quality EL frames to compensate the lost current EL frame directly. Instead, it uses a training process in which the pairs of other BL and EL frames are learned to generate a database to present the mapping from BL low quality patches to EL high quality patches. In the error concealment process, the BL frame and the information retrieved from the database by the index generated from the BL frame jointly provide high quality concealment for the lost EL frame. The overall process is illustrated in Fig. 1.
The rest of the paper is organized as follows. In Section 2, we present our proposed EC method for SVC in detail. We show the simulation results in Section 3, and conclude in Section 4.
SECTION II
Proposed Error Concealment Using Hallucination for Spatially Scalable Video Coding
A. Previous Work
Without loss of generality, in this paper we consider two spatial layers only: one being the BL, and the other the EL, where the resolution of the EL is twice of the BL in both the horizontal and the vertical dimensions. Several error concealment algorithms have been proposed in [3] for two-layer spatial SVC. Among these methods, the one with the best performance is motion and residual up-sampling (BLSkip), which up-samples both motion vectors and residuals of the BL, and performs motion compensation for the lost EL frame using the up-sampled BL motion vectors and residuals. “BLSkip” stands for “base layer skip”, in which the spatial enhancement layer uses the base layer information in the way similar to that in the AVC SKIP mode [2].
The problem with BLSkip is that although the up-sample of the BL motion vectors may be good, the up-sample of the BL residuals may not be as good, thus the error concealed EL frame might still be as blurry as the up-sample of the BL frame. The situation is especially severe when a key frame of EL is lost, as its error could drift to many other frames. It could produce long sequences of frames with low quality and the video could look as if it is out-of-focus from time to time, to which the human eyes are quite sensitive.
Hallucination is getting popular and widely used in areas such as image/video up-sampling [5], [6], [7], and image compression [8]. To our knowledge, this is the first time hallucination is utilized for error concealment of SVC. The hallucination process is discussed with more details in the following section.
B. Hallucination
We use the conventional hallucination method as in [5]. As shown in Fig. 2, it is basically divided into two stages: training and synthesizing. Let H0, L0 be the training high-resolution frame with EL information and the low-resolution BL-only frame pair, L0h be the up-sample of L0 (we use the method 0 up-sampling of the JSVM reference software, which is based on a set of 6-tap filters derived from the Lanczos-3 filter), HP(L0h) be the high-pass filter output of L0h. In the training stage, co-located patch-pairs, namely, a patch pl on HP(L0h) and its co-located patch ph on (H0 − L0h) are extracted and stored in a database D. The database D is in fact the mapping from a BL frame to an EL frame, in which pl is the index that can be generated from BL frame, and ph is the information we can retrieve from other EL frames to conceal the current lost EL frame.
In the synthesizing stage (i.e., the process of error concealment), we only have the low-resolution BL frame L1 to hallucinate its high-resolution frame H′1. For every eligible patch pl on HP(L1h), we find its nearest neighbor p′l in D by the Approximate Nearest Neighbor search [9], and add the corresponding patch p′h onto L1h to hallucinate the high-resolution H′1 (note p′l and p′h form a patch-pair in D, and these retrieved p′h patches form the hallucinated EL information).
The basic idea of hallucination is similar to that of “image analogy” [4]. One or multiple pairs of high-resolution and low-resolution images are available. A mapping (or analogy) from low-resolution to high-resolution image is learned in terms of extracting corresponding patches into a database. The mapping can be applied to another low-resolution image to get its high-resolution counterpart.
C. Hallucination for Error Concealment of Spatially Scalable Video
For the purpose of recovering the lost EL frames in SVC, we do not use a general database for hallucinating their corresponding BL frames. Instead, the database for each frame is generated adaptively. Specifically, for a lost frame in the sequence to be hallucinated, the database is generated from the two most recently decoded I frames. In this way, the lost frame and the training frames have high similarity, thus the mapping from the BL frame to the EL frame for the lost one can be more effective.
There are actually two ways of utilizing hallucination for error concealment. One way is “out-of-loop” hallucination, that is, hallucination is performed out of the decoding loop and works as a pure post-processing tool. The bit-stream with frame loss can be firstly decoded possibly with other error concealment techniques, and then some frames are replaced by the hallucinated ones. However, this post-processing fashion will introduce some extra delay. Also, due to error drift, in the reconstructed video sequence there might be more error frames than those frames with enhancement layer information loss, and one has to decide which frames shall be hallucinated and which not. The other approach is “in-loop” hallucination, i.e., hallucination is performed inside the decoding loop, and the hallucinated frame can be further used as a reference frame for reconstructing other frames. Because a hallucinated high resolution frame usually has higher fidelity than that using other error concealment methods such as BLSkip, the error drift can be attenuated. Also there is no extra delay for the in-loop hallucination, as long as the frames for training are reconstructed before the frame for hallucinating. Our hallucination algorithm works inside the decoding loop, and we find that in-loop hallucination performs better than the out-of-loop one.
As stated above, the performance of BLSkip is especially poor when the I frame is lost (in this case, the quality of error-concealed enhancement layer frame is just reduced to the up-sample of the base layer frame). When the lost frame is not an I frame, its residual energy is relatively small, and so the up-sample of base layer residual may be enough for error concealment, while hallucination will incur more computation overheads. Thus, we apply hallucination only to the lost I-frames.
The proposed method is tested under the standard packet loss model of [10], [11] and testing conditions of [12]. The following encoder settings are used:
JSVM 9.5 standard reference software, base layer is QCIF and enhancement layer is CIF, two layers spatially Scalable Video Coding;
GOP size = 16, Intra Period =16, using hierarchical-B prediction.
We focus on enhancement layer loss only since this is a common condition and assumption in SVC. All the test sequences are repeated to more than 4,000 frames as designated in [12]. We tested our EC method on nine sequences, in which the five recommended sequences (Foreman, News, Paris, Stefan, Football) in [12] are all included.
Table I shows the average PSNR (in dB) result for these sequences, with a 10% frame loss rate, and QP (Quantization Parameter) fixed to 28 for both BL and EL. Gain is observed on all sequences, and is especially significant on News, Paris, and Container, because of their relatively static scene. In these cases, the learned low-resolution to high-resolution frame mapping from the nearby frames is particularly effective, due to the high similarity between the frames.
To get a sense of how different QP settings can affect the performance, in Table II we show the average PSNR result for two sequences (Foreman and News) with different QP (28 and 36 for both BL and EL). The frame loss rate is fixed to 10%. The result indicates a reasonable drop of performance when QP is high.
Table III is the average PSNR for the same two sequences with different frame loss rate (3%, 10% and 20%), while the QP is fixed to 28 for both BL and EL. As expected, the higher the frame lost rate, the more frames will get benefit from hallucination, and the higher the overall average PSNR will be.
In Fig. 3 we present the PSNR curve of first the 150 frames of News, with QP = 28, frame loss rate = 10%. Our method clearly outperforms state-of-the-art BLSkip. Also, the PSNR curve of BLSkip indicates a very large quality variance in the sequence, to which the audiance is very sensitive. But the PSNR curve of our method has a smaller variance, and thus error concealed sequence could have smoothly high quality.
For visual quality, in Fig. 4 we show the 37th frame of News, with QP = 28, frame loss rate = 10%, of reconstructed sequence without error, BLSkip, and the proposed approach, respectively. With our method, the concealed frame has much better visual quality compared to that of BLSkip.
SECTION IV
Conclusion & Discussion
In this paper we present a new method of error concealment utilizing hallucination for spatial Scalable Video Coding. Simulation results show that this method can achieve significant performance improvement compared to the state-of-the-art error concealment for SVC, both in PSNR and visual quality.
The effectiveness of hallucination lies in the high similarities among nearby frames in a video sequence, such that the information in one frame can be effectively employed for the error concealment of nearby frames. Our method has particularly impressive performance on sequences with low motions, such as News, Paris, and Container. However, its performance on sequences with mild and high motions are also satisfactory.
Hallucination, or more generally patch-based image and video processing, has gained significant attention recently. Future works include mathematical analysis of the hallucination process, and approaches for improving current hallucination performance.
Acknowledgment
The author would like to thank Yi Guo and Zhiwei Xiong of University of Science and Technology of China, Xiaoyan Sun and Yonghua Zhang of Microsoft Research Asia, and Jian Lou of University of Washington for their helpful discussions and suggestions.