Loading web-font TeX/Main/Regular
A Survey on Intelligent Solutions for Increased Video Delivery Quality in Cloud–Edge–End Networks | IEEE Journals & Magazine | IEEE Xplore

A Survey on Intelligent Solutions for Increased Video Delivery Quality in Cloud–Edge–End Networks


Abstract:

The digital age has brought a significant increase in video traffic. This traffic growth, driven by rapid Internet advancements and a surge in multimedia applications, pr...Show More

Abstract:

The digital age has brought a significant increase in video traffic. This traffic growth, driven by rapid Internet advancements and a surge in multimedia applications, presents both challenges and opportunities to video transmissions. Users seek high-quality video content, prompting service providers to offer high-definition options to improve user experience and increase profits. However, traditional end-to-end best-effort networks struggle to meet the demands of extensive video streaming and ensure good user Quality of Experience (QoE), especially in high user mobility scenarios or fluctuating network conditions. Addressing some of these challenges, content delivery networks (CDN) are instrumental in delivering video content, but they are under increased pressure to support high quality and reduce their deployment and maintenance costs. Currently, cloud-edge-end fusion technologies have become one of the optimization directions for network services due to their flexibility and scalability. At the same time, in the context of the recent advancements in computing-focused network paradigms, intelligent enhancement techniques (e.g., super-resolution), commonly utilized in image optimization, have been adopted as a pivotal solution for increasing video delivery quality. To illustrate the essence and employment of the intelligent enhancement solutions for video streaming, this paper first outlines the video streaming process, discusses relevant evaluation metrics, and examines aspects related to the intelligent solutions. Then the paper presents the intelligent enhancement process of video streaming, analyzes various typical intelligent models for content enhancement and highlights their distinct characteristics. This exploration delves deeper into various intelligent quality-improved solutions, scrutinizing their applicability across different transmission scenarios like Video on Demand (VoD) and live streaming, and shedding light on their strengths and weaknesses from...
Published in: IEEE Communications Surveys & Tutorials ( Volume: 27, Issue: 2, April 2025)
Page(s): 1363 - 1394
Date of Publication: 12 July 2024

ISSN Information:

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

With the development of Internet infrastructures, network bandwidth capacity has upgraded from Kbps to Gbps [1], [2], triggering the demand for delivering large-volume content such as video streaming. At the same time, the rise of video-sharing platforms (e.g., YouTube, TikTok and Netflix) has greatly facilitated the dissemination of user-generated and professional content, offering users a wide range of video content choices. However, traditional network architectures, due to their best-effort transmission mode, still face numerous challenges in the transmission of massive high-definition video content.

Traditional network architectures typically include cloud and end sides for content deliveries. Cloud servers deliver content to the wide area network and transmit it to end users. Currently, content delivery networks (CDN), as an important method for accelerated deliveries, essentially play the role of cloud servers. Despite the emergence of technologies such as Software Defined Networking (SDN) [3] and Network Function Virtualization (NFV) [4], which provide better solutions for issues such as single-point failure, scalability limitations and lack of flexibility, challenges related to the lack of controllable relays in the cloud-end network architecture persist in video deliveries. Worth noting is that, rapidly developing and widely used edge computing can serve as reliable relays in the cloud-end network, thereby constructing a cloud-edge-end integrated network architecture.

Nevertheless, possessing the relay capability of compute-store-transmit in the cloud-edge-end network does not necessarily guarantee stable content delivery efficiency. For instance, the temporal and spatial dynamics of backhaul links may lead to unstable content deliveries [5]. During peak hours or in busy areas, network congestion may affect the speed and quality of video transmission, thus undermining user experience. Additionally, traditional network paradigms adopt static caching strategies, unable to swiftly update content cache based on users’ real-time needs and preferences, leading to a low match between content storage and user demands [6]. Moreover, due to the static allocation and rigid utilization of network resources, certain computational resources may remain idle, thereby failing to fully harness network resources to enhance the efficiency and quality of video transmission [7]. Given these challenges, ensuring the quality of video streaming and user satisfaction becomes difficult for traditional network delivery services. Therefore, adopting more advanced network technologies and architectures is necessary to enhance the efficiency and quality of video content deliveries.

Several techniques are employed to enhance video delivery and service quality, notably prefetching and quality-enhancing (e.g., super-resolution). Generally, caching strategically stores video content to expedite its delivery, with its effectiveness hinging on the caching algorithms’ performance, proving particularly beneficial in specific service scenarios. Instead, prefetching proactively stores data at edge devices or terminals, using available network capacity to boost delivery efficiency. Its success largely depends on the accuracy of predicting future requests, since inaccurate predictions lead to unused data, wastefully consuming storage and bandwidth [8]. Super-resolution (SR) is also utilized to enhance video quality by leveraging computational power at various points in the delivery chain, such as multimedia servers, edge nodes, and client devices. It improves video clarity by processing lower-bitrate streams (in the range of Kbps rather than Mbps), thus reducing the network’s data load. Although this method requires additional computational work at the receiving end, it generally induces slight processing delays, ranging from tens of milliseconds to a second [9]. SR’s effectiveness stems from utilizing computational resources across the delivery network. This necessitates the collaboration among transfer nodes for content or model information sharing, which is essential for the decoding or inferencing processes.

We delve into the innovative approaches mentioned above to improve the effectiveness of current and future network architectures for video delivery, scrutinizing their implications and advantages within networked video delivery frameworks. Traditional methods to accelerate content transmission across networks have predominantly utilized Web caching. This approach involves storing copies of content on specialized servers, which then fulfill user requests, alleviating the load on original servers and core networks. However, traditional caching strategies, particularly those implemented at proxy servers [10], encounter challenges like content mismatch and inefficiencies. Innovations in content deliveries have evolved to embrace peer-to-peer (P2P) networks [11] and, more recently, content delivery networks (CDN) [12], which have become cornerstone technologies for enhancing content deliveries. P2P networks facilitate a decentralized approach, enabling users to store and disseminate content, thereby distributing the traffic load. On the other hand, CDNs alleviate stress on origin servers by redistributing traffic to strategically placed points of presence (PoPs). Major video service providers, including Netflix [13], Amazon Video [14], and YouTube [15], leverage these networks to offer HD content for both VoD and live streaming services, utilizing CDNs and Internet service provider (ISP) infrastructures. However, the intrinsic limitations of IP networks, which offer varying throughputs from Kbps to Mbps, coupled with the substantial costs associated with CDN deployment and maintenance (potentially exceeding 20{K} dollars for the initial setup of a basic node, plus similar annual operational costs [16]), necessitate continued advancements in network capabilities and the development of more innovative content delivery methodologies.

The strategies discussed above mainly focus on augmenting the number of content replicas, utilizing significant storage capacities, in contrast to the computational resources required by intelligent quality-enhancement techniques. These methods also encounter various implementation hurdles, such as the need for manual setup in dynamic synchronization in proxy servers, privacy and permission concerns in P2P networks, and substantial costs for CDN deployment and maintenance. Currently, multi-access edge computing (MEC) [17] provides scalable cloud capabilities, addressing computationally demanding tasks, notably those associated with intelligent quality-enhancement solutions like SR [18]. MEC shifts processing and distribution closer to users by transitioning these services from central servers to the network’s edge, enhancing data handling, reducing latency, and boosting bandwidth. However, it’s crucial to acknowledge that most existing commercial MEC deployments heavily rely on mobile cellular networks [17], which may lead to various issues concerning reliability, coverage range and costs. Therefore, there is a pressing need to refine or innovate beyond the current models to cater effectively to commercial video streaming’s demands, especially when integrating compute-intensive methods like SR. To sum up, while intelligent quality-enhancement solutions and alternative strategies such as P2P, CDN, and MEC, which utilize different resources, can function independently or in conjunction for streaming enhancement, their synergistic application could significantly accelerate content delivery.

It is important to acknowledge that when HD video content cannot be delivered with optimal quality, a viable alternative is to transmit lower-quality video and then enhance its perceived quality at the receiver’s side through upscaling techniques. While traditional upsampling methods like bicubic interpolation and kernel ridge regression [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30] are available and can somewhat improve image resolution, their scaling capabilities are generally limited. Recent advancements have introduced video intelligent quality-enhancement algorithms (i.e., super-resolution) powered by deep neural networks (DNN) [31], which surpass traditional upsampling techniques in improving video quality. These sophisticated SR methods utilize the computational capabilities at the receiver side (such as user-side edges or user terminals) to enhance low-quality videos, while requiring limited additional bandwidth for transmitting or updating SR models. For live streaming applications, lightweight SR models are preferred due to their rapid inference speeds, achieving up to nearly 100 FPS [32], while more computationally intensive models [33], [34], [35] are better suited for edge computing or longer-buffered VoD scenarios, not directly on client devices. Employing intelligent upsampling technologies offers a significant advantage in network resource efficiency, aiming to provide an optimal viewing experience with less data transmission. However, the integration of SR introduces additional complexities, such as increased training and inference time, necessitating a careful evaluation of the trade-offs between computational demands and network efficiency. The goal is to identify an optimal balance, ensuring that the application of SR maximizes content quality improvement while maintaining transmission effectiveness.

At the same time, viewing experience is important to evaluate the effect of applying assisting methods in different networked video delivery scenarios, especially in relation to various factors including video quality, smoothness and rebuffering. Heavyweight SR models may result in high quality, but the limited computing capacity of terminals might not be able to support processing heavy models over the whole video.

On one hand, the heavyweight models’ inference process also introduces high latency, possibly leading to rebuffering. On the other hand, the lightweight models introduce less latency, but may not achieve the same level of quality enhancement. Meanwhile, different transmission scenarios (e.g., VoD and live) present distinct challenges and requirements, influencing the effectiveness of SR techniques. Therefore, the application of SR DNN in a video delivery system requires careful design to achieve a better user experience. Given the scarcity of exhaustive analyses on SR’s role in video streaming, this article endeavors to fill the gap, offering an in-depth exploration of SR-enhanced video delivery practices across various transmission settings.

There are some existing surveys exploring SR methods and SR-related video delivery systems, as listed in Table I. They are different from our work. Most of the surveys focus on SR models, mainly presenting their characteristics only [31], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47]. Lee et al. analyzed some SR schemes from a system-based perspective [9]. Park et al. [36] provided detailed processes associated with image SR. Hayat [37] compared the structures of different SR neural networks and analyzed related algorithms. Li et al. [38], Chen et al. [47] and Singh and Singh [39] analyzed image SR models while some provided video datasets for SR methods. Wang et al. [40] concretely analyzed the performance of the image SR models. Anwar et al. [31] performed a parameter comparison of different SR algorithms. Liu et al. [41] mainly presented video SR models. Notable is that some surveys researched SR methods in some special fields such as bio-metrics [42] and face recognition [46]. Some surveys explored SR schemes in the context of special techniques like generative adversarial networks [43], optical flow [44] and blind SR [45].

TABLE I Summary of Existing Surveys on Intelligent-Enhancement Solutions
Table I- Summary of Existing Surveys on Intelligent-Enhancement Solutions

Specifically, in all the surveys mentioned except that of Lee et al. [9], the authors only discussed the single image super-resolution (SISR) technology from the perspective of neural network design, without focusing on the cost of inferring in these models. In contrast, our survey emphasizes the usability of SR models in video streaming systems. Regarding Lee et al. [9], although they briefly introduced the costs of SISR models used in some existing systems, the SR model itself was not a major focus of their review. Our survey, however, provides a comprehensive analysis of existing technologies, including video super-resolution (VSR) models and blind SR, which are crucial for the design of video streaming systems. Additionally, we highlight their availability in video streaming systems, thus providing opportunities for further model potential exploitation. Moreover, Lee et al. only discussed the components related to SR in the video streaming system, neglecting other crucial components such as video capturing, video codecs, and adaptive bitrate (ABR) algorithm. Our survey addresses these components as well, enhancing the understanding of enhancement technologies from the perspective of the entire video streaming system.

The major contributions of this survey are as follows.

  1. Comprehensive Introduction to Video Streaming: The survey introduces the video streaming process and image/video assessment metrics for evaluating the effect of intelligent-solution-assisted video delivery.

  2. In-depth Exploration of Quality-Enhancement-Supported Video Delivery: The survey presents specific processes of SR-assisted video delivery, including video capturing, downsampling, compression, transmission, decoding, upscaling and rendering, to provide a deep insight into the SR-related workflow.

  3. Detailed Analysis of Intelligent Enhancement Models: We investigate several SR models in terms of transmission and processing-oriented parameters to illustrate the potential of applying them in different network scenarios, indicating their pros and cons.

  4. Comparative Evaluation of Intelligent Enhanced Video Delivery Schemes: The survey presents and discusses relevant intelligent-enhancement-assisted video delivery schemes and compares them based on enhancement effect and transmission efficiency from a network-based communication-oriented perspective.

  5. Challenges and Open Avenues for Future Enhancements of Video Streaming: We investigate main challenges and identify future avenues for enhancement-supported video streaming research and development.

Fig. 1 presents a visual survey overview. The rest of the paper is organized as follows. Section II introduces video streaming and related assessment metrics. Section III details the processes of quality-enhancement-supported video streaming. Section IV investigates the intelligent enhancement models and indicates their pros and cons. Section V presents and discusses relevant intelligent quality-enhanced video delivery schemes. Section VI indicates related challenges in enhancement-supported video streaming and future research avenues while Section VII concludes this work.

Fig. 1. - Survey roadmap of Intelligent-Enhancement solutions for increased video streaming quality.
Fig. 1.

Survey roadmap of Intelligent-Enhancement solutions for increased video streaming quality.

SECTION II.

Background

A. Video Streaming

Video content is the major traffic component in current network environments, mostly due to the recent explosive growth in video-based applications, especially in mobile networks. As the hardware and software supporting video streaming develop rapidly, there’s a growing trend towards the popularity of streaming large-resolution, high-quality video content. This includes HD, ultra high definition (UHD), augmented reality (AR), virtual reality (VR) and mixed reality (MR) [48]. Yet, high quality video delivery still faces significant challenges over existing best-effort IP networks. Important works report research on video streaming for delivery optimization and experience improvement [48], [49], [50]. To match the available network capacity, adaptive bitrate (ABR) solutions are employed to adjust dynamically the video streaming process [51]. Each video is chopped into chunks and each chunk is encoded at multiple discrete bitrates. A client continuously monitors and estimates the available network capabilities, and then requests a chunk with the highest video bitrate that is possible given its estimated network capacity [52]. MPEG DASH has standardised the support for ABR video delivery [53].

However, video quality is still limited by the available bandwidth, indicating that, for instance, in poor network conditions we are not able to provide a HD video stream at high quality. Meanwhile, although some general network metrics (e.g., throughput, jitter and delay) can characterize viewing experience, the sequential-chunk-based feature of dynamic streaming makes it more efficient to evaluate user experience with QoS-derived metrics (e.g., bitrate, rebuffering and smoothness). Notably, many commercial solutions such as Microsoft Smooth Streaming (MSS) [54], Apple’s HTTP Live Streaming (HLS) [55] and Adobe’s HTTP Dynamic Streaming (HDS) [56] all implement dynamic adaptive streaming, providing different-featured services. However, low-latency high-quality video services are still not able to be always provided as required by users [57], [58].

As VR/AR/MR develop, the current best-effort-based Internet infrastructure struggles to support the increasing volume of video content transmitted remotely. It is known that VR offers a simulated experience that may closely resemble or obviously differ from the real world. It has a broad application prospect like video games, medical care and education. AR provides an interactive experience within the real world environment, wherein computer-generated perceptual information enhances the objects present in the physical space, often spanning across multiple sensory modalities. Regarding MR, it amalgamates computer-generated elements with real-life components, creating a unified environment. To sum up, VR/AR/MR call for high requirements of video quality, especially when in motion. Despite ABR techniques, consistent transmission of vast video content remains challenging. Hence, exploring solutions such as SR becomes imperative, which utilizes additional computing and/or storage capacity to enhance video delivery, ensuring low latency and high quality.

B. Image/Video Assessment Metrics

Any processing applied to an image or video may result in the loss of information [59]. When optimization methods are applied to image or video data, it is essential to utilize evaluation metrics to assess video quality or user experience levels. These metrics are also valuable for evaluating the performance of various content processing techniques, including SR. From the perspective of human vision, image/video evaluation indicators can be categorized into objective (metrics-based) methods [60], [61], [62], [63] and subjective (user perceptual-based) methods [64], [65]. Concerning the availability of reference source content, evaluation algorithms can be classified into three types: full-reference, reduced-reference, and no-reference [66]. Comprehensive works have extensively surveyed existing image/video assessment methods [67], [68]. Furthermore, there are established evaluation standards such as ITU-T Rec. P.1203 [69], [70] and ITU-T Rec. P.1204 [71], proposed in 2018 and 2020, respectively. These standards serve as the foundation for implementing video evaluation metrics.

Based on these works, we introduce several generic metrics to assess image quality. Mean Absolute Deviation (MAD) is the simplest pixel-based method for image quality assessment (IQA), calculating the pixel difference for the entire image to reflect its quality. However, MAD can be easily disturbed by noise. Mean Square Error (MSE) is often used for measuring video fidelity [72], but MSE-related models may perform poorly when predicting fidelity or imperceptibility signal quality due to uncertainty in the dataset. Visual Information Fidelity (VIF) [73] is effective, but its performance depends on the confidence of the source and distortion models. Peak Signal to Noise Ratio (PSNR) [74] is widely used for image assessment, especially for evaluating Super-Resolution (SR) effects, providing an approximate estimation of human perception of the reconstructed content quality. However, [75] found that PSNR-oriented metrics tend to reflect low-frequency regions well but lose high-frequency details, which are crucial for human perception. Similarly, the Structural Similarity Index Method (SSIM) [66] is a perception-based model but sometimes delivers unexpected or invalid results [76]. Both PSNR and SSIM are commonly used SR metrics. These metrics are common objective methods for measuring image quality, with PSNR and SSIM being more widely adopted in quality assessment due to their user perception capability and efficiency. However, single-frame quality assessment may not fully reflect the user experience of video streaming.

Next, we elaborate on how video quality is evaluated, including employing subjective assessment, to best estimate or measure user QoE. A simple and direct method is to use bitrate as a video quality metric. However, video bitrate is often linked with its complexity (e.g., SR effect), which cannot fully indicate the viewing experience level due to users’ request patterns and content awareness. Viewers’ experiences are influenced not only by the encoded bitrate of the video but also by QoS-derived aspects such as delay, bitrate switching, and rebuffering [49], which are usually included in a typical video QoE model. When implementing SR, which involves training costs, model delivery, and processing delays, it’s essential to strike a balance between content improvement, transmission efficiency, and inference speed. It is recognized that the general QoE model for adaptive streaming utilizes these chunk-based objective parameters to subjectively evaluate video streams. Meanwhile, subjective methods, based on human judgment without explicit criteria references, are also acknowledged. Viewers evaluate each delivered video from their individual perspective, and these evaluations contribute to calculating a mean opinion score (MOS), which reflects the subjective experience of viewers. Despite subjective methods being the most effective in capturing user experience, their widespread adoption is hindered by high costs and low efficiency [77]. To address this, Netflix introduced the Video Multi-Method Assessment Fusion (VMAF) [78], which correlates objective evaluations with MOS scores. While VMAF offers a more accurate reflection of subjective quality opinions compared to metrics like PSNR and SSIM, it tends to overemphasize contrast distortion [79].

While there exist diverse image/video evaluation metrics, official standards play a crucial role in providing unified and scientifically-backed evaluation references. These standards are recommended to guide researchers and developers in designing their quality and experience testing processes more effectively. The ITU Telecommunication Standardization Sector (ITU-T) has proposed standard video evaluation methods, such as ITU-T Rec. P.1203 [70], [80] and ITU-T Rec. P.1204 [71], [81], to assist users in conducting quality assessments. ITU-T Rec. P.1203 was specifically developed to assess user QoE for HTTP adaptive streams (HAS), taking into consideration both video quality and QoS metrics, including factors like initial loading delay and rebuffering. P.1203 stipulates specific constraints, such as a maximum resolution of 1920\times 1080 and a frame rate of 30 FPS for segmented H.264-encoded video sequences. Empirical testing with real data has shown that P.1203 accurately predicts user experiences [80]. However, it is worth noting that P.1203 may be more suitable for evaluating super-resolved, small-scaled video streams due to its aforementioned limitations in scope.

ITU-T Rec. P.1204 is recommended for use in the 4K and reliable transport scenarios [81] based on various codecs including H.264, H.265, or video payload type 9 (VP9). It has three major components: Bitstream-based (P.1204.3), Pixel-based (P.1204.4), and Hybrid (P.1204.5). P.1204.3 identifies three types of degradation, i.e., Quantization degradation, Upscaling degradation, and Temporal degradation. Additionally, P.1204.4 supports full-reference and reduced-reference evaluation while P.1204.5 is a non-reference model. The aspects of upscaling and temporal degradation render P.1204 a suitable choice for evaluating SR approaches. This is due to its ability to assess both the quality of super-resolved frames and the retention of inter-frame information. In conclusion, given its flexibility in evaluating diverse streaming scenarios and network conditions, P.1204 is recommended as the preferred method for assessing SR video streaming.

SECTION III.

The Process of Quality-Enhancement-Supported Video Streaming

The main stages of quality-enhancement-supported video streaming include capturing, downsampling, compression/encoding, transmission, decoding, upscaling and rendering. In live streaming scenarios, downsampling the original high-resolution (HR) videos before encoding/compression is to adapt to limited uplink bandwidth. In VoD scenarios, bandwidth fluctuations and competing traffic require traffic reduction. To tackle this, sender-side downscaling and receiver-side upscaling from low-resolution (LR) are employed. This section elaborates on these stages in detail, and Fig. 2 visually depicts them in the context of a quality-enhancement-supported video delivery system.

Fig. 2. - Enhancement-supported video streaming chain.
Fig. 2.

Enhancement-supported video streaming chain.

A. Capturing

Before delivering content to viewers, it must first be captured, making video content production the initial step. Video content can be categorized into professionally produced content (PPC) and user-generated content (UGC) [82]. PPC, produced by professionals, typically ensures high-quality content. On the other hand, UGC, created and published by users, presents challenges in maintaining quality. Both UGC and PPC benefit from SR techniques when faced with bandwidth limitations or high competition with other traffic. Section III-D discusses the effects of network conditions. Additionally, video capture devices with limited capabilities may struggle to capture HR videos, which impacts video quality from the outset. Factors such as optical distortions (e.g., out of focus and diffraction limit) and motion blur due to limited shutter speed can lead to a loss of spatial resolution. For example, mobile phone cameras, while convenient, may produce low-resolution videos or introduce noise in low-light conditions. A Single Lens Reflex (SLR) camera, considered medium/high-end equipment, typically captures higher-quality videos compared with mobile phones. However, it lacks an electric zoom function, resulting in less smooth zoom during recording. Business-level cameras offer independent aperture, shutter, white balance, and other settings, enhancing recording convenience. However, their image quality may not match that of SLR cameras. Different capture devices yield varying video qualities. If videos are of low resolution due to device limitations or network constraints, additional processing (e.g., super-resolving) may be required at the client side to ensure a high user experience.

B. Downsampling

In an SR-assisted video system, clients with limited uplink bandwidth for live streaming or content servers (i.e., live relay servers or VoD content servers) may need to downsample videos. This downsampling enables clients to perform live video transmission and allows receivers with limited network bandwidth to access video content while preserving the appearance of high-resolution content as much as possible [83], [84], [85], [86], [87]. Table III illustrates that supporting high-quality video distribution theoretically requires at least a 2 Mbps bandwidth. Typically, the traffic consumed by downloading is greater than that consumed by uploading, resulting in a higher capacity demand for the downlink [88], [89], [90]. Therefore, broadcasters or service providers may need to create LR content to alleviate the traffic load on the downlink. This LR content can then be super-resolved at the receiver side to ensure QoE.

TABLE II Abbreviation List
Table II- Abbreviation List
TABLE III Theoretical Required Bandwidth
Table III- Theoretical Required Bandwidth

Video downsampling involves reducing the resolution of each frame, typically achieved through image downsampling on a frame-by-frame basis. Traditional downsampling methods, as demonstrated in NTIRE 2017 [91], exhibit relatively low computational intensity, with running times typically in the order of 10^{-2} seconds. In contrast, neural network (NN) based scaling methods require significantly more computing capacity, with running times ranging from 10^{0} to 10^{3} seconds. NN-based downsampling models, which rely on GPUs for both training and inference [92], [93], [94], offer the advantage of producing more realistic and accurate images compared to traditional methods like bicubic and Lanczos. In both VoD and live streaming scenarios, source servers or media servers can assist in performing downsampling tasks. Typically, the optimization of NN-based downsampling is carried out on the server side. Additionally, it’s worth noting that the construction of LR) videos directly impacts the efficiency of SR techniques.

Traditional downsampling algorithms [19] lack the ability to adaptively downsample video frames based on specific content characteristics. This limitation arises from the inherent inability of smoothing filters and resampling techniques, like bicubic interpolation [20], to effectively retain image information in an adaptive manner. Downsampled frames may loose some texture information beneficial to the upscaling process. Another option is to perform Gaussian blur on HR videos and then downsample the videos [95]. However, both interpolation and Gaussian blurring are not enough to effectively model the degradation process, because the real-world degradation process is complex and uncertain.

To address this issue, various content-aware downsampling algorithms have been proposed in the literature. These algorithms aim to ensure that LR images retain more detailed information, thereby enhancing the performance of SR DNNs [92], [93], [94], [96], [97], [98], [99], [100]. Kim et al. [92] proposed a task-aware image downscaling method, using an auto-encoder to jointly train downsampling DNNs and SR DNNs. This approach allows the downsampling DNNs to preserve detailed information in LR images, enhancing subsequent SR processes. Sun and Chen [93] introduced a learned content adaptive resampler, employing kernel weights and offsets for downsampling. While effective for enhancing SR, this method does not explore the reversible relationship between image upscaling and downsampling. Xiao et al. [94] developed an Invertible Rescaling Net (IRN), performing downsampling and upscaling using a bijective transformation approach. IRN utilizes a wavelet transform to generate multiple LR images from each input image, facilitating an efficient reverse process for image upscaling. However, these techniques focus primarily on image downsampling and upscaling and do not address real-world content delivery constraints in video systems, such as bandwidth fluctuations and computational limitations.

C. Encoding

Efficient storage and transmission of original video data pose significant challenges due to their massive size. Therefore, video encoding/compression is essential to address these challenges. There are two major coding avenues, i.e., entropy coding and lossy coding. Entropy coding aims for lossless compression up to the Shannon limit, ensuring no information loss. However, its compression rate is limited. In contrast, lossy coding is typically more effective as it removes redundant or less important data. Human vision can tolerate the loss of some details well, making lossy compression a preferred choice in many scenarios.

Video coding typically involves three main components: predictive coding, transform coding, and entropy coding, as depicted in Fig. 3. Videos consist of completely-encoded frames and incrementally-encoded frames. Completely-encoded frames only undergo compression through transform and entropy coding, while incremental frames require predictive coding to reduce temporal and spatial redundancy. Predictive coding, involving motion estimation and compensation, achieves both intra-frame and inter-frame compression. Intra-frame compression treats each frame independently, similar to image compression. On the other hand, inter-frame compression utilizes neighboring frames to predict the content of the current frame, resulting in higher efficiency due to the removal of temporal redundancy.

Fig. 3. - Video encoding process.
Fig. 3.

Video encoding process.

In encoded videos like MPEG, different types of frames are utilized, including Intra frames (I-frames), Predicted frames (P-frames), and Bi-directional frames (B-frames). The I-frame, typically the first frame in a Group of Pictures (GoP), relies solely on the redundancy within the frame for compression. It can be encoded or decoded independently without referencing other frames, and its quality directly impacts overall video quality as it serves as a reference for subsequent frames. P-frames require reference to previously encoded frames. They employ motion estimation and compensation to exploit inter-frame redundancy, effectively reducing the size of encoded frames. B-frames offer even greater flexibility as they can reference frames occurring both before and after the current frame during compression. They play a crucial role in minimizing video size while maintaining quality. However, decoding B-frames may require significant computational resources and could increase processing latency. Additionally, video frames are divided into non-overlapping blocks and encoded on a block-by-block basis, improving encoding efficiency and reducing the resulting data volume.

Several widely-used video coding formats include H.264/AVC [101], H.265/HEVC [102], VP6 [103], VP9 [104], VC-1 [105], AV1 [106], and AVS2 [107]. Moreover, Scalable Video Coding (SVC) and Scalable High Efficiency Video Coding (SHVC) have been developed to support adaptation to network dynamics and user requirements, serving as scalable extensions of H.264/AVC and H.265/HEVC, respectively.

D. Transmission

Video deliveries mainly involve three types of entities, i.e., cloud, edge and end. Typically, video content is transmitted from a cloud server to end users, possibly undergoing acceleration at the edge. However, in live video scenarios, broadcasters often face challenges due to insufficient uplink bandwidth to deliver high-resolution (HR) content [108], [109]. This limitation becomes particularly evident when streaming videos with complex textures or attempting to provide end users with HD content. Reports indicate a significant disparity between uplink and downlink bandwidth, with differences of up to tenfold [110]. For effective delivery of 1080p HD online streams, Apple Inc. recommends a minimum bandwidth of 6 Mbit/s [111], while Netflix suggests a bandwidth of 5 Mbit/s for the same resolution [112]. Additionally, IMCCA provides theoretical bandwidth requirements for various video formats, as outlined in Table III. Despite these recommendations, research shows that the average uplink transmission rate of commercial live streaming servers was less than 1 Mbit/s in 2020 [113], with a median uplink bandwidth observed to be less than 2.1 Mbit/s in 2021 [109]. The uplink throughput measurements conducted by Zhu et al. [109] over LTE networks indicated that the uplink bandwidth exhibited significant variability and could be lower than the sustained bandwidth requirements of commercial broadcasting apps. Additionally, broadcasters will also compete for the limited transmission resources.

Meanwhile, fluctuations in available downstream bandwidth during video transmission from the server to remote clients can also have a negative impact on the process. These fluctuations are typically caused by burst traffic, network jitters, or congestion. Barakat et al. [114] conducted research on the Sprint IP backbone, revealing average rates ranging from 26 Mbps to 262 Mbps. They found that traditional loss/delay-based TCP algorithms suffer from remarkably low bandwidth utilization (10% to 30%) over 5G networks [2]. Furthermore, available bandwidth becomes significantly more variable in heavily utilized paths or those with limited capacity [115]. Even lightly loaded networks with higher available bandwidth may not consistently ensure stable transmission capacity, offering a more predictable and smoother throughput for applications such as streaming audio/video. So dynamic networks with common bandwidth fluctuations do not always guarantee stable transmission capacity. Even in 5G scenarios, TCP-related delivery, like HTTP-based video streaming, faces challenges with throughput dynamics and bandwidth utilization uncertainty.

From a network perspective, optimization techniques should align with architectural features (e.g., communication methods) and network conditions (e.g., throughput and jitter) to ensure compatibility and effectiveness. To address concerns about quality assurance over the Internet, various network-assisted approaches have been developed, e.g., CDNs [13], [14], [15], Information Centric Networking (ICN) [116], [117], [118] and edge computing [48], [119], [120]. CDNs play more of a role akin to cloud servers, strategically deploying servers at network junctions to improve response times by deploying content replicas. While CDNs primarily focus on storage offloading rather than computational capacity, this may pose compatibility challenges for computation-intensive applications like SR. Despite having computational power, CDNs may lack specific functionalities required for intelligent solutions, such as proper allocation of computing resources for SR tasks, implementation of cache strategies for storing original and enhanced content, and execution of assisting policies for online training. Furthermore, reports indicate that hardware resources on CDN servers may be inferior compared to servers at the origin station, potentially leading to performance issues or even SR task disablement when essential computational elements like RAM and CPU processing power are inadequate [121]. Therefore, introducing computational offloading strategies becomes crucial for effective SR execution in CDN environments.

As for ICN, it fundamentally severs the bond between content and specific locations, facilitating storage, multi-party communication, and other services through the Publish/Subscribe Paradigm. ICN’s primary objective is to host frequently accessed content within the network, a concept commonly referred to as in-network caching. This vision involves routers equipped with local caches, capable of storing a portion of the data that traverses the network. Prior to forwarding an interest message, which seeks specific data, to the next network hop, these routers perform a check to determine whether the requested content is available within their local high-speed cache. If the content is present, they promptly deliver the requested data to the interested user and discard the interest message. Alternatively, if the content is not found, the routers forward the data request to the next hop. However, the establishment of ICN within existing network infrastructures presents challenges. This arises from the fact that ICN’s communication mechanism necessitates substantial upgrades and redesigns in network devices, applications, and even network protocols [116]. Moreover, one of ICN’s notable drawbacks is its requirement for substantial network reconfiguration, which can lead to significant deployment and migration challenges. The need for widespread adoption and compliance across various network components makes its implementation complex and resource-intensive. Therefore, the practical implementation of ICN on existing networks can be impeded by these technical and logistical challenges, let alone when used in conjunction with SR technology.

Edge computing brings cloud services closer to mobile devices, enabling low-latency responses by deploying cloud services like caching and computing on edge servers located at exit points, such as gateways, connected to network providers. This setup allows edge managers to directly optimize services for their covered users. For example, platforms like the iQIYI Open Cache Program offer customizable edge functions [122]. Alternatively, virtual data centers based on public cloud services and super clients can also serve as smart edges through virtual machines or applications. Although MEC both storage and computation capabilities suitable for super-resolution methods, it lacks unified practical standards. In traditional mobile networks, the core network and edges remain strongly coupled, and decoupling is essential to unlock edge capabilities. Meanwhile, the emergence of multi-edge collaboration and edge-cloud collaboration poses new challenges, highlighting the significance of cooperative edge-SR solutions. This aspect represents a key point for exploring edge optimization based on super-resolution techniques in this article.

E. Decoding

Video decoding is the process of decompressing encoded digital video data, essentially reversing the encoding process. This restoration operation allows the compressed data to be played back in its original format. A video decoder can take the form of either hardware microchips or software, both of which are responsible for converting encoded video stream signals into digital video files. These files typically encompass video, audio, subtitles, and captions. During decoding, the decoder retrieves various components from the compressed stream, including motion vectors, residual coefficients, and predictive parameters. The residual data is obtained according to the residual coefficients, while the motion vectors aid in generating prediction frames using reference frames. Then the residual data is superimposed onto the prediction frame to reconstruct the actual frame.

F. Upscaling

To better understand video super-resolution (VSR), we begin by introducing the concept of single-image super-resolution (SISR), which focuses on recovering HR images from LR ones, without considering the inter-frame information. Formally, let I_{LR} denote an LR image and its corresponding HR image be denoted as I_{HR} . The degradation process can be mathematically expressed as I_{LR}=\Phi (I_{HR}; \theta _{\alpha }) , where \Phi represents the degradation model and \theta _{\alpha } includes parameters like scaling factor, blurring kernel, and noise. In SISR models, the commonly used loss function is pixel-wise MSE. However, in practice, the degradation model is often unknown and influenced by factors like noise, compression, and blurring. To simulate this deterioration process more accurately, the degradation model can be specified as I_{LR} = (I_{HR} \otimes k)~\downarrow _{s}+n , where k is the blurring kernel, I_{HR} \otimes k represents convolution with the HR image and blurring kernel, n is noise, and \downarrow _{s} denotes down-sampling with scaling factor s [40], [41].

In VSR, deep neural networks (DNNs) are employed to upscale the images. Figure 4 illustrates the major steps of a VSR model, including alignment, feature extraction and fusion, and reconstruction. Given the jitter between video frames, alignment is essential before feature extraction. Existing alignment methods typically estimate optical flow between the reference frame and neighbor frames, deforming neighbor frames accordingly. Implicit motion compensation can also be achieved via dynamic filtering or deformable convolution. After alignment, the frames undergo feature fusion, which can be done using convolutional neural networks (CNNs) to fuse all frames or recurrent networks to gradually fuse multiple frames. Based on the extracted fusion features, the video is effectively reconstructed using deep CNNs. However, two major issues persist in DNN-based SR: efficient alignment of multiple frames and accurate feature fusion to avoid artifacts caused by complex motion.

Fig. 4. - VSR flowchart.
Fig. 4.

VSR flowchart.

G. Rendering

Video rendering is a critical stage in the multimedia pipeline, where encoded data from a decoder is transformed into a visually coherent and understandable stream. This process involves a series of intricate steps performed by computer systems to decode, process, and display the content accurately. During video rendering, various elements such as edited video clips, audio tracks, and animation effects are seamlessly integrated to create a cohesive and engaging visual experience. These elements are combined into a single stream of video that can be viewed by the end-user. One of the key challenges in real-time video rendering is the demand for significant hardware resources. Modern graphics processing units (GPUs) and specialized rendering hardware play a crucial role in accelerating these tasks and enabling real-time rendering of complex multimedia content. Moreover, real-time rendering often involves optimizing performance to maintain a consistent frame rate and minimize latency.

SECTION IV.

Intelligent Enhancement Models

To support intelligent-enhancement-supported video deliveries, servers generally possess high computing power, such as powerful central processing units (CPUs) and GPUs, capable of training enhancement models (i.e., SR models). However, considerations must be given to the features of these models (e.g., model size, inference speed, and training overhead) to adapt to different video streaming types like VoD and live streaming. We also need to address related considerations, such as where to perform enhancement inference (e.g., at the client or the network edge), defining the approach for training SR models (e.g., pre-training generic models at the server, training specialized update models for popular content, or updating models at the client), and striking a balance between transmission overhead and computational expenses. When applying SR models to various video streaming scenarios, tailored solutions should be chosen based on the network architecture and conditions. For instance, high-performance models are often resource-intensive and may not be compatible with low-capacity devices, leading to increased resource consumption and longer inference times. Additionally, different SR models utilizing various network architectures, such as CNN, RNN and Transformers, offer varying performance and are suitable for different training approaches. Factors such as neural network type, model complexity, network depth, and code framework can influence the performance of SR models, affecting their compatibility with different video streaming scenarios and delivery paradigms. Here, we will discuss the characteristics of some SISR and VSR models, providing insights into enhancement application in different scenarios.

A. Single Image Super Resolution (SISR)

The SISR methods can be broadly classified into two categories: 1) traditional non-CNN methods [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [37], [131] that are favored for benchmarking and 2) learning-based ones [33], [34], [35], [123], [124], [125], [126], [127], [129], [130], [132], [133], [134], [135], [136], [137], [128], [138], [139], [140], [141]. These methods are listed in Table IV. In recent years, following the success achieved with deep learning in various domains, many learning-based SR methods were proposed. Recently, their performance was improved significantly compared with the traditional SR models. Next, we introduce several widely-used models and discuss their features that influence their application to streaming systems. They include SR NN developed from general CNN [123], [124], [125], [126], [127] to generative adversarial network (GAN) [33], [34]. Some can be based on residual networks (ResNet) [35], [128], [140], RNN [132], distillation [129], [133], [134] and transformer [130], [135], [136], [137], [138], [139], or even achieve blind SR [141].

TABLE IV Representative Image Enhancement Models
Table IV- Representative Image Enhancement Models

1) General CNN:

Dong et al. [123], [142] were first to use deep CNN in SR and proposed the SR Convolutional Neural Network (SRCNN), an end-to-end SR framework, to learn the mapping between HR and LR images directly via CNN. This is a pioneering work for SISR. In the pre-processing step, SRCNN employed a bicubic interpolation to upscale the input LR images to a desired size. After that, a convolutional layer was used to extract overlapping patches and characterize the patches as high-dimensional vectors which comprise as many feature maps as their dimensions. Then the vectors were mapped non-linearly to another high-dimensional vector that presents a HR patch. In the final step, the HR patch-wise representations were aggregated to generate the reconstructed HR image. Although the model of SRCNN is simple and explicit, the computational cost is rather heavy (i.e., 1.3 FPS) [31]. In fact, the convolution operation of SRCNN was performed on the upsampled images instead of the original LR ones, resulting a low efficiency.

To overcome this limitation, the authors [124] re-designed the SRCNN structure and came up with a new model named Fast SRCNN (FSRCNN), taking the original LR patches as input. FSRCNN aims to bring the rate of computation to real-time video frames (24 FPS) as compared to SRCNN (1.3 FPS) for upsampling a 240\times 240 image by a factor of 3, when relying on an Intel i7 CPU 4.0 GHz processor. The authors also adopted the strategy of reducing filter sizes and introducing additional mapping layers. The refined model is claimed to be 40 times faster than the original SRCNN.

Similar to FSRCNN, ESPCN [125] also provides a fast approach for directly operating convolution on LR images to produce HR ones. It is reported to perform real-time SR tasks on 1080p video SR using a K2 GPU device [38]. The core of ESPCN is its interpolation function (i.e., the sub-pixel convolutional layer), which scales images from LR to HR. The interpolation function is implicitly included in the preceding convolutional layer and can be learned automatically. The convolution process was performed on LR images while the upscaling of images was performed only in the last layer.

Though efficient, ESPCN treats super-resolution of different scale factors as independent problems and overlook the mutual relationships among different scales. To better utilize the redundant information, Kim et al. [126] proposed VDSR which uses scale augmentation during training to deal with the multi-scale upsampling problem in a single framework. VDSR model is a very deep convolutional network inspired by VGG-net [143], which utilizes contextual information over large image regions. To cope with the slow convergence rate in deep networks, the authors adopted residual-learning and gradient clipping. However, VDSR also needs to upscale LR images with bicubic interpolation to a desired size, leading to a high computation and memory cost. As Li et al. reported [38], with a 3.4 GHz Intel i7 CPU (64G RAM) and Nvidia Titan X GPU (12G Memory), VDSR only super-resolved video to 2\times at a speed of 1.7\sim 11.0 FPS, which was far slower than FSRNN at a speed of 31\sim 92 FPS.

Although the performance of CNN-based SR models is high, these models suffer from high complexity with the increase of depth and width of neural network in practice. Li et al. [127] proposed a convolutional anchored regression network (CARN) which is inspired by locally linear regression methods A+ [24] and its improved version anchored neighborhood regression [144]. In CARN, the features were extracted from input LR images by convolutional layers, and the regressors were used to map features from low dimension to high dimension. All the operations were transformed into a convolution process so that the components in CARN could be learned jointly. Experimental results show that CARN struck a balance between time consumption and performance improvement. Although CARN is heavier than SRCNN as shown in Table V, it achieves better image recovery by about 10% in terms of PSNR [31].

TABLE V Representative Video Enhancement Schemes
Table V- Representative Video Enhancement Schemes

2) GAN:

The above mentioned methods are all dedicated to improving the performance in terms of machine-centric criterion (e.g., PSNR), but rarely put the visual quality as their first priority. Due to the excellent performance of GAN, more and more works adopt GANs to improve the SR effect, but may also make the models heavier [33], [34], [43], [145]. These works are divided into three types based on the training employed: supervised [33], semi-supervised [145] and unsupervised [34]. The adversarial loss pushes the network to prefer solutions that are close to natural image textures. To recover finer texture details of images, Ledig et al. [33] proposed a perceptual-driven method named super-resolution generative adversarial network (SRGAN). Different from most of the previous works, SRGAN defined a perceptual loss function which consists of an adversarial loss and a content loss. It has a similar parameter size and achieves a similar SR effect as CARN in terms of quality. However, as NTIRE 2017 Challenge shows [91], SRGAN’s running time is longer than VDSR’s and it is far from the real-time inference.

3) ResNet:

Some works design SR DNNs based on residuals to decrease the training time of the neural network and avoid the loss of high-frequency details in reconstruction. This would finally accelerate the learning process and convergence speed. Lim et al. [35] used a deep ResNet to develop enhanced deep super-resolution network (EDSR). Experiments confirmed that this simple modification increased the overall performance and saved approximately 40% of memory usage during training. Moreover, the authors also proposed a multi-scale deep super-resolution network (MDSR) which provides multi-scale EDSR support through different upsampling layers, as Fig. 5 shows. In the multi-scale architecture, scale-specific pre-processing modules which consists of two residual blocks with 5\times 5 kernels are adopted at the beginning of the network. However, EDSR/MDSR are heavy models and by employing them it is difficult to achieve real-time processing in a video delivery system. As tested, it takes 8 and 4 days to train EDSR and MDSR respectively, using NVIDIA Titan X GPUs. This seems very computing-intensive and does not meet the needs of online model updating and fast inferencing, especially for real-time streaming.

Fig. 5. - MDSR.
Fig. 5.

MDSR.

The above SR DNN ignores the different important levels of feature information. LR inputs and features with abundant low-frequency information should not be treated equally between channels to ensure the representational ability of CNN. Zhang et al. [128] proposed very deep residual channel attention networks (RCAN), which contain a residual-in-residual structure. RCAN is composed of several residual groups with long hop connections. These long-hop residual groups contain some residual blocks with short hop connections. RIR enables the main network to focus on learning high-frequency information by skipping some connections. Although RCAN has less parameter number than that of EDSR, it still needs work to become light enough for real-time SR.

4) Distillation:

Knowledge distillation is a common method of model compression. Different from pruning and quantification, knowledge distillation builds a lightweight small model from the supervision information of a larger model with better performance, achieving better performance and accuracy. Hui et al. [129] proposed an information distillation network (IDN), which reconstructed HR images directly from the original LR images through knowledge distillation. The average inference time of IDN for upscaling 3 times on Set5 [146] is less than 0.01 seconds, based on 4.2GHz Intel i7 CPU (32G RAM) and Nvidia TITAN X (Pascal) GPU (12G memory). IDN consisted of a feature extraction block, stacked information distillation blocks and a reconstruction block. Feature maps are extracted directly from the LR image by the feature extraction block. After that, multiple cascaded distillation blocks are employed to generate the residual representations in HR space, which is the key step of IDN. The group convolution employed greatly reduces the number of parameters and computation of the SR DNN. IDN achieves real-time SR and is adopted by some video delivery schemes such as Supremo [147], but IDN training is very time-consuming (i.e., it takes a day with a TITAN X GPU on for 2 times upscaling).

5) Transformer:

Transformer is an encoder-decoder model completely based on an attention mechanism, discarding the use of CNN. Transformer achieved a great success in natural language processing (NLP), and since increasing number of works apply Transformer to computer vision areas, such as image classification and object detection. The majority of SISR methods are based on convolutional neural networks using MSE or MAD as their objective function, ignoring the human perception. Recent advances in vision transformer [130], [135], [136], [137], [138], [139] have brought new vitality into this field, harvesting more accurate details from reference images and improving perceptual quality. An innovative transformer-based solution for image restoration tasks was proposed by Yang et al. [130]. The authors first proposed a Texture Transformer network for image SR (TTSR). The solution transfers the texture features of the reference map in LR and uses a transformer for super-resolving. However, the implementation of Transformers is difficult, requiring specific chips to run the operators and may have low inference efficiency issues.

6) Blind SR:

The latest blind SR methods include two major avenues: degradation prediction and conditional restoration. They include single-branch networks and two-branch networks. Xie et al. [141] proposed a Filter Attribution method based on Integral Gradients (FAIG) to investigate how single-branch networks automatically learn to distinguish degradation. Using the proposed FAIG, a simple and effective method was developed to predict the degradation of the input image. Unfortunately, due to their principle, the performance of these methods is not the best.

B. Video Super Resolution (VSR)

Apart from recovering HR images from LR ones, SR techniques play a key role in reconstructing high quality videos by using the same principle. Since videos can be treated as a combination of images, an explicit manner of performing VSR is to upsample the videos frame by frame. VSR algorithms can be classified into traditional [41], [148], [149], [150] and learning-based ones [151], [152], [153], [154], [155], [156], [157], [158], [159], [160], [161], [162], [163], [164], [165], [166], [167], [168], [169], [170], [171]. It is noteworthy that inter-frame relation can be utilized not only for content recovery (i.e., super-resolution), but also for content filtering (e.g., frame discarding [172], [173]).

1) Aligned Methods:

These methods align the neighboring frame with the target frame before video reconstruction using particular prepossessing techniques such as motion estimation and motion compensation [151], [152], [153], [154], [155], [156], [157], [158], [159], [160], [161], [162].

Kappeler et al. [151] proposed a video super-resolution network, VSRnet, which extended SRCNN model [123] using several consecutive frames (compensation frames) as input. As the pioneer of using CNN to restore videos, the authors investigated different options of combining the video frames within a CNN architecture. Similarly, Caballero et al. [152] proposed a motion compensation transformer module for spatial motion correction using the video efficient sub-pixel convolutional neural network (VESPCN), which extended ESPCN model [125]. Although the VSRnet and VESPCN models with hundreds of parameters can directly enhance video streaming, they can only achieve around 25dB PSNR [41], which introduces slight traffic for model delivery but provides limited video enhancement.

A GAN-based VSR, TecoGAN [153] was proposed in order to employ adversarial and cyclic training methods to extract spatial high-frequency details and temporal correlations. The TecoGAN model’s processing speed is 41.92 msframe on a Nvidia GeForce GTX 1080Ti GPU. It achieves good inference effect of more than 32dB in terms of PSRN as well. However, it is still difficult to make the training converge due to the utilization of adversarial loss, which may consume a great amount of computing resources on the server side.

To better utilize inter-frame motion information, Sajjadi et al. [154] proposed the frame-recurrent video super-resolution based on optical flow, FRVSR. It passes the previously estimated HR frame as an input for the following iteration to reduce the computational cost and propagate information of past frames to later frames, which is fairly compatible with temporally-related video streaming.

To better characterize the features of video content and achieve better SR effect, use of deformable convolutional networks was proposed first by Dai et al. [174], followed by an improvement version designed by Zhu et al. [175]. Typical SR methods using deformable convolution are EDVR [157], DNLN [158] and TDAN [159]. Wang et al. [157] introduced the Video Restoration framework with Enhanced Deformable convolutions (EDVR) to align multiple frames and effectively fuse different frames with diverse motion and blur. It also used the ResNet to super-resolve LR video frames, indicating that its model may be a bit heavy but provide inter-frame information better. Although EDVR won the NTIRE19 video restoration and enhancement challenges by a large margin [176], its parameter size is about 20M which may be more suitable to be pre-trained for VoD streaming instead of Live streaming that may need model updating.

Similar to SISR, distillation-based VSR methods were also introduced. Xiao et al. [160] proposed space-time distillation (STD) for transferring knowledge from complicated models to compact student models without affecting the compact models’ original architecture for exploiting both spatial and temporal knowledge. Spatial attention maps for high-frequency video information are extracted simultaneously from two networks and utilized to transfer spatial modeling functions, which provide content-aware-likely SR to better ensure viewer experience.

Although various methods were proposed to better learn image/video information, their blur kernels in the degradation process are generally assumed to be known. As the blur kernel will not be modeled again in the reconstruction phase, an over-smoothed image/video results. So blind SR [162] are adopted to restore HR content and better simulate real world with unknown and complex degradation, which may benefit more VR/AR/MR live streaming.

2) Non-Aligned Methods:

The non-aligned methods do not perform any align operations before video reconstruction as preprocessing and directly feed the frames to the developed models [163], [164], [165], [166], [167], [168], [169], [170], [171]. The unaligned approaches can be divided into two categories: spatially unaligned and spatio-temporally unaligned. The spatially unaligned techniques feed input frames straight into a two-dimensional convolutional network for spatial feature extraction, fusion, and super-resolution, without alignment processes like inter-frame motion estimation and motion compensation. The spatio-temporally unaligned methods utilize the spatio-temporal information in the input video for the SR task. These models usually have the capability to cope with both spatio- and temporal dependency in a given video, which work well in extremely dynamic video scenes and with unpredictable viewing behaviors.

To eliminate motion estimation and motion compensation, DUF [164] employed a dynamic upsampling filter that combined filtering and upsampling operations with spatio-temporal information gained by a three-dimensional (3D) convolution. DUF employed the network to estimate the residual map of the target frame and improve the high-frequency details of the super-resolution results. The final output of a frame is the sum of the residual map and the frame processed by the dynamic upsampling filter. However, this strategy overlooks the inter-frame information and is bound to be sub-optimal [164].This approach was confirmed to be inferior to EDVR [157].

This discussion on SR models enhances our comprehension of when and how they can be effectively applied across diverse network environments with varying resources and viewer requirements.

SECTION V.

Intelligent Enhanced Video Delivery System and Schemes

As computer vision technologies evolve, video content transmitted over the network has also changed dramatically. From the initial two-dimensional (2D) VoD/Live videos to 360° panoramic video and 6 degrees of freedom (DoF) videos, network conditions and performance of video delivery systems are facing increasingly high demands. For the VoD streaming, SR models based on DNNs are usually offline trained on the server side [86]. Then the client uses the SR model to upscale and enhance the video streaming. In the live video streaming, the upstream bandwidth from the broadcaster to the server is less than the downstream bandwidth from the server to the viewer [88], [89], [90]. So the SR DNN is online trained and then used to upscale the uploaded videos at receiver sides (e.g., media servers or viewers). Different SR schemes are fit into different video types and transmission systems. Here, we will analyze the promotion details of intelligent enhancement solutions based on video types and transmission scenarios, to explore the application insight of SR in cloud-edge-end video deliveries.

A. Video on Demand

As shown in Fig. 6, in a VoD streaming system, videos are pre-coded and stored in a media server. As edge computing develops, network operators or content providers [122], [192] may deploy edge services to accelerate video delivery by caching popular content and enhancement models. Due to the limited computing capacity of clients, the existing VoD system generally pre-trains SR models in the media server and sends them to the clients [85], [86].

Fig. 6. - The delivery architecture of VoD with SR.
Fig. 6.

The delivery architecture of VoD with SR.

Yeo et al. [193] proposed to apply SR DNNs in a CDN network, with the server and client side. In [193], the analysis of adaptive video streaming revealed that clients with computing capacity play an active role in video enhancement. Meanwhile, there is redundancy between video frames on the time scale. So a content-aware SR DNN was adopted for videos with similar content to improve the performance of the SR model. Yeo et al. quantified the SR performance using a desktop GPU (i.e., NVIDIA Titan Xp), resulting in 7.87 FPS to recover a 720p resolution image. Larger image size will takes a longer recovery time. Although mobile devices with power-efficient mobile GPUs have better performance than older generation game consoles (e.g., XBox 360 or Playstation 3) [194], their performance is not comparable to a desktop GPU, indicating that real-time SR is a challenge. In addition, Yeo et al. proposed that PSNR did not reflect human perception well and more reasonable evaluation metrics such as VMAF [78] should be used.

\bullet NASL For a VoD scenario, Yeo et al. introduced NAS, the first practical system utilizing an SR DNN for DASH video streaming [86]. NAS leveraged real throughput traces from Norway’s 3G network [195] and U.S. broadband [196], ranging from 0.38 Mbps to 4.69 Mbps. This pioneering work combined SR DNNs with dynamic video streaming, effectively addressing frame redundancy over larger time scales through the inherent overfitting property of DNNs. NAS operates by pre-training a DNN for each video on the server and subsequently training content-aware DNNs for individual video episodes. The content-aware DNN is then transmitted to the client, resulting in superior performance compared to generic SR models in terms of enhancing video quality. Unlike MDSR [35] as shown in Fig. 5, NAS supports multi-scale super-resolution and utilizes select CNN layers for efficient SR processing, ensuring an inference speed faster than 24 FPS. Remarkably, NAS demonstrates the potential to improve video quality by 1-5 dB while conserving bandwidth, leading to a 43% enhancement in average QoE with the same bandwidth consumption. In the traffic analysis presented in Fig. 7, NAS incorporates both video and model streaming. The dimensions of the SR DNN, ranging from several KB to hundreds of KB, have negligible impact on data transmission consumption. To assess the impact of SR on QoE, NAS integrates a Reinforcement Learning (RL)-based ABR algorithm, Pensieve [197], using the output quality of the SR model as the video quality metric. However, in cases where the video is lengthy or exhibits significant content variations over time, a single pre-trained DNN may struggle to perform optimally across the entire duration. To address this, the video is segmented into multiple clips, and a dedicated DNN model is trained for each segment. Unfortunately, this approach results in the generation of a large number of DNN models, leading to increased bandwidth consumption during model transmission.

Fig. 7. - NAS vs. SRAVS.
Fig. 7.

NAS vs. SRAVS.

\bullet SRAVS: Zhang et al. addressed the oversight of NAS regarding the delay introduced by SR model inference with the introduction of SRAVS [87]. Tested on 4G network traces characterized by poor bandwidth conditions [198], SRAVS enhances the ABR algorithm for VoD delivery. Unlike NAS, SRAVS offers optional SR processing, allowing for the avoidance of unnecessary SR inference. The ABR algorithm, guided by RL, not only selects the bitrate for the current video chunk but also determines whether to execute SR processing on that chunk. Additionally, to better simulate SR processing, SRAVS considers the delay introduced by SR and incorporates a playback buffer, ensuring video content alignment with the SR model. However, unlike NAS, SRAVS does not account for the traffic consumption associated with model delivery as shown in Fig. 7. This approach may lead to the use of general SR models, potentially impacting inference effectiveness.

To be clear, let t_{i}^{H} denote the beginning moment to reconstruct the ith chunk, \psi _{i}(q_{i}, q_{i}^{\prime }) denote the duration to reconstruct the ith chunk from the resolution q_{i} to resolution q_{i}^{\prime } , and T be the length of a single video chunk. t_{i} , the time when the client begins playing the ith chunk, is computed as follows: t_{i}=\max \{t_{i-1}+T, t_{i}^{H}+\psi _{i}(q_{i}, q_{i}^{\prime })\} . Next, the improved rebuffering time \tau _{i} is also calculated: \tau _{i}=(t_{i}-t_{i-1}-T)_{+} . Therefore, the average rebuffering time Q_{3} is:Q_{3}={}\frac {1}{N} \sum _{i=1}^{N} \tau _{i} .

This rebuffering time is introduced to the state space of the ABR agent to replace the download duration in the QoE evaluation. The QoE evaluation metric is improved by considering the SR-processed chunks and the scene in the playback buffer. However, it should be noted that SRAVS adopts one of the simplest SR models (i.e., SRCNN [123]) whose real-time inference is not confirmed by SRAVS’s experiments. SRAVS does not mention how to train, deploy or transmit the SR models. It seems that SRAVS uses a generic SR model embedded in the client in advance, while the SR effect is only evaluated based on PSNR and it seems to be limited.

\bullet NEMO: In some NAS-based studies [86], [87], end-users are assumed to have good computing capacity such as gaming computers with GPUs, by default. However, a smartphone, which is a more common end device, has a computing capacity much lower than other end devices such as personal computers. To provide real-time per-frame playback of super-resolved video, clients should super-resolve at a real-time framerate (i.e., 24-30 FPS) [85], [86]. However, the existing SR models are not able to achieve a processing speed higher than 24 FPS on smartphones [85], [199]. Although a client in NAS with a powerful GPU (i.e., Nvidia Titan Xp) could complete a 24 FPS inference, per-frame super-resolution on real mobile devices supports 11 FPS only, as NEMO [85] and BiSR [199] demonstrated. Additionally, it is unacceptable for a smartphone to become overheated or consume too much power in order to complete SR inference.

To address the limitations related to execution of SR models on smartphones, Yeo et al. proposed NEMO [85], which is the first system to implement real-time inference of SR models on smartphones (i.e., Xiaomi Redmi Note7, Xiaomi Mi9 and LG GPad 5). As shown in Fig. 8, NEMO achieves real-time chunk-dependent SR inference on smartphones by selectively super-resolving certain video frames and leveraging the SR results for other intra-chunk frames using inter-frame motion vectors and residual information. It also reduces the computing and energy requirements of end devices for SR-based VoD delivery systems.

Fig. 8. - NEMO vs. Supremo vs. FOCAS.
Fig. 8.

NEMO vs. Supremo vs. FOCAS.

NEMO operates in two phases: offline preparation and online streaming. In the offline phase, NEMO utilizes a content-aware approach to select a minimal frame set of anchor points, considering both the video content characteristics and the player’s computational capabilities. This involves comparing the quality difference between inferring only a subset of frames and inferring all frames. The server prepares a range of DNNs of varying sizes and qualities tailored to the specific video content and player capabilities. In the online phase, when a mobile client requests a DNN model and cache profile within the NEMO network, it specifies its processor type. The cache profile consumes approximately 0.3 KB per minute of video on average, while the DNN models range in size from 118 KB to 1085 KB. The client caches the super-resolved frames and deploys them for reconstructing the remaining frames, optimizing content delivery and processing resources for an enhanced viewer experience.

NEMO’s performance has been validated through QoE experiments using real network traces with an average bandwidth of 1.3 Mbps, providing insights from a networking perspective. However, it’s important to note that the NEMO algorithm requires significant computational resources due to the need for training multiple neural networks for each video, as well as the time-consuming frame selection process. These computational demands and time considerations are essential factors to consider when assessing the feasibility and efficiency of the NEMO approach.

\bullet Supremo: In addition to utilizing the computing capacity of a client, it is also possible to offload some client tasks to a server, such as by analyzing important parts of the video at the server and only partially super-resolving at the client. Different offloading approaches are utilized to achieve fine-grained super-resolution and reduce transmission/computation overhead, as depicted in Fig. 8. For example, NEMO employs frame-level offloading [85], Supremo focuses on block-level processing [147], and FOCAS provides a vision for this process [177]. The server which stores the video content needs to identify its key parts (e.g., certain blocks, frames or chunks). This is achieved by manual or automatic strategies (e.g., crowdsourcing [77], [200], user retention [201], learning-based highlight detection [202], [203]). Then, the server sends lower-quality content as well as auxiliary information (e.g., residual data, block region, frame index or chunk number) to clients to enable super-resolving and restore the high-quality content.

Yi et al. [147] proposed Supremo, comprising three key modules: Priority Ordering, Residual Encoding and Offload-Inference Pipelining, achieving about a 3x reduction in computational complexity while only introducing a minimal 0.1 dB PSNR loss in reconstruction performance. For a single image, the mobile side-located Priority Ordering, based on Canny edge detector, selects the patches with the highest edge intensity in the whole image for consideration. Residual Encoding, operating on the cloud/server side, reduces the downlink traffic by exploiting the sparsity of residual signals and relies on IDN [129] (improved to IDN-Lite) and bicubic interpolation. Offload-Inference Pipelining performs the offload process of multiple patches in parallel. For the case of video frames in Fig. 8, Supremo caches the SR results, finds matching patches between consecutive LR frames, and reuses the SR results of these patches to ensure a reasonable use of computational resources. Supremo utilizes the computing capacity of the server in the VoD/live streaming system and highly parallelizes the offloading process. However, the patch selection approach of Supremo is simplistic, with negligible latency on commodity devices (e.g., 2 ms for 640\times 360 image in Google Pixel 2) and it is not integrated with the video coding, so it does not fully exploit image information. If more advanced content selection approaches are adopted, the end-to-end latency of client-server SR-related cooperation (i.e., identifying at the server and partially super-resolving at the client) should not be longer than full-offloading or full-on-device processing. Additionally, the system aims to enhance the client’s local image/video content without considering the efficiency of the entire transmission system.

\bullet FOCAS: Wang et al. [177] proposed FOCAS, a novel approach designed to super-resolve specific frame regions within a user’s visual focus, rather than the entire frame. This approach takes into consideration a key observation [204] that users tend to concentrate more on the central portion of each video frame. To better align with the user’s visual perception, FOCAS employs the quality metric EW-PSNR [205] instead of traditional PSNR. This metric places greater emphasis on the middle section of the video frame. FOCAS employs RRN [171], an VSR model with multiple ResBlocks, divides a video frame into three different regions from the edge to the middle, and uses SR models with three different scales by adjusting the numbers of ResBlocks. FOCAS performs differentiated SR in different frame regions, ensuring a good visual experience as well as reducing the overhead from the SR processing. It finally reduces the overall latency of the system. Nevertheless, the effectiveness of EW-PSNR is not universally confirmed. As shown in Fig. 8, FOCAS primarily aligns with computer vision research and does not fully address real deployment issues related to client-server communication and SR model delivery. Notably, both model training and SR inference are executed on high-performance computing units, such as the GTX 2080Ti GPU, which may not be compatible with resource-constrained mobile devices. From a networking and communication perspective, these considerations are essential when assessing the feasibility and adaptability of the FOCAS approach, especially when delivering key regions with high resolution for training purposes.

\bullet Liu et al: To further reduce the bandwidth budget, Liu et al. [178] proposed a novel training framework of specified DNNs based on Content-aware Feature Modulation (CaFM) by sharing model parameters. Liu et al. [178] noted that the deep learning features of video clips at different moments of the same video have a linear relationship. When simultaneously training specified DNN models for multiple video clips of a single video, different video chunks will share most of the convolution layers of the DNN model. Meanwhile, each chunk has its own specified CaFM layer to adapt to its specified video content. Therefore, for a video with n clips, it just needs a shared SR model for all video clips and n CaFM layers specified to the corresponding video clips. The shared model has only less than 1% of parameters compared to training separate SR models for every video chunk while performance of the model is not weakened, which significantly reduces traffic consumption of model streaming. The CaFM layer is a 1\times 1 convolution layer with a small number of network parameters (0.6% of the entire SR model). While it is confirmed that the CaFM-based SR achieves better video quality compared with H.264 and H.265 under the same storage cost, helping to reduce terminal storage overhead. But the video quality improvement from the SR processing is still limited because the specified DNN parameter of each video clip is a fixed convolutional layer. Although some research [178] contributes to the lightweight of the SR model, the video delivery system still consumes more than 40Mb bandwidth for sending a generic EDSR model and about 250Kb non-negligible bandwidth for sending CaFM parameters for each video chunk.

\bullet SRVC: Similar to Liu et al. [178], Khani et al. [179] proposed SRVC, efficiently compressing video via content-adaptive SR. It includes a flexible training method for the specified DNN parameter of each video clip. Each video clip is the training dataset of itself. The specified DNN parameters are the ones with large gradient and will be updated when they change. These parameters take up a small size of the SR model, which are encoded and transmitted with their location in the SR DNN model. In this way, the specified DNN parameters of each video clip can be in any location of the entire DNN model instead of a fixed convolutional layer. So the specified parameters in [179] are more flexible and further improve the performance of DNN models. Its modified EDSR model turns typical 43M to less than 10M parameters. However, extra model streams (i.e., requiring 82Kb/sec to encode the model stream to generate 1080p video) still consume much bandwidth. Even though content and model streams are well compressed, the computation in training and inferencing is fairly high, requiring the most advanced data center GPU (i.e., NVIDIA V100) to implement the pipeline.

\bullet dcSR: Utilizing content-aware SR models means that a SR model needs to be downloaded at the beginning of every video chunk, consuming important bandwidth share for model delivery and space for model storage. At the same time, using the same SR model for a whole video leads to fluctuations of video quality due to the content variations between video chunks. Thus, Baek et al. proposed a data-centric VoD system dcSR [180] that applies a model to several similar video clips. Data-centric is to cluster video chunks according to content texture and then select a micro model for the video chunks in the same category, for reducing model transmission overhead. dcSR trains multiple micro SR models for a whole video and only infers I frames with these micro models in a video chunk. At the server side, in order to adapt different micro SR models to different scenes, dcSR divides the videos into variable-length chunks according to scene changes. It achieves clustering of video chunks with similar content by extracting the features of the chunks through a variational auto-encoder [206] and putting these features into the K-means [207] algorithm. After that, the video chunks in the same category can be trained and inferred with the corresponding micro SR model. At the client side, the P frames and B frames of the video chunks are decoded according to the enhancement results of I frames. The micro SR models are also cached at the client side. Compared to NAS and NEMO, dcSR consumes 25% less bandwidth on average but has a limited recovery effect. It is notable that content-aware models are associated with better SR effect, while data-centric models reduce traffic consumption due to the reduced size of SR models. However, clients may need to save the received models for reuse. A chunk-based model size is about 5MB for a 720p video, which increases for larger videos. It can be concluded that EDSR-based models are still not lightweight enough to significantly reduce traffic or storage consumption.

\bullet VISCA: In line with the emerging edge computing paradigm, Zhang et al. proposed VISCA [18] that makes comprehensive use of the edge server’s cache and selects different levels of SR models for videos with different popularity. However, VISCA predicts video popularity by Zipf distribution [208] which may not conform to content providers’ real situation or edge-covered users’ real needs. Then it selects FRVSR [154], an SR model with optical flow modules for the most popular videos and removes the optical flow modules from FRVSR for the low popularity videos to accelerate the inference speed. For the utilization of the edge server’s limited cache, VISCA defines cache utility of a video chunk as its contribution to QoE. Video chunks are stored in a short-term cache that is managed using the least recently used (LRU) algorithm, which is not optimized and results in numerous memory/disk I/O operations when handling a substantial volume of user requests. The other long-term cache is not updated according to the cache utility, which is an important disadvantage due to the inflexible long-/short-term space allocation. For the ABR algorithm, with cache information, VISCA considers the time consumed by obtaining the video chunk from both the origin server and the edge server, the time consumed by SR inference, and the possibility of fetching higher quality content directly from the server instead of fetching low quality content from the edge. However, this scheme solves the issue of insufficient backhaul bandwidth only, without considering the real bandwidth dynamicity of the last mile. Additionally, the content popularity prediction should be further analyzed based on the developing 5G infrastructure and its densely-deployed stations and edge servers.

B. Live Video Streaming

YouTube has reported that it cannot support live video streaming with 1440p and 4K resolutions in real-time [209]. As illustrated in Fig. 12, the video needs to be provided by a broadcaster which should also be the sender of the live video streaming. The video is encoded by the broadcaster with a fixed resolution and uploaded to a media server. Subsequently, the media server decodes the video, processes/encodes it, and ultimately delivers appropriate content versions to different end users. In essence, a media server requires both a decoder and an encoder to transcode the original video content into multiple representations. Due to the demands of network operators or content providers, the delivery paths of the content may also involve CDN PoPs or edge servers.

Fig. 9. - Comparison between Liu et al.’s approach and SRVC: a general super-resolution-based network communication mechanism with server-side model training and client-side SR inference.
Fig. 9.

Comparison between Liu et al.’s approach and SRVC: a general super-resolution-based network communication mechanism with server-side model training and client-side SR inference.

Fig. 10. - dcSR Principle.
Fig. 10.

dcSR Principle.

Fig. 11. - VISCA Principle.
Fig. 11.

VISCA Principle.

Fig. 12. - The delivery architecture of live videos with SR.
Fig. 12.

The delivery architecture of live videos with SR.

In this process, the live video streaming system faces the following two main challenges. 1) First, it may be difficult for the media server to obtain high-quality original videos due to the device capacity of broadcasters. There are situations that the capture device of broadcasters is of limited capacity or the uplink bandwidth from the broadcaster to the media server is limited, especially for UGC. The low-capacity network can only transmit low-quality and low-volume content in real time. 2) Second, the media server needs to process the video content and respond to end users quickly. The encoded limited-quality video is uploaded to the server by the broadcaster in real time. It will be difficult for the server to obtain prior knowledge about the high-quality original video. Nevertheless, the requirement for high-quality videos as references makes it challenging to implement content-aware SR models for live video streaming. Additionally, although it may be acceptable for one-way (non-interactive) live streaming to have several seconds of delay [210], one-way (mouth-to-ear) transmission delay should not exceed 150 ms as recommended by ITU-T Rec. G.107 [211] as too much delay will negatively influence user retention [212]. It was noted that a 100ms increase in video delay is similar to introducing a 1.0dB quality decrease, expressed in terms of SSIM [32]. If intelligent enhancement approaches are adopted, it should be considered that the media server has a short time to process the video while SR processing consumes time in training (e.g., hours), updating (e.g., minutes to hours) and inferencing (e.g., \times 10 ms based on powerful GPUs to seconds based on normal mobile devices) [84], [91]. Training content-aware SR models in advance and achieving real-time inference for live streaming can be challenging. While generic SR models can be utilized, their performance may not always meet the required standards. Therefore, it is essential to explore existing SR methodologies in live streaming scenarios to determine the most effective application of proposed SR techniques within the current network environment, ensuring the fulfillment of users’ quality requirements. Fig. 13 presents several typical SR-assisted live streaming schemes, comparing their training and enhancement methods tailored to adapt to dynamic networks.

Fig. 13. - Different SR-assisted solutions for live video streaming.
Fig. 13.

Different SR-assisted solutions for live video streaming.

\bullet LiveNAS: Kim et al. proposed LiveNAS [83], the first system integrating content-aware SR methods in live video streaming by super-resolving key region of video frames, similar to Supremo [147] and FOCAS [177]. Firstly, to provide the server with HD videos for training the content-aware SR model under severely limited bandwidth, LiveNAS selects only some patches in a subset of video frames at the broadcaster side. These patches maximize the quality gain of using content-aware SR, whose high-quality versions are compressed with the JPEG format and uploaded to the server together with the encoded LR video. The transmission bandwidth consumed by the HR patches and the LR video is allocated according to the gradient ascending method based on WebRTC [213]. Secondly, to achieve real-time training and inferencing of the SR model, LiveNAS adopts the lightweight SR model provided in NAS [86] (i.e., extended MDSR). It realizes real-time SR processing on the key patches of all the frames to be 4K, through parallel inferencing with multiple GPUs. Finally, LiveNAS also designs an algorithm based on the real-time quality gain threshold from the SR processing to determine when the SR model needs to be updated by the online learning. It is innovative that LiveNAS selects patches for updating the content-aware SR model. It provides 1080p video with minimum 30dB PSNR, which requires approximately 6Mbps bandwidth. The HR patches provided as ground truth involve an additional 124Kbps bandwidth consumption. However, LiveNAS is not computation-saving as it super-resolves all the frames, which may also result in long reconstruction time. It will be hard to satisfy the stringent delay requirement of live streaming.

\bullet LiveSRVC: Although LiveNAS reduces the bandwidth consumption from the broadcaster to the media server, the uplink is still under a heavy burden due to the transmission of HR patches. So Chen et al. proposed LiveSRVC [84] to further reduce uplink traffic as well as decrease the SR inferencing time. LiveSRVC designs a new coding module at the broadcaster side where the compression ratio of key frames is increased, accommodating to the limited uplink bandwidth. Because the key frames will be downsampled while the remaining ones keep the original resolution. Then these frames will be transmitted to the server. Compared with traditional encoding methods, e.g., H.264, LiveSRVC achieves approximately 50% improvement in the video compression. On the media server, the key frames will be upscaled by the SR DNN while the remaining ones will be restored with reference to the SR-processed frames. In this way, compression efficiency of videos is improved and the server can restore a video chunk with only one SR-processed frame. Meanwhile, LiveSRVC modifies the ABR algorithm based on MPC [214] to provide the broadcaster with a target encoding bitrate, with the purpose of improving the quality gain from the SR processing. The server only applies the SR processing on the compressed key frames to reduce the workload. Furthermore, the improved LiveSRVC, i.e., FlexSRVC [215], achieves a balance between bandwidth and computing power by super-resolving more low-resolution frames. Similar to LiveNAS, LiveSRVC also train the content-aware SR model online, but the training data of them are different. LiveSRVC uses the 1st frame of every 1-second chunk and the frames after scene changes as the training data, while LiveNAS takes some important patches as ground truth. Compared with LiveNAS, LiveSRVC effectively reduces the SR workload and GPU usage, improving the performance of the live video system. However, only inferring SR model on key frames will limit the overall quality gain, since the enhanced image quality cannot perfectly be transmitted to non-key frames with original motion vectors.

\bullet DAVE: In live video streaming, latency is a crucial metric alongside high video quality. However, integrating SR DNNs into a live video system typically increases latency due to encoding and decoding processes. To address this challenge, Huang et al. introduced DAVE [181], aiming to minimize overall latency. DAVE is the first solution to integrate SR processing with RL-based coding methods in a live video system. Despite the limited computing resources of broadcasters, the encoding process accounts for 14% of the overall execution time. This indicates that the encoding latency cannot be ignored. Therefore, DAVE focuses on reducing the latency by optimizing the video encoding process, while using the generalized FSRCNN-S model to improve the video quality at the receiver (e.g., high-performance clients). Concretely, DAVE uses a reinforcement learning (RL) method to optimize the video encoding. On each video chunk (1-second length), the network state and the parameters of the last video coding are used as the state space of the RL agent. The parameters of H.264 video format such as resolution, coding speed, constant rate factor and frame rate are included in the action space. QoE is the reward to select the most suitable video coding parameters. Previous studies generally use the fixed video coding parameters except encoding bitrate, while DAVE dynamically adjusts these parameters. However, DAVE seems to super-resolve a whole frame based on a generalized model and high quality recovery results cannot be always ensured.

\bullet CrowdSR/LiveSR: Although the device performance of broadcasters is generally weak, it is necessary to consider the heterogeneity among different broadcaster devices. The broadcasters with low performance devices can only provide low-quality videos. Meanwhile, a low uplink bandwidth may also hinder HR video delivery. In a such case, it is impossible for the online learning method proposed in LiveNAS to acquire HR patches as training data.

Luo et al. proposed CrowdSR and LiveSR [182], [183] based on EDSR to deal with the device heterogeneity of broadcasters and enhance LR video content for low-performance broadcasters. CrowdSR employs a broadcaster selector to fetch representative frames from different broadcasters and measures similarity between live video streams using hash values. According to the video similarity (i.e., MSE), CrowdSR periodically fetches HR video frames from high-performance broadcasters to train the SR model for the corresponding LR video streaming. Concretely, the most representative regions of these HR frames, e.g., region of interest, are cropped into patches for the online learning. Additionally, LiveSR sends numerous content-aware SR models from multiple bandwidth-sufficient devices to the server side, aggregating them to generate a global model akin to FedAvg [216]. As the hardware performance and network condition of different broadcasters are very different, Luo et al. provide a more general and practical approach. However, CrowdSR’s periodic training of content-aware SR models consumes substantial computing capacity. Moreover, despite providing a quality gain slightly better than that of a generalized SR model, the SR gain defined by CrowdSR imposes limitations.

\bullet Dejavu: Conference video streaming can also be regarded as a kind of live video streaming, which inclines to a low latency. Hu et al. proposed Dejavu [184] for conference video, using the SR DNN to solve the problem that conference videos are often transmitted at a low quality to guarantee a low latency. Dejavu observes that there is a strong similarity between multiple historical conferences launched by the same user so that some frames from the past video conferences can be utilized to enhance the video frames from the current conference. Based on this observation, Dejavu designed the offline and online modules. In the offline part, the meeting initiator sends some high-quality frames from historical video conferences to the server. The training data is selected by a DNN that scores the PSNR gain for each patch and adopted to train the SR model. In the online part, the meeting recipient downloads the SR model directly from the server to enhance the received video. Dejavu helps to improve the meeting video system but lacks practicability, for it assumes that the user has a powerful GPU. Besides, Dejavu is not so applicable to QoE-based live video streaming.

\bullet CloudSeg: For real-time video streams like video analytics, SR models help to improve the efficiency of video analytics by improving video quality [217]. Wang et al. proposed CloudSeg [185], which sends an LR video to the server while recovering HR video frames at the server using an SR model customized for video analytics, reducing 6.8\times bandwidth consumption. CloudSeg is an SR-based edge-to-cloud scheme for model optimization as well as system optimization, not fit into low-capability terminals. First, it trains a basic CARN [127] model with the quality loss between the original HR frames and the super-resolved ones as the target, which is computation-consuming and should be done offline. Then the CARN model is fine-tuned with the performance differences produced by running the semantic segmentation model, i.e., the image cascade network (ICNet) [218], on the HR frames and the reconstructed ones separately. Since the ICNet segmentation inference works better on larger patches, the CARN in CloudSeg uses the same patches for fine-tuning so that online model updating is possible. Secondly, CloudSeg also achieves a system-level optimization. According to video semantic segmentation [219], CloudSeg designs an edge-side 2-level frame selection algorithm for filtering the key frames and useful frames respectively. After the two types of frames are found, they will be tagged and downsampled before being sent to the server. CloudSeg will also keeps the LR frames on the server side to simplify the ICNet computation. Besides, CloudSeg designs an adaptive bitrate controller to determine the downsampling rate, frame selection threshold and frame rate appropriately, or give up the SR processing directly. Nevertheless, CloudSeg’s approach does not account for the latency introduced by the filtering and downsampling stages, which may compromise the requirements for real-time content delivery. Additionally, CloudSeg relies on an existing neural network for assessing frame similarity and filtering out key frames and pertinent frames. In the context of networking, recent research has explored alternative, more efficient frame filtering techniques to enhance video analytics and video recovery processes, e.g., Reducto [220], VOXEL [173] and Reparo [221].

C. 360 Degree Video Delivery

When watching 360° videos, there are three spatial degrees of freedom at x, y and z axes respectively. A 2D video is divided temporally into multiple video chunks, whereas a 360° video is divided into multiple tiles based on 2D video chunks depending on the user’s viewing angles. Due to the huge volume of 360° video, it is impractical to transmit all parts of the 360° video at high quality. Some existing studies [222], [223], [224] predict the tiles that are likely to be watched by predicting the viewing behaviors, i.e., viewport prediction. Then, only the relevant tiles are delivered at high quality, saving bandwidth. Similarly, it is also unrealistic to apply SR on the whole 360° video, and the limited computational resources of the client need to be utilized more rationally. It should be noted that the viewport prediction algorithms are of relatively low accuracy. So there are still many problems that need to be solved for SR-based delivery of 360° videos.

\bullet PARSEC: Dasari et al. proposed PARSEC [186], adopting SR DNNs in the 360° video systems. To better adapt to the characteristics of the 360° video, PARSEC performs SR processing on each video tile and achieves superior results compared to the HEVC codec. Additionally, PARSEC employs a newly-designed heuristic ABR algorithm and QoE evaluation metrics applicable to 360° video systems. The errors in viewport prediction are considered in the QoE metrics. PARSEC was the first solution to introduce SR processing in 360° video delivery systems, and the characteristics of 360° video are utilized to improve the video delivery process. However, similar to dcSR, PARSEC trains a large number of micro-models for every chunk and every tile, and transmits them to the client side, which brings considerable transmission cost.

D. 6DoF Video Distribution

Compared to 360° videos, a six degrees of freedom (6DoF) video provides users with a higher freedom spatially [187], [225]. In a 360° video, also known as 3DoF video, users can only stay in a fixed position and watch the video by turning their heads (see Fig. 14). In a 6DoF video, users can also move around following the x, y and z axes respectively. Since the user can watch objects in the 6DoF video from different angles, the content of the 6DoF video needs to be presented in the form of point cloud [226] rather than a classic image. Each frame of the 6DoF video is also reflected as a change in the 3D coordinates and colors of certain points in the point cloud. Therefore, a 6DoF video has larger size compared to a 360° video, putting additional pressure on the content delivery system. This also indicates that the quality enhancement of the 6DoF video brought by SR methods is indispensable. Unfortunately, the SR algorithm to be employed in a 6DoF video is also different from that used in an image-based video. The SR in the 2D video serves to map LR images to HR images, but the existing SR approaches for the 6DoF video make the point cloud denser for better visualization. This makes the solutions heavier than SISR and VSR DNNs. Consequently, the inference speed of some typical point cloud SR networks, e.g., PU-Net and PU-GAN, is only 0.1 FPS, making it almost impossible to introduce SR algorithms in a real 6DoF video delivery system.

Fig. 14. - From 3DoF to 6DoF.
Fig. 14.

From 3DoF to 6DoF.

\bullet VoluSR: Zhang et al. proposed VoluSR [187], attempting to use the SR DNNs in a 6DoF video delivery system. VoluSR lightens the SR GAN by removing some convolutional layers and some features, and optimizes the patch generation process of point cloud by Voronoi graph [227]. Similar to the 2D video delivery system, VoluSR also reuses the SR results over similar point cloud frames via motion vectors to avoid the visual bias caused by inefficiently super-resolving each point cloud frame. VoluSR also takes advantage of the unique properties of point clouds by combining SR inputs with SR outputs to reduce the upsampling scale. Besides, the colors of simple patches are obtained by interpolation while those of complex patches are super-resolved by inferring both position (X, Y, Z) and color (R, G, B) information. To support 6DoF video delivery, VoluSR performs viewport prediction and proposes an ABR algorithm for the 6DoF video. The ABR algorithm utilizes a factor of user visibility to determine the patches that need to be transmitted and the sampling rate of these patches. VoluSR is the first to adopt SR methods in the 6DoF video delivery. Unfortunately, this solution only provided a novel idea, without a complete implementation. Later on, Zhang et al. proposed Yuzu [188], which included a complete implementation of VoluSR. However, it is still difficult to ensure good quality when delivering high-volume videos under insufficient bandwidth and limited computing capacity.

E. Other Solutions

\bullet MobiSR: SR inferencing on mobile devices is a hardware-dependent task. The above works mainly focus on accelerating SR training and inferencing through algorithm design from a software-based/network-oriented perspective. Based on the heterogeneity of mobile devices, the SR tasks can be allocated to different local computing units. So Lee et al. designed MobiSR [189], a novel scheduling algorithm to allocate image patches to different SR DNNs (based on RCAN [128]) and hardware device processing. Specifically, for a given mobile device, MobiSR traverses the design space of the SR DNNs according to novel model compression techniques such as group convolution and depthwise separable convolution, to find a lightweight SR DNN that achieves optimal tradeoff between SR image quality and processing speed. MobiSR divides the images that need to be super-resolved into smaller patches, and assigns them to different compute-model pairs according to the difficulty of super-resolving tasks. The computation engines can be digital signal processors (DSPs), CPUs and GPUs. Experiments show that MobiSR increases the speed of SR inference by 4.79 times than a highly-optimized single compute engine. Nevertheless, MobiSR requires traversing a large design space to find the optimal SR DNN, which can be computationally intensive.

\bullet ClassSR: Similar to MobiSR, Kong et al. [190] proposed ClassSR to accelerate SR by decomposing large images into small sub-images/patches. ClassSR aims at accelerating the SR process on large images (e.g., 2K or 8K) that consume more storage space and require high delivery traffic. As the patches of these images are classified into different categories, to accelerate network video streaming, patch-/frame-/chunk-based classification can also be achieved to apply generic SR models. After decomposition, different SR networks are employed to process different classes of individual sub-images. There are multiple base networks trained with all data in the corresponding class. It is found that smooth areas are easier to be restored than complex textures. The sub-images are divided into distinct classes based on the difficulty of restoration. So a class module is used to classify the sub-images into different categories while a SR module is to super-resolve the patches of different categories. ClassSR claims that it reduces the computation by 50% to 70%. Unfortunately, the success of ClassSR depends heavily on accurately classifying image patches, which can be challenging and error-prone. Additionally, maintaining multiple SR networks for different classes of patches increases complexity and requires more storage and training resources.

\bullet SplitSR: Liu et al. proposed SplitSR [191] to lighten the CNN-based RCAN [128] model and solve the hardware limitation. The researchers observed that different channels in the SR network are of different importance and proposed SplitSRBlock accordingly. SplitSR can be tuned for the computational budget by introducing lightweight residual blocks (i.e., SplitSRBlock), balancing the tradeoff between accuracy and latency. SplitSRBlock splits all the channels in a feature block into two parts. One part is to perform depthwise or pointwise convolution while the other is directly concatenated to the output of the above convolution. The order of the two parts is reversed when the concatenation occurs, which ensures that each channel can participate in the computation in the structure of the multiple CNN layers. Nevertheless, the performance of SplitSR is only evaluated in the scenario of image SR for mobile devices. It may be difficult in applying to video delivery architectures due to the content dynamicity.

SECTION VI.

Challenges and Future Research Avenues for Enhanced Video Delivery

The previous section presented intelligent solutions for quality enhancement for different types of video delivery, including VoD, live streaming, 360° video and 6DoF video. We discussed these works based on their advantages and disadvantages from a network-based communication-related perspective. However some solutions optimized video streaming process also when considering the training overhead, traffic load, processing delay and other aspects. In this section, we explore additional challenges and discuss the potential for further improvements of the video delivery process from a system-based network-oriented perspective. Fig. 15 illustrates some of these possible avenues and the location of their associated solutions. On one hand, the clients hope to play high quality content smoothly, while on the other hand, servers prefer to minimize both the traffic and content processing. The delivery system suffers from limited uplink bandwidth, especially for a live content distribution scenarios, whereas the goal for downlink communications is providing very high content transfer capacity and reducing latency and loss. Finally for the overall video distribution system, data processing optimisation from diverse points of view such as Next we discuss some open challenges and potential research avenues to solve the associated problems, especially in the context of the cloud-edge-end continuum.

Fig. 15. - The problems located at SR-assisted video streams.
Fig. 15.

The problems located at SR-assisted video streams.

A. Model Optimization

In the context of video streaming optimisation, it was noted that the use of intelligent enhancement models is highly beneficial. However it is very important for research to focus on improving the models employed. There are various targets for model optimization, especially in terms of improving model effectiveness and model lightweightness. In a video delivery system, effectiveness relates to exchanging high content quality and achieving high user QoE. Lightweightness refers to reducing computation complexity and inference time to decrease the server load and improve the user experience. From a communication perspective, the optimisation goal is to improve the system efficiency in terms of data exchange and considering different network conditions. To strike a balance between effectiveness and lightweight, it is useful to share information among models. However, this approach increases the network traffic, but very much reduces the amount of data associated to model transmission. It is challenging to find the right balance between the data to be exchanged and model instances required in diverse scenarios. Special training strategies can be adopted to reduce the number of model instances, which helps to alleviate the system load in terms of delivering training data, maintaining models and exchanging relevant parameters.

1) Training With Selective Data:

For VoD streaming, it is desirable to use a big dataset and consume some computing resources for training in order to get a more efficient model. Both general models and content-aware models work well for VoD streaming. Unlike VoD, live streaming usually needs online updating with real-time data. Unfortunately, the uplink from the receivers to the broadcasters is usually very limited [108], [109]. A good avenue for optimisation is related to the provision of training data selectively. The data exchanged can be very different in terms of type, size, etc. and can be related to the whole video, some frame(s), can consist of various delivery-related metrics or model parameters. For instance, SRAVS [87] trains a generic model for all the videos, embedded at the client side. The authors collected 123 videos stored in 2K resolution, of which 100 videos are taken as the training dataset to train a CNN model. The live scheme, Dejavu [184] also uses historical whole videos to update a SR model used.

NAS [86] trains content-aware models for every video episode which consumes a great amount of computing, storage and transmission resources. It pre-trains a generic model and then uses a single video as its training dataset for further content-aware training. This approach is very resource-intensive as it is applied at the level of each frame. Similarly, NEMO [85] also trains multiple SR models for every video, but it executes SR processing based on anchor points instead for every frame like NAS. LiveSR [183] is a live video streaming scheme which employs some HR frames for its model upgrade.

Considering a finer-grained avenue, Supremo [147] and FOCAS [177] both need patch-level information for model training to further reduce traffic caused by sending training data. Supremo crops the LR images into 17\times 17 patches while FOCAS obtains 64\times 64 patches from the downsampled images of 448\times 256 resolution used as training input. Additionally, Supremo [147] utilizes the existing dataset to train an offline model, which accelerates the updating process of fine-tuning the models. It also filters out the key frame and useful frames to fine-tune the SR model for video analytics. This reduces both the real-time computing consumption and traffic consumption for network video streaming. The live streaming scheme, LiveNAS [83] also employs some HR patches for improving the online model. Liu et al. [178] simultaneously train specified DNN models with multiple video clips of a single video. They provide chunk-level DNNs that share most of the convolution layers.

SRVC [179] also performs the same chunk-level model training by sharing DNN parameters. Both of these solutions use smaller datasets to train finer-grained models. Different from typically training generic models or content-aware models, dcSR [180] classifies video chunks into different categories and uses them as the training dataset.

Summarising, in this context, some schemes are designed for VoD, others for live streaming; some use much data, others little amount of data; some solutions use generic, others employ specific training. However, all solutions train the models with selective data to achieve different processing targets and with different resource consumption, achieving various results. It continues to be challenging to perform research and find the best balance between efficiency and user QoE, while also managing well the model-related data transmission.

2) Reduce Model Size:

In a large number of videos, the similarity and difference of content coexist. If a model is trained for each video or segment, it requires massive computing, storage or transmission overhead. Training a single generic model with diverse video contents increases the resource efficiency, but impacts negatively the model effectiveness. In the context of SR, there are several avenues to train models according to scene switching (i.e., scene change between different videos/segments/frames).

The first avenue refers to training content-aware SR models [86], [193], that involves training a special model for specific content. Content-aware techniques help to provide differentiated services by reusing models effectively based on scene similarity and updating models based on scene differences. The second avenue involves building data-centric SR models [180]. It refers to clustering different videos according to their content and selecting the same model for all the videos classified in the same category. It is reasonable to train generic and specific models to provide differentiated services for different video contents. The performance of the generic model is inferior to that of the specific one, but the use of a generic model has better scalability. The third approach involves aggregating models. In the context of SR, different from the content-aware models that customize the SR process for the content and the data-centric models that choose the SR process for the given content, SR model aggregation indicates that finding the right combination (i.e., assigning appropriate weights) of selected models works better than using a single SR model. This approach saves resources by exploiting the trained models instead of training specifically a new one. The fourth avenue considers sharing model parameters. In [178], the shared model has only less than 1% of parameters compared to training separate SR models for every video chunk. SRVC [179] includes a flexible training method for specified DNN parameters of each video clip. These specified DNN parameters are the ones with large gradient, and take up a small size of the SR model.

B. Improving Video Compression

From a video system perspective, SR processing should save as much video traffic as possible while providing a good restoration effect in real time. Video compression utilizes servers’ computational resources to reduce the amount of data transmitted and the subsequent traffic network load. This is usually performed at the servers, as shown in Fig. 15. It is important to improve existing codecs to achieve a higher compression ratio as well as reduce processing speeds. Additional information can be employed to ensure better restoration results [174], [228], [229]. The researchers usually compress much the key frames as they have high sizes. They also achieve efficient video compression by leveraging inter-frame and intra-frame codec information for the video restoration.

For instance, Zhang and Sze proposed FAST [230] which upscales the key frames using SR processing. This consumes a lot of computing power. The remaining frames (i.e., non-key frames) are upscaled with the first up-scaled frame and temporal correlation already embedded in the compressed video (i.e., motion vectors and inter-frame residual). The decoding process of the remaining frames is very lightweight. However, when the encoded video is long or with drastic content changes, the SR benefit of the first frame cannot be transferred to other frames.

Li et al. [231] proposed a CNN-based block upsampling method for video frames, which depends on intra-frame coding in video compression. The upsampling process of the video frame blocks is divided into two stages. In the first stage, the downsampled blocks are restored to the original resolutions by a suitable SR algorithm. In the second stage, the boundaries of the blocks are refined for improved visual benefit based on the intra-frame information.

Chen et al. [232] achieved a more lightweight VSR DNN by leveraging the interactivity between decoding prior and deep prior. Specifically, to effectively utilize spatial-temporal cues between frames, the VSR DNN needs to align frames of different time series. Unfortunately, the frame alignment consumes much computing power. In the video encoding process, to better utilize the inter-frame dependency to increase video compression efficiency, the pixel blocks of different time series are aligned through motion matching. At the same time, the motion matching information (i.e., motion vector) is also sent with the video. Therefore, Chen et al. proposed a motion vector-guided multi-scale local attention module to accelerate frame alignment in the VSR DNN, extract depth feature information from multiple scales and reconstruct HR video frames. It can be concluded that it is very important to enhance the codec for SR-assisted video systems, not only by improving compression ratio and processing speed, but also for providing useful codec information for SR processing.

C. Degradation Optimization

In the context of SR, degradation refers to obtaining a LR image by blurring, downsampling, or effect of noise. The goal is for servers to achieve flexible and content-aware degradation, making LR images retain more information. This helps to improve the performance of SR processing at the client. In general, it is assumed that the degradation function is known. That is, in the training process, a certain degradation kernel is used to obtain LR from HR frames. However, it is not reasonable to assume a fixed degradation kernel. Therefore, a blind SR is proposed to assume an unknown blur process. There are two typical blind SR methods, i.e., supervised and unsupervised. Some works design a particularly complex degradation kernel to simulate the degradation process and then let a neural network learn to super resolving. They know the ground truth of HR frames. These are supervised methods. Other works directly super-resolve the LR frames without the corresponding HR images. These are referred to as unsupervised methods.

When there are not enough learning samples, it is reasonable to adopt meta-learning for SR model training [233]. Meta Learning, known as “Learning to learn”, refers to the use of previous knowledge and experience to guide the learning of new tasks, so that the network has the ability to learn. This is one of the commonly used methods to solve the problem of few shots learning (i.e., with a small number of samples). They may be compatible to live video streaming, which has a small number of real-time generated content. However, the inferring speed of meta-learning is slow and needs further improvement to be employed by real-time video streaming.

D. Transmission Optimization

Content delivery is performed over the public network, and requires a balance between reducing traffic and fully utilizing the bandwidth. There are multiple network-assisted approaches, e.g., CDN, ICN and edge computing to improve delivery efficiency. CDNs can perform caching of popular/useful content or even support full caching. It is also possible for CDNs to provide space for saving multiple copies of the original content, processed content and some content-related information (e.g., SR models or key frames). As edge computing develops, abundant edge resources can efficiently improve the efficiency of SR-based video delivery architectures, especially with edge computation deployed at the base stations. Additionally, other implementations of edge computing are also feasible, such as exit point-based (e.g., gateway) connected to network providers, virtual data center based on public cloud service and super clients, which can act as edges (i.e., with intelligent functions like SR) by using virtual machines or applications. These transmission infrastructure elements are associated with abundant resources (e.g., computing, storage and bandwidth) and can provide support in terms of diverse transmission-related functions such as caching, transcoding and model training.

Related to SR techniques, for instance, VISCA [18] utilizes the edge cache and computing capability to support a deployable SR based video delivery system. They utilize the computing resources of the edges to super-resolve some popular content as well as utilize the storage space to cache some super-resolved content or SR models with maximal utilities. CloudSeg [185] recovers HR frames at a cloud server with SR processing for video analytics. It tries to offload computation from a low-capability terminal to a high-performance server, which consumes some traffic to deliver LR content and fully utilizes the available computing power to generate HR content. Supremo [147] offloads a portion of the client tasks to an edge server in a real SR-assisted video delivery scenario and selects important parts of the video at the origin server. It helps to reduce the inference overhead by partially super-resolving the key video parts. This not only speeds up the inference, but also improves the SR effect.

With the development of the network infrastructures and the latest change in the network paradigm, SR processing depends on storage, computing and transmission capacity of the end device-edge-cloud continuum and not on a single device capability. Intelligent optimization functions can be deployed by network operators, service providers and even clients who have special demands. The network providers are responsible for improving communications, and have also possibilities to customize transmission-related protocols or functions. The service providers may also be able to cooperate with the network providers or implement edge-based optimization by themselves with self-managed data centers, with the hope to increase the revenues from their high-quality service provided to end users. The clients who prefer to get better user experience may choose to utilize these edge-based optimization functions provided by the service providers. Additionally, the solution designers, developers and researchers should also consider combining various problems such as edge/function allocation, single edge optimization and multi-edge cooperation with the SR techniques.

In conclusion, there are diverse transmission-related avenues for improvement for SR methods, but edge-based optimizations seem to be potentially of highest benefit to both clients and service and network providers.

E. Selective Super-Resolving

To reduce inferring overhead of the client or improve SR effect for the video locally, it is efficient to selectively perform SR processing. Selective SR can also be achieved based on pre-processing. Existing network paradigms (e.g., CDN, edge computing) can be utilized to save some models or execute processing tasks in advance. For instance, VISCA [18] caches some SR models at the edges based on content popularity. Unlike NAS [86], that super-resolves all the frames, some works only process important part of the video for ensuring the SR effect. SRAVS [87] decides whether to execute the SR processing on the current video chunk by reinforcement learning at the client side or not. This could be further improved by offloading some decision-making tasks (e.g., whether to cache SR models and which model to prefetch) to the edge if adopting the edge-computing paradigm. NEMO [85] employs a frame-based SR method by selectively super resolving some video frames and caching some frames based on quality rewards. Similarly, LiveSRVC only super-resolves the key frames that are compressed at the sender side to save delivery traffic as well as ensure video quality. From a finer-grained perspective, it can be noted that both Supermo [147] and FOCAS [177] make patch-level SR decisions based on edge intensity and user vision, respectively.

In conclusion, it is useful to perform local SR processing on certain content in a video streaming system. This not only reduces model/data delivery traffic, but also alleviates the inference load, especially in a poor network scenario or with a low-performance client.

F. Task Assignment

Task assignment refers to a high-level operation, including 1) across-network task offloading (e.g., Supremo [147], CloudSeg [185] and VISCA [18]) and 2) in-device task assignment (e.g., MobiSR [189]). Based on the latest deployment of edge-based delivery and development of the network infrastructure, some computing tasks can be moved to the edge servers or even remote servers. For instance, Supremo [147] utilizes a server to perform the inference by figuring out important parts of the video and perform partial super-resolving at the client. A key contribution of Supremo is not finding the key region of frames or super-resolving them, but the client-server cooperation mechanism. VISCA [18] utilizes the edge closer to the users to achieve the SR processing by caching some SR models and super-resolving high-utility content before the last mile. They all depend on the servers’ computing capacity, which is reasonable due to the rapid development of edge computing. It is also promising to implement multi-edge cooperative SR by service providers, based on model sharing, content sharing or computation sharing for SR tasks.

Although some schemes propose client-side SR, there is a doubt that the low-performance user devices achieve real-time inferencing. To accelerate the inferencing speed at the client, the in-device assignment of SR tasks can be employed, which refers more to the optimization of hardware utilization. The target is running an appropriate model on a proper computing unit. Taking into account the hardware conditions, namely properly scheduling computing units, can also speed up the inference of the SR DNN and enhance user experience. It is efficient to assign SR tasks to different computing units [189], by first allocating image patches to different SR DNNs and assigning computing resources to different inferencing tasks according to task complexity. There are also other possibilities for accelerating the SR process, such as choosing a suitable edge computing server to offloading computation or designing more scalable SR neural networks to fit into the computation units of mobile devices. It is possible to both adjust/split SR models from a software perspective or schedule the computation units of the devices from a hardware perspective.

G. Evaluation Metrics

The evaluation metrics for SR-based video delivery system should be more considerable instead of only focusing on image quality or inference speed. Because clients care about not only video definition but also smoothness and video rebuffering. Besides, for an integrated video system, slight delay may also be affordable if there is no rebuffering. SR processing introduces overhead in terms of computation, delivery and storage. The model training and inferring is a time-consuming process, that may determine playback rebuffering. For live streaming, the requirement of low delay is more strict, indicating that online updating of the SR model should not negatively influence playback smoothness. It is necessary to keep a balance of quality and smoothness when adopting SR, instead of only focusing on improving the SR effect. For a single image, it is reasonable to use some traditional metrics to evaluate the SR effect. However, in a video delivery system, the service evaluation should not only rely on image quality. This could also involve the inferring speed of SR models, resource consumption for training and inference, video smoothness, important texture details and other parameters related to a subjective experience. Considering some special video service scenarios, e.g., 360° and 6DoF, we could also consider human interaction with the surrounding environments from both objective and subjective perspectives.

H. Application-Oriented Solutions

As different applications introduce different limitations, in the context of SR, the improvement of the SR-assisted video system will have different targets for different applications. For any designed SR function deployment in practice, there is a need to concretely analyze the corresponding service and make the SR compatible with the goal of the delivery system. For example, medical diagnosis has high requirements for image definition and accuracy. However, it is very difficult to use images from the internal structure of the human body and upscale them accurately. A slight image error may lead to serious consequences, so the main challenge of SR methods in a medical imaging context lies in the precise reconstruction with low errors and high robustness. For AR/VR/MR games, the major requirement of the SR methods may be fast inference and flexible interaction between wearable devices and users. There are also some other SR-assisted applications, like bio-metrics and remote sensing, that may be very interesting and may require emphasis on different aspects.

I. Generative AI-Based Approaches

Generative AI offers outstanding opportunities for enhancing the video streaming process, both in terms of improving user experience and optimizing resource utilization. A number of these avenues include increased quality content generation, quality enhancement, bandwidth optimization, content personalisation and adaptation and interactive content generation. Increased quality content generation refers to creation of video and other rich media content at very high quality level, which can be then tailored to user preferences and network conditions in real-time and be used easily for adaptation to different deployment scenarios. Quality enhancement is a perfect example where SR techniques, for instance based on CNN, can be used to improve the quality of the multimedia content in given bandwidth conditions and with available device capabilities. Bandwidth optimization refers to the use AI-driven compression algorithms, such as those based on neural networks to adjust dynamically the compression levels based on network conditions and user preferences. This ensures optimal quality while minimizing bandwidth usage. Reinforcement learning algorithms can learn during the streaming session and be used to allocate adaptively bandwidth resources to diverse streams such as the overall network performance and user satisfaction are optimised. Personalization and adaptation refer to content delivery adjustment based on user preferences and viewing habits and network and device operational environment, respectively. Generative AI models can be trained to generate recommendations for personalized and/or adapted multimedia streaming sessions in order to improve user satisfaction and increase their overall quality of experience. Finally, interactive generative models, such as those based on reinforcement learning or sequence generation, can be used to enable users to actively create content, improving the immersiveness of their experience.

SECTION VII.

Conclusion

The recent developments in terms of innovative networked multimedia applications have resulted in an exponential increase in video traffic. Unfortunately, network-related aspects such as congestions and bandwidth fluctuations affect the content delivery and often result in unsatisfactory viewing experience (e.g., watching low-quality content, suffering from dizziness, etc.). To ensure high user QoE levels, diverse innovative solutions were proposed, including super-resolution (SR), an intelligent quality-enhancement technique, recently adopted by remote-delivered networked video services to maintain a good balance between computation and transmission efficiency. As lightweight models work better for live content delivery, while heavyweight ones are more likely to ensure better quality, this work discussed advantages and disadvantages of existing solutions from a cloud-edge-end fusion perspective.

This paper first introduced the video streaming process and some related assessment metrics used for evaluating the effects of intelligent-enhancement-based video delivery. Then, it presented avenues for a streaming enhancement process based on intelligent models, highlighting their pros and cons. These models are especially useful for selecting the most appropriate approach for improved quality streaming. Next, the paper analyzed SR-supported video delivery, as an innovative quality improvement avenue, considering a network-based perspective. It discussed solutions to both optimize transmission efficiency and improve the associated content enhancement effects. Finally, challenges and research opportunities for designing further intelligent solutions to support video streaming were presented and discussed.

References

References is not available for this document.