Measuring, modelling and Integrating Time-varying Video Quality in End-to-End Multimedia Service Delivery: A Review and Open Challenges

The multimedia delivery chain consists of multiple stages such as content preparation, content delivery via Over-The-Top delivery network and Internet Service Providers network.Within the multimedia service chain, each stage influences the Quality of Experience (QoE) of the end user. The objective of this work is to provide a comprehensive literature survey with future research challenges and opportunities in the field of time-varying video quality in multimedia service delivery. The contribution of this work is two fold: 1) Survey – we provide a review of state-of-the-art works for video quality models to quantify multiple artifacts into a single QoE metric, pooling strategies for global quality measurements, and Continuous Time-Varying Quality (CTVQ) models; 2) Future Challenges and Directions – we investigate ten major research challenges and future directions based on the state-of-the-art for QoE modelling, QoE-aware encoding/decoding and QoE monitoring/management of multimedia streaming in next-generation networks.


I. INTRODUCTION
High resolution multimedia contents and applications are proliferating due to the advancements in consumer electronics, communication and compression technologies. It is expected that over 80% of the Internet data traffic will consist of video data, and over 66% of the connected TV sets will support 4K [1]. This exorbitant increase in video contents in communication networks will demand extortionate bandwidth requirements. To this end, it is envisaged that 10% of the global mobile devices connected to the network will be 5G capable (13.1 billion) by 2023. The enormous growth of multimedia content in communication networks makes it crucial for service providers, content distributors and creators to monitor and measure the Quality of Experience (QoE) for the content transmitted over unreliable and time-varying channels. More importantly, QoE modelling, measuring and monitoring should take place within different stages of the multimedia service delivery chain. Thus, having an in-depth understanding of the elements in end-to-end multimedia service delivery chain, mechanisms for QoE modelling, measuring and monitoring, and QoE-aware operations at each stage in the service delivery chain is important for improving enduser experience in multimedia applications.
To this end, Fig. 1 shows the End-to-End (E2E) multimedia service delivery chain with the development of the QoE model and model integration in the multimedia service pipeline. As illustrated in Fig. 1, the End-to-End multimedia service delivery chain for video streaming consists of multiple stages including content generation, content compression (encoding), content distribution through media servers, content delivery over the Internet, and content consumption by users using the client-side implementation of the service applications. The process of development and integration of the QoE model into the service delivery chain contains threephases that include: 1) multimedia service delivery to endusers; 2) development of the QoE model using subjective assessments; and 3) Model integration for QoE-aware multimedia service delivery on different stages of the multimedia service [2]- [5]. To develop a quality (QoE) prediction model for a video streaming service, subjective assessments are performed to collect feedback from multiple users and Key Performance Indicators (KPI) which are further analyzed to develop a QoE prediction model [6]- [9]. Thereafter, the developed QoE model is integrated in several stages in the multimedia service delivery chain [10]- [13]. Such an integration facilitates converting a traditional multimedia service delivery chain into a QoE-aware multimedia service delivery. As illustrated in Fig. 1, this can lead to 1) QoE-aware clientside content adaption, 2) QoE-aware content generation, 3) QoE-aware video encoding and 4) decoding, and 5) QoEaware network service monitoring and management [14].

A. BACKGROUND AND MOTIVATION
Adaptive streaming has become the norm (as opposed to progressive streaming with a constant bit rate) when delivering multimedia contents to end users through unreliable and time-varying channels. This is typically achieved through the use of 1) scalable video streams or 2) bitstreams with multiple bit rates coded into layers of contents such that they are adapted according to the network bandwidth available for the end-user. The scalable video streams [15], [16] are generally used with Media Aware Network Elements (MANE) [17] that operate on the server-side (or in edge network nodes). MANEs are capable of adjusting the quality of the video stream by removing or compounding different layers in the bitstream depending on the available network bandwidth [18]. Hypertext Transfer Protocol (HTTP) Adaptive Streaming (HAS) on the other hand operates on the clientside [19] and facilitates adapting the bit rate of the content dynamically to suit the network capacity at a given time [19]. The HAS is now supported by major vendors such as Apple [20], Adobe [21], and Microsoft [22] with their proprietary streaming protocols. Furthermore, the Dynamic Adaptive Streaming over HTTP (DASH) standard introduced by ISO MPEG [19] has made adaptive streaming ubiquitous in the OTT streaming market. In addition, the recent developments in Scalable High Efficiency Video Coding (SHVC) have promoted the use of Scalable Video Coding (SVC) together with HTTP adaptive streaming [23]- [25] to provide further adaptation capabilities.
Even though the existing HAS solutions are promising, maintaining a high consistent end-user QoE is a compelling challenge. For example, the frequency of the dynamic rate adaptation during a media playback has a significant impact on the user experience. Furthermore, associative memory conditions such as primacy, recency, and hysteresis [26] of end-users should be carefully considered when designing effective adaptation algorithms [27]. For example, the impact of a poor quality video segment will affect user perception of the overall video quality up to a certain duration [28]. In addition, the work in [29] discusses the prospect theory which states that quality loss carries more weight than gain in terms of human visual perception. The awareness of these behaviors of our Human Visual System (HVS) is important when designing efficient rate adaptation algorithms [30], [31].
One of the important elements that enable effective rate adaptation is the ability to accurately predict the time-varying user's QoE. Such prediction algorithms should operate in real-time, account for primacy / recency properties as well as the nonlinear behavior of the HVS [32]. Another factor that impacts end user QoE is the underlying video encoding and decoding algorithms. The modern video coding standards such as High Efficiency Video Coding (HEVC) and Versatile Video Coding (VVC) demonstrate significant coding efficiency improvements over their predecessors. However, the complexity of the algorithms demands excessive use of computational and energy resources at both ends of the multimedia service delivery chain. Hence, certain optimization strategies are carried out both at the encoders and decoders to reduce the energy/compute resource requirements while keeping the user QoE intact. In addition, QoE-aware video compression, bit allocation and rate control, QoE-aware error resilience and concealment, and QoE monitoring and management in communication networks contribute immensely towards maintaining a high end-user QoE in multimedia applications, yet at the same time present compelling research and engineering challenges.
In this context, engineering solutions for measuring modelling and integrating time-varying video quality in E2E multimedia delivery requires an in depth holistic understanding of key parameters that operate in all stages of the multimedia service delivery chain illustrated in the Fig. 1.

B. RELATED WORK
To this end, a number of recent surveys and reviews have been conducted to analyze the state-of-the-art as well as the challenges and opportunities for QoE measuring, modelling and integration. However, these surveys typically overlook the time-varying aspects of QoE and other stages in the E2E multimedia service delivery chain. These related surveys, their focus areas, and properties that are overlooked compared to this review are summarised in the Table 1. More importantly, as illustrated in Table 1, there is a gap in surveys and reviews that discuss the challenges in QoE management in 5G/6G networks, and the impact on QoE due to the complexity & energy consumption of encoding and decoding algorithms that play a significant role in multimedia service delivery chain. In addition, the application layer techniques such as QoE-aware video encoding, and QoE monitoring in next-generation communication networks have been overlooked in recent surveys and tutorials that discuss QoE modelling.

C. SCOPE AND CONTRIBUTION
Therefore, the objective of this work is to equip the reader with a comprehensive survey of state-of-the-art works and with the future directions in the domain of the time-varying video quality in E2E multimedia service delivery chain. The contribution of the work consists of the following two parts: 1) Survey: We present a detailed review of existing quality models for quantifying different artifacts into a single QoE metric, pooling strategies employed to obtain global quality measures, and CTVQ models. We also provide the standardization efforts related to time-varying quality estimation.
2) Future challenges and directions: Based on state-of-theart works, we provide the ten major future challenges and research directions in the following categories: • QoE modelling -The challenges and opportunities for measuring time-varying video quality modelling and prediction are investigated including 1)

D. PAPER STRUCTURE
In this regard, this article is organized as follows: Section II discusses the primacy and recency effects of users during viewing. The global quality models based on different pooling strategies are discussed in Section III. This section also elaborates on different temporal, spatial, and hybrid pooling strategies. Section IV presents the CTVQ models. A generic CTVQ model is introduced in Section V. This section also elaborates the existing challenges in designing CTVQ models and standardization efforts. Sections VI and VII illustrate the challenges in QoE-aware video encoding strategies & encoder, decoder optimization challenges, and QoE-aware network and service monitoring & management in 5G/6G networks, respectively. Fig. 2 represents an overview of the organization and structure of this paper, and a list of abbreviations used in this article is provided in the Table 2.

II. EFFECTS OF SERIAL POSITION OF ARTEFACTS/QUALITY CHANGES
In order to model the QoE for a particular application, user experience over time has to be considered. The effect of memory in assessing the impact of service interruptions, delay, jitter on QoE has been studied for instance in [39]. In general, it has been observed that the same quality values for a frame or portion of a video sequence have a different impact on the global perception of quality depending on the position at the time of the corresponding frame / portion of the video. This is due to the features of HVS, including recency, primacy, and the asymmetrical response to quality variations of the HVS. These major features contributing to the HVS ( [39], [26], [40]) are highlighted and represented in Fig. 3 and the rest of the section discusses each of these key features in details.

A. RECENCY, PRIMACY AND HYSTERESIS EFFECTS
Recency is a characteristic of human subjects and the exploration of recency can reveal the true nature of human perception and behavior. Based on the serial position effect, a viewer tends to remember the last and first items from a series. Previous research [26] suggests that the recall accuracy varies as a function of time. The ability to recall previous experiences which occurred recently is called the "recency effect", whereas the ability to recall the data from the very beginning of the content is called the "primacy effect". These effects are said to be caused by the storage of serial information in long-term memory (primacy effect) and working memory (recency effect). The influence of primacy and recency effects on identifying/memorizing certain stimuli is studied in Short Term Memory (STM) research presented in [41]. Studies in particular, in the context of audio and video quality degradations and improvements are reported in [42] [43] and [28] [40], respectively. For instance, the study described in [40] observes that global quality ratings are reduced when a worst-quality video segment occurs at the end sequence compared to the beginning of a 30s video sequence. Furthermore, the authors claimed that the effect of recency was eliminated when subjects were asked to continuously evaluate picture quality. The duration of the impairment was found to have little impact on quality ratings. A regression analysis carried out in this study also found that quality ratings are best predicted by the peak impairment intensity. These models have been integrated into QoE prediction models which mimic the exponential decay or rise of user QoE over time (e.g., [44], [45]). An audio quality model ITU-R BS.1387 based on recency has been standardized [46]. Not all studies have shown a significant effect of recency. For instance, the time-varying speech quality study described in [43] states that no significant effect is observed during quality switching under the considered experimental conditions. The authors, therefore, concluded that the fundamental human integration operation is not cofounded by speech quality VOLUME 4, 2021 history [43]. Even though these claims depend on the setup and the environment of the tests, similar studies are necessary to clearly understand the true effects of recency for timevarying quality adaptations. The memory of poor quality elements in the past causes subjects to provide lower quality scores immediately afterward, even after the time-varying video quality returns to acceptable levels. This has been denoted as the "hysteresis" effect [32].

B. FREQUENCY OF QUALITY SWITCHING
Due to the increase in multimedia delivery over time-varying channels, audio-visual QoE fluctuations over time have become one of the main focal points in recent research and development in this field. In recent studies, it has been found that temporal fluctuations of media quality taking place within a time scale between 15 secs up to several minutes are governed by the short-term or working memory [42], [47]. The memory effect is linked to the adaptation frequency of the multimedia service. For instance, if the services are frequently adapted (e.g., Adaptive HTTP live streaming enables choosing up to a second long video segment) the effect could be much worse compared to the effect when the service degradation/improvement frequency is significantly low. For instance, if the quality of a media stream changes rapidly, the users do not have enough time to settle for a particular viewing condition, which could lead to discomfort due to the high cognitive load associated with switching. In the case of slow media adaptations, the HVS system may get a considerable amount of time to adjust to the current viewing experience. Therefore, it is important to study the memory effect on both fast and slow quality rise/fall situations. Studies in the literature have found some optimum perceptual bounds for the frequency of adaptation for certain applications. For example, perceptual bounds for segmentation sizes for adaptive HTTP streaming are studied in [31] [48].

C. ASYMMETRIC RESPONSE OF HVS TO QUALITY VARIATIONS
The asymmetric nature of users' responses to quality improvements and degradations are explored and discussed as a mechanism to model the HVS response for time-varying video quality. For instance, users respond more adversely to a drop in quality compared to a quality improvement at a similar degree. This phenomenon is described by the prospect theory introduced by Daniel Kahneman and Amos Tversky in 1979 [29]. Fig. 4 illustrates the dynamics of user response according to the prospect theory. This concept is applied in [30] [31] and [49] to model time-varying quality of rate adaptive video. The detailed subjective studies conducted in these tests show high correlation with Mean Opinion Scores (MOS) for rate adaptive video applications.

D. PERCEPTUAL SATURATION EFFECT
Perceptual saturation effects also affect time-varying quality. For instance, the quality improvements or degradations after a certain threshold will not be uniquely distinguished by the end-users (due to the masking effect at these extreme quality levels). As illustrated in Fig. 4, after a certain threshold, users won't be able to differentiate quality changes and their QoE is saturated after these limits. Therefore, when designing quality models, these effects can be taken into account. This will also enable us to fully optimize the usage of system resources. For instance, if the user QoE has passed the saturated level, there won't be any improvement of user QoE even though we allocate more resources. The model described in [50] applied a predetermined threshold to account for these perceptual saturation effects.

E. SUMMARY
As discussed in this section, it is observed that a number of factors (i.e., hysteresis, the asymmetric response of our HVS, perceptual saturation, etc.) are influencing our HVS when perceiving time-varying quality video. It is paramount to understand these effects and emulate them in CTVQ models. It will be challenging to capture all these perceptual aspects of HVS within a single model. As an alternative, further research can be conducted to evaluate and prioritize these effects, so that new quality models can at least integrate more prominent features.

III. GLOBAL OBJECTIVE QUALITY MODELS BASED ON POOLING
The overall experience or QoE of users is affected by several factors which include capturing artifacts (e.g., lens distortions), compression artifacts, transmission and decoding artifacts. Some of these distortions are spatial artifacts and some are temporal artifacts. The integration of these artifacts into a single quality model is challenging due to their distinguishable and independent characteristics and to how they affect our HVS. Several quality metrics for objective quality assessment exist, such as Mean Square Error . . .

Single frame (MxN)
Temporal Pooling tn-1 tn capture temporal distortions in the video, hence the need for sophisticated temporal pooling methods. For example, the SSIM metric developed for images takes into account image degradation, luminance masking and contrast masking. In the MOVIE index, a family of bandpass Gabor filters is used to filter both the reference and distorted videos and the output is then used to measure spatial quality degradation. The output of the spatio-temporal Gabor filters family is then used to calculate the spatial MOVIE index which primarily captures spatial distortions such as blur, ringing, etc. The temporal MOVIE index then captures temporal distortions (e.g., motion compensation mismatch) in the video by tracking video quality along the motion trajectories of the reference video. The spatial and temporal indices are then pooled to obtain a final MOVIE index representing the visual quality of the entire video. Some video quality assessment metrics such as MOVIE correlate quite well with human subjective judgment by evaluating video quality not only in space and time separately but also spatio-temporally. Table 3 provides a summary of the objective quality assessment metrics proposed in the literature and the reader is referred to [51] for a more detailed discussion on the basics of popular terms, metrics, and other related literature.
Pooling can be used to predict the subjective quality of a video using objective metrics by combining the separate effects of spatial and temporal artifacts to obtain a global quality score. Hence a proper choice of the pooling method is crucial for improving the prediction capability of a video quality metric. Pooling can be performed spatially or tem-porally or in a combined fashion (known as spatio-temporal pooling). Spatial pooling computes the space-varying quality parameters of a video sequence at a single time instant and pools them to obtain a single quality index for that particular time instant. Most of the approaches perform spatial pooling at the frame-level. Temporal pooling then is used to combine these periodical measures over time to get a final measurement for the whole video sequence. Fig. 5 illustrates a general temporal pooling algorithm to combine periodical quality measures obtained using Spatial Pooling.

A. SPATIAL POOLING
Common spatial pooling methods include simple spatial averaging, Minkowski pooling, Local Quality/Distortion-Weighted Pooling, Information Content-Weighted Pooling [64], Content Adaptive Spatial Pooling [65] and Attention Model (detection of the attention regions in every video frame and average over the distortion map in attention regions only [66]). Spatial pooling methods usually consist of two stages. In the first stage, image quality is evaluated within local regions (resulting in a quality/distortion map). In the second stage, the spatial pooling algorithm combines the quality/distortion map into a single quality score. For example, Minkowski Pooling for a given quality/distortion map can be defined as where m i is the quality/distortion value at the i-th spatial location, K is the number of samples and p is the Minkowski power. As p increases it will put more emphasis on image regions with high distortions. A suitable value of p should provide a good approximation of human quality perception. Details on other spatial pooling techniques can be found in [64]- [66].

B. TEMPORAL POOLING
A review of early temporal pooling techniques is presented in [67]. The authors compare the performance of six basic pooling methods applied to five different objective quality metrics using the Pearson Linear Correlation Coefficient (PLCC). Out of the six temporal pooling methods tested in the study (histogram, Minkowski summation, exponentially-weighted Minkowski summation, mean value across a sequence, mean value of scores in the last F frames and local maximum or minimum of mean values of scores in L successive frames), the exponentially-weighted Minkowski summation and the mean value of the last F frames are found to have the best correlation with the subjective ratings. It is observed that the best performing pooling methods are those taking into account the recency effect and the influence of the worst quality section.
All works mentioned so far evaluate the final video quality using local quality information and then combining this temporally, trying to match subjective human scores. In [32] the authors propose a hysteresis based temporal pooling strategy for Quality Assessment (QA) algorithms. In their model they use the average of frame-level quality scores obtained from objective QA algorithms while also taking into account the memory effect of the users (by modelling the quality VOLUME 4, 2021 scores over a certain time duration) as well as the fact that users respond sharply to drops in quality (by sorting the quality scores in ascending order and combining them using a Gaussian weighting function).Some of the latest work on temporal pooling based quality predictions can be found in [68], [69], [70], and [71].

C. SPATIO-TEMPORAL POOLING
The spatio-temporal approaches allow us to combine both local spatial and temporal image/video features into a single quality metric. The work presented in [78] uses eleven Image Quality Metrics (IQM) for quality assessment on lossy video sequences using temporal pooling methods such as Minkowski summation with different exponents and averaging over distorted frames. Regardless of the IQMs and exponents used, the latter is found to be better than the former. The authors further evaluate different spatial pooling methods such as Minkowski summation over all pixels, averaging over distorted regions and averaging over attention region using average over all the frames as the temporal pooling method [79]. Furthermore, they evaluate a spatio-temporal scheme using averaging over different distorted spatial regions and frames. Using temporal pooling, spatial pooling and spatiotemporal pooling methods, it is observed that users are more sensitive to distorted spatial regions and temporal segments.
In [76], the authors propose a video quality assessment model which determines the overall quality of a distorted video as a weighted average between global quality and local quality. The global quality is calculated using IQM and direct spatio-temporal averaging method, while the local quality method takes into account visual attention and frequency of quality variations over video frames. Three temporal pooling methods evaluated in this work includes Minkowski summation with exponent 2, direct average of quality values over all frames and a new proposed temporal pooling function which is the filtered result of a function defined as by the Gaussian filter for several times (typically 8). The proposed scheme is proved to be better than the other two for all four IQM that were evaluated (PSNR, SSIM, multi-scale SSIM [80] and PSNR-HVS-M [81]). The model proposed in [77] first calculates the fidelity scores for the video sequence and then pools them to a representative quality score using perceptual motion models. The fidelity measurements discriminate similarity of pixel values between two images. Pooling is performed at frame-level and sequence level using user-defined perceptual weights. The weights are calculated based on the frame type, moving or stationary. The proposed strategy is proved to outperform other VQA algorithms except for MOVIE.
The authors of [65] propose a new content-adaptive pooling strategy called Video Quality Pooling (VQPooling) based on the distribution of local spatio-temporal quality scores from objective VQA algorithms and effects on the perception of large motion, cohesive motion fields such as egomotion (presence of optical flow or motion fields induced by camera motion). Taking into account the slope of the SSIM curve sorted in rank order (steep increasing sections of the curve indicate severely degraded quality section) and presence (or absence) of egomotion, a frame-level quality index is computed in a content adaptive manner. Temporal pooling is then performed by performing low and high quality classification using the K-means clustering method. The scores are then used to calculate the global quality using a weighting function (ratio of low and high quality scores).
The main drawbacks common to the schemes discussed so far are: • The test video sequences are of very short duration (usually 10-15s)  [72]; EPFL-PoliMI database consists of 12 original sequences of 10 secs each and 156 distorted sequences [73]; http://trace.eas.asu.edu/yuv/ consist of 4 original video sequences of 12 secs each, 64 progressive CIF video sequences were used in the evaluation [74].
c Year 2009 indicates the year of study and not necessarily the date the techniques were proposed.
• Only classical approach of video streaming (no quality adaptation over time) is taken into consideration.
All the mentioned temporal pooling schemes so far were evaluated using classical streaming (fixed-bit rate applications) and for short duration videos (typically less than 15s). With the advancement of streaming technologies such as HAS, it is necessary to validate the pooling methods and possibly develop new ones for these latest technologies. The work in [82] evaluates and compares the performance of different temporal pooling mechanisms to validate their application for the recent HAS protocol while also considering long duration video sequences (100s). Based on the results, the authors advocate the use of the simple mean of objective metrics over complicated pooling mechanisms given that the simple mean performs quite well for longer duration sequences. Authors in [83] performed an evaluation of eight different temporal pooling strategies for various objective Video Quality Assessment (VQA) metrics for gaming video streaming applications. Similar to [82], they also observed that no temporal pooling strategy provided a considerable gain over the simple averaging across different VQA metrics. Similar results are also reported by Netflix in [84] where their results suggest that simple arithmetic mean is the best method of averaging per-frame quality scores resulting in a high correlation with the subjective scores. They also observe that Harmonic Mean (HM) produces similar results as simple arithmetic mean but helps to emphasize the impact of small values in the presence of outliers.

D. SUMMARY
This section first describes the objective image and video quality assessment metrics available in the state-of-the-art, then summarised in Table 3. We also highlight the need for global quality models based on pooling. In this regard, state-of-the-art methods based on either spatial or temporal pooling approaches, which yield a global quality value for the whole video segment, are discussed. The strategies discussed above are mainly tested on the LIVE and EPFL-PoliMI databases. Thus, an interesting future work can be to test the performance of all pooling methods for different objective metrics, for both classical and modern streaming technologies, using a single database with various distortions and content types. However, the reported methods (which are summarized in Table 4) are not capable of predicting instantaneous quality values (i.e., CTVQ). In this context, Sec. IV provides an overview of the need and use of Continuous Time-Varying Quality models and their state-of-the-art.

IV. CONTINUOUS TIME-VARYING QUALITY MODELS
This section first discusses the importance and difficulty of modelling continuous QoE in a time-varying image/video streaming application. This allows us to not only anticipate the video's global quality but also to derive QoE parameters at various time intervals. The presence of such models aids us in making image/video stream adaptation decisions. In adaptive HTTP streaming, for example, a choice is made on which segment should be requested for delivery next. Because of memory effects, it's best to pick the highest-VOLUME 4, 2021 quality segment for the next slot. If we can model the timevarying quality with the hysteresis effect and other nonlinearities, we can answer this question. Since the inception of video transmission through unreliable communication channels such as wireless channels, the research of timevarying quality has been pursued [28]. The early research was on forecasting overall quality based on quality measurements taken at various points in time. In this scenario, the total quality is predicted using a temporal pooling technique, as explained in Section III [32] [85]. In [32], for example, a temporal pooling method is used to transfer instantaneous objective quality measures to overall video quality using a model that takes into consideration the HVS's recency effect. Even though several of these methods anticipate instantaneous video quality as an intermediate step in calculating overall video quality, the results have not been confirmed against recorded subjective Time-Varying Subjective Quality (TVSQ).
The second generation of continuous quality prediction models can anticipate the rate-adaptive video's instantaneous quality. At the moment, few metrics can model ongoing user responses. The studies in [86]- [89] describe models that incorporate complex low-level vision models. These models are intended to produce a single quality metric for a video stream. Although these algorithms may give frame-byframe estimates, their temporal summation approaches are not intended to imitate a human subject's continuous quality estimation process.However, the time-varying QoE Indexer proposed by [90] manages to capture interactions between stalling events and then analyze the spatial and temporal content of a video to predicts the perceptual video quality. It has also been identified that QoE of an IP transmission can change according to the usage situations. The work presented in [91] suggest that priors related to user attributes can be used to trade-off spatial and temporal quality of an IP stream. Other ways are discussed in [30] and [92]. The method described in [30] combines an objective quality evaluation model at the frame level with a cognitive emulator to account for human viewers' slow temporal responses to image quality changes as well as asymmetric behaviour when picture quality changes from bad to good and vice versa. The distortion masking effect, perceptual saturation effect, and our HVS liking for poor quality experiences over good quality experiences are all modelled in the suggested cognitive model [29]. These methods, however, have drawbacks. Some of these measurements, for example, are designed for slowchanging videos (e.g., video segments lasting 30s or 40s) and so are not ideal for frequent rate modifications [30], [93]. For example, existing HTTP streaming portions could alter in as little as 1-second [94]. Some of the proposed measures are only applicable to low-bitrate videos and do not account for a wide variety of bitrates [85], [92], [95].
The prospective continuous time-varying quality model predictors, on the other hand, must be simple to deploy in real-time for "on the fly" quality prediction. If the algorithm is too complicated, it will not be able to respond as quickly as needed to adjust the bitrates [93]. As a result, new methods for TVSQ must be more responsive and comprise basic computations that can be executed on the fly. To consider the hysteresis impact of HVS, the methodologies available in the literature use a variety of approaches. To predict timevarying quality, most techniques use Infinite IIR filter time series. They can often forecast quality, but the complexity of the procedures prevents them from being used in realtime rate adaption approaches in new technologies like HTTP based streaming. For example, [50] proposes an approach that includes an IIR filter as well as two non-linear filters before and after the IIR filter. To account for the perceptual saturation effects and non-linear response of HVS, two nonlinear filters are used at both the input and output stages in this method. If the rate of adaptation is high, this may make a real-time prediction of TVSQ difficult. Hewage and Maria provide a time-varying quality metric based on a moving average filter that accounts for the recency and primacy effects of our HVS in [97]. In comparison to the state of the art, the results of this method indicate good performance. In comparison to other approaches proposed in the literature, the suggested method's easy computations and accuracy allow it to be employed in real-time HTTP based streaming applications. Using a sliding window method, the Cumulative Quality Model (CQM) proposed in [98] predicts the cumulative quality of streaming sessions based on the last window quality, the average window quality, the minimum window quality, and the maximum window quality. Table 5 summarizes the continuous time-varying quality models reported in the literature to date. The basic building blocks used by the proposed CTVQ models in the literature mimic the hysteresis effect and other HVS responses in addition to the main approach in chronological order. It can be observed that most of the proposed models rely on a dynamic system model and an objective quality model to predict the time-varying quality. This leads the authors to present a generic representation for continuous time-varying quality measurement, which is discussed in Section V.

V. GENERIC MODEL FOR CONTINUOUS TIME-VARYING QOE MODEL, CHALLENGES AND STANDARDIZATION EFFORTS
The analysis of continuous time-varying quality models (in Sec. IV), and pooling-based global objective quality modelling (Sec. III) enabled authors to identify key building blocks for such a model. Based on this, we advocate that a generic CTVQ model should account for spatial/temporal artifacts, perceptual saturation/smoothing and hysteresis effects. Fig. 6 illustrates the proposed conceptual quality model for predicting CTVQ. This model reflects the main concepts (building blocks) employed in the relevant literature discussed in previous sections.
The first module in this model in Fig 6 quantifies the short term quality (i.e., instantaneous video quality) based on both spatial and temporal distortion measures. The objective of this module could be achieved by employing an accurate image/video quality metric that could account for both spatial and temporal distortions. This module measures instantaneous video quality in given time intervals. The second module is responsible for the perceptual saturation of the short term quality ratings [50]. Our HVS would not be able to recognize quality gains or drops after a certain threshold (as discussed in Section II.D). Therefore, this module will perform distortion averaging and thresholding to account for perceptual saturation. Finally, the dynamic system module in the proposed model is responsible to mimic the hysteresis or the memory effect of HVS. In most cases, the dynamic system model could be a Finite Impulse Response (FIR) or Infinite Impulse Response (IIR) filter to perform the temporal decomposition. Pros and cons of using FIR and IIR filters to model hysteresis effects of time-varying video are discussed in [95]. The selection of a particular type of filter could be down to other operational parameters such as delay and complexity. To obtain true user perception under different applications and parameter settings, individual parameters of these components may need fine-tuning.
Short term quality rating Distortion masking and perceptual saturation Dynamic system model

A. CTVQ MODELLING CHALLENGES
Over the past few years, there have been quite a few models proposed for HAS based applications that tried to predict the continuous time-varying quality and in most cases also the overall QoE score for the whole media session. A review of HAS models provided by the authors in [106] highlights such models. The review paper also highlights various open-source datasets which tried to model such timevarying subjective quality. Also, it is not always possible to obtain short term or long term quality by comparing the processed video with the original content. Especially in real-time applications, the gateway which processes the adaptation may not have access to the original content. In these scenarios, other approaches such as Reduced Reference (RR) or No-Reference (NR) metrics based on scene characteristics and network characteristics could be used [93] [107]. For instance, the method described in [108] tries to predict quality based on the bitrate. Recently proposed ITU-T Rec. P.1203 also predicts the quality scores on a per-second basis but its performance is not evaluated on a continuous time scale. An interesting future work would be the performance evaluation of such time-varying quality estimation models on a comprehensive dataset.
To build these databases and establish ground truth data, there should be good quality evaluation tools to capture the input of human observers at each time interval. The existing quality evaluation tools for conventional video are neither user friendly nor with enough fidelity to record user inputs at a fast rate. A multimodal interface to obtain 2D/3D quality inputs from the user over time is described in [109]. This tool utilizes external stimuli such as vibration, flickering and sound to improve human concentration during 2D/3D QoE evaluation. The EU FP7 CONCERTO project proposed a comprehensive CTVQ evaluation application interface to be used in mobile phones and tablet computers [110] to measure CTVQ. This tool can be used to evaluate the quality of the stored video as well as with real-time video streaming applications over a wired/wireless network. A snapshot of this APP interface is shown in Figure 7. The application enables to control video playout and the collection of quality scores (from 1 to 5) by every second via a smartphone/tablet. This is done via a slider positioned horizontally on the device's display. A plot of the provided scores over time is visualised above on the display. A Psychology-based subjective interface is provided by the Bampis et al. in [111] which can generate and record visual stimuli with high precision for the collection of continuous, per-frame subjective data. Further research and development should be undertaken to design similar and effective CTVQ measurement tools as mentioned above to duly capture human visual response in future efforts VOLUME 4, 2021

Real-time CTVQ prediction
Even though a number of dynamic system models are proposed in the literature to predict CTVQ, the application of these models in the end-to-end multimedia chain is limited. The deployment and operational issues (time complexity and real-time resource requirements) are impeding the integration. Furthermore, with regard to learning based QoE modelling, training and execution times have been identified as main challenges. Therefore, further research and development should be carried out to utilize/integrate proposed dynamic system models in real-time applications to maximize end user QoE

Datasets and tools
The availability of open and comprehensive time varying video quality databases are limited for the research community. Only a limited number of public datasets contain instantaneous subjective image quality measurements. The majority contains overall image quality for the whole video (summative image quality). This is a challenge given different types of video content (e.g. SD, HD, 4K, UHD, 3D, etc.). Therefore, further efforts should be made to create good video quality databases with instantaneous subjective image quality measurements or QoE as well as the overall QoE for the whole video. Furthermore, the tools and applications for conducting, measuring, visualizing, analysing and storing subjective quality ratings at different time intervals during subjective quality measurement experiments are rare as discussed in this section. This is a challenge due to the numerous factors that need to be measured over time to understand the QoE as perceived by the end users. This needs to be overcome in order to facilitate efficient subjective quality experiments in the CTVQ domain.

Standardization efforts
Standardization of technologies related to adaptive HTTP streaming is required to provide conformity among different vendors. Further efforts should be made in standardizing subjective quality evaluation of time varying video applications. This will facilitate the standardization of conducting subjective quality measurement experiments in this domain. Furthermore, standardization work on networks domain will support efficient integration of emerging QoE models in the end-to-end delivery chain on quality/bitrate adaptation over time.

B. STANDARDIZATION
There are many standardization activities taking place around adaptive bitrate video streaming. For HTTP Adaptive stream- ing DASH is the main standardization effort to date [19]. MPEG-DASH was developed by MPEG and became an International Standard in November 2011. Since then MPEG-DASH has been revised as MPEG-DASH ISO/IEC 23009-1:2019 in 2019. This standard defines a media representation framework for dynamic adaptation of media content in an encoder agnostic manner. However, this standard does not address the QoE or time-varying quality of the video. In 2017, Server and Network assisted DASH (SAND) was published as an extension of the MPEG-DASH standard [112]. This standard focuses on content-awareness and QoEservice-awareness through server/network assistance, analytics and monitoring of DASH-based services, and unidirectional/bidirectional, point-to-point/multipoint communication with and without a session (management) between servers/CDNs and DASH clients. It is observed that even though there are standardization efforts on media representation and reference decoders, no standardization effort is on monitoring and measuring the quality of video delivered using adaptive streaming. Often custom made tools are used to evaluate the quality (e.g. Joystick). These limitations hinder the progress of research in this domain. There are also standardization activities in study group 12 at ITU-T such as ITU-T P.1203 for HD HTTP adaptive video streaming [114] and ITU-T P.1204 for 4K video streaming [115]. The ITU-T P.1203 is a parametric bitstream model with four different modes of operation where each model utilizes different application layer Key Quality Indicators (KQIs). The works in [116], [117] provided an open-source Python language-based implementation of ITU-T P.1203. Table 8 shows the summary of the standardization efforts in the domain of HTTP adaptive video streaming.

C. SUMMARY
As discussed in this section, even though the basic building blocks for the generic continuous time-varying QoE model can be envisaged, there are many challenges to be addressed in the future. Some of the key challenges and future research directions are discussed in Table 7. It is also important to deploy these CTVQ models in the end-to-end multimedia QoE pipeline that spans from encoding to delivery. Therefore, it is important to discuss the methodologies followed to integrate these QoE models with the networking infrastructure and the challenges associated with QoE monitoring and management. Similarly, the distortions introduced at the video compression, and optimizations at the encoder and decoder have a significant impact on the overall end-user QoE. Hence, QoE modelling should consider the impact of compression and other application-layer processes. In this regard, Section VI elaborates on compression artefacts, the impact of the encoder, decoder optimizations on the QoE and QoE-aware encoding strategies. To this end, Section VII discusses QoE model integration and monitoring and management chal-lenges in 5G/6G. The following two sections combine the CTVQ modelling and deployment within content creation and communication infrastructure in end-to-end multimedia delivery chain illustrated in Fig. 1.

VI. QOE-AWARE ENCODER AND DECODER OPTIMIZATION, AND VIDEO ENCODING STRATEGIES
The QoE modelling aspects discussed in Sections III-V can be applied within the communication infrastructure (discussed in Sec. VII as well as within the content creation stages (as defined in Fig. 1). For instance, the knowledge and understanding of end-user QoE at the content generation influence the video compression and multimedia adaptation techniques. As such, the QoE-aware multimedia adaption techniques can be broadly categorized into application layer adaptations and methods that operate on the network layer. Network layer-based techniques constitute a range of sophisticated approaches such as network resource allocation [118], and QoE management methods using SDNs, and NFV, which are discussed in Section VII. Application layer methodologies can also be categorized into adaptations within the media distribution chain, encoder level, and decoder (and media playback device's) adaptations. Multimedia content adaptation within the media service delivery chain typically focuses on selecting the most suitable video segment with a bit rate that matches the prevailing network bandwidth. These techniques discussed in [119]- [121] work well with HTTP adaptive media streaming technologies such as MPEG-DASH to improve the end-user's QoE in fluctuating network conditions [71]. However, these methods do not consider the algorithms that operate at the encoder and decoder level and their impact on the end-user QoE. These aspects are of prime importance when designing theory, technologies, and applications for 5G multimedia communications [122]. Compressing the raw image and video contents leads to compression artifacts and visual quality degradation. This is evident, especially in low bit rate transmission cases. As the number of bits per frame decreases, both the objective and subjective quality of the image sequence will decrease. On the other hand, another major challenge when working with modern video coding standards is their exorbitant increase in the encoding and decoding time complexity. Yet, the adaptation techniques that operate on the distribution pipeline do not address the QoE implications of the optimization strategies undertaken at the modern encoder and decoder implementations. This section provides an overview of the encoding and decoding complexity of modern video coding standards, optimization strategies proposed in the literature, and their impact on the end-user QoE. Furthermore, QoEaware encoding techniques proposed in the literature to generate bitstreams that can potentially improve end-user QoE in lossy communication channels are also discussed.

A. IMPACT OF IMAGE/VIDEO COMPRESSION ON QOE
It is estimated that over 80% of the internet traffic will consist of video data by 2023 [123] and efficient video compression VOLUME 4, 2021 and communication strategies must be continued to be investigated. The challenge for any video and image encoding algorithm is to compress the visual data while maintaining a target quality level [124]. Modern hybrid video coding standards achieve this by reducing the redundancies seen in the video signals: namely the spatial, temporal redundancies and entropy coding to reduce the redundancies between data symbols (achieved through variable length coding) [125]. The recent video coding standards are proving to be more efficient compared to their predecessors. For instance, HEVC which was standardized in 2013 shows 40-50% coding efficiency improvement compared to H.264 [126], [127]. The demand for coding efficiencies beyond HEVC was immediately felt with the exponential increase in UHD, HDR video contents, and immersive media applications [128]. Thus, the collaboration of the Joint Video Experts Team (JVET) composed of ITU-T and MPEG members introduced Versatile Video Coding (VVC), in 2020 [129] which demonstrates 30-40% bit rate reduction for the same quality level compared to HEVC. In addition, the popularity of open-source, royaltyfree codecs such as VP9 [130] and AV1 [131] has seen an increase in the recent past. Experimental data between different standards reveal comparable compression efficiency performance between VP9/AV1 and their MPEG counterparts. In any case, the aforementioned standards are all based on the hybrid block-based video encoding approach. Hence, the Discrete Cosine Transform (DCT) based compression in block-based video coding achieved through quantization of transformed coefficients generally results in compression artifacts such as blurring, ringing distortions, and boundary artifacts (blockiness) [132]. The degree of visibility of these artifacts affects the end-user's QoE. Hence, reducing the bit rate to cater to the increasing demands in video communications without significant impacts on the perceived visual quality [133] is a compelling research challenge that will continue to dominate this space for the foreseeable future. Under this umbrella of research, Region of Interest (ROI) based video coding and efficient rate controlling algorithms are seeing an interest in the research community. The proliferation of machine learning algorithms and emerging low complex neural network architectures are facilitating new research avenues over the traditional rule-based algorithms. Rate controlling deals with a complex problem especially in real-time communication and any inaccurate bit rate estimation or allocation coupled with a sub-optimal coding parameter selection can lead to unacceptable QoE issues in multimedia applications. The λ domain rate control algorithm proposed in [134] and adopted in the HM Test Model reference encoder [135] is attempting to dynamically model the RD relationship using a Least Mean Squares (LMS) based adaptation model. A rate controlling algorithm requires 1) allocation of certain amount bits to a frame or a CTU based on the overall bit rate and 2) selecting the appropriate coding parameters to meet the bit rate constraint set at the beginning of the encoding process. The dynamic nature of the content, and the fact that bit allocation takes place at an early stage of the encoding pipeline makes it difficult for the encoders to simultaneously achieve both tasks. Hence, advanced CTU level and frame-level bit allocation methods have been proposed in the recent literature using a range of techniques. Game theory-based methods are presented in [136]- [138] targeting video conferencing applications. If the rate estimation is incorrect for a particular frame or a CTU, the overall rate control performance is degraded impacting the end-user QoE. Hence, the adoption of machine learning models, Bayesian estimation methods is prominent in recent rate controlling algorithms [139]. The use of Convolutional Neural Network (CNN) for extracting spatial and temporal saliency features, and perceptual priority-based QP selection algorithms are becoming popular within the video coding domain [140]. Identifying ROI areas through a saliency map is important in this regard, as it facilitates the bit allocation algorithms to allocate more bits into these regions of the frame compared to static regions. Deep CNNs are utilized to generate saliency maps to identify ROIs in a frame in [141], [142] and the algorithm is then extended to introduce a modified RDO process to perform an optimum bit allocation based on the importance of the region. Machine learningbased ROI extraction is proven to be more consistent with the HVS compared to the traditional texture based extraction techniques. This is attributed to the fact that these prediction models are specifically trained using large datasets to recognize areas with objects as regions of interests within a video frame. The popularity of machine learning technologies has given rise to a range of deep learning-based image and video coding solutions where the encoder and decoder are composed of neural networks. It has been identified that learningbased image coding solutions offer significant compression performance, yet the decoded images are affected by compression artifacts that are typically not seen with DCT-based coding methods [143]. Therefore, further investigations are needed in assessing the overall QoE impact of compression algorithms, rate controlling methods, and ROI-based coding techniques with state-of-the-art video coding standards.

B. IMPACT OF THE ENCODER OPTIMIZATION ON QOE
Modern video coding standards such as High Efficiency Video Coding (HEVC) [127], Versatile Video Coding (VVC) [144], VP9 [145], and AV1 [146] have shown significant coding efficiency improvements compared to their predominant predecessors such as H.264, and VP8 video coding standards. For instance, HEVC which was introduced in 2013 shows 50% [147] coding efficiency improvement compared to its predecessor H.264. Similarly, VVC which was standardized in 2020 [148] demonstrates 30-35% [149], [150] bit rate reduction compared to HEVC for a similar video quality level. However, these improvements in the bit rate reduction arrive with a greater increase in the encoding complexity which demands a significant amount of computational and energy resources at the encoding servers/devices [150]- [152]. For example, the use of larger coding blocks (64 × 64 in HEVC, and 128 × 128 in VVC), increased intra-and inter-prediction modes, cross-component prediction, advanced motion compensation and estimation methods, and improvements in transform coding have resulted in a ≈250-400% encoding time increase compared to its predecessor HEVC [150]. In this context, a significant amount of research effort is dedicated to optimizing the encoders to reduce the encoding time complexity with minimal impact on the coding efficiency.
Encoders typically follow a Rate-Distortion (RD) optimization using a Lagrangian cost function and go through all possible encoding parameter combinations in a brute force fashion to determine the optimal encoder parameter combination (that minimizes the RD cost) for a given content [153]. Therefore, state-of-the-art methods that focus on encoder complexity reduction attempt to skip all or certain stages in the compute intensive RD optimization when determining the best possible encoding parameter combinations for a given content. In this regard, the majority of the encoder complexity reduction methods can be categorized into two approaches; statistical feature-based methods and learningbased approaches [154]. Statistical feature-based methods attempt to use texture complexity [152], [155], motion complexity details [156], [157], combined with statistics from previously encoded blocks [158] to generate probabilistic models (e.g., Naive-Bayes) to early determine the best coding structure/parameters for a given image segment. On the other hand, learning-based approaches use data sets composed of previously encoded image segments and associated encoding parameters to train machine learning models which can then be used to predict the coding structures/parameters for a given arbitrary content. Supervised learning algorithms such as Support Vector Machines (SVM) have been used predominantly in recent literature due to their less complexity and ability to handle binary classification effectively [154], [159]. In addition, techniques such as random forests [160], decision trees for data mining [161], and various deep learning-based methods [162], [163] have been attempted to predict coding parameters at various stages in the encoding toolchain.
The main challenge that presents with any encoder optimization algorithm is to achieve a significant encoding complexity reduction while maintaining the coding efficiency achieved by the benchmark encoding algorithms presented in reference implementations [135], [164]. Any optimization strategy that attempts to reduce the encoding complexity by skipping the exhaustive RD optimization tends to select a certain number of less optimum coding parameters for a given image segment [152], [156], [161]. These less optimum selections cause the encoder to operate slightly below the benchmark RD curve as illustrated in Fig. 8 As a result, a bitstream generated at a particular rate by such an encoder will have a lesser objective (as well as subjective) quality level. In such cases, the encoder will have to generate a bitstream at a higher bit rate to achieve a similar visual quality (Fig.  8). These scenarios either result in low quality video streams or will demand a higher network bandwidth that ultimately impacts the end-user QoE. Hence, some of the proposed

PSNR (dB)
Benchmark encoder Optimized encoder FIGURE 8. An illustration of the RD curve in a typical encoder. The "Benchmark encoder" curve represents the coding performance of an encoder that achieves the best coding efficiency performance. The "Optimized encoder" curve represents the RD performance of an encoder that skips the computationally intensive RD optimization to select the coding modes for a given content. encoder optimization algorithms tend to provide engineering design parameters to trade-off the coding efficiency to the encoding complexity depending on the end-user's QoE requirements, network conditions, and computational resource constraints of the encoding servers [154], [156].

C. IMPACT OF DECODER OPTIMIZATIONS ON THE QOE
The assortment of new coding tools and features in novel video coding standards increase the complexity of the resulting bitstreams. Complex bitstreams increase the time complexity of the decoders and demand more computational and energy resources to achieve real-time decoding for smooth video playback. For example, decoding time for VVC bitstreams has increased by ≈130-170% compared to HEVC bitstreams [150] making decoding of these bitstreams a major source of energy consumption in resource constrained consumer electronic devices. This is further compounded by the ever-increasing resolution of the video frames (e.g., HD, 4K, and 8K), and novel media formats such as High Dynamic Range (HDR), 360 degree videos, etc. [165], [166]. Thus, the overall energy consumption at the decoder (particularly in the case of mobile devices), has become a vital parameter that affects the overall end-user QoE in multimedia applications [167], [168].
Reducing the energy consumption of resource constrained consumer electronic devices during media playback is a compelling research and engineering challenge. This is typically attended across all layers of the TCIP/IP stack (physical layer, link layer, network layer, and application layer) [169], [170]. Physical layer approaches mainly focus on the changes to the modulation scheme or dynamic modulation scaling techniques [171], whereas the link layer and network layer techniques operate on managing the wireless interface and energy-aware scheduling algorithms [172]- [174]. The application layer techniques on the other hand use a range of approaches to manipulate the decoder operations (changes to VOLUME 4, 2021 the loop filtering and motion compensation) [175], change the video stream (by manipulating coding parameters such as resolution, frame rate, and Quantization Parameter (QP)) [176], and also by considering decoding complexity-aware video coding at the content preparation stage at the encoders [165]. It has been identified that changes to the video resolution, and frame rate can result in sever perceived video quality degradation in mobile video broadcasting [177]. However, in the case of the latter, encoders can be designed to consider the corresponding decoding complexity (or energy consumption) associated with a particular coding mode, or coding block size for a given content. For example, the decoding energy parameter is considered within the RD optimization to jointly select the coding modes that attain a minimum distortion for a given bit rate, and decoding energy constraint. This is particularly important as any attempt to reduce the decoding energy consumption results in impacting the coding efficiency and is eventually impacting the QoE. Furthermore, the use of Dynamic Voltage and Frequency Scaling (DVFS) in both software and hardware decoders is evident in modern mobile devices to manage the device's power consumption. For example, the CPU's operating frequency and voltage level are adjusted depending on the complexity of the bitstream [178], [179].
Application layer methods that alter the decoding process to reduce the energy consumption typically skip certain operations within the decoding process. For example, Green-MPEG (a standardization effort from MPEG to reduce the energy consumption of the decoders) proposes to send metadata specifying decoding-complexity requirements for a video frame(or a video segment) [180], [181]. This allows the decoders to skip certain high complex steps in the decoding pipeline to reduce the overall energy consumption [182]. However, it is reported that such alterations (especially within the motion compensation phase) severely affect the video quality, hence the QoE [165].
DVFS on the other hand attempts to reduce the decoding energy consumption by altering the operating CPU frequency and voltage levels depending on the complexity of the current video frame [178], [183]. The main challenge with DVFS is to estimate the decoding complexity of subsequent video frames to set the appropriate CPU frequency level. In accurate estimates lead to frame drops, and sub-optimal CPU frequency/voltage levels adversely affect the overall system performance degrading the end-user's QoE in multimedia consumption. The emerging Green-MPEG specifications such as C-DVFS attempts to mitigate these challenges by incorporating decoding complexity/energy requirement metadata into the bitstreams to assist accurate frequency/voltage scaling [184].

D. ERROR RESILIENCE AND CONCEALMENT-AWARE VIDEO ENCODING
Video transmission over a lossy channel is a compelling challenge and requires cross-layer approaches to protect the media streams against network vulnerabilities. In this con-text, QoE-aware media protection strategies such as forward error correction should be supported by both the encoder and decoder that operate in the application layer to recover the lost information [185]. The authors in [186] classify the transmission challenges on DCT compressed images as bit-error, desynchronization, packet loss, packet delay, and packet intrusion. In these cases, the encoder incorporates additional information into the bitstream to support error resiliency whereas the decoder utilizes a range of error concealment strategies to recover or conceal the lost data from the reconstructed video frame. The frame or slice copy is the simplest error concealment strategy supported by the majority of the video decoders [187]. However, reconstructing a video frame with a lost video slice in a static, low textured background section of the frame has a low impact on the QoE, compared to a case where the lost video frame slice in a highly textured, motion rich section of a video frame. In the case of latter, the degradation of the visual quality is easily noticeable to the users and has a high impact on the end-user QoE [188]. As a result, strategies such as Boundary Matching Algorithm (BMA) are adopted to estimate the motion vectors for the pixels of the lost slice from the motion information available in the neighboring slices [189]- [191]. Recovery of full lost frames through texture analysis and motion vector extrapolation is also attempted as decoder-side error concealment techniques [192]. However, handling and reducing the error propagation remains a crucial challenge.
The focus has also been directed towards priority-based slice protection schemes to improve the error correction capability of the decoders. For example, non-Video Coding Layer (VCL) Network Abstraction Layer (NAL) units such as Video Parameter Set (VPS), Sequence Parameter Set (SPS), and Picture Parameter Set (PPS) NAL units are given additional protection within the lower layers of the TCP/IP [193], [194]. One of the prominent application layer strategies to obstruct error propagation in the decoders is the use of intra-predicted blocks or intra-coded frames within the encoded bitstreams. Intra-coding involves predicting the current coding block using the previously encoded spatially adjacent pixels as opposed to the inter-coded frames that predict from previously encoded temporally adjacent video frames. Three approaches have been identified to use intra-coding within a sequence. These include intra coding of several blocks selected randomly or regularly, intra coding of some specific blocks selected by an appropriate cost function, or intra coding of a whole frame. Intra-coding increases the bit rate impacting the compression ratio, but the experimental results demonstrate periodical I-frame coding is preferred over coding only several blocks as intra mode in P-frames [192]. Encoders can configure the number of intra-frames within the Group of Pictures using the configuration parameters such as intra-refresh interval. However, frequent injection of intra-frames increases the bit rate of the encoded stream. Hence, dynamic and intelligent model-based methods to adjust the intra-frame frequency is heavily investigated in the literature [195], [196]. Duplication of macroblocks inside Efficient rate controlling through intelligent bit allocation, and appropriate coding parameter selection are compelling challenges, specially with the new standards such as VVC, AV1. Use of machine learning to identify ROI within a frame is becoming more prominent with these algorithms. Furthermore, standardization activities are ongoing for client-specific manifest files for DASH through Session Based Description (SBD). The demand for efficient compression technologies is rising constantly and emerging MPEG-5 standards are generating opportunities for more efficient and low complex video coding algorithms.

QoE-aware encoder optimizations
Reducing the encoding complexity of state-of-the-art video coding standards (HEVC, VVC, AV1) while keeping the coding efficiency intact is a compelling research challenge. Novel standards such as MPEG-5 LCEVC present further research into low complex video coding and its impact on end-user QoE. Furthermore, transcoding between different standards, use of machine learning for low complex video coding without impacting coding efficiency will require further investigations.

QoE-aware decoder optimizations
The energy profiling of the latest and upcoming video coding standards is crucial. Investigations into embedding decoding complexity data into the bit stream, efficient use of DVFS under different operational conditions, and the consideration of energy consumption of the whole end-to-end multimedia streaming pipeline will be crucial.
Error concealment and resilienceaware video coding As the coding efficiency improves (with standards such as HEVC, VVC), and redundancies are removed from the source signal, the compressed signal becomes more prone to transmission errors. Intelligent concealment-aware picture partitioning schemes, communication channel-aware encoding strategies need to be further investigated.
Regions of Interest (ROI) within a video frame introduces additional redundancies yielding high quality reconstruction of frames with lost data packets [197]. These attempts are further extended by Flexible Macroblock Ordering (FMO) and ROI-based rate controlling to improve the error resiliency against bursty packet losses and end-user QoE in lossy channels [198]- [200]. Modern video coding standards such as HEVC and VVC introduces features such as Temporal Motion Vector Prediction (TMVP) to improve their coding efficiency performance [127]. However, TMVP is a major source of error propagation in lossy communication channels and research such as [201], [202] attempt to address this by intelligently turning on and off TMVP depending on the channel conditions and content being encoded.
In this context, QoE-aware encoding parameter selection at the encoder can play a vital role in supporting the decoders in reconstructing the encoded bitstreams such that end-user QoE is maximized. For instance, probabilistic models that predict the overall distortion at the decoder based on the impact on motion vectors, the pixels in the reference frames, and the clipping operations. The predicted error is used in the encoders to select the optimum coding parameters to facilitate robust video transmission in lossy channels [203]. A similar approach is undertaken in [204] where the optimal error concealment strategies at the decoder are identified in the encoder by simulating the transmission errors. These are then signaled to the decoder as supplemental enhancement information messages. Two-pass coding strategies are also considered to determine the motion vectors, coding modes such that both source coding, channel propagation errors are reduced while improving the error concealment capabilities of the decoders [205].

E. SUMMARY
This section describes the challenges and potential research directions in the application layer compression domain techniques for QoE-aware multimedia adaptation, and the key challenges and potential research directions are summarized in Table 9. The QoE impact of the image/video compression and potential strategies to improve the end-user QoE during compression are discussed in Section. VI-A. These include the use of ROI-based bit allocation, and intelligent rate controlling algorithms using state-of-the-art machine learning models. Section VI-B discusses the impact of the complexity of modern video coding standards on the enduser QoE. The encoder optimization strategies to reduce the complexity impacts the coding efficiency of the encoders. In this case, the resultant bitstreams produce low quality video contents, or will demand higher bit rates. In this context, effectively trading off the complexity to the coding efficiency, and producing bitstreams with low visual quality impact is a compelling challenge. Similarly, the impact of decoder energy optimization strategies are discussed in Section VI-B. In this case, various application layer methods, DVFS strategies and decoding complexity-aware encoding algorithms and their challenges are discussed. The increasing complexity of the encoded streams demands additional computing and energy resources; a scarce resource for mobile handheld consumer electronic devices. Hence, energy consumption during multimedia consumption is a crucial element that affects the overall end-user QoE in future multimedia applications. Section VI-C discusses the error resilient and concealment-aware video coding methodologies available in the state-of-the-art. In this regard, modelling the quality of the reconstructed video sequence of a bitstream transmitted over a lossy channel is important to determine the optimum coding parameters to be used in the encoder. This can assist the encoder to generate a bitstream that is more resilient in the error prone channel and can support decoder's error concealment strategies in order to maximize the end-user QoE.

VII. QOE-AWARE NETWORK AND SERVICE MONITORING/MANAGEMENT IN 5G/6G NETWORKS
Once the QoE prediction model is developed by following the considerations discussed in Section III-VI, it can VOLUME 4, 2021 be integrated into the multimedia service delivery over the Internet for QoE-aware network and service management/monitoring. The QoE-aware multimedia streaming service delivery in the next-generation networks relies on novel solutions for the integration of the QoE models in the future internet architecture and network and service management chain [12], [13], [206]- [208]. The effective QoE model integration in the 5G/6G networks for the video streaming requires a QoE monitoring solution to be deployed to monitor Key Quality Indicators (KQIs) based on the features/parameters contributing to the QoE model [209]- [212]. Furthermore, the monitored KQIs are then utilized by the QoE models for the measurement of the user-perceived quality leading to QoE-aware management of the network and service resources [10], [213]- [216]. This section provides the future challenges and research directions with the perspective of the QoE-aware network/service management and monitoring in the 6G and beyond networks by integration of the QoE model in the service chain and future internet architecture.

A. QOE MANAGEMENT IN 5G/6G NETWORKS
The network enablers including Software-Defined Networking (SDN), Network Function Virtualization (NFV), Mobile Edge Computing (MEC) can play an important role in the QoE-aware network management [14], [217]- [219]. The QoE management in 6G and beyond networks using network enabling technologies promises network programmability, scalability, agility, distributive computing, dynamic resource optimization and automation which will allow nextgeneration networks to fulfil the user-perceived quality for emerging video streaming applications while being costeffective [12], [215], [220], [221]. The QoE management of the softwarized and virtualized next-generation networks requires the deployment of QoE monitoring and measurement solution on top of SDN controller/NFV Management and Orchestration (MANO) [222]- [225]. Moreover, the deployment of QoE-aware management approaches in 6G networks requires information exchange among the major players in multimedia streaming which includes Internet Service Providers (ISP) and Over-The-Top service providers (OTTs) [13], [215]. The ISP and OTT have different roles in multimedia service delivery and both have access and control of different information and resources. For example, an ISP has control over network infrastructure and access to network-level information based on which ISP can perform optimization of the network resources. While OTTs being service provider has the access to application-level KQIs and can optimize the video streaming delivery at client and server-side. To date, the multimedia services are encrypted by OTTs leading to no retrieval of OTT's application KQIs in the hand of ISP [226]- [228]. The exchange of applicationlevel KQIs among OTTs and ISPs stands important for QoE management of the video streaming services as QoE is a multidisciplinary concept and most of the QoE models are mainly composed of the application KQIs such as video quality layers, average bitrate, stalling events, duration of the stalling event. Being blind to these KQIs, the ISPs cannot effectively perform QoE-aware network management [11], [229], [230]. The exchange of information among OTTs and ISP will require Service Level Agreements (SLAs) or Experience Level Agreements (ELA) [5]. Also, ELAs based on QoE Model will be needed for offering services to the end customers [210], [231]. Furthermore, the E2E QoE management in the future networks will require a collaborative network and service management by both ISP and OTT provider based on the agreed QoE model and integration of the QoE model in the whole network and service delivery/optimization [11], [12], [231]. In the literature, several state-of-the-art works have been proposed for the collaborative QoE-aware network and service management where different business-centric QoE management strategies such as joint-venture, QoE-fairness based on fairness metric defined in [232], Customer Lifetime Value (CLV) and Zerorated QoE has been proposed [11], [210], [215], [229], [231]. However, more future research work in the direction of the E2E collaborative QoE management of multimedia services by OTT-ISP collaboration is needed. The ongoing OTT-ISP collaboration in the industry can be seen in industrial projects such as T-Mobile Binge On where ISP limit the data rate of the OTT application and provide zero-rated data of the collaborating OTTs to its customers [233]. The Google Global Cache [234] and Netflix [235] Open Connects are other examples from industrial OTT-ISP provider collaborative service management where Google and Netflix allow collaborating ISP to host OTT's surrogate servers at the network edge to reduce content retrieval latency and unnecessary traffic from the core network of ISP. However, these approaches are not QoE-aware. One can argue upon the Network Neutrality on the formulation of the collaborative E2E QoE management approach which stands important in some geographical locations such as Europe [236]. Since June 2018, in United States Network Neutrality has been repealed [237] as it hinders the basic concepts of applicationaware network slicing in 5G [238]. Therefore, a better definition of Network Neutrality in terms of QoE is required in the geographical location where it has been enforced [215].

B. QOE MONITORING
In the literature, many state-of-the-art works propose different passive and active QoE monitoring solutions based on user-end probes for the collection of the applicationlevel QoE KQI at the user-end [239]- [247]. However, few works consider a cross/multi-layer model approach towards the QoE monitoring [239], [240]. Therefore, future research works for the cross/multi-layer QoE monitoring solutions are needed. The collection of the QoE related KQIs from the customer premises/user-device/terminal also triggers concerns about user privacy and security which needs further research in the domain of the QoE monitoring [2], [226], [248]. There are few efforts in the state-of-the-art where QoE KQIs are extracted from the encrypted OTT video streaming session [248]- [253]. However, further research in this direction The network enabling technologies such as SDN, NFV and MEC allows the dynamic QoE-aware network resource allocation in the next-generation networks for the multimedia services. The deployment of the QoE management solutions on top of SDN controller/NFV MANO will deliver cost-effective optimization and automation of the network infrastructure. Indeed, the QoE-aware network and service management of the multimedia application requires information exchange among OTT and ISP. Therefore, future research requires the development of standardized and secure interfaces among OTT and ISP for sharing QoE KPIs, collaborative QoE-aware service management approaches and integration of ELAs for service delivery/business model. Additionally, standardization efforts are required to include QoE management in softwarized and virtualized next-networks.

QoE Monitoring
In the literature, the state-of-art works propose different passive/active QoE monitoring solutions using the user-end probe where few studies consider multi-layer QoE monitoring solutions. As QoE is a multidisciplinary concept, future research towards multi-layer QoE monitoring solutions in line with user privacy and security for monitoring QoE KPIs is need. Furthermore, future research requires the extension of the existing interfaces/data representation for secure and ultra-compressed data monitoring and finding an optimal tradeoff between monitoring frequency and control plane traffic.

QoE Model Integration & QoE measurements
Most of the QoE models in the literature consider only application-level KQI while ignoring the fact that QoE is a multidisciplinary concept. Therefore, future research should consider the development of the cross-layer QoE models. Further research is required to propose the long term and more accurate QoE models which can be easily deployed in the network management framework enabled by SDN and NFV MANO in next-generation networks.
in line with user privacy and security for QoE monitoring is needed.
The deployment of the QoE-aware network management approaches using network enabling technologies requires the standardized secure interfaces for QoE monitoring and the exchange of information from the end-user equipment, among Virtual Network Functions (VNFs), among inter/intra slice SDN controllers, and among ISP/MNOs and OTT providers [215], [254], [255]. The extension of web socketbased interfaces and Remote Procedure Calls (RPC) protocols can leverage the development of new standardized interfaces for data monitoring and information exchange to deploy data-driven network management. Additionally, finding the optimal trade-off between the control plane traffic and QoE monitoring frequency for optimization in 6G and beyond remains an open challenge. Future research in the direction of ultra compressed data representation formats can decrease the volume of control data [241], [256]. The QoE monitoring frequency depends on the time interval of optimization, accuracy/data requirements of optimization algorithm(s), and time-period for the optimization [257]. For example, if the optimization is being performed at O-RAN near-real-time RIC then extremely low latency is required which needs secure fast information retrieval as compared to the optimization being performed at O-RAN non-real-time RIC [258], [259].
The virtualized and softwarized network infrastructure in 5G/6G networks allows flexible deployment of the QoE monitoring solution using SDN and NFV but security, reliability, scalability and placement of the QoE monitoring solutions remain open questions for the research community [226], [260]- [262]. Furthermore, the cost-effectiveness such OPerational Expenditures (OPEX) and CAPital EXpenditures (CAPEX) for deploying QoE monitoring solutions in the next-generation 5G/6G networks using softwarized/virtualized probes remains an open challenge and question [12], [218], [263]- [265].

C. QOE MODEL INTEGRATION & QOE MEASUREMENTS
The QoE model integration is one of the most crucial parts for QoE-aware multimedia service delivery in the future Internet as QoE monitoring, measurement, management and optimization completely depends on it. For example, the QoE KQIs to be monitored by QoE monitoring solutions highly depends on the parameters of the QoE model that is being used. Mostly, QoE models in the literature consider only application-level KQIs while ignoring the fact that QoE is a cross-disciplinary field that also includes multiple influencing factors such as network, business and system (end-user) devices [266], [267]. Therefore, future research is required for developing a cross-layer QoE model to cover most of the influencing factors on the user-perceived quality. Moreover, the video streaming QoE model found in the state-of-theart works are mainly developed for the short time scale leading to less accurate QoE prediction for the long video streaming sessions. Hence, long/multi-time scale QoE model development for long video streaming session requires future research [268]- [270]. The deployment of the QoE models on top of SDN controller/NFV MANO is another open issue when it comes to network resource management of 5G/6G networks as monitoring QoE to effectively predicts may require additional interfaces in the future network architecture and it may create unnecessary overhead of control traffic in the control plane [215], [241], [271]. Future research should consider the time-varying nature of the QoE and time scale of different network resource optimization for developing new QoE models to be integrated into next-generation networks for QoE monitoring, measurement and management purposes leading to E2E QoE-aware multimedia service delivery.

D. SUMMARY
This section discusses the research challenges and future research directions for QoE-aware network and service management of multimedia streaming in the 5G/6G networks. Section VII-A highlights the major research challenges and opportunities related to QoE management of the video streaming application in 5G/6G networks including QoE-VOLUME 4, 2021 aware network management using SDN/NFV, deployment options for QoE-aware network optimization, collaborative network and service management by ISP and OTT, ELAs and QoE KQIs information exchange among OTT and ISP, ELAs based service agreements, and network neutrality and beyond. Section VII-B investigates the QoE monitoring challenges and future research directions in the nextgeneration networks such as multi-layer user-end probebased QoE monitoring solutions, QoE monitoring from encrypted OTT services and user privacy, secure interfaces for KQIs information retrieval, optimization/trade-off between control traffic and monitoring frequency and deployment of QoE monitoring solution in the virtualized and softwarized 5G/6G networks considering CAPEX/OPEX. Section VII-C discusses the QoE model integration challenges in the future networks including the development of the cross/multi-layer and long/multi time scale model considering different optimization intervals and different layers of network/radio protocol stack for network resource management in the 5G/6G networks. Table 10 provides the summary of the research challenges and future research direction of this section.

VIII. CONCLUSION
With the advancement of technologies such as adaptive HTTP streaming, the time dynamics of user QoE has become a crucial issue. In this paper, we provided a survey on video quality models and state-of-the-art works for global video quality measurement and CTVQ models. In addition, we also highlight ten research challenges and future directions that impact QoE modelling, monitoring, optimization, and management in the end-to-end multimedia service delivery chain involving next generation networks. These research challenges are grouped under three categories-1) Continuous Time-Varying QoE modelling, 2) QoE-aware video encoding and decoder optimization, and 3) QoE-aware network monitoring and service management.
Unlike predicting the overall quality of an image sequence, measuring time-varying subjective quality is a significant challenge. However, such a quality model is a must to achieve maximum QoE for end-users with emerging adaptive HTTP streaming applications. To create a predictive quality model considering the time-varying nature of the user-perceived quality, this paper first provides a discussion on state-of-theart works considering spatial/temporal distortions, memory effect of HVS, effects of hysteresis for QoE, and use of pooling methods to combine spatial-temporal distortion measures. Moreover, a generic quality model is also presented to account for main factors affecting the user QoE in timevarying application scenarios followed by the research challenges for time-varying quality model development.
Next we discuss research challenges and future directions in QoE-aware multimedia creation, encoding and decoder optimization. We identify image/video compression, encoder/decoder optimization, error resilience, and concealment-aware video encoding as major factors that affect end user QoE in multimedia applications. With the emergence of novel video coding standards and multimedia technologies, their compression capabilities, complexity, and energy consumption have a significant impact on the enduser QoE. Hence, the adoption of machine learning technologies towards QoE-aware video coding, green multimedia technologies that reduce overall energy consumption in the multimedia delivery chain are discussed. Furthermore, we also emphasize the importance of QoE modelling and its integration in the application layer at the content creation stage. This facilitates compressed video signals to become resilient to severe quality degradation due to transmission errors.
Finally, state-of-the-art works on QoE management, QoE monitoring and measurement, and model integration in 5G/6G networks are discussed followed by their respective research challenges and future directions. We emphasize the importance of QoE-aware network optimization, collaborative network, and service management in 5G/6G networks. We also discuss probe-based QoE monitoring, and cross layer model integration challenges in next generation networks.
Thus, this paper presents a comprehensive review of state-of-the-art works on time-varying user-perceived quality (QoE) modelling for video streaming applications and covers the full spectrum of QoE modelling across all stages of multimedia service delivery chain from the content creation, encoding, decoding, network and service management where the future research challenges and directions are discussed.