No-Reference Video Quality Assessment Using Distortion Learning And Temporal Attention

The rapid growth of video consumption and multimedia applications has increased the interest of the academia and industry in building tools that can evaluate perceptual video quality. Since videos might be distorted when they are captured or transmitted, it is imperative to develop reliable methods for no-reference video quality assessment (NR-VQA). To date, most NR-VQA models in prior art have been proposed for assessing a specific category of distortion, such as authentic distortions or traditional distortions. Moreover, those developed for both authentic and traditional distortions video databases have so far led to poor performances. This resulted in the reluctance of service providers to adopt multiple NR-VQA approaches, as they prefer a single algorithm capable of accurately estimating video quality in all situations. Furthermore, many existing NR-VQA methods are computationally complex and therefore impractical for various real-life applications. In this paper, we propose a novel deep learning method for NR-VQA based on multi-task learning where the distortion of individual frames in a video and the overall quality of the video are predicted by a single neural network. This enables to train the network with a greater amount and variety of data, thereby improving its performance in testing. Additionally, our method leverages temporal attention to select the frames of a video sequence which contribute the most to its perceived quality. The proposed algorithm is evaluated on five publicly-available video quality assessment (VQA) databases containing traditional and authentic distortions. Results show that our method outperforms the state-of-the-art on traditional distortion databases such as LIVE VQA and CSIQ video, while also delivering competitive performance on databases containing authentic distortions such as KoNViD-1k, LIVE-Qualcomm and CVD2014.


I. INTRODUCTION
F OR manufacturers and telecommunications service providers, the recent increase of video-driven data consumption has led to the challenge of delivering better video services. It has also created a pressing need to monitor and control the video quality to maximize their benefits [1]. As a result, video quality assessment (VQA) has drawn increasing attention from researchers in the field. VQA, which aims to predict the perceptual quality of a video, remains a fundamental problem in many video processing tasks such as video acquisition, compression and transport [2]- [4]. Like IQA (Image Quality Assessment), there are subjective and objective VQA approaches. Subjective VQA is the most reliable of the two, however its high cost and complexity to prepare and run tests involving humans makes this approach impractical for automated quality assessement. On the other hand, objective VQA uses computational models to predict the video quality in line with the perception of the human visual system (HVS). Existing objective VQA methods can be classified into full-reference VQA (FR-VQA) [5]- [8], reduced-reference VQA (RR-VQA) [9], [10] and noreference VQA (NR-VQA) [11]- [15] based on the acces-sibility of the corresponding reference when estimating a video's quality. Compared to FR-VQA and RR-VQA, which require all or part of the information from reference videos, NR-VQA is highly beneficial as the reference video is not required [16]. Therefore, NR-VQA models are better suited for many practical applications, such as real-time monitoring of the received video quality at a streaming client.
According to recent studies, the legacy NR-VQA models don't perform well on videos containing authentic or natural distortions such as videos in KoNViD-1k dataset [17]. As in [18], we call authentic/in-capture distortions, those occurring during acquisition and traditional/post-capture distortions, those generated in a controlled lab such as compression and transmission distortions (e.g. packet loss). Authentic videos are also referred as in-the-wild.
Consequently, there is a need to design approaches that can perform well on a broader range of data. To design a robust objective VQA method, it is thus important to consider the different types of distortions that can impact video quality. However, the complexity of temporal visual characteristics and content-dependent video compression artifacts make NR-VQA a very challenging task.
In this paper, we propose a novel deep learning NR-VQA method based on multi-task learning and temporal attention to predict the video quality for both authentic and traditional distortions databases. Our method combines contentaware and distortion features extracted in two different CNN branches of the network (see Fig. 1) and incorporates them into a Gated Recurrent Unit (GRU) network coupled with a temporal attention mechanism.
Additionally, we perform an ablation study to further validate our approach and verify the advantage of its main components. Finally, we evaluate our model's computational complexity and observe a good trade-off between high accuracy and computational efficiency for our deep learning method.
The main contributions of this work are as follows: • We present a novel deep learning approach for objective NR-VQA, based on multi-task learning where the distortion of individual frames in a video and the overall quality of the video are predicted by a single neural network. Our approach also leverages temporal attention to select the frames of a video sequence which contribute the most to its perceived quality. Compared to recent models such as CNN-TLVQM [23] and RAPIQUE [24], which are a mixture of hand-crafted and CNN feature extraction models, our method uses a deep learning strategy for extracting all features. Moreover, while recent models such as CoINVQ [25] use several branches for feature extraction, and aggregate temporal features using average pooling, our method uses a GRU that can better model temporal dynamics in the video and its impact on overall quality. Experiments showed that standard pooling strategies like average pooling are not well suited for non temporally uniform distortions such as transmission errors. • A special pooling designed with a weighted sum of mean and attention pooling. This enhanced pooling mechanism considers both temporally-local quality and global quality. This is achieved via a novel combination of attention-based pooling, focusing on frames having a greater impact on perceived quality, and average pooling. The benefit of our pooling mechanism, compared to three other techniques that reflect the human judgement of quality (minimum, temporal and recency pooling), is confirmed in Table 5 of this manuscript. • We introduce a distortion network designed as a complementary feature extraction branch to improve the video quality prediction, specially in case of traditional distortion. From the state-of-the-art, we present one of the first deep learning model for a range of databases containing traditional and authentic distortions. Recent models are mostly designed for authentic video distortion and don't take into consideration traditional video distortions such as distortions related to wireless transmission. Authors of [18] have recently designed a handcrafted model for a wide variety of databases containing traditional and authentic distortions but the performance of their model on authentic databases is not competitive with state-of-the-art approaches. • The proposed method achieves state-of-the-art performance for both authentic and traditional distortion databases, outperforming existing current approaches for traditional distortion databases (LIVE VQA, CSIQ video) while also providing accuracy on par with topranked approaches for authentic distortion databases (CVD2014, KoNViD-1k and LIVE-Qualcomm). The rest of this paper is organized as follows. In Section II, we present related works. In Section III, we describe our proposed NR-VQA method. Our experiments and results are presented and discussed in Section IV. Finally, in Section V, we conclude and suggest some future works.

II. RELATED WORKS
Recent NR-VQA methods, which are mostly learning-based, can be roughly divided in two groups: those based uniquely on spatial image-level features and those that also account for temporal information between the frames in the video.
Image-based NR-VQA methods share roots with image quality assessment methods and thus involve the analysis of natural scene statistics (NSS) [26], [27]. The supporting theory of NSS is that certain statistical properties of natural images are highly related with how the HVS processes these … … GRU GRU < l a t e x i t s h a 1 _ b a s e 6 4 = " V F s 5 5 n j 8 f T

MLP
< l a t e x i t s h a 1 _ b a s e 6 4 = " k I i h 8 K / Y I B 8 D 3 Z 5 w + 2 W 5 y n 6 5 S 9 g = " > A A A B + n i c b V B N S 8 N A E N 3 U r 1 q / U j 1 6 C R b B U 0 l U 1 G P R i 8 c K 9 g P a G D a b T b t 0 s w m 7 E 7 X E / B Q v H h T x 6 i / x 5 r 9 x 2 + a g r Q 8 G H u / N M D P P T z h T Y N v f R m l p e W V 1 r b x e 2 d j c 2 t 4 x q 7 t t F a e S 0 B a J e S y 7 P l a U M 0 F b w I D T b i I p j n x O O / 7 o a u J 3 7 q l U L B a 3 M E 6 o G + G B Y C E j G L T k m d X w L u s D f Y Q s 0 M v y 3 A P P r N l 1 e w p r k T g F q a E C T c / 8 6 g c x S S M q g H C s V M + x E 3 A z L I E R T v N K P 1 U 0 w W S E B 7 S n q c A R V W 4 2 P T 2 3 D r U S W G E s d Q m w p u r v i Q x H S o 0 j X 3 d G G I Z q 3 p u I / 3 m 9 F M I L N 2 M i S Y E K M l s U p t y C 2 J r k Y A V M U g J 8 r A k m k u l b L T L E E h P Q a V V 0 C M 7 8 y 4 u k f V x 3 z u o n N 6 e 1 x m U R R x n t o w N 0 h B x 0 j h r o G j V R C x H 0 g J 7 R K 3 o z n o w X 4 9 3 4 m L W W j G J m D / 2 B 8 f k D X B a U u Q = = < / l a t e x i t > f dist t < l a t e x i t s h a 1 _ b a s e 6 4 = " Q V 6 P C P u o a 3 X f H g z r P 0 r H k 9 z r 9 R o = " > A A A B + n i c b V D L T s M w E H T K q 5 R X C k c u F h U S p y o B B B w r u H A s E n 1 I b Y g c 1 2 m t O k 5 k b 4 A q 5 F O 4 c A A h r n w J N / 4 G 9 3 G A l p F W G s 3 s a n c n S A T X 4 D j f V m F p e W V 1 r b h e 2 t j c 2 t 6 x y 7 t N H a e K s g a N R a z a A d F M c M k a w E G w d q I Y i Q L B W s H w a u y 3 7 p n S P J a 3 M E q Y F 5 G + 5 C G n B I z k 2 + X w L u s C e 4 S M x h L y 3 A f f r j h V Z w K 8 S N w Z q a A Z 6 r 7 9 1 e 3 F N I 2 Y B C q images, thus image quality can be obtained by measuring the deviations from these statistics [28]. These image-level approaches have been extended to videos by evaluating such statistics at the frame level and aggregating them to get a quality score for the entire video. Examples of such approaches include Naturalness Image Quality Evaluator (NIQE) [11], COdebook Representation for No-Reference Image Assessment (CORNIA) [29], Blind/Referenceless Image Spatial QUality Evaluator (BRISQUE) [15], Feature maps based Referenceless Image QUality Evaluation Engine (FRIQUEE) [30], and High dynamic range Image GRADient based Evaluator (HIGRADE) [31].
On the other hand, few approaches in the literature take directly into consideration the temporal aspect of videos. The best known learning-based model in this domain is Video BLIINDS (V-BLIINDS) [32], which extends the imagebased metrics designed for NIQE [11] by adding temporal motion information and time-frequency characteristics of the video. The temporal features are extracted using block-based motion estimation and Discrete Cosine Transform (DCT) coefficients computed from frame differences. This approach has been referred as the baseline method against which most NR-VQA methods are compared. Another well-known machine-learning NR-VQA model is the Video COdebook Representation for No-reference Image Assessment (V-CORNIA) [33]. This frame-feature learning approach uses Support Vector Regression (SVR) to first predict quality at the frame level, and then applies temporal pooling on frame-level qualities to obtain the overall video level score.
Recently, Deep Neural Networks (DNN) have been applied to NR-VQAs. For example, SACONVA [34] uses a three-dimensional (3D) shearlet transform to extract framelevel features which allows capturing spatio-temporal quality features. A Convolutional Neural Network (CNN) and a logistic regression model are then employed respectively to expand these features and obtain the quality scores. The COnvolutional neural network and Multi-regression based Evaluation (COME) [35] approach separates the problem of extracting spatio-temporal quality features in two parts. First, a CNN is used on the CSIQ database to extract spatial quality features, based on max pooling and the standard deviation of activations in the final layer. Temporal quality features are then obtained as standard deviations of motion vectors in the video. Lastly, the predictions of two SVR models are combined with a Bayes classifier to predict the final quality score. The Video Multi-task End-to-end Optimized neural Network (V-MEON) [36] approach, which is the video version of MEON [37], predicts video quality with a multitask framework that jointly estimates the perceptual quality of a video and predicts its codec type using spatio-temporal features extracted from a 3D CNN.
Two recently-developed methods against which most methods are compared, VSFA [13] and TLVQM [38], are used as state-of-the-art baselines in our study. VSFA [13] extracts content-aware features from a CNN pre-trained on ImageNet [39] and uses a GRU to model the long-term dependencies between different frames in the video. Additionally, a subjectively inspired temporal pooling model is proposed to consider the hysteresis effect observed in human judgments of time varying video quality [40]. Our proposed method differs from VSFA in two important ways. First, while VSFA predicts the video quality directly, we employ a multi-task learning strategy where the distortion type of individual images is predicted in a separate branch of the network. As in self-supervised representation learning approaches [41], this branch is trained with easy-to-obtain labels, i.e. images corrupted with different types of distortion, to learn a representation in a pre-training phase. In a second phase, this image-wise representation is combined with the features computed in another branch of the network for predicting the quality of the whole video. Compared to VSFA, which uses a pre-defined strategy to combine the quality scores of separate frames, our method uses learned attention to find frames in the sequence having the greatest impact on perceived quality.
The Two Level Video Quality Model (TLVQM) [38] model adopts a hierarchical feature extraction approach for predicting the video quality. Specifically, two types of features are extracted: low complexity features characterizing global information of the full sequence and high complexity features such as spatial activity, exposure or sharpness which are extracted from a small representative subset of frames. Compared to TLVQM, where spatial and temporal features are extracted together using hand-crafted techniques, our method learns the spatial and temporal characteristics of videos with two neural network branches.
The above-mentioned studies have focused on a specific type of distortion and trained models on a same category of distortion databases. For example, recent state-of-the-art methods like VSFA and TLVQM are designed specifically for in-capture distortion databases and tested only on this type of data. This limits the adoption of these algorithms in application settings where distortion can arise during the acquisition as well as transmission processes. Thus, there is a need to develop novel NR-VQA algorithms that consider all types of distortions such as in-capture/authentic as well as post-capture/traditional distortions. Apart from the two state-of-the-art baselines used in this study, we compared the performance of our method to some recent published NR-VQA approaches such as RAPIQUE [24], PVQ [42], CoINVQ [25] and CNN-TLVQM [23], where the latter is an improvement of the TLVQM model [38].

III. PROPOSED METHOD
In this section, we present the details of our deep learning NR-VQA method, which combines content-aware and distortion features. These combined features are sent to a fully-connected (FC) linear layer reducing their dimensionality and integrated into a GRU network which models the temporal inter-dependencies. Finally, a temporal attention mechanism and a traditional average pooling are coupled to this GRU network to select the frames of a video sequence that contribute mostly to the perceived quality and improve the robustness of our model. The overall architecture of the proposed method is shown in Fig. 1. We detail each part in the following sections.

A. FEATURE EXTRACTION
Recent studies on image distortion and HVS have motivated our approach for feature extraction. Comparing NR-IQA to NR-VQA models, deep learning is widely used for the former task while only a few NR-VQA deep learning models have been proposed to date [43]. Moreover, improvements are often observed when combining the distortion and contentaware features for estimating NR-IQA [37], [44] and NR-VQA [73]. Thus, we take advantage of these studies and extract these two types of features in our study.

1) Distortion Features
It is well known that image distortion strongly affects the final quality score. The more severe the image distortion is, the lower the quality score will be. Recently, it was revealed that deep neural network (DNN) features are distortion-sensitive [45], [46], and NR-IQA/VQA methods began to incorporate networks for predicting distortion in their model [37], [47], [73]. Additionally, it was shown that DNN layers of increasing depth learn features of growing complexity. Hence, features computed in the first layers typically resemble the output of Gabor filters or color blobs, while features in deeper layers correspond to semantic entities such as circular objects with a specific texture or even faces [48].
To implement our distortion prediction network, we leverage the high performance of CNNs trained on ImageNet [39], and choose the ResNet-50 architecture [49] as backbone network. We also adopt a transfer learning strategy and fix the parameters of the first layer of the pre-trained ResNet-50, training only the following layers to learn the type of distortion. This fine tuning strategy, which uses visual features from early layers accelerates the training and leads also to a better generalization.
We employ a self-supervised approach to learn a useful representation from low-cost labels in a pre-training phase. We train our distortion network with the most apparent distortion database (CSIQ) in contrast to the work of [73], where the authors used in-capture/authentic distortion data to train their model. We selected this image database because of its excellent results with Most Apparent Distortion (MAD) or what is most apparent to the human observer [50]. The distortions used in the CISQ image database are called the most apparent distortions and are JPEG compression, JPEG-2000 compression, global contrast decrements, additive pink Gaussian noise, additive white Gaussian noise, and Gaussian blurring. The CSIQ image database contains 30 reference images distorted with six types of distortions, each at four to five different levels of distortion. Moreover, we tested our model with the KADID-10k [51] image database (which contains 10,125 distorted images grouped in 25 distortion types) and obtained almost the same performance as with the CSIQ image database. Actually, KADID-10k contains distortions selected from the TID2013 [52] image database and some new authentic distortions. However, as supported by studies from MLSP [53], pretrained CNNs on ImageNet are already robust enough to predict videos impacted by authentic distortions. Our distortion network is trained to predict the type of distortion that was applied to an input image. As mentioned above, distorted images can be generated in large quantity and at almost no cost compared to having humans rate the quality of videos. Since the distortion in images affects their perceived quality, we use the representation learned in the self-supervised pre-training phase to boost the learning of the downstream NR-VQA task. Toward this goal, we truncate the distortion prediction network at the last convolutional layer  and use the output of this last layer as additional features for the NR-VQA network. Similar to [73], during training, we do not resize the input images to avoid introducing additional artifacts. Thus, our network is trained on images having the same resolution as those used for collecting the subjective quality in the CSIQ image database [50]. Fig. 2 shows the architecture of the distortion prediction network. Additional details are provided in Section IV-A2.

2) Image Content-Aware Features
Numerous studies have shown that the human judgment of visual video quality is content related. For example, it was found that two compressed images with the same compression ratio may have different subjective quality if they contain different scenes [54], [55]. Hence, it is important to take into consideration the content of images when designing the NR-VQA model. Furthermore, our ablation study also confirms the improvement with the content-aware network. As in the distortion prediction network, we use the ResNet-50 network pre-trained on ImageNet as backbone for extracting our content-aware features. For a given video containing T frames, we feed a frame I t , (t = 1, . . . , T ) simultaneously to the distortion prediction and content-aware CNN networks. The output features vectors f dist t and f cont t from each of these CNNs are obtained by truncating the networks to the last convolutional block (see Fig. 2) and applying spatial global average (mean) pooling. This generates a representation of size m × n × 2048, where 2048 is the number of feature maps and m = n = 1 after pooling: Here, GP is the global average pooling and t ∈ {1, . . . , T }. Finally, we concatenate the distortion features f dist t and the content-aware features f cont t . The result features f dc t is thus obtained as where ⊕ is the concatenation operator and t ∈ {1, . . . , T }.

B. MODELING TEMPORAL EFFECTS
Unlike in IQA, another important challenge in designing the VQA model is effectively modeling the temporal information of videos. In this study, we achieve this by implementing two separate techniques. First, we use GRU layers to capture long-term dependencies between frames in the video. Second, we employ a temporal attention mechanism to select the most relevant frames for predicting the overall video quality.

1) Temporal Modeling
Recurrent Neural Networks (RNNs) have shown a great potential for tackling various sequences modeling tasks in machine learning [56]. In our study, we select the GRU model, which is a simplified version of the Long Short Term Memory (LSTM) model. Unlike LSTMs, a GRU merges the input and forget gates of LSTM and simplifies them with an update gate. Thus, GRUs have fewer parameters than LSTMs, which makes their training easier and lowers the computational requirements [57]. Like LSTMs, GRUs can alleviate the vanishing and exploding gradient problems of the traditional RNN model. Since the features extracted and combined from the two CNNs networks (i.e., f dc t ) are of high dimension, they cannot be used directly as input to the GRU network. To alleviate this problem, we perform a dimension reduction step using a linear (fully-connected) operation: where W xf are the learned parameters of the linear model. After this dimensional reduction to a size of 128, the features x t , (t = 1, . . . , T ) are sent to the GRU. We consider the hidden states of the GRU as the integrated features, where the initial state is given by h 0 and the previous state by h t−1 . The current hidden state h t is computed as where σ is the sigmoid function and ⊗ is an element-wise multiplication operator. z t , r t and c t are respectively the update gate, reset gate and candidate activation. W zx , W rx , W cx , U zh , U rh and U ch are the related weight matrices. The GRU captures the temporal dependencies among the features extracted from each frame of the sequence. Actually, each GRU receives the features (x t ) as an input and outputs its hidden state. The GRU output could be seen as a selective memory of past hidden states and this underlines the long-term dependencies of the final output. Moreover, the spatial and temporal correlations are jointly learned through optimization. Thus, the quality of each frame is predicted by the spatio-temporal features. In our study, for simplicity in terms of computations and memory, we select the standard GRU with a single layer, i.e no stacked GRU. The hidden size, which is the amount of information stored, is set to 32.
Finally, the attention mechanism described in the next subsection is used through the video sequence to select important frames for the quality prediction.

2) Attention Mechanism and Pooling
Soft attention [58] has been used with great success in various vision tasks such as image captioning and emotion classification [59]- [61]. Following this principle, we add a temporal attention mechanism to the GRU network, which estimates the importance of each frame for predicting the overall perceptual video quality. This is achieved by computing attention weights α t that define how much each frame should be considered in the final output.
The predicted quality, q t , for each frame t is then calculated as W qh and b q are respectively the weights and bias parameters and they are jointly learned with all the other components of the system. Denoting as T the total number of frames, for each frame at instant t, the attention weights are computed as follows: A common problem with attention mechanisms is that they often focus on a limited set of temporal or spatial characteristics, which can make the model less robust in terms of generalization performance when the contents of the training and testing videos are quite different. To improve our model's robustness, we combine temporal attention with a standard mean pooling strategy which considers all frames in the video. For the T frames in the video, the overall video quality Q is finally calculated as where β ∈ [0, 1] determines the relative importance of temporal attention and mean pooling. In our experiments, we empirically set β = 0.5, giving equal importance to these two techniques.

IV. EXPERIMENTAL RESULTS
This section first describes the experimental setup, including the VQA databases and evaluation criteria used for evaluation as well as the implementation details of our method. To validate the superiority of the proposed method, we then conduct four experiments: comparison on individual databases, cross databases evaluation, ablation study and computational efficiency analysis.

A. EXPERIMENTAL SETUP 1) Databases
Five publicly available databases, namely CVD2014 [19], KoNViD-1k [17], LIVE-Qualcomm [20], LIVE VQA [21] and CSIQ video [22], are employed to validate the performance of the proposed method. We present the characteristics of each one in this subsection.  [50]). The average differential MOS ranges from 14.48 to 82.80. The performance of our proposed model is evaluated on the datasets described above. Similar to the state-of-the-art methods, the performance of our model is evaluated in terms of SROCC (Spearman Rank Order Correlation Coefficient), PLCC (Pearson's Linear Correlation Coefficient), and RMSE (Root Mean Square Error). As recommended by the Video Quality Expert Group (VQEG) [64] for adjusting scaling and non-linearity effects between subjective scores and objective scores, PLCC and RMSE are calculated after performing a non-linear logistic fitting between subjective scores (s) and objective scores (o). The non-linear transform f (o) used in this study is given by: The parameters τ 1 to τ 4 are fitting parameters initialized with τ 1 = s max , τ 2 = s min , τ 3 = µ o , τ 4 = σ o /4, s min , s max are the minimum and maximum subjective scores, and µ o , σ o are the mean and standard deviation of the objective scores.

2) Implementation Details
Our model is implemented using the PyTorch framework [65] and comprises two training blocks: a feature extraction block and a temporal modeling block (see Fig. 1). The feature extraction block contains two CNN branches, the contentaware CNN and distortion prediction CNN. We use a ResNet-50 pre-trained on ImageNet as backbone for both CNNs. The content-aware features are extracted using the content-aware CNN. The distortion prediction network is trained with CSIQ image database to estimate the type of distortion, keeping the parameters of the first layer (conv1) as fixed. We use cross-entropy as loss function and the Adam optimizer [66], training the model for 200 epoch with a learning rate of 0.0001. The temporal modeling block receives as inputs the concatenated features of the two CNNs (content-aware and distortion) and is trained to estimate the quality score of the entire video. Inside this temporal modeling block, a dimension reduction step is first performed using a linear projection. These reduced-size features are then fed to the GRU for estimating the quality score of each video frame. Finally, a temporal attention mechanism and a mean pooling are employed to aggregate the scores predicted for each frame. To train the temporal modeling block, we use an L1 loss between the aggregated score predicted for the video and the ground-truth score, and employ the Adam optimizer with an initial learning rate of 0.0001 and a batch size of 4.
The GRU in our model has a single layer and a hidden state size of 32. The ground-truth MOS are scaled in the range [0, 1] using the min-max scaling.

B. PERFORMANCE ON INDIVIDUAL DATABASE
To facilitate the analysis, we separate the results on the five databases into two groups based on the categories of distortions, i.e. authentic and traditional distortions databases. We used 80% of the data for training and the remaining 20% for testing.
For a fair comparison with the state-of-the-art methods selected in this study, following [38], we run 100 different random splits with 80% of the sequence for training and 20% for testing in each simulation. The same random splits were used to evaluate all tested methods and we followed the same procedures as used in our baselines for training our model.

1) Authentic VQA databases
In authentic VQA databases, the distortions are introduced by the camera and the processing software during capture. These types of distortions include blurriness, insufficient color representation, over/under-exposure, focus, sharpness and stabilization related distortions. Three authentic distortion databases, namely CVD2014 [19], KoNViD-1k [17] and LIVE-Qualcomm [20], are selected in this part. We compared the performance of our proposed method with popular state-of-the-art methods: NIQE [11], BRISQUE [15], V-BLIINDS [32], HI-GRADE [31], FRIQUEE [30], TLVQM [38] and VSFA [13]. Additionally, CNN-TLVQM [23], which is an improvement of TLVQM [38], recently reported top performance on the KoNViD-1k database. We included this method in all our comparisons tables. For other recently published methods such as RAPIQUE [24], PVQ [42] and CoINVQ [25], we just reported their performance on KoNViD-1k in this section as the authors did not report their performance on the other databases.
Also, to have a fair comparison, we re-simulated the VSFA method by using the 80:20 split adopted in our study. The TLVQM and the other state-of-the-art methods are already simulated using this split. We also re-simulated CNN-TLVQM on the three authentic databases using the authors' pre-trained CNN model and MATLAB R2020a [67]. The source code of MLSP [53] is not publicly available, thus we cannot reproduce their model with the 80:20 split adopted in our work. As in recent studies, we evaluated the performance of our proposed method in terms of SROCC, PLCC and RMSE. In Table 1, we report the mean performance (and standard deviation) of compared methods for each database, in terms of SROCC, PLCC and RMSE. We also computed the overall performance of tested methods following the strategy of the original VSFA paper [13], where the performances for each database are combined using a weighted average, and the weight of a database is proportional to its number of videos. As can be seen, our proposed method achieves the second-best overall performance in terms of prediction VOLUME 4, 2016 TABLE 1. Performance results on in-capture distortion databases. In each column, the best, and second-best values are respectively marked in boldface, and underlined. Note that * are performances taken from paper [38] and † from the methods' original papers. Other results were reproduced using the authors' code.

Method
Overall Performance CVD2014 correlation and accuracy (SROCC, PLCC and RMSE), not far behind CNN-TLVQM, which yields the best overall performance, and with comparable performance (SROCC and PLCC) as VSFA. Although our method and CNN-TLVQM present a similar RMSE, CNN-TLVQM gives slightly higher SROCC and PLCC than our approach. Moreover, on the KoNViD-1k database, RAPIQUE has a performance comparable to our method in terms of SROCC, while presenting better PLCC and RMSE. In summary, from Table 1, our method achieves SROCC, PLCC, and RMSE results placing it among the top three compared methods. Thus our approach is competitive with the state-of-the-art methods on authentic distortion databases.

2) Traditional VQA databases
The distortions found in the traditional VQA databases are introduced during a compression or transmission process, which is why they are also called post-capture distortion databases. In this study, we have selected CSIQ video [22] and LIVE VQA [21] as post-capture databases. We used the same procedure as with the authentic databases to perform the evaluation, i.e. running 100 different random splits with a 80:20 ratio between training and test examples. For comparison, we selected the state-of-the-art methods found in the literature that performed well on traditional VQA databases such as V-BLINDS [32], SACONVA [34], V-MEON [36], VIIDEO [12] and NIQE [11] . CNN-TLVQM, RAPIQUE and our proposed method.
As the source codes for SACONVA and V-MEON are not publicly available, we only reported the results from their articles. Table 2 gives the performance of tested methods in terms of median SROCC and PLCC.
We observe that our method outperforms all other approaches by a large margin, on both the CSIQ video and LIVE VQA databases. For CSIQ video, our method gives SROCC and PLCC improvements of 0.05 and 0.04, respectively, compared to the second best approach SACONVA. Likewise, for LIVE VQA, it achieves a boost of 0.03 in SROCC and PLCC with to the second-ranked approach which is again SACONVA. As expected, approaches for this type of databases, in particular SACONVA, V-BLINDS and V-MEON, perform better than VSFA, TLVQM, CNN-TLVQM, and RAPIQUE. We also observe a good performance improvement for CNN-TLVQM compared to TLVQM on the traditional distortion databases, while the performance of RAPIQUE is not satisfactory.
Unlike these approaches, our method performs very well on traditional and authentic VQA databases due largely to its self-supervised representation learning step based on distortion prediction. Specifically, for traditional distortions, our proposed method largely outperforms other state-of-the-art approaches.

C. PERFORMANCE ACROSS DATABASES
An important challenge for NR-VQA models based on deep learning is generalizing to data with different characteristics than the training database. Thus, we evaluated the generalization performance of our method in a cross-database scenario using training and testing databases with different contents and types of distortions. For each training database, we took the trained models and used them to estimate the quality scores of the videos from the other databases. We evaluated the cross-database results in terms of SROCC and reported the best performance value obtained for each test database. The performance of our method in this scenario is compared with that of VSFA, TLVQM and CNN-TLVQM.  Table 3 reports the performance of the four tested methods in terms of SROCC, when trained on an authentic distortion database and tested on the remaining two ones. The proposed method obtains the best performance in four of the six training-testing scenarios of this table. Moreover, our method obtains good generalization performance (SROCC) when trained on the KoNViD-1k or the LIVE-Qualcomm database. From Tables 1 and 3, although CNN-TLVQM performs slightly better than our proposed model on the KoNViD-1k and LIVE-Qualcomm databases, our model generalizes better. We believe that this is because it learns all features while CNN-TLVQM also uses hand-crafted features which may not be optimal for all settings. Also described in the literature, KoNViD-1k comprises natural videos with a wide diversity of contents while LIVE-Qualcomm contains videos with rich scenes. Hence, these results illustrate the robustness of our method to training with data having very different characteristics.
Although not presented in Table 3, we also evaluated the generalization ability of tested methods on the CSIQ video and LIVE VQA databases. However we observed a low performance for those scenarios. For example, when our method is trained on the CSIQ video database and tested on LIVE-VQA, it obtains a SROCC of 0.30.
Similarly, we observed a poor generalization performance when the models are trained on authentic VQA database and tested on traditional/post-capture VQA database (and vice versa). Thus, as concluded by some previous studies [38], learning-based VQA models perform poorly when the distortion type in the testing database is almost absent in the training database.
Finally, we compared the generalization performance of our proposed method with some deep learning NR-IQA models. Actually, authors of [68] have tested the deep learning NR-IQA models such as WaDIQaM [69] and SPAQ [70] on VQA databases (KoNViD-1k, LIVE-Qualcomm, and CVD2014) and concluded that their overall performance was not satisfactory due to temporal information being discarded. Furthermore, this article shows that NR-VQA models such as VSFA and CNN-TLVQM present better generalization performance than those deep learning NR-IQA models.

D. ABLATION STUDY
In this section, we analyze the impact on performance of the different components of our model. Firstly, we evaluate the importance of the content-aware network on the proposed model. Secondly, we evaluate the benefit of the proposed strategy for aggregating the quality scores of individual images into a global video score, based on temporal attention and mean pooling. Toward this goal, we compared this strategy against three pooling mechanisms found in the literature. Finally, we compare the performance of our GRU-based method with two competitive temporal memory networks, RNN and LSTM. To avoid bias in the results, we selected 20 random splits and reused them for all test scenarios. As our study covers both traditional and authentic VOLUME 4, 2016 distortion databases, our ablation study is conducted using both authentic and traditional databases.

1) Study of Content-Aware Network
In Table 4, we performed an ablation study by using only the distortion network (i.e. the proposed method without contextaware network) for predicting the video quality. The results show an improvement when adding the context-aware network. In this study, this improvement is more notable for LIVE-VQA and CSIQ VIDEO databases (traditional distortion databases).

2) Study of Pooling Methods
We compared the performance of our pooling strategy, which combines temporal attention with mean pooling, against two other pooling techniques that reflect the human judgements of quality [71] and the temporal pooling (TP) designed by authors of VSFA [13]: • Min pooling, which selects the minimal score across the different frames of a video. This strategy is based on the idea that users rate the overall video quality based on the worst degradation. • Recency pooling, which is based on the temporal hysteresis effect where users remember poor quality frames in the past and lower the perceived quality scores for the following frames, even when the frame quality has returned to acceptable levels. We fine-tuned the recency parameters for KoNViD-1k database by setting the frame rate to 30 fps, and the memory intensity effect parameter α r to 0.01. More details can be found in [72]. In Table 5, we report the average (and standard deviation) SROCC and RMSE obtained for the compared pooling methods, when applied on CSIQ video and KoNViD-1k databases. We see that our pooling strategy achieves the highest SROCC and RMSE on both databases. Notable improvements are especially observed for the CSIQ video database, where our method achieves a 0.02 higher SROCC and a 0.83 lower RMSE than the second best method (temporal pooling of VSFA). The better performances of temporal and recency pooling compared to min pooling confirms the benefit of considering human behaviour in the pooling strategy (i.e., temporal hysteresis effect). However, as shown by its better generalization performance, our model's learned attention is more robust to cross-database differences in terms of content and distortions than the pooling strategy of VSFA, which relies on hand-tuned hyper-parameters.

3) Study of Temporal Network
Next, we evaluate our temporal network by replacing the proposed GRU by a basic RNN or a LSTM. The SROCC and RMSE of our method using different temporal network, on the KoNViD-1k and CSIQ video databases, is reported in Table 6. We find that RNN and LSTM yield a similar performance to our method on the KoNViD-1k database, which can be due to the the short and fixed length (8 seconds) of videos in this database. However, the proposed GRU-based network achieves a higher SROCC and considerably lower RMSE than other approaches on the CSIQ video database (5.7% lower RMSE than the second best approach, LSTM).

E. COMPUTATIONAL EFFICIENCY
Another important consideration when designing NR-VQA methods is the computational efficiency. Some existing NR-VQA approaches offer suitable performance, however they cannot be used in real-life applications due to their high computational complexity. To complete our study, we evaluate the computation performance of our proposed method. We selected twenty representative video sequences from the CVD2014 database, ten with low resolution (640 × 480) and ten with high resolution (1280 × 720). The length of the sequences varies approximately from ten to twenty seconds. We compare the computation times of our proposed method with those of the state-of-the-art baselines selected in this study (VSFA, TLVQM), and with CNN-TLVQM and RAPIQUE. The simulations for all three methods were performed on a desktop computer with NVIDIA Quadro RTX8000 with 4608 CUDA cores. VSFA and our method were implemented in Python and exploit the PyTorch framework, while the CNN-TLVQM implementation uses only MATLAB and the two other (TLVQM and RAPIQUE) implementations use both MAT-LAB (for feature extraction) and Python (for regression). In Table 7, we report the average computation times for TLVQM, VSFA CNN-TLVQM, RAPIQUE and our methods. As shown, our method (based ResNet-50) is about 3 times faster than CNN-TLVQM, which itself is faster than TLVQM. Additionally, our method has average computation times relatively close to VSFA for low-resolution videos. It is however slower than VSFA for high-resolution videos, which require almost 26.5% less runtime compared to our method. This is due to the fact that our method has to compute features in two different CNN branches (distortion prediction network and context-aware network), whereas VSFA only has a single feature extraction branch. Nevertheless, this runtime difference could be reduced by performing in parallel the computations in both branches, at the cost of additional hardware, or by having the two network branches share some of their layers. Compared to VSFA, RAPIQUE shows significant better computation times (see Table 7). However, as can be seen in Table 2, VSFA and RAPIQUE do not perform well on traditional distortion databases. Moreover, our method generalizes better than VSFA method (see Table 3).

V. CONCLUSION
In this paper, we have proposed an objective NR-VQA method for videos affected by both authentic and traditional distortions. The main contributions of our method are threefold: first, a deep learning based on multi-task learning approach where the distortion of individual frames in a video and the overall quality of the video are predicted by a single neural network; second, a special pooling designed with temporal attention mechanism and average pooling for respectively selecting the frames of a video which contribute the most to its perceived quality and to account for all uncertainties; third, a distortion network designed and used as a complementary network to improve the quality prediction for some VQA databases.
Experiments on five different databases containing videos with authentic and traditional distortions demonstrate the effectiveness of our proposed method in highly-different settings. While state-of-art NR-VQA approaches such as TLVQM, CNN-TLVQM, RAPIQUE and VSFA only perform well on videos with authentic distortion, and give unsatisfactory results on videos with traditional distortion, our method provides competitive performance in both these settings. As a deep learning model, it also has a reasonably good computational complexity.
Despite these promising results, the performance of our model on authentic databases could still be improved. Hence, we did not take into consideration video motion features in this study, which could further boost performance on databases such as LIVE-Qualcomm. In future work, we plan to investigate DNN models which can efficiently extract motion features from videos.