Study of the Subjective and Objective Quality of High Motion Live Streaming Videos

—Video livestreaming is gaining prevalence among video streaming services, especially for the delivery of live, high motion content such as sporting events. The quality of these livestreaming videos can be adversely affected by any of a wide variety of events, including capture artifacts, and distortions incurred during coding and transmission. High motion content can cause or exacerbate many kinds of distortion, such as motion blur and stutter. Because of this, the development of objective Video Quality Assessment (VQA) algorithms that can predict the perceptual quality of high motion, live streamed videos is greatly desired. Important resources for developing these algorithms are appropriate databases that exemplify the kinds of live streaming video distortions encountered in practice. Towards making progress in this direction, we built a video quality database speciﬁcally designed for live streaming VQA research. The new video database is called the Laboratory for Image and Video Engineering (LIVE) Livestream Database. The LIVE Livestream Database includes 315 videos of 45 source sequences from 33 original contents impaired by 6 types of distortions. We also performed a subjective quality study using the new database, whereby more than 12,000 human opinions were gathered from 40 subjects. We demonstrate the usefulness of the new resource by performing a holistic evaluation of the performance of current state-of-the-art (SOTA) VQA models. We envision that researchers will ﬁnd the dataset to be useful for the development, testing, and comparison of future VQA models. The LIVE Livestream database is being made publicly available for these purposes at https://live.ece.utexas.edu/research/LIVE_APV_Study/apv_index.html

Abstract-Video livestreaming is gaining prevalence among video streaming services, especially for the delivery of live, high motion content such as sporting events. The quality of these livestreaming videos can be adversely affected by any of a wide variety of events, including capture artifacts, and distortions incurred during coding and transmission. High motion content can cause or exacerbate many kinds of distortion, such as motion blur and stutter. Because of this, the development of objective Video Quality Assessment (VQA) algorithms that can predict the perceptual quality of high motion, live streamed videos is greatly desired. Important resources for developing these algorithms are appropriate databases that exemplify the kinds of live streaming video distortions encountered in practice. Towards making progress in this direction, we built a video quality database specifically designed for live streaming VQA research. The new video database is called the Laboratory for Image and Video Engineering (LIVE) Livestream Database. The LIVE Livestream Database includes 315 videos of 45 source sequences from 33 original contents impaired by 6 types of distortions. We also performed a subjective quality study using the new database, whereby more than 12,000 human opinions were gathered from 40 subjects. We demonstrate the usefulness of the new resource by performing a holistic evaluation of the performance of current state-of-the-art (SOTA) VQA models. We envision that researchers will find the dataset to be useful for the development, testing, and comparison of future VQA models. The LIVE Livestream database is being made publicly available for these purposes at https://live.ece.utexas.edu/research/LIVE_APV_Study/apv_index .html Index Terms-live streaming, video quality assessment, video quality database, objective VQA algorithm evaluation

I. INTRODUCTION
V IDEO traffic now occupies more than 70% of all total downstream Internet traffic and is still expected to grow [1], [2]. Major content providers such as Amazon Prime Video, YouTube, Netflix, and Hulu are providing increasing amounts of video on demand (VoD) content, as well as live streaming videos, to an expanding audience. live streaming, which is real-time audio and video transmission of live events, is gaining popularity very rapidly, especially for sporting events like the Super Bowl [3].
Although significant efforts have been made to enable the delivery of high-quality, high-resolution VoD, little effort has focused on live, high motion video streaming. In live streaming, there are still a variety of factors that can adversely affect the quality of live streaming videos. For example, bandwidth and stability may affect the received video source quality, causing distortion like blocking, banding, deinterlacing Zaixi Shang and Joshua Ebenezer contributed equally to this work. motion mismatches, local flicker [4], aliasing and interpolation artifacts [5]. If the network connection is unstable or the bitrate inadequate, then frame drops may also occur. The videos may be distorted by stutter or motion blur, especially when there is rapid motion. By contrast with VoD streaming, a large portion of live streamed content is still interlaced and then deinterlaced, causing combing effects, flicker or noticeable line movements.
Video impairments like these can severely impair the delivered video quality and users' holistic levels of visual satisfaction. This is a pressing problem for high motion, action content such as sports videos. high motion videos generally contain richer temporal information and are harder to compress, hence compression artifacts are often more severe in sports videos. Other distortions can also be exacerbated by high motion. For example, at lower frame rates, high motion sports may appear discontinuous over time, and may exhibit obvious judder. Likewise, high motion can worsen the visual appearance of interlacing, by causing jagged moving edges.
While it is highly desirable to create algorithms that can successfully predict the visual impacts of these distortions, subjective data is necessary to understand these perceptual phenomenon and to design and test the underlying objective models. Human subjective studies make it possible to better understand and model the specific factors that contribute to the perceived quality of streaming videos. This data can be used to design or learn objective Video Quality Assessment (VQA) models that are consistent with subjective human evaluations of quality. The development of subjective video quality assessment datasets has been an ongoing effort for two decades [6]- [16], yet none of these are specific to live streaming distortions. Among existing datasets, most include fewer than 20 pristine source video contents of Standard Definition (SD) or High Definition (HD) resolutions, along with various distorted versions of them. The distortions in these resources are largely limited to compression and aliasing, and the datasets lack other live streaming distortions. What is needed is a database of higher resolution (UHD), high-quality source videos that have been processed to include distortions characteristic of those encountered in live streaming scenarios.
Towards filling this gap, we have created a new resource that we call the LIVE Livestream Database, which includes a large number of high motion sports videos, impaired by the most common distortions that impact the perceptual quality of live streamed videos. The new database contains 315 videos, built from 45 source sequences from 33 original contents impaired by six types of common processing distortions. Unlike prior, legacy VQA databases, the LIVE Livestream database consists of Full High Definition (FHD) and Ultra High Definition (UHD) videos of high motion sports content captured by professional videographers. Using these videos, we conducted a large human subjective study, whereby we presented the videos to a large pool of volunteers to obtain Mean Opinion Scores (MOS). To demonstrate the usefulness of the new dataset, we used it to perform a holistic evaluation of current state-of-the-art VQA models, to compare their performance and to gain insights into potential future live streaming VQA problems.
The rest of the paper is organized as follows: In Section II we introduce prior work related to our study and in Section III we discuss the relevance and novelty of the work. In Section IV, we explain the details of the construction of new database and the protocol of the human study. Section V elaborates on the processing and analysis of the obtained subjective scores. Section VI compares the performances of various state of the art (SOTA) VQA models on the new database. Finally, Section VII concludes the paper with thoughts regarding future efforts.

II. RELATED WORK
Over the past decade, there have been many efforts to build subjective video quality databases. Among those, the LIVE VQA Database [6] includes 10 pristine videos processed with compression and packet loss distortions. Similarly, the later database in [17] contains 156 videos modified by H.264 compression artifacts and wireless packet losses. The LIVE QoE Database for HTTP-based Video Streaming [13] studies the quality of experience (QoE) of users who viewed compressed videos with simulated video stalls, which can arise when there is low channel throughput. This database models the perception of video quality on mobile devices, and the human study was performed on mobile phones and tablets. Another QoE database proposed in [23] aims to motivate QoE prediction in video streaming, with different bitrate levels and stalling events. Among 20 1080p source sequences, 5 videos contain high motion content. Another database [28] studied H.264 compressed videos transferred through an error-prone network, including 156 sequences at CIF and 4CIF spatial resolutions. The LIVE Mobile Video Quality Database [19] consists of 200 distorted videos created from 10 RAW HD reference videos, including compression and wireless packet-losses, with dynamically varying distortions. The MCL-V database [29] was designed for streaming video quality assessment, and contains 12 source video clips and 96 distorted video clips impaired by H.264 compression, as well as compression followed by spatial scaling. The TUM databases [30], [31], contain several synthesized videos with H.264 compression. Other exemplars include the MCL video quality database [32], ECVQ and EVVQ [33], and the Poly@NYU Video Quality Databases [34], [35].
More recently, novel databases have been introduced that contain user-generated-content (UGC) videos with authentic distortions. The LIVE-VQC database [8] contains 585 videos, all of unique contents captured by a large group of users deploying various camera devices, including smartphones of all brands. The LIVE-VQC videos cover a wide range of qualities, and include complex, often commingled authentic distortions. The large KoNViD-1k [36] video quality database contains 1,200 video sequences, covering a wide variety of contents and authentic distortions. The YouTube UGC Dataset [37] contains 1500 20-second video clips covering popular UGC video categories, including gaming and sports.
A number of deficiencies limit the usefulness of all of these databases for the study of the quality of live video streams. Older, legacy databases contain only limited numbers of SD source contents, which are not representative of current highresolution live streaming. Although most databases consider compression distortions and packet loss, other prevalent distortions common to live streaming videos are rarely found in them. Given exploding interest in live streaming video, a comprehensive database that includes both ample video content and representative live streaming distortions is needed.
UGC video quality databases usually include a large number of contents, but there is a lack of professionally captured content, and the distortions encountered in live streaming often significantly differ from those caused by typical casual social media users. The only existing publicly available VQA database designed for live streaming is the LIMP Video Quality Database [38]. The LIMP database consists of nine high-quality videos taken from the LIVE Video Quality Video Database [6], with simulated compression modeling transmitted in a controlled network. However, it suffers from the same problems mentioned above. Motivated by an apparent dearth of live streaming databases containing enough high-resolution video contents and sufficiently representative live streaming distortions, we have created a large new resource intended to address modern aspects of the live sports streaming video quality problem.

III. RELEVANCE AND NOVELTY
In recent years, the streaming of live high motion video content such as sports has exploded [1]. Live streaming high motion videos often suffer from severe distortions less often encountered in the streaming of generic content. In live streaming, considerations of network instability and bandwidth limitations imply greater challenges when attempting to control video quality. Moreover, the real-time requirement greatly limits the time available for post-processing to compensate for defects. The unique nature of live streaming introduces many obstacles that differ from those encountered in generic ondemand video streaming. For example, sports videos usually include content containing complex, large motions. Rapid and irregular camera motions occurs frequently, when tracking moving objects, such as balls or players. Temporal distortions often arise that are annoying and that adversely affect the viewer experiences.
The new psychometric database that we describe here has a number of unique attributes. It contains a larger number of unique source contents and distortion types. We summarize the attributes of public video quality databases in Table I. The new database includes 45 source sequences token from 33 unique contents. All of the videos contain complex, fast motions, which are rarely included in existing databases. The new resource contains a wide variety of distortion classes

IV. DETAILS OF SUBJECTIVE STUDY
We constructed a new video quality database that consistent of 315 video sequences including 45 reference videos and 6 copies of synthetically distorted version of each reference video. Those videos are used as stimuli in the subjective study.

A. Source Sequences
We collected 33 uncompressed, high-quality source videos with sports content. These videos are freely available online from multiple sources, including from Tampere University [39], the MCML Group [40], the Netflix Public Dataset [41], the VQEG HD3 Dataset [42], the Consumer Digital Video Library (CDVL) [43], and the SJTU Media Lab [44]. All of the selected videos were captured with professional, high-end camera equipment and are distortion-free. The original pristine videos all have resolutions of 1920x1080 or 3840x2160 pixels, and were progressively scanned in YUV 4:2:0 format with audio components removed. The videos have frame rates at 30 fps. The video contents include 10 different types of sports, including running, football, and soccer, and one video of the audience in a stadium, as exemplified in Fig. 1.
The original 33 videos that we collected are of durations ranging from 5s to 26s. However, since viewing videos of such differences of durations could cause biases in subjective and objective judgments, longer videos may exhibit visible changes of distortion over time. While the effects of video duration is interesting and worthy of study, this also would increase the dimensionality of the study. Thus, we manually cropped the longer videos along the temporal dimension into one or two shorter clips of about 7 seconds with no overlap or close proximity between the clips. Based on internal studies at UT-LIVE, it has been observed that very short videos of sports videos may cause annoying content disruptions, such as incomplete "play," but these events are usually shorter than 8s. To avoid unpleasant cuts during action scenes, we allowed some flexibility of the video durations, hence the final set of original videos had lengths in the range 5s-8s, averaging 7.88s with a standard deviation of 1.36s. In this way, 45 video clips were created from the 33 originals, of which 22 clips are of resolution 1920x1080 and 23 clips are of resolution 3840x2160.

B. Synthetic Distortions
We created 6 distorted video sequences from each of the pristine sequences, using six different distortion processes. These included H.264 compression, aliasing, judder, flicker, frame drops, and interlacing. Since our primary goal is to model the visual quality high motion live sports videos, the distortions chosen were judged to be the most common and salient ones that are encountered during live sports events. During live streaming of high motion contents, certain distortions may produce more severe effects than on more generic video content. For example, a moving object may cause large pixel offsets between neighboring frames or fields. If the video is interlaced, then severe edge combing and blur may occur. If the frame rate is too slow, then judder from 3:2 conversion [45], [46] may be visible in high motion regions, which can seriously and adversely impact the appearances of sports videos. Purely temporal distortions, such as frame drops, which cause discontinuities and motion stalls, are difficult to detect.
When applying different levels of each distortion type, we sought to ensure that the distorted videos would be both perceptually separable and also cover a wide range of perceptual qualities, following successful practice in numerous previous studies [6]- [8]. However, given the large number of source sequences, it is not practical to include multiple copies of the same content, which can greatly increase the duration of the human study. Moreover, having larger number of unique contents can contribute to improved model building. Hence, given the fairly large number of source videos, we dictated that each would only have a single level of severity of each distortion type applied to it. For example, four levels of H.264 compression, corresponding to different constant rate factors (CRF) were defined. This was accomplished in a "round robin" sequential manner: the first reference video could only be compressed using the first CRF level, the second reference was only compressed using the second CRF level, and so on. The fifth source video then had the first level of distortion applied. However, to ensure that there would be no contentrelated quality bias, the first video in the quality level cycle was also sequenced as subsequent distortions were applied. In this way, each of the 45 clips taken from the original 33 pristine source videos has 6 associated distorted versions of it, yielding 315 videos including the 45 reference videos.

1) H.264 Compression
H.264 remains the most widely-accepted and used video compression standard. A 2020 streaming industry survey [47] found that 91% of streaming services use H.264. Although newer codecs exist, such as HEVC, VP9 and AV1, they are not yet as widely adopted. Browsers and devices also don't have full support for all codecs. The Apple Safari browser supports HEVC, but not VP9, while Chrome and Firefox support VP9 and AV1, but not HEVC. All browsers support H.264. Hence, when designing this VQA database, we deemed H.264 to be most representative of current practice. Moreover, even emerging standards still follow the basic hybrid codec method of distortion, viz., quantization of DCT blocks, while several distortions are not compression-related. Hence, we believe that the new database will retain usefulness as the compression standards evolve. We fixed four levels of H.264 compression using the criteria described earlier, by varying the CRF values. Similar to other successful VQA databases [6], [17], [29], we included a wide range of compression CRFs to ensure that the distorted videos cover a wide range of perceptual qualities, while also ensuring perceptual difference between the applied compression levels, to allow for improved modelbuilding. Since in practice, the compression parameters differ on videos of different resolution, we selected different sets of CRFs for the 4K videos and the 1080p videos. The CRF values selected for the 4K videos were 9, 27, 39, and 43, while those for 1080p videos were 9, 25, 35, and 39. All of the compressed videos were generated using FFmpeg.

2) Aliasing
Aliasing was simulated by first downscaling each video, then upscaling it back to its original dimensions. The downscaling was performed by spatially downsampling the video to half the original size without the use of an anti-aliasing filter, while the upscaling was performed using a Lanczos filter.

3) Judder
Motion judder is an artifact that is introduced when scenes shot at 23.94 fps are converted to 29.97 fps by a process called 2:3 pulldown. The ratio of these frame rates is 4:5: for every 4 input frames, 5 output frames were created by temporally downsampling the video to 23.94 fps, then converting the frame rate to 29.97 by 2:3 pulldown. The odd video field of every 2 nd frame, and the even video field of every 3 rd frame of each group of 4 frames were combined to form an additional frame, for each group of 4 frames. This process is shown in Fig. 2. Classic 2:3 pulldown followed a slightly different pattern where the 2nd and 3rd frames of the original video would be interlaced to form the 3rd frame of the juddered video, and the 4th and 5th frames of the original video would be interlaced to form the 4th frame of the juddered video. This had the disadvantage of producing two "dirty" frames, which were the 3rd and 4th frames in each group, but was used in legacy systems where the buffer could not hold fields from more than one frame at a time. The version we use here is a more advanced pulldown, supported by cameras released after 2000 such as the Panasonic DVX100 [48] or the Canon XL2 [49]. The more advanced version of pulldown generates only one "dirty" frame and also allows for better compression and easier conversion back to 23.94 fps.

4) Flicker
We simulated flicker distortion from compression by alternating the H.264 quantization parameter (QP) on the video. The QP is fixed at a constant value by passing this parameter to libx264. These QP values were applied to each frame, regardless of the frame type, content and motion. Three pairs of QPs were chosen to form three flicker distortion levels: QP26 and QP32, QP26 and QP 38, and QP26 and QP44. The flicker rate, which is the number of QP alternations per second, was kept a constant roughly 5 Hz i.e. by alternating the QP every 3 frames. This process is depicted in Fig. 3.

5) Frame Drops
We simulated video frame losses that occur when a source video is transmitted over a channel, such as a wireless network. We simulated frame drop clusters of adjacent frames to account for 10%-30% of a group of pictures (GOP). When a cluster of frames was removed from a video, the previous frame was repeated as many times as needed so that the total video duration remained unchanged. Three levels of frame drop densities were chosen: 3, 6 and 9 frames per cluster, yielding a slight to severe impact on the perceptual qualities of the videos.

6) Interlacing
On each frame of the video, the even and odd lines were separated to form two fields, field A and field B. Field B from each current frame and field A from each next frame were then combined to create interlaced frames. In the presence of motion, combing effects become evident. Since interlaced video fields are captured at different moments in time, interlaced frames often exhibit motion combing artifacts, when objects move quickly enough to be at different positions in each field.

C. Subjective Testing Environment and Display
The human study was carried out in the LIVE Subjective study room at The University of Texas at Austin. The Lab was arranged to simulate a living room environment. The windows were covered, and background distractions were removed. A Samsung UN65RU7100FXZA Flat 65-Inch 4K UHD TV was used to display all of the videos. All advanced motion optimization options on the TV, including the antijudder and anti-flicker functions, were disabled. The viewing distance was about 2H, where H is the height of the TV so that the subjects could comfortably view the videos and assess the video distortions. The level of illumination was set to be similar to a living room, using one stand-up incandescent lamp and two indirect white LED studio lights behind the viewer. The lights were positioned to eliminate reflections from the lights on the screen.
Since the TV is able to upscale 1080p content using an unknown algorithm, all of the 1080p videos were instead upscaled using the Lanczos resizing function in OpenCV [50], to avoid any unpredictable effects. The 1080p videos were upscaled to 4K, after the distortions were applied. To ensure perfect playback, all of the videos were stored as raw YUV 4:2:0 files. The powerful Venueplayer application developed by VideoClarity was used to guarantee smooth playback of the 4K videos, without introducing any additional artifacts that could impact the perception of video quality.
After displaying each of the test videos, a continuous rating bar was displayed on the screen with a randomly placed cursor. The quality bar was marked with labels "Bad," "Poor," "Fair," "Good," and "Excellent" quality to facilitate the subjects in making decisions. The scores given by the subjects were sampled as integers from [0, 100] although numerical values were not made visible to the subjects. A Palette gear console was provided to enable the subjects to move the cursor without distraction. After moving the cursor to each desired scoring position, the subject depressed the button next to the sliding bar to confirm the score, which was then recorded without any further change. After each score was stored, the system immediately began to play the next video on the playlist.

D. Subjective Testing Design
In the human study, a single-stimulus (SS) method was employed, as described in the ITU-R BT 500.13 recommendation [51]. The reference videos are included as "hidden reference", not explicitly marked as "distorted" or "reference." The subjects used a rating bar to record their subjective opinion scores. Video rating scores were given after watching each video on an (invisible) scale ranging from 0 to 100, where 0 indicates the worst quality and 100 indicates the best quality. Due to the large number of video sequences, each subject participated in two sessions. The 45 contents associated with the pristine videos were divided into two sessions, where the reference videos and their corresponding distorted versions were grouped into the same session. The playlists within each of the two sessions were placed in randomized order for each subject, where videos of the same content, were separated by at least one video. This was done to counter any visual memory effects that might affect the subjective quality judgments, or any bias caused by playing the videos in a particular order. Each session required about 40 minutes.

E. Subjects and Training
A total of 40 human subjects were recruited from the student population at The University of Texas at Austin. The male/female gender ratio of the subject pool was 4.0. The mean and standard deviation of the ages of the participants was 23.47 and 1.78. Each subject participated in two sessions separated by at least 24 hours. Two of the subjects finished only one of the two sessions, while the rest of the 38 human subjects finished both sessions. 180 of the videos were rated by 40 subjects, while 187 videos were rated by 38 subjects. The subject pool was inexperienced with the topic of video quality assessment and video distortions.
The Snellen test and the Ishihara test were performed to validate each subject's vision. Two subjects were found to have 20/30 visual acuity, while one subject was found to have a color deficiency. However, these subjects were allowed to participate since the overall subject pool was deemed to be a good representation of the general population, following our common practice [52]. We conducted the tests as a screen against an unusual percentage of deficient subjects. Before the study, each subject was presented with a brief introduction to the study. The introduction described the study's goals, and gave detailed instructions on how to operate the system and assign scores. Each subject was asked to rate each video by quality only, without regard to the appeal of the content. Before the actual study commenced, each subject participated in a training session on two videos, to familiarize themselves with the system. The training videos and their scores were not included in the final database.

V. PROCESSING OF SUBJECTIVE SCORES
Subjective Mean Opinion Scores (MOS) were computed using the formulas below: Let s ij denote the score by subject i for the video j. The subject scores were then converted into Zscores z ij for each subject. Subject rejection was performed based on the ITU-R BR 500.11 recommendation [51]. The scores z ij for each video were tested against the normal distribution using the β 2 test: where for subject i and video j, where N j is the number of subjects that viewed video j. A score was regarded as normally distributed if β 2j fell between 2 and 4. We calculated the quantities P i and Q i for each subject i, by comparing z ij with the meanz j standard deviation σ j of video j: If the score for video j was found to be normally distributed then: if z ij ≥z j + 2σ j , then If the score for video j was found to not be normally distributed, then: if z ij ≥z j + √ 20σ j , then P i = P i + 1 if z ij ≤z j − √ 20σ j , then Q i = Q i + 1. A subject i was rejected if the following two conditions held: and In our study, 8 of the 40 subjects satisfied these two condi-  tions. However, since most of the rejected subjects fell close to the decision boundaries, we decided to revisit how the rejection criteria should be used. Given that the intent of subject rejection is to eliminate the outcomes of less engaged, distracted, or otherwise deficient subjects, we believed it worth considering whether any of the high-deviation subjects were actually representative, as we have done in other recent studies [53]. We therefore computed the correlations between each subject's score and the MOS calculated using three different variations of the rejection criterion: 8 rejected, none rejected, and 1 (most anomalous) subject rejected, as shown in Fig.  4a. Specifically, the subjects were divided into three groups: Group 1 included all subjects not excluded by the ITU method. Group 2 and Group 3 included only the 8 subjects that were rejected, while Group 3 considered only of the single subject having the worst correlation against MOS. In the end, we chose to report all of the foregoing results by only excluding the single subject in Group 3. Table II shows our analysis of the data's internal consistency. Our modification of the typical outlier rejection criterion finds support in the analysis, and allows for a larger amount of likely representative data for model-building. We randomly divided the subjects into two equally sized groups and computed the Pearson correlation coefficient (PLCC) between the two groups' scores. We repeated this calculation over 1000 results, and report the mean and median correlations in Table. II. As may be seen, the best results were attained by removing the single very anomalous subject. We also observed negligible effect of the choice of rejection criteria on the objective algorithm performances reported later.
The Z-scores were then linearly rescaled from [-3,3] to [0,100]: Finally the Mean Opinion Score (MOS) of each video was calculated: The converted MOS score is shown in Fig. 4b.  Fig. 6. Generally, the MOS ranges of different distortion levels are mostly wellseparated, but there are overlaps between distortion levels, largely because of the different interactions that occur between content and distortion. The perceptual quality of distorted (compressed videos) is affected by content masking, e.g. in regions containing significant high frequency spatial energy or high motion. While spatial masking is well-understood, temporal masking is less so, although it is known that motion has a silencing effect on flicker [4]. Tables III and IV show measurements of the consistency of human scores for each of the different distortion types. The Tables list the Spearman's Rank Order Correlation Coefficient (SROCC) and the Pearson Linear Correlation Coefficient (PLCC) computed on the entire database and for each distortion type, again by randomly dividing the subjects into two groups. It may be observed that the SROCC was slightly lower than the PLCC, which might be explained by subjects having difficulty supplying correctly ordered ratings of videos of very similar quality. but still generally able to make predictions in a linear manner. Overall, the results of the results indicate a very high degree of internal consistency and agreement amount the human subjects on all of the distorted video types.
Although MOS is a good representation of the subjective quality of videos and is necessary for the development and evaluation of NR VQA algorithms, the Difference MOS (DMOS) is more commonly used in the development and evaluation of FR VQA models, since it allows a way to reduce content dependencies of quality labels. Since we are supplying this resource for the study of both NR and FR models, we also calculated the DMOS of the videos with references. We calculated the DMOS according to: where M OS j is the MOS of video j, and M OS ref j is the MOS of the reference video j, which is regarded as a "hidden reference," since it is not identified as such to the subjects.

A. Performances of FR VQA Models
Here we present the results for the following seven popular FR VQA models: PSNR, SSIM, MS-SSIM, SpEEDQA, ST-RRED, FAST, and VMAF. The distorted versions of the 45 reference contents (270 videos in total) were processed to produce predictions that were cast against the DMOS. Note that most FR VQA models require that there be an equal number of frames between each reference video and its corresponding compared distorted video. However, the videos subjected to interlacing distortions have one less frame than the originals they derive from. Hence, the final frame of the interlaced video is duplicated to match the reference. The predicted scores s were passed though a five-parameter nonlinear logistic regression function before the PLCC and MSE were computed: where s are the predicted scores produced by the tested algorithm and f (s) is the mapped score. By fitting parameters β i (i = 1, 2, 3,4,5), the MSE between the mapped and subjective scores is minimized. The SROCC, PLCC, and RMSE for each category of distortions are calculated by comparing the predictions made by the FR models and the ground truth for each of those distortions separately. Table V, VI, and VII show the performance metrics of the compared algorithms, which will be discussed shortly.

B. Performance of NR VQA Models
We compared the quality predictions made by a variety of NR models against the MOS. The NR VQA algorithms that were tested include NIQE [54], BRISQUE [55], HI-GRADE [56], CORNIA [57], TLVQM [58], V-BLIINDS [59], and ChipQA [60], [61]. BRISQUE, HIGRADE, CORNIA, TLVQM, V-BLIINDS, and ChipQA are supervised learning algorithms that use a support vector regressor (SVR) to learn mappings from 'quality-aware' features to mean opinion scores. These algorithms were tested on 1000 random traintest splits. On each split, 80% of the data was used for training, and 20% for testing. Follow common practice, 5-fold crossvalidation was applied within each training set to find the best parameters for the SVR. Care was taken to ensure that no content could appear in both the training and testing set, or the training and validation set.
NIQE, BRISQUE and HIGRADE are image quality assessment (IAQ) algorithms, so they were used to extract features frame by frame, followed by temporal average pooling.
For the unsupervised methods (NIQE), the scores s were passed through the same nonlinear logistic regression process before the PLCC and MSE were computed, as described earlier. The performances of the compared VQA models on the entire database, as well as for each synthetic distortion, are shown in Tables VIII, IX, and X, where the best performing model on each distortion category is boldfaced. The results for each specific distortion were acquired by training the SVR on the reference sequences and the specific distorted sequences. Scatter plots of some selected objective VQA models against MOS are shown in Fig. 8.

C. Statistical Evaluation
A one-sided t-test was performed on the 1000 SROCC scores of the NR VQA models computed on the LIVE Livestream Database, using the 95% confidence level to evaluate whether one VQA algorithm was statistically superior to another. The results are shown in Table XII. Results on the entire database and on individual distortions are both included. Each entry in the table consists of 7 symbols corresponding to the entire database, and the 6 distortions, in the order of compression, aliasing, judder, flicker, frame drop, and interlacing. A symbol '1' indicates using the performance of the algorithm on the row was statistically superior to that of the column, while a symbol '0' indicates that the column algorithms was statistically better than the row algorithm. A symbol of '-' indicates that the performances of the row and the column algorithms were statistically equivalent.

D. Computational Cost
Since we are interested in live streaming use scenarios, we studied the computational costs, the number of giga floating point operations (GFLOPS), and complexity of the compared models, as shown in Table XI. The O(·) figures make clear that all of the compared algorithms could be implemented as realtime hardware realizations. To measure computation time, we used a single 4K video having 210 frames. Of the compared algorithms, V-BLIINDS, and ChipQA were implemented in Python. All other algorithms were implemented in MATLAB ® . All the algorithms were run on an Intel Xeon E5-2620 CPU with a maximum frequency of 3 GHz.
While none of the tested algorithms runs in real time in their current implementations, they may be optimized to do so. In most of the algorithms, the most expensive step is filtering. For example, in BRISQUE the largest computation is computing the mean subtracted contrast normalized (MSCN) coefficients. However, filtering scales up linearly and is highly parallelizable. Frame based algorithms can be applied at a lower frame rate with little loss of prediction efficacy [62]. While V-BLIINDS expends considerable computation on motion computation, motion vectors can be re-used from those produced by the involved codec. The complexity of CORNIA, which computes dot-products between local descriptors and visual codewords, is affected by the codebook size, which can be quite large.

E. Discussion of Results
The results presented in Tables V, VI and VII suggest that, other than PSNR, the compared FR VQA models generally delivered similar overall performances on the entire database, but some algorithms yielded better performances on certain distortions. For example, SSIM, which performed well overall, obtained the highest correlation against DMOS on the compressed videos, but low correlation on the judder videos. The main reason that SSIM delivers low performance on judder videos is that it is a frame-based model. Judder is a temporal distortion that arises when high motion is present in a video. The greater the magnitude of the motion, the more apparent the distortion is likely to be. While SSIM effectively captures spatial distortions (like compression), it is unable to capture the temporal effects of judder. ST-RRED does include limited temporal information expressed as NSS features from adjacent frame differences, which is inadequate to model complex or longer-duration temporal distortions, hence it does not outperform the other compared FR models. VMAF yielded   the highest correlation on the aliased and flicker videos, but low correlations on the interlaced videos. The FR VQA models tended to deliver decent performances on common distortions found in other VQA databases, such as compression, and also flicker, which is compression based. These distortions are better studied and easier to catch with the presence of the reference. However, when tested on the purely temporal distortions, all of the compared FR VQA models delivered low correlations against DMOS. This suggests ample room for research on developing better models of temporal and motionrelated distortions.
From Tables VIII, IX, and X, it may be observed that ChipQA performed the best among the compared NR VQA algorithms, while TLVQM and V-BLIINDS also achieved relatively higher correlations against the human judgments. TLVQM achieved the top performance on flicker and frame drops, likely because of the large number of temporal features it uses. ChipQA builds a statistical representation of local spatiotemporal data that is attuned to local orientations of motion over large spatial fields, motivated by processes in areas V1 and MT of the brain. The explicit modeling of deviations from statistical regularity in the spatiotemporal domain allows it to perform well on both spatial and temporal distortions. NIQE and BRISQUE are similar methods, but BRISQUE is trained while NIQE is completely blind, hence BRISQUE usually can deliver predictions having higher correlations against human quality judgments. Similar statistical features are used in V-BLIINDS and HIGRADE. The frame-based models NIQE, BRISQUE, HIGRADE, and CORNIA do not access any motion information, which greatly limits their performance. CORNIA yielded top performances on compression, aliasing, and interlacing, all of which present strong spatial aspects of distortion. However, the overall performance of CORNIA was lower than that of V-BLIINDS, TLVQM, and ChipQA, due to the lack of temporal information.

VII. CONCLUSION
We created a large scale video quality database targeting high motion, live streaming scenarios. The new resource includes 45 source sequences from 33 original contents and 6 different distortion types. The new database can be used to create, test, and compare both NR and FR VQA models. We are making the new LIVE Livestream database publicly available. Future steps include developing new NR VQA models using the proposed database.