Content-Adaptable ROI-Aware Video Storage for Power-Quality Scalable Mobile Streaming

The demand for mobile video streams is constantly increasing. With this demand comes a need for mobile devices to receive more videos at ever increasing quality. However, due to the large size of video data and intensive computational requirements, video streaming requires frequent memory access that consume a substantial amount of mobile device power; as a result, the battery life of mobile devices is limited. In this paper, we present a video content-adaptable Region-of-Interest (ROI)-aware video storage technique that promotes power savings. During the video encoding process on the transmitting server, based on the macroblock variance and ROI characterization, the “macroblocks of interest” are identified and embedded in the encoded bitstream. In the decoding process, a new frame buffer with dynamic power-quality trade-off is presented to adapt to the macroblock characteristics during run-time. Results from the system-level and circuit-level simulations show that the proposed technique enables substantially more truncated bits and significant power savings while delivering similar or better video quality as compared to other state-of-the-art solutions.


I. INTRODUCTION
Mobile video streaming on YouTube, Vimeo, and Netflix has increased on average 70% per year and will consume approximately 79% of the total internet traffic by 2022 [1]. At the same time, power-efficient video storage has proven to be a very challenging problem to solve. This is due to the large data sizes associated and intensive computational requirements demanding frequent data access. With the advancement of computing technologies, more video streaming services deliver content to battery-powered mobile devices: such as smart phones and Internet-of-Things (IoT). On one hand, these devices would benefit greatly from low-power consumption as this would extend their battery life. On the other hand, the mobile video streaming process -receive, decode, and display of a video bitstreamconsumes considerable power and limits the mobile devices' battery life. For example, with a video decoding chip, embedded memories contribute to over 50% of the decoding power consumption [2]. This use-case is only expected to grow for the next-generation video formats, H.265/HEVC and H.266/VVC, which has 2x-3x greater memory demands when compared to H.264 [3].
Today's mobile hardware designers, including memory designers, are focusing on hardware-level energy-efficient design techniques in order to accommodate the large amount of video data. However, these design techniques usually come with significant implementation overhead (e.g., silicon area, delay) to solve failure problems in memories. We have recently explored viewer-aware video memory design by investigating the impact of illuminance levels in different viewing surroundings on the viewer's experience [4,5,6,7], as shown in Fig. 1. Our previous studies illustrate a new dimension of power savings for hardware design through the introduction of viewer awareness, but the developed memories lack runtime adaptation across a wide variety of mobile videos. To enable an optimized trade-off between power efficiency and video quality, this paper aims to develop a video content-adaptable Region-of-Interest (ROI)aware memory for general videos. Specifically, this paper makes the following contributions:  An intelligent ROI-aware and content-adaptive framework is proposed to determine video frame regions to preserve (output quality) or truncate bits for power savings. The truncation is applied for all Luma and Chroma video data (i.e., Y, U, and/or V components) (Section III).  The system-level implementation scheme of the proposed technique is developed and discussed (Sections IV-A, IV-B, and IV-C).  A low-power low-cost frame buffer with dynamic power-quality trade-off is developed to adapt to the video content (i.e., macroblock characteristics) during run-time (Section IV-D).  A comprehensive suite of simulations on the proposed technique is performed and the enriched results are discussed, including the performance, circuit-level power efficiency, video-level power efficiency, number of truncated bits, and output quality of various mobile videos (Sections VI-A, VI-B, VI-C, and VI-D).  An extensive statistical analysis demonstrates the effectiveness of the proposed technique in achieving significant bit truncations and power savings as compared to the state-of-the art, particularly for the videos with medium or high variance (Section VI-E). To the best of the authors' knowledge, this is the first work that seamlessly integrates ROI knowledge, i.e., "macroblocks of interest", into the hardware design process.
The organization of the paper is as follows. A review of low-power video memory designs is provided in Section II, Section III presents the macroblock variance and ROI study, and Section IV discusses the proposed technique. We discuss the evaluation methodology and results in Sections V and VI respectively, and finally, we conclude the paper in Section VII.

II. STATE OF THE ART
A vast amount of research has been conducted to improve the power efficiency of video data storage. State-of-the-art, power-efficient video memories consist of either approximate memory with application-level information [8,9,10,11,12] or viewer-aware memories with an awareness of viewer's experience [4,5,6,7]. In this section, some of the existing work related to the proposed technique are briefly reviewed, and the detailed comparison analysis will be provided in Section VII.

A. APPROXIMATE VIDEO-SPECIFIC MEMORY
Researchers have presented various low-power video memory design techniques. Chang et al. [8] presented a hybrid 6T+8T SRAM to achieve quality-power optimization. Gong et al. [9] developed a hybrid 8T+10T memory for power savings based on the correlation between most-significant-bits (MSBs) of video data. In [10], a heterogeneous sizing scheme was presented to reduce the failure probability of conventional 6T bitcells. The video memory presented in [11] used the Least-Significant-Bits (LSBs) of video data to store the MSBs' error-correctioncode (ECC). Kazimirsky et al. [12] developed a hybrid SRAM+DRAM memory to store MSBs in robust SRAM bitcells and LSBs in error-prone DRAM bitcells, leading to a tolerable output quality with power reduction. However, all of those video memory designs were developed without considering viewer's experience.

B. VIEWER-AWARE VIDEO MEMORY
We have investigated viewer-aware low-power video memory techniques in [4,5,6]: where an increased amount of ambient luminance allows for a larger number of bits to be truncated without noticeable degradation to the viewers. Very recently, we studied the impact of video content characteristics on viewer's experience to enable video content-adaptive memory with dynamic energy-quality tradeoff [7]. However, the technique determined the number of truncated LSBs based on the averaged plain macroblock percentage of an entire video sample; therefore, it was only effective to store low-motion videos with a stationary camera or containing a reporter in a video cast use-case. Additionally, this technique may result in noticeable distortion, e.g., a banding distortion caused by bit truncation, which negatively influenced the viewer's experience.
The common feature of these viewer-aware storage techniques is that the same number of the truncated bits were applied on an entire video. In contrast, the technique proposed in this paper realizes content adaptation and ROI awareness within each video frame, thereby maximizing the number of truncated bits while maintaining the video quality.

III. OVERVIEW OF THE PROPOSED TECHNIQUE
In this section, we first present the motivation of the proposed technique that introduces ROI awareness as bit truncation is applied for power savings. Then, the high-level overview of the proposed technique is shown.
Researchers conducted studies on the human visual system's (HVS) performance and concluded that viewers usually pay more attention to one or a few areas of a video and the region of concentration is called Region-Of-Interest (ROI) [13]. For example, in video conferencing applications, viewers typically pay more attention to the face regions than other areas. In video surveillance, the facial regions are what viewers concentrate most on in consecutive frames. Accordingly, ROIs have higher contribution towards the overall visual quality than other areas. Consequently, if truncation-caused banding distortion appears in ROIs, this will negatively influence a viewer's experience. Fig. 2 shows one example. The output quality of the video (Video tag: wF6lvdXXwc4 [14]) using the technique in [7] is shown in Fig. 2 (a). Since the banding distortion caused by bit truncation appears on the reporter's face, viewers were less likely to accept the displayed degradation due to this particularly noticeable distortion, as emphasized in [7]. Therefore, the motivation for this work arises from the following two observations: 1) In a video frame, the distortion in ROIs is more noticeable by viewers. Accordingly, if ROIs can be extracted and protected from truncation, the video quality would be improved from the viewer's perspective ( Fig. 2 (b)). A comparison of the report's face using the technique in [7] and the proposed technique with ROI awareness is shown in Fig.  2 (c).
2) There existed a positive correlation between power savings and the number of bits truncated in a video decoder's frame buffer memory [7]. To optimize the power efficiency, it would be beneficial to increase the number of truncated bits in other regions which are not ROIs: the truncation regions. Fig. 3 shows the proposed content-adaptable ROI-aware video storage technique. During the traditional mobile video streaming process, first, from (1) in Fig. 3, the mobile device requests a video for display from the cloud. Then, the streaming servers process the requested video by encoding and transmitting the encoded bitstream to the mobile device for decoding and display, (2) in Fig. 3. During this process, multiple memories are needed for storing the intermediate and final results of the frame data. In particular, the reference macroblock, frame memory, and display memory, which store the decoded video frames, are accessed very frequently, and they have a profound impact on the system's overall cost and power consumption. The proposed technique extracts ROIs in the cloud server and transmits the truncation region data together with the encoded bitstream to the mobile device, (3) in Fig. 3, to further reduce the mobile device's power consumption from computational overhead. The mobile device hardware video decoder receives the truncation region data and makes memory bit-truncation (a) Output quality using [7] (at 3 truncated bits) (b) Output quality of the proposed technique (at 3 truncated bits) (c) [7] (left) vs. Proposed technique (right) FIGURE 2. Observer discernable flaws in the facial region due to a "banding effect" on the face when comparing (a) and (b) caused the overall quality of the frame to become unacceptable at 3 truncated bits (Video tag: wF6lvdXXwc4 from [14] decisions for greater power savings with less perceived quality loss than [7]. To optimize the truncation decision logic of the mobile device hardware, which further improves power consumption, either no truncation or 3-bit truncation is applied to the truncation regions. Explicitly, the proposed technique is detailed as follows.

1) ROI AWARENESS
ROI has been recently applied for different research areas for video system optimization, such as wireless transmission [15], virtual reality (VR) [16], and video summarization [17]. The proposed technique introduces ROI awareness into video storage. Specifically, to minimize the complexity and computational overhead, we focus on the faces as ROIs in our analysis based on the basic machine learning facial detection OpenCV model [18]. Different algorithms, such as 1 The word vector means all bytes of a component -Y, U, or V -in one frame. user attention model [13], motion-based models [17], and machine learning models [19], can be applied in our future investigations to extract different ROIs. It should be noted that the complexity of ROI extraction algorithms is a tradeoff between video quality, computational complexity, and power consumption. A simple ROI extraction algorithm will save compute resources and power from video encoding. Also, the algorithm may transmit fewer truncation region bits to mobile devices, thus more pixel bits will be truncated for power savings in the mobile devices. The drawback is that it will influence the video quality negatively. Alternatively, a more complex ROI algorithm will identify additional regions and therefore, can convert a video without ROI to a video with ROI, which will benefit the video quality, but it will reduce the power savings due to the less truncated bits and increased computational complexity.

2) VIDEO CONTENT ADAPTATION
After the ROIs to preserve are detected and captured by the framework ROI Identifier, it then searches for regions of low variance measured by the percentage of plain macroblocks (MBs). Specifically, a MB defines an area of 16x16 pixels within a frame. An attribute associated with MBs is how "Textured or Plain" they are. A Plain MB is one in which the variance of intensity within the MB is less than or equal to the threshold value. It has been concluded in [7] that textured MBs are less susceptible to bit-truncation. We will adopt the pre-established method from [20] for determining the variance in a MB.  (2), where ρ is the average brightness within the MB, is the texture variance within the MB, and traditionally, Th is defined as a value of 1.25 [21].

3) TRUNCATION REGION EXTRACTOR
After ROIs are identified on the server, a truncation region extractor encodes the truncation region data using a proprietary protocol per frame and transmits in synchronization with the encoded video transmission to the mobile device. The truncation region data is decoded onboard the mobile device's hardware video decoder in a novel Memory Bit Truncation Manager (MBTM) hardware unit: which truncates a novel frame buffer memory through the use of unique control YUV truncation signals. The video decoding and bit truncation processes occur in lockstep.

4) 3-BIT TRUNCATION
Truncation is performed in the YUV (Y'CbCr) color space [22], inferring that any truncation is done to the YUV color values. The memory designed in [7]   frames within an entire video as a blanket truncation. The proposed technique will enable a different amount of truncated bits for each region within each frame within an entire video. To minimize the implementation overhead, only 3-bit truncation is adopted in the new frame buffer, which will be discussed in Section IV-D. Meanwhile, the proposed technique can identify bit-truncation for each Y, U, and V vector of the frame separately for each truncation region in each frame, instead of only truncating the Y vector as a blanket truncation across the entire video as the existing techniques [4,5,7]. Furthermore, the proposed technique is expected to enable additional bit truncations as compared to existing techniques. Also, to minimize the video quality degradation caused by bit truncation, the developed frame buffer truncates three LSBs to the optimal value "100" [7], instead of truncating the values to "000". Fig. 4 shows the Akiyo video sample using the proposed technique. The extracted preserved ROI region is highlighted in pink. All truncation regions within a frame are identified, including the following seven possible truncation combinations: (1) Green, Y vector truncation; (2) Blue, U vector truncation; (3) Yellow, V vector truncation; (4) Dark blue, YU vectors truncation; (5) Dark Yellow, UV vector truncation; (6) Dark green, YV vectors truncation; and (7) Grey, YUV vectors truncation. Each of these combinations would be encoded in the truncation region data for the MBTM to generate control signals for memory bit truncation in the video decoding process.
To conclude, our proposed technique truncates the chroma sub samples within each frame as well as the luminosity: Y, U, and V vectors. Previous research only targeted luminosity, Y, of a video for truncation, while chroma samples were disregarded for the entire video. Also, our technique preserves ROIs that impact viewer perception most, while enabling greater truncation for each Y, U, and V vector for the truncation regions with textured MBs. Accordingly, the proposed technique will realize a greater number of truncation while preserving visual quality. The system-level and circuit-level implementations of the proposed technique will be discussed in Section IV.

IV. PROPOSED TECHNIQUE: SYSTEM LEVEL AND CIRCUIT LEVEL IMPLEMENTATION
This section presents the system-level and circuit-level implementation of the proposed technique.

A. SYSTEM-LEVEL IMPLEMENTATION: VIDEO STREAMING PLATFORM
Fig. 5 shows the developed system-level video streaming platform. As shown, a Raspberry Pi [23] microcontroller was used to serve as a video streaming server with which a mobile device would communicate and retrieve video data. Also, we utilized a Z-Turn 7020 [24]  The corresponding block diagram for Fig. 5 is illustrated in Fig. 6. The video streaming process is kicked-off by a command from the mobile device to the server to retrieve an encoded H.264 video stream over Secure Copy Protocol (SCP) [26]. The mobile device sends the initial kick-off command to the server over a serial terminal on a PC interfaced with the mobile device over USB. The server then processes the video stream requested by the mobile device by both transmitting an H.264 encoded format of the video stream over SCP to the mobile device and parsing the frames for truncation region information.
After the video frame is parsed on the server, the truncation region information is transmitted over GPIO per frame. In our developed system, the protocol is defined in Table I. Only the truncation region information of the frames that would be truncated is transmitted. The preserved ROI information will not be transmitted as these regions are identified prior to the transmission on the server and preserved. As listed in Table I, the first index, index 0, denotes the current frame number parsed. The second index, index 1, denotes the number of truncation regions to truncate. Then the next indices denote the first three YUV truncation signal bits plus two sets of XY coordinates denoting the left top and right bottom corners of rectangles grouping the affected truncation region. These three indices repeat for each region called out by the "Number of Regions", index 1. The GPIO interface data width bit size of the developed system is 22-bits per index. The 22-bit distribution is to account for a maximum of 2 11 x 2 11 pixel addressing -a max resolution of 1,920 × 1,080 -totaling 22 bits. There is an additional 2 handshaking bits between the server and mobile device to denote data reception confirmation in-order to transmit the next index.
This truncation region information will be transmitted to a MBTM for processing in the mobile device side, as discussed in Section IV-B. The MBTM will generate control signals for the frame buffer memory, thereby determining which sub-pixels -from Y, U, and/or V -shall be truncated for each frame written to the frame buffer memory, which will be detailed in Section IV-D. Finally, the decoded and bittruncated frame is output over HDMI from the mobile Device and captured by the Video Capture & Display Device.

B. MEMORY BIT TRUNCATION MANAGER (MBTM)
The MBTM implemented into the H.264 decoder parses the protocol data that is transmitted by the server's Truncation Region Extractor. The flow is broken down as follows. First, from Fig. 7 (a), the encoded frame is transmitted via SCP to the mobile device. Fig. 7 (b) illustrates the truncation regions determined to be bit-truncation capable on a sub-frame vector level: Y vector, U vector, and V vector each encompassing all the sub-frames summing to a frame. From Fig. 7 (b), the gray areas denote the truncation regions determined to be bit-truncation capable for all Y, U, and V vectors. The areas in boxes are regions where only 1 or 2 vectors were determined to be bit-truncation capable. Two coordinates, top-left and bottom-right, are highlighted in Fig.  7 (b) for each of these regions to show how the truncation region data was used to determine the regions to truncate using the protocol in Table I. A total of 61 regions to truncate are shown in Fig. 7 (b). Fig. 7 (c) shows the resultant frame after Fig. 7 (a) is decoded using the identified truncation region information. As shown, the preserved ROI around the face, pink region from (b), is not truncated to avoid visual quality degradation. The frame is decoded normally, but when it is written into the frame buffer, the transmitted truncation region information is used to control the T_Y, T_U, and T_V control inputs to truncate the frame buffer memory as it is written. These control inputs are provided to the proposed frame buffer in Fig. 8, which will be discussed in detail in Section IV-D.

C. H.264 DECODER AND MBTM INTEGRATION
A H.264 video decoder is implemented based on the Open Source Osenlogic OSD10 decoder IP [27]. This decoder was capable of decoding baseline profile level 3.1 encoded bitstreams. The slice types supported were I-Slice, SI-Slice, P-Slice, and SP-Slice [28]. The entropy coding profile supported was Context-Adaptive Variable-Length Coding (CAVLC). The decoder took an H264 Network Abstraction Layer (NAL) bitstream and output YUV 4:2:0.
During the NAL bitstream parsing process, the bitstream is parsed into raw bytes of syntax elements from the Raw Byte Sequence Payload (RBSP). Within the RBSP, therein lies the slice layers. Ignoring the Sequence Parameter Set (SPS) and the Picture Parameter Set (PPS), the Instantaneous Decoder Refresh Access Unit (IDR Slice(s)) and the slice layer includes all slice headers and slice data for the frames that shall be truncated using the MBTM. H.264/AVC defines a frame as an array of luma samples and two corresponding arrays of chroma samples: denoted as YUV.
Specifically, the slice header includes the parameters first_mb_in_slice, which indicates the position of the first macroblock in the slice data, and frame_num, which represents the order in which a video decoder shall decode the encoded frames. This is not the same as the display order or Picture Order Count (POC), which is the order in which the frames are displayed. The frame_num parameter is used to determine which frames during the decoding process would be susceptible to YUV bit-truncation by the MBTM and the first_mb_in_slice is used to determine the starting coordinates of the macroblocks susceptible to bit-truncation. The slice data included all the macroblocks of the slice.
After the MBTM determined that a frame would be truncated, through a conditional match between the frame number parameter from Table I (a) and frame_num, a running count of the current macroblock index was kept track of internally to the MBTM from the slice data starting with the index of first_mb_in_slice. After the MBTM determined that a macroblock would be truncated, through a conditional match of the running macroblock index and the truncation region given by the two indices from   Request index is used to fetch any frame index truncation information for macroblocks that required multiple frames for prediction. This process is repeated until the end of the NAL bitstream.
The trade-off with utilizing the BTM is the additional GPIO parallel bitstream overhead required to truncate the macroblocks in each frame. Each frame parsed had an absolute worst case overhead of approximately 380,738 additional bits to transmit using the protocol from Table I. This worst case is calculated assuming every macroblock with 16 × 16 pixels in a maximum resolution of 1,920 × 1,080 would be truncated differently per frame in a video. On average, however, the number of additional bits transmitted per frame is 1,200, because the maximum resolution of each frame is 1920 × 1080 and the truncation regions are combined to encompass a greater area in the video to save on bits transmitted: on average 50 truncation regions per frame. With a 1920 × 1080 video at 30 frames per second progressive (1080p 30fps) or a 1280 × 720 videos at 60 frames per second progressive (720p 60fps), i.e. 5,000 kbps bit rate, the worst case percentage overhead would be 7.62% with an average of 0.02% per frame. The protocol utilized is one of the simplest methods to implement the proposed technique.

D. CIRCUIT-LEVEL IMPLEMENTATION OF THE PROPOSED FRAME BUFFER MEMORY
During the video decoding process, multiple memories are needed. In particular, the frame buffer memory is accessed very frequently and it has a profound influence on the system's overall cost and power consumption [7]. In this paper, a new frame buffer is designed, and the circuit-level implementation is shown in Fig. 8. Specifically, the logic in the truth table highlighted in yellow was designed to be supported by the MBTM. Here, T_Y, T_U, and T_V are utilized to truncate Y, U, and V byte from the word. Each word consists of a Y, U, and V byte. During the Write Enable (WE) phase of the frame buffer memory access, if either control line of T_Y, T_U, and / or T_V are asserted, the memory would truncate the 3-LSB of the optimal asserted vector as "100" [7].
The proposed frame buffer has M words and each word consists of N bits. To evaluate the functionality and measure average power consumption of this proposed circuitry, a 128-word by 24-bits memory array is designed. Here, input and output pins are denoted as data[23:0] and out [23:0] respectively. Bits 23-16 are named Y byte, bits 15-8 are named U byte, and bits 7-0 are named V byte. The memory implemented had a driver and sense amplifier for writing and reading data. These enabled bit truncation according to T_Y, T_U, and T_V control signal activation. If T_Y, T_ U, and T_V are all de-asserted as logic '0', then the frame buffer would operate as a traditional memory device where the sense amplifier would operate with a supply voltage (VDD) and pre-charge signal phi2b. When the T_Y signal is asserted as logic '1', the peripheral circuitry would generate two signals: y! which is the inverted value of T_Y and y_pre! which is inverted value of the pre-charge enable signal. These two signals are used to control the sense amplifier for the Y byte's 3-LSBs, thereby enabling truncation. During this process, the VDD for this sense amplifier remains grounded and the pre-charge signal would be reactivated. As a result, the power consumption of this portion of circuitry will be reduced as compared to the normal operation. During the read back operation, the 3-LSBs are generated as "100" though use of three 2:1 multiplexers in-place of regular of data output. When the bit truncation is asserted, these multiplexers would select "100" through control signals y!, u!, or v!. Otherwise, these multiplexers would pass normal readout data values. In addition, the VDD of all the 3 LSBs  of each byte are also controlled by the corresponding control signals y!, u!, and v!. During the truncation, VDD for LSBs can be powered off to save power consumption and multiplexers will select "100" as the output data, thereby achieving low-power operation. The detailed timing diagram and power efficiency of the proposed memory will be discussed in Section VI.

V. EXPERIMENTAL METHODOLOGIES
This section discusses the metrics, methods, and strategies used to evaluate the effectiveness of the proposed technique. The testing and analysis setup used to generate the experimental results is also discussed.

A. VIDEO SELECTION
To verify the effectiveness of the proposed technique, such as video quality, power savings, and number of truncated bits, 74 videos with diverse characteristics were selected from Xiph.org Video Test Media [29]. Of those videos, 60 videos contain facial features to enable ROI preservation using the proposed technique. All videos were converted to the YUV 4:2:0 chroma subsampling standard for ease of bit-truncation. The results are presented in Sections V-B to D. We further conduct statistical analysis using all 889 videos of 1080p resolution or lower selected from the YouTube UGC dataset [30] and Xiph.org Video Test Media [29]. Of those 889 videos, 699 videos contain facial features for ROI analysis, which will be detailed in Section VI-E.

B. VIDEO FRAME QUALITY METRICS
Existing video quality metrics such as Peak Signal-to-Noise Ratio (PSNR) and structural similarity (SSIM) [31], which are used widely to evaluate the video quality, fail to incorporate the importance of ROI. This is because these metrics weigh all pixels of the video equally, regardless of the ROI impact on user awareness [32]. For this reason, an additional video quality metric -Weighted PSNR (WPSNR) -is also used in this paper to evaluate the quality of videos with ROI [22], which is defined as [33]:  where MSE stands for the Mean Squared Error between the original frame and after truncation while (alpha) is defined as the weight that the ROI would have. The value will be a constant value of 0.9 following the previous research in [22]. WPSNR is a metric that can be used to evaluate quality for videos with ROI. Videos without ROI information are evaluated using both PSNR and SSIM. [32]

C. SYSTEM-AND CIRCUIT-LEVEL IMPLEMENTATION
The hardware system platform from Fig. Fig. 5 implemented an H264 decoder synthesized into a Xilinx Zynq XC7Z010 FPGA fabric. The H264 decoder IP Core was designed using the Xilinx Vivado 2019.2 [34] software design suite. This same decoder is modified to include an MBTM. The FPGA was commanded via an ARM Cortex-A9 Processor running on a Linux Operating System through a custom baseband driver.
The circuit-level frame buffer is implemented using a 45nm CMOS technology [35]. The supply voltage is 1.0V. The memory size is 128 words at 24 bits per word.

D. VIDEO QUALITY EVALUATION
All selected videos were analyzed using an in-house custom software tool. The tool operated in the following three-step process: (i) Load one original video frame from memory; (ii) Apply both the method in [7] and the proposed method to the original frame and generate the truncated frame using each method; and (iii) Compare the frames generated against the original frame and calculate the PSNR, SSIM and WPSNR values. With data points collected on a per-frame basis, the average PSNR, SSIM and WPSNR of each video stream was calculated and compared.

E. STATISTICAL HYPOTHESIS VALIDATION
From the proposed method, the following hypothesis was conjectured: the differences between the method in [7] and the proposed method follow a Weibull distribution. This should hold true for power savings and different video quality parameters including PSNR, SSIM, and WPSNR. To support this hypothesis, a goodness of fit regression test was preformed to determine if the data falls within the probability plot of a Weibull distribution. The intention behind this analysis was to identify patterns in the output videos that serve to estimate the differences in quality and noise for any   given input video. If the data follows this hypothesis, this would suggest that the sample set of videos is of adequate size and as a result, no more videos would need to be tested. Fig. 10 and Fig. 11 show the post-implementation results of the baseline H264 decoder and the H264 decoder modified to include an MBTM. When comparing both figures, one observes that the Lookup Table (LUT) overhead, which is the additional logic gates required for the proposed design over the baseline, was 204 LUTs or a 0.38% increase in area. The I/O, which was used for the server-to-mobile device interface, increased by 37, or 29.6%. The power consumption of the modified decoder also increased by 0.068 watts or 0.03%: most of which was attributed to the increased number of I/O. Finally, the Worst Negative Slack (WNS) increased by 0.011ns, which was within acceptable tolerance for this system as any positive value means that the critical path passes timing constraints. Overall, this additional overhead was negligible when compared against the benefits in power savings and quality improvements achieved using the proposed technique. Table II summarizes the implementation overhead of the proposed technique with a video resolution of 1920 × 1080, which is the maximum resolution supported by the system. It can be seen that the proposed technique comes at a cost of bitrate, network overhead, and logic gate overhead. Specifically, the mobile device needs to receive 1,200 additional bits on average per frame from the server, which is a 0.02% increase on average per frame, and increases network uptime by 240μs to transmit the protocol's additional bits. Also, it needs 204 additional LUTs to implement the proposed method, which results in 0.38% area overhead. The primary advantage of the proposed method is the power savings enabled by the decoder frame buffer memory, which will be discussed in detail as follows.

B. CIRCUIT-LEVEL FRAME BUFFER TIMING DIAGRAM
The proposed frame buffer is shown in Fig. 8 and the simulation timing diagram is shown in Fig. 9. In this waveform, phi2b, T_Y, y!, and y_pre! denote the pre-charge (for un-truncated bits), bit truncation enable for Y byte, power supply for truncated bitcell's (last 3 LSBs of Y byte), and pre-charge deactivated signal for truncated bit cells, respectively. T_U and T_V controlled the bit truncation for U and V bytes respectively. Write and read enable signals initiated the write and read operations for the memory accordingly. Data [23:0] were the three bytes of each word of the proposed memory buffer. Here, blue to red lines stand for "don't care" regions. The red lines denote where the rising clock edge was initiated for write and read operations. Finally, the green lines denote that the write and read operations were enabled. All 8 truncation permutations and traditional read and write operations were presented in the timing diagram as an exhaustive simulation of the frame buffer circuit.
It should be noted that, if the bit truncations were initiated, then 3 LSBs were truncated from the selected byte/bytes based on the control signals T_Y, T_U, and T_V. During the read operations, the 3-LSBs of the truncated bytes would output "100" bits through the utilization of 2:1 multiplexers instead of being read from memory to save power.  As discussed in Section III-B, for the truncated vectors, the three LSB will be truncated to "100" to maximize power savings. All 8 truncation cases presented in Fig. 12 are tested in 6 ways: when written ('0' to '0', '0' to '1', '1' to '0', '1' to '1') and when read back ('0' & '1'). The power consumed in each case was calculated, and then the average is presented. At first, a random word was initialized with (A5A5A5) 16 , then the same memory word was immediately read back with (F0F0F0) 16 , then all the '1's and '0's written and read received the same priority in the power consumption calculations. The same word consumed 3.90E-4 W power without any bit truncation. When the circuitry selected any T_Y, T_U or T_V control option, where 3-LSBs were truncated from each one selected, 6.67% power was saved when compared against no bit truncation. When T_Y & T_U, T_Y & T_V or T_U & T_V were selected, where 3-LSBs were truncated from each selected byte, then 13.33% power was saved when compared against no bit truncation. Finally, when T_Y, T_U, and T_V were selected for truncation, where 3-LSBs were truncated from each selected byte, then 19.74% power was saved. The supply voltage for this simulation was 1V, where the proposed frame buffer circuit can operate to specification and had no faulty bit(s). Table III shows visual frame comparisons for three selected videos with ROI between the proposed method and [7]. It can be seen that the proposed technique enables significant visual quality improvement as compared to [7]. Specifically, for the Foreman_cif video, due to the truncated LSBs in [7], the man's cheeks, forehead, and hat shadows experience noticeable banding distortion, negatively affecting video quality. Alternatively, the proposed ROI-aware technique effectively reduces the banding distortion and improves the visual quality. Similarly, with [7], the mother_daughter_cif demonstrates banding distortion around the cheeks and hair, and the carphone_qcif video suffers from discoloration around the cheeks and chin. The introduced ROI awareness of the proposed technique effectively avoids losing the quality of videos. Another observation from Table III is that the proposed technique achieves a much higher WPSNR value of all three videos. A more detailed analysis on WPSNR will be provided in next sub-section. Table IV compares WPSNR values and the number of truncated bits of 60 videos with ROI using the proposed technique to the state-of-the art [7]. As shown, the proposed technique can enable 26.46% additional truncated bits as compared to [7]. Meanwhile, with the ROI awareness, the proposed technique can effectively enhance the quality of the majority of videos. On average, the proposed technique can increase the WPSNR values by 20.17%, as compared to [7].

E. OBJECTIVE VIDEO QUALITY AND BIT TRUNCATION ANALYSIS
We further analyzed the impact of the MB variance characteristics (low, medium, and high variance) on the effectiveness of the proposed technique. The results are shown in Fig. 13. As can be seen, the WPSNR improvement strongly depends on the MB variance of videos. Specifically, videos with high variance achieve the most significant quality improvement using the proposed technique, with 47.31% WPSNR increase on average. With the proposed technique, all videos with medium variance also demonstrate quality improvement, with 13.74% WPSNR increase on average. However, the proposed technique shows little video quality improvement for videos with low variance and even results in minimal video quality degradation (with 1.75% WPSNR loss on average. This suggests that the proposed technique is particularly effective for videos with high and medium MB variance. We compared the video quality of 14 videos without ROI using the proposed technique and the state-of-the art [7]. As shown in Table V, the proposed technique can enable a much larger number of truncated bits, with quality degradation as compared to [7]. On average, 44.61% additional truncated bits can be achieved for videos without ROI, with 3dB PSNR loss. To further assess the video quality loss induced by the proposed technique, the SSIM value of each video was evaluated. As observed in Table V, the average SSIM with proposed technique and the design [7] is 0.9726 and 0.9891, respectively. Using the proposed technique, for the majority videos (13 out of 14), the SSIM value is much higher than 0.95, which is the acceptable SSIM threshold value [36]; only one video (i.e., bridge_far_cif) results in a SSIM of 0.9453, which is very close to the acceptable SSIM value (i.e., 0.95 [36]). It can be concluded that the proposed technique can result in significant number of truncated bits while delivering an acceptable perceived quality.

F. VIDEO-LEVEL POWER SAVING ANALYSIS
To compare the power effectiveness of the proposed ROIaware technique to the traditional memory design and the state-of-the art [7], we model the power consumption of the memory for a video as: where N i is the total number of bytes for the video i, P k (j) is the normalized power consumption to store byte j with k truncated bits. For the proposed memory, k = 3; for the traditional memory, k = 0; for the memory in [7], k = 0, 1, 2, or 3. For a fair comparison, the normalized power consumption P k (j) is based on the power consumption reported in [9]. The results are listed in Table VI and Table  V. As observed, the proposed technique only consumes 83.79% and 76.56% total power on average for videos with ROI and videos without ROI, respectively, as compared to the traditional memory. Also, the proposed technique achieves 3.06% and 8.26% power savings for videos with ROI and videos without ROI, respectively, as compared to [7]. It is worth mentioning that, our analysis only considers the facial features as ROI of videos and integrating advanced ROI identification algorithms will covert videos without ROI to videos with ROI, thereby further increasing the effectiveness of our proposed technique to general videos.

G. STATISTICAL ANALYSIS
In order to verify that the selected video analysis results are a representation of the full population of all videos, we further carried out a statistical analysis based on 889 different videos, as discussed in Section V-A. Specifically, the Pearson's Chi-square test [37], which is also known as the Chi-Squared goodness-of-fit test, is used in our analysis.
The goodness-of-fit test checks whether the sample data is likely to be from a specific theoretical distribution, and therefore represents the data expected in the actual population. The idea is, if the sample data does fit an expected distribution, then it shows that the sample data represents the full population of the video data in existence.
The statistical results will either reject or accept the working statement called the null hypothesis, H 0 , which is the opposite of the alternative hypothesis, H 1 . To reject or accept the null hypothesis, several methods exist, one of which is the Probability value method i.e. P-Value method. The P-Value is the evidence against the null hypothesis, i.e., the smaller the P-Value, the stronger the evidence that the null hypothesis should be rejected. The P-Value method is based on a critical value, which is determined based on the distribution. For example, if we are dealing with a normally distributed population -which we are according to our statistical results shown later, this critical value is a z-score. The z-score is a value that is then used to lookup the P-Value in a Standard Normal z-table, which is used to then test the null hypothesis. If a P-Value is greater than an alpha or value of 0.05, then the statistical results are "not significant" and thus, the null hypothesis is accepted. However, if the P-Value is less than or equal to values of 0.05 or 0.01, then the results are "significant" or "highly significant" respectively, and thus, the null hypothesis is rejected in favor of the alternative hypothesis. The rejection regions depend on the confidence level that the results are significant, e.g., if a confidence level is 95%, then an value of 5% or 0.05 is chosen: 100% -95%. In our analysis, the null hypothesis for the Chi-Squared goodness-of-fit test, H 0 , is, "For the given set of video data points, a specified distribution accurately represents the data", and therefore, the alternative hypothesis, H 1 , is, "For the given set of video data points, a specified distribution does not accurately represent the data." Hence, the goal of the statistical analysis is to validate the null hypothesis and thus deduce that the specified distribution would fit the data. To achieve this statistical result, P-values were calculated for three different video categories -low, medium, and high variance -for power savings and three different video quality metrics (i.e., PSNR, SSIM, and WPSNR). In our analysis, we used the MathWave Technologies EasyFit software [38], to identify the Chi-Squared goodness-of-fit test, in order to determine the type of distribution.
The results are shown in Fig. 14-16. Specifically, Fig. 14 demonstrates how categorizing the data creates clear groupings when comparing the truncation method in [7] to the proposed method. The figure shows three distinct 3-parameter Weibull distributions that describe the quality improvement between the proposed and [7]. These Weibull distributions are within the 99% confidence interval required. All distribution reports a P-value greater than 0.98, implying that we cannot reject the null hypothesis and accept this distribution as a possible representation of the data. Fig.  15 shows the power savings distribution for each video type as a 3-parameter Weibull distribution. Power savings is reported as a percentage increase, using the total number of bits truncated in each video and the power consumption shown in Fig. 12. All of these distributions pass the 99% confidence interval. Fig. 17 shows the probability of noise increase in a random video stream. All distributions shown fall into the category of Weibull distributions with a 99% confidence interval. Fig. 14 shows the probability of quality drop measured in SSIM for a random video stream, with a 99% confidence interval that the probability lies within a Weibull distribution. The most notable differences between Fig. 17 and Fig. 14  comparison between the entire image without considering the user's perception. It was determined that because all videos are compared to themselves for improvement, e.g., video after the proposed method is applied verses the original video, video resolution has no statistical impact in the data set. Power Consumption will be presented by improvement percentage, thus ignoring linear growth in watts saved in larger scale videos. Similarly, it is statistically sound that a larger dataset is not needed to affirm the distributions. As all distributions shown fall within the 99% confidence interval, there is only a 1% chance that the data collected is far from the specified distribution.
In summary, videos categorized as high variance show the biggest improvements in WPSNR quality, the most power saving by percentage, and introduce the least noise as measured by PSNR and SSIM. With medium variance videos also saving on power consumption, with a more noticeable drop in quality and increase in noise. As such, videos classified as low variance often have little to gain using this method, and sometimes even cause video quality degradation.

VII. COMPARISON WITH PRIOR WORK
Table VI compares this work against state-of-the art lowpower video memory designs. As shown, our proposed memory enables more-flexible run-time power-quality adaptation according to video content characteristics of each frame, while considering the important region within one frame from a perceptual point of view.

A. COMPARED TO STATE-OF-THE-ART APPROXIMATE VIDEO MEMORIES
To enhance the power efficiency of video storage, approximate video-specific memories have been developed to store the MSBs of video data in more robust memory bitcells, such as more-than-6T SRAM bitcells [8,9], upsized 6T [10]. In order to minimize the video quality loss, those techniques typically store MSBs in reliable bitcells (e.g., 8T, 10T, or upsized 6T), thereby leading to a tolerable output quality degradation. However, those approximate video memory designs usually bring large implementation overhead (e.g., up to 52% [9]). More importantly, for those techniques, the achieved video quality is fixed during designtime, so they lack of adaptation at run-time to meet different requirements of a variety of video applications.

B. COMPARED TO STATE-OF-THE-ART ADAPTIVE VIDEO SRAM
Recently, in order to enable run-time power-quality adaptation, several video SRAM designs have been presented, such as data-dependent memory [39], SRAM with selective hamming (15,11) [11], and SRAM with errorcorrection-code (ECC) adaptation [40]. The data-dependent SRAM consists of 10T bitcells and associated conditional pre-charge circuitry to adapt to the stored data's statistical dependencies. The developed SRAM with selective hamming (15,11) [11] can switch between no ECC and hamming (15,11) based on the quality requirement of the video applications. The very recent SRAM design with ECC adaptation [40] supports three different power-quality tradeoff levels, including hamming code-74, hamming code-1511, and no ECC. However, those adaptive memory designs focus on efficiency optimization while maintaining an acceptable video quality, such as keeping the PSNR values above 30 dB [11,40], and they did not consider the viewer's experience in the memory design process. Therefore, they may cause large and inefficient design margins. Heterogeneous sizing SRAM [10] Split-data SRAM [9] Datadependent SRAM [37] SRAM with hamming [11] SRAM with ECC [40] Vieweraware memories [4,5,6] Contentaware memory [7] VOLUME XX, 2022

C. COMPARED TO STATE-OF-THE-ART VIEWER-AWARE VIDEO MEMORY
By introducing viewer's experience to video memory design process, we have studied that memory failures can be leveraged to improve video system power efficiency without sacrificing viewer's experience [4,5,6]. The basic idea is that in high noise-tolerance viewing contexts with highilluminance levels, memory failures are intentionally introduced by adaptively disabling LSBs of the video data stored in memories. This line of studies illustrates a new dimension of power savings for hardware design through the introduction of memory failures. However, those designs did not consider the variance of different videos and they are not sufficient to support videos with various content characteristics.

D. COMPARED TO STATE-OF-THE-ART CONTENT-AWARE VIDEO MEMORY
The content-aware SRAM presented in [7] is another recent viewer-aware memory design that can enable run-time power-quality adaptation based on the content characteristics of video applications. However, it adapts the number of truncated LSBs of video data based on the average plain macroblock percentage of an entire video sample, so it is not suitable for the videos with frame-level difference. Section VI provided detailed comparison results and analysis between the proposed memory and memory presented in [7], in terms of power savings, video quality, number of truncated bits. It concludes that the proposed memory enables an average of 26.46% additional truncated bits, 3.06% power savings, and 20.17% WPSNR improvement for ROI videos, as compared to in [7].

E. COMPARISON SUMMARY
In our developed video memory technique, the ROI regions are identified and utilized to enable intelligent tradeoff between video quality and power efficiency of video storage in mobile devices. Accordingly, the proposed memory enables run-time quality adaptation with significantly reduced pixel bits and further power savings, as compared to existing techniques. To the best of our knowledge, this is the first work that can adapt the video storage to frame-level video content and important regions from viewer's perceptual experience point of view. Our proposed ROIaware video memory is orthogonal to existing viewer-aware or data-dependent schemes and therefore can be simultaneously utilized to further optimize power efficiency.

VIII. CONCLUSION AND FUTURE WORK
In this paper, we have presented a video content-adaptable Region-of-Interest (ROI)-aware video storage technique to optimize the power efficiency. The ROI of videos is identified and protected to preserve the video quality, while other regions are truncated with 3-LSB truncation for power savings. To support the proposed method, a low-power frame buffer was developed that implemented 3-LSB truncation which enabled runtime quality and power adaptation. Our results show that the proposed technique only uses 83.79% and 76.56% of the power on average for videos with ROI and without ROI respectively, as compared to the traditional memory and the state-of-the art [9], respectively. Meanwhile, the proposed technique can increase the quality (i.e. WPSNR values) by 20.17% on average for the videos with ROI and 26.46% additional truncated bits as compared to [9]. For the videos without ROI, the proposed technique can realize 44.61% additional truncated bits and 8.26% power savings as compared to [9], while maintaining a healthy above 40dB PSNR and 0.95 SSIM. This paper focuses on the facial features as ROI of videos; our future investigations would include extensions of ROI identification to deal with general videos. Additionally, psychological experiments will be conducted to access the visual experience of viewers for hardware optimization.