A Superpixel-Wise Just Noticeable Distortion Model

The just noticeable distortion (JND) model reveals the visibility limitation. Human eyes hold different attention and sensitivity to different regions owing to different contributions to the perceptual quality. In this paper, a superpixel-wise JND model based on region (RJND) is proposed. First, an image is segmented into superpixels by the simple linear iterative clustering (SLIC). Then, region color contrast is calculated for each region and foveation regions are selected for the image. Based on the human visual perception, a region weighting model is established by incorporating region color contrast and foveation regions modulation. Considering the contrast masking (CM) effect is not perfect, we introduce the texture coarseness combined with CM effect for a more accurate visual masking effect. Finally, a new region JND model is established by combining the region weighting model and the coarseness modulation. The experimental results demonstrate the proposed RJND model can decrease PSNR more efficiently compared with some existing JND models when achieving nearly the same subjective perceptual quality. It can remove much more visual redundancy.


I. INTRODUCTION
It is known that the visual sensitivity of human visual system (HVS) is limited. Human eyes cannot perceive content changes when pixel differences are lower than a certain threshold. The just noticeable distortion (JND) thresholds refer to the greatest distortion values of images that human eyes cannot perceive. Currently, the JND is widely used for the perceptual coding [1], water-making [2], visual quality assessment [3] and so on.
A number of JND models have been proposed during the past decade. Existing JND models can be divided into two categories according to the JND thresholds domain: the pixel-wise JND and the subband-domain JND. Yang et al. [4] proposed the nonlinear additivity model for masking (NAMM) model, including the background luminance adaptation (LA) and contrasting masking (CM) effect. Liu et al. [5] decomposed an image into texture regions and structure regions for the estimation of the edge masking and texture masking. In [6], Wu et al. proposed the effect of disorderly concealment to estimate the JND thresholds The associate editor coordinating the review of this manuscript and approving it for publication was Mehul S. Raval . of the disorderly regions more accurately. In [7], Wu et al. introduced the concept of pattern complexity to estimate the total masking effects and proposed a new spatial masking estimation function. Zeng et al. [8] decomposed the image into structural image, ordered texture image and disordered texture image, then estimated their individual JND thresholds, respectively.
Since most videos and images are usually compressed in DCT domain, DCT domain JND models are usually established [1]. Wei et al. [9] proposed a classic JND model in DCT domain by incorporating spatial contrast sensitivity function (CSF), brightness adaptive effect, contrast masking effect. In [10], a saliency based JND model was proposed by combining visual attention model with visual sensitivity model. Utilizing the DCT coefficients to analyze directional information, Wan et al. [11] proposed a JND model based on directional regularity. Wang et al. [12] proposed a novel adaptive foveated weighting JND model considering both the foveated masking effect and the visual attention effect.
With the development of deep learning, some novel JND models are also proposed. Ki et al. [13] proposed a learning based JNQD (Just Noticeable Quantization Distortion) model with CNN (Convolutional Neural Network), which can adjust JND levels automatically. Liu et al. [14] proposed a deep learning based picture-wise JND prediction model and a sliding window based search strategy to predict the JND thresholds.
The contributions of different regions to the image quality are different, and the HVS holds different susceptibilities for different contents within one scene [15]. Inspired by this, the paper proposes the superpixel-wise JND model based on regions (RJND). A region with a uniform feature usually can be seen as a visual unit and we establish the region weighting model to estimate visual importance of each region. Human eyes usually pay more attention to the important regions and pay less attention to other unimportant regions because of the limited perception mechanisms. Besides, we propose the texture coarseness modulation factor to estimate texture feature of each region and improve the masking effect. Different from other model based on pixel, the proposed model is based on superpixel. In the proposed model, the input image is firstly segmented into superpixel regions by the simple linear iterative clustering (SLIC) [16], [17]. Then, the color contrast of each region is calculated. According to the HVS, we select the foveation regions and establish the foveation modulation factor to estimate the visual importance of each region based on the color contrast. Besides, we propose the texture coarseness modulation factor according to different noise masking abilities of texture regions, which is used to estimate the more accurate visual masking effect.
The rest of paper is organized as follows. Section II presents the basic framework of the proposed RJND model. In Section III, the experimental results are shown and discussed. Section IV concludes the paper.

II. THE PROPOSED JND MODEL
The Fig. 1 depicts the schematics of the proposed model. First, the input image is segmented into regions by the SLIC [16], [17]. The color contrast is calculated to show the different attractions of different regions according to the HVS. Then, by incorporating the region based color contrast and foveation regions modulation, we propose the region weighting model. Besides, the texture coarseness modulation factor is proposed to estimate more accurately the masking ability combined with CM effect. Finally, considering the attention and sensitivity to different regions with individual feature, the proposed model is established incorporating above two parts. The proposed model is efficient demonstrated by the experimental results.

A. THE SUPERPIXELS SEGMENT
The superpixel refers to the pixels block with certain visual significance composing of adjacent pixels with similar texture, color, brightness and other features [16], [17]. With SLIC, the image is transformed into a 5-dimension features vector in the LAB color space and XY coordinate. Then each pixel is merged with the nearest cluster center through the clustering iteration process.
The algorithm produces compact superpixels where each pixel is labeled accordingly. The outlines of superpixels are usually similar in the smooth regions and irregular in the texture regions, as shown in Fig. 2. The total number K determines the size of superpixels in an image. The larger the value K is, the smaller the size of a superpixel is. The size of superpixel affects the experimental results slightly, but we ignore its influence for the reason that it is not the factor we will discuss. The value K is defined as: where K is a positive integer, w and h are the width and height of the image, respectively. The n is set to 1000 in this work.

B. THE REGION WEIGHTING MODEL
The region weighting model presents the visual importance of each region. The more important a region is, the more attention human eyes will pay. We can estimate the most attractive regions with the color contrast in the image and the sensitivity of other regions with the foveation modulation. The region weighting model is established with the color contrast and foveation modulation.

1) THE COLOR CONTRAST BASED ON SUPERPIXELS
The color contrast shows the different attractions of regions. It is considered as visually important when a region looks distinct from its surrounding regions [18]. Superpixel is a region which has a certain sense of visual perception. According to HVS, a region with higher color contrast will draw our attention more easily. The calculation is as follows. First, the image is converted to the LAB color space close to the physiological mechanism and perceived uniformity [19]. Then, according to superpixels segment map, the average values of l, a and b in LAB color space of each region are calculated and written as (l k , a k , b k ). The average coordinate values of x, y are written as (x k , y k ). The color difference and spatial distance between region k and i can be calculated as: where k and i are the index, (l k , a k , b k , x k , y k ) and is the Euclidean distance between region k and i in LAB color space, and d xy (k,i) is the positional Euclidean distance. The color contrast is high when the color appearance of a region is distinct from the background. Besides, its surrounding regions should also be considered. A region will be more distinct when similar regions are nearby and less distinct when resembling regions are far away [20]. Therefore, the color contrast is proportional to the color Euclidean distance and inversely proportional to the positional Euclidean distance. We calculate the color contrast of region k: where the c (k,i) is the color contrast between region k and i. The d lab (k,i) and d xy (k,i) , normalized to [0, 1], are the Euclidean distance in the LAB color space and positional distance, respectively. The parameter c 1 is set to 3, the same as [20].
The output map is shown in Fig. 3. The highlighted regions, possessing a higher color contrast, can attract the attention of human eyes more easily. They are important visually and should be protected.

2) THE FOVEATION MODULATION BASED ON SUPERPIXELS
The foveated masking plays an important role in estimating JND thresholds. The foveations refer to the positions where human eyes gaze in an image. The psychophysical experiments show the visual acuity decreases with increased retinal eccentricity [21]. Pixels in the same superpixel have the same or similar features, so we treat them equally. We consider a superpixel as a basic visual unit with visual information and establish the foveated masking model at the region level.
There is usually more than one salient region in an image. In this work, we select five regions with the highest color contrast as foveation regions and they attract most of attention. The selected foveation regions are highlighted in Fig. 4. We can find the selected regions are sensitive to HVS, including the head region, basketball region and body region. We consider the average spatial coordinate values of foveation regions as the fixation point. With the increased eccentricity, the sensitivity of region decreases. The eccentricity can be defined as: where e i is the eccentricity using the selected region i as the foveation, i = 1, 2, . . . , 5. The x and y are the average spatial coordinate values of regions, the x i and y i are the average spatial coordinate values of the selected foveation regions. The d is the viewing distance [22]. The position of each region is represented with the average spatial coordinate values of pixels in it. We calculate the foveation modulation factor based on the method in [22].
where f i is the modulation factor, the e i is the eccentricity, λ is the parameter set as 1 [22]. The limited number of superpixels is far less than the number of pixels, and the calculation process is fast. The final modulation factor is obtained by taking the average of f i .
where m is the total number of selected foveation regions. The foveation modulation maps are shown in

3) THE PROPOSED REGION WEIGHTING MODEL
To keep consistent effect with JND model, the visual importance of a region is inversely proportional to the value of the weighting model. In other words, the more important visually the region is, the smaller the weighting value is. We establish the region weighting model with the color contrast and foveation modulation. The higher color contrast is, the more important visually the region is. According to the HVS, regions with the high color contrast should be protected and their corresponding JND thresholds should be suppressed. So the region weighting model is defined as the product of the foveation modulation factor and the opposite of color contrast.
where c k is the color contrast and f fov is the foveation modulation factor, the c 2 is the parameter and set as 1.75 through a number of experiments. The response map of region weighting model is shown in Fig. 6. Obviously, the important meaningful regions have lower weighting value. It demonstrates the validity and accuracy of the model.

C. THE TEXTURE COARSENESS BASED ON SUPERPIXELS
Texture coarseness is one of the most important advanced visual features of texture. Sometimes in the narrow sense, the texture means the coarseness [23]. According to HVS, the sensitivity of human eyes to regions is different because of the different texture features. For example, texture regions, which have more visual redundancy than smooth regions, can endure more distortion. So the texture coarseness can be used as a visual masking factor in the calculation of JND thresholds for each region.  As shown in Fig. 7, the same level noise is injected into each region in an image. The ability of enduring noise is different according to individual texture feature of different regions. We select three patches from the contaminated image and their corresponding patches from original image for more detailed visual presentation in Fig. 7. The original patches are above and the noise injected patches are below. From Fig. 7 (b), we can find their subjective qualities are similar because of the complex texture. However, we can perceive easily the presence of much noise in the contaminated patches in Fig. 7 (c) and Fig. 7 (d) compared with Fig. 7 (b), because the texture structure is smooth in Fig. 7 (c) and the face region is salient and sensitive in Fig. 7 (d).
According to [23], when two texture patterns differ only in size, the one with larger primitives size or fewer repeating units has a larger coarseness value. Tamura texture coarseness [23] of each region is calculated as follows.
Take averages at pixel (x, y) in neighborhoods with different sizes. Different from [23], the size of neighborhoods is VOLUME 8, 2020 2n × 2n for more detailed size.
where g(x, y) is the gray value at pixel (x, y) in a region. n = 1, . . . , L max , and L max represents the maximum size and is set as 5 here.
For each pixel of a region, take differences between pairs of averages corresponding to pairs of non-overlapping neighborhoods in both horizontal and vertical orientations.
At each pixel, pick the best size which gives the highest output value [23].
S(x, y) = 2n (13) where n maximizes E in either direction at pixel (x, y), its calculation is as follows: Take the averages of S as the coarseness F crs of a region.
where n k is the number of pixels of region k, pixel (x, y) is the pixel coordinate. The larger the texture primitives size is, the larger the texture coarseness value is. Regions with large texture primitive size usually have sparse and simple texture structure which is more sensitive for human eyes. Therefore, the final modulation output value should be proportional inversely to the texture coarseness value. According to the fact that variability intensity of pixel values in a local region affects the visual quality, we also consider the influence of average gray value difference and incorporate the highest output value E max as a coefficient. Therefore, for region k, its coarseness modulation factor is defined as follows: (16) where the E max is the average gray difference maximum. The σ , related with the reciprocal of the best size, is the adjust parameter less than 1 to prevent excessive distortion. The map resulting from the coarseness modulation factor F k is shown in Fig. 8. The larger the gray value is, the less sensitive the human eyes have, such as the basketball hoop region. It demonstrates the factor conforms to visual perception.

D. THE PROPOSED MODEL BASED ON SUPERPIXELS
There are other two factors related to HVS: luminance adaptation (LA) and contrast masking (CM) [4]. The LA presents the different sensitivities of HVS to different background luminance. The CM indicates the reduction of the visibility whereĪ (x, y) is the average background luminance at pixel (x, y) [4]. The G(x, y) is the maximum weighted average of gradients around the pixel (x, y), W (x, y) is an edge-related weight of the pixel at (x, y) and calculated by edge detection followed with a gaussian low-pass filter [4]. The control parameter β is set as 0.117, θ represents four different directions.
The CM masking effect is a crucial step and the existing model is not perfect [5]. According to [24], more quantization noise can be hidden in the texture regions without compromising perceptual quality. In order to utilize the region texture feature and achieve the more accurate visual masking estimation, we take both the coarseness modulation factor and the CM effect into account, both of which are based on the Y channel. In general, two types of masking effects are concurrent, and we take both of them to present the final spatial masking effect. The NAMM model is established as follows: where α is the gain reduction parameter due to the overlapping effect between LA and VM effects. Its value is set to 0.3, the same as [4]. The F k is the coarseness modulation factor. The VM is the sum of CM and F k , according to lots of experiment results. As described above, different regions have different contributions to perceptual quality in an image. The importance of visual content and information in different regions is different. The region weighting model has been obtained in the previous section. So the proposed RJND model is defined as the function of the JND NAMM and the region weight modulation model. (22) where W k is the region weighting modulation model.

III. EXPERIMENTAL RESULTS AND ANALYSIS
An excellent JND model will inject more noise into the regions with higher visual redundancy and less noise into the regions with lower perceptual redundancy without compromising the perceptual quality. In order to evaluate the performance of the model, we inject the JND noise into each pixel as follows: whereF(x, y) is the noise contaminated image and the F(x, y) is the input image. The ϕ is the noise level adjuster and rand(x, y) takes randomly +1 or −1.

A. THE ANALYSIS OF PROPOSED FACTORS
In order to prove the key contribution of region weighting modulation and texture coarseness modulation to the performance improvement, the experiments are carried out through control variables. There are four images processed by different JND models, including original model, model only based on coarseness modulation, model only based on region weighting modulation and the proposed model. As we can see from the Fig. 9, these contaminated images with different PSNR values achieve nearly the same subjective quality. The difference is hard to be perceived only by human eyes between any two. However, they have different PSNR values. The original JND model achieves the highest PSNR value while the proposed model achieves the lowest value. The PSNR value of proposed model is 3.69 dB, 2.15 dB, and 2.29 dB lower the first three models, respectively. For the models based only on region weighting modulation or texture coarseness modulation, the PSNR values are obviously lower than that of the original JND model and higher than that of the proposed JND model. The result demonstrates the two factors contribute to estimate the JND thresholds and remove more visual redundancy.

B. THE COMPARISON OF SUBJECTIVE QUALITY
In order to further evaluate the effectiveness of the proposed model, three classic JND models are selected for comparison, including the model Liu [4], model Wu2013 [6] and model Wu2017 [7]. We propose the JND model based on the conventional methods, so the proposed method is compared to only the classic JND models for fair comparison without involving deep learning models. Noise is injected into the input images with these JND models.
The model will be more consistent with the HVS if the PSNR value is lower with the same perceptual quality. It means the model can remove more visual redundancy without noticeable visual distortion. In Fig. 10, all the images are contaminated by different JND models. As we can see, their perceptual qualities are the same and the difference between them is hard to be perceived only by human eyes. However, the proposed JND achieves the lowest PSNR value with the same perceptual quality. Its PSNR value is 3.81 dB, 2.86 dB, and 1.87 dB lower than the first three models, respectively. The result demonstrates the proposed model conforms to the HVS more accurately. As shown in Fig. 11, five representative patches are selected from Fig. 10 for the more detailed comparison. The head patch, face patch and basketball patch are more easily noticed by human eyes because of their important and meaningful visual content. For the head patch, the proposed model injects less noise into the hair area and achieves the best visual perception compared with others. For the face patch, we can find there is obvious distortion in the eyes area for model Wu2017. For the basketball patch, they achieve the nearly same subject quality. Even the proposed model achieves better visual quality because it protects the smooth and salient region. The hoop patch can endure more noise because of the complex texture, and the floor patch should be injected appropriate noise because human eyes will pay less attention to the background. From the enlarged local patches, the proposed model injects more noise into the insensitive regions without noticeable visual distortion, achieving the optimal noise allocation strategy.
Moreover, we further took the image ''Airplane'' in other scenes for comparison, as shown in Fig. 12. It can be observed that the proposed model achieves the lowest PSNR value. In Fig. 13, we can see the more detailed visual information. In Fig. 13 (a), the model Liu maintains the edge information and achieves the highest PSNR value. In Fig. 13 (b) proposed by the model Wu2013, it utilizes the disorderly effect but overvalues the numbers region, which is relatively sensitive for human eyes. In Fig. 13 (c), the model Wu2017 considers the pattern complexity, but it injects much noise into complex texture regions and causes the noticeable distortion. With the lowest PSNR value, the proposed model achieves a good even better perceptual quality, as shown in Fig. 13 (d). The proposed model can achieve the lowest PSNR value when achieving nearly the same subjective perception quality, even better in some local regions. The experimental results demonstrate the proposed model conforms to HVS more accurately and can remove the visual redundancy more effectively.

C. THE COMPARISON OF SUBJECTIVE QUALITY
In order for a more comprehensive analysis, some images taken from sequences which are often used in video codec and JND tests are selected for comparison of the models. The peak signal to noise ratio (PSNR) and mean opinion score (MOS) are recorded as the evaluation standard. Twenty assessors are invited to give a MOS score from 1 to 5 for each contaminated image compared with the original image. Twelve of them have experiences in image processing, and the other eight do not. The higher the score is, the better the subjective quality is. The viewing condition is set with the guidance of the ITU-R BT.500-11 standard [25]. Table 1 shows the PSNR and MOS results of test images with the four models. A JND model will be considered better when it achieves lower PSNR at the same or better perceptual quality level. From Table 1, we find that the PSNR values are different while their perceptual qualities are similar. The proposed model achieves the lowest average PSNR and the highest average MOS value compared with other models.
We compare the average result values of the proposed model with others. Compared with the model Liu, the proposed model achieves lower PSNR with 1.984 dB but the perceptual quality is nearly same. Compared with the model Wu2013 and model Wu2017, the PSNR reduced by 1.199 dB and 0.730 dB, respectively. At the same time, the proposed model achieves better perceptual quality than them. The results demonstrate that the proposed model outperforms other models comprehensively and is more consistent with the subjective perception.

IV. CONCLUSION
In this paper, we propose a novel superpixel-wise JND model based on superpixels and region features. The image is firstly segmented into regions by SLIC. According to HVS, the region level color contrast and foveation modulation are calculated to establish the region weighting modulation, which is used to estimate the visual importance of each region. Besides, we propose the coarseness modulation factor for the more accurate visual masking effect, utilizing the different sensitivities of human eyes to different texture regions. Finally, the proposed model is established by incorporating the above two parts based on the CM and LA effects. The model considers not only the attention of human eyes to local regions, but also the ability of enduring noise. Experimental results show that the proposed model can endure more distortions at the same perceptual quality. In the future, we will research the deep learning based pixel-wise JND models when the more public related databases are built.