Eye Tracking, Saliency Modeling and Human Feedback Descriptor Driven Robust Region-of-Interest Determination Technique

The Region of interest (ROI) analysis is widely used in image analytics, video coding, computer graphics, computer vision, medical imaging, nuclear medicine, computer tomography and many other areas in medical applications. This ROI determination process using subjective method (e.g. using human vision) often differ from the objective ones (e.g. using mathematical modelling). However, there is no existing method in the literature that could provide a single decision when both methods’ ROI data is available. To address this limitation, a robust algorithm is developed by combining the human eye tracking (subjective) and the graph-based visual saliency modelling (objective) information to determine a more realistic ROI for a scene. To carry out this process, in one hand, several different independent human visual saliency factors such as pupil size, pupil dilation, central tendency, fixation pattern, and gaze plot for a group of twenty-two participants are collected by applying on a set of publicly available eighteen video sequences. On the other hand, the features of Graph based visual saliency (GBVS) highlights conspicuity in the scene. Gleaned from these two pieces of information, the proposed algorithm determines the final ROI based on some heuristics. Experimental results show that for a wide range of video sequences and compared to the existing deep learning based (MxSalNet) and depth pixel (DP) based ROI, the proposed ROI is more consistent to the benchmark ROI, which was previously decided by a group of video coding experts. As the subjective and objective options frequently create an ambiguity to reach a single decision on ROI, the proposed algorithm could determine an ultimate decision, which is eventually validated by experts’ opinion.

, [17], [18], [19]. Most of the contributions 59 presented here use some statistical correlation to determine 60 fixation mapping, saliency-based visual prediction, object 61 tracking and human attention in a scene. However, literature 62 shows that more accurate approach to determine the actual 63 gaze locations is to use a gaze-tracking device (e.g. eye 64 tracker) [20].  [19] and Itti-Baldi (IB) 81 [18] visual attention models and while observing that IKN 82 showed better accuracy than IB based on the eye tracker data 83 as the benchmark ground truth. Jia et al. [33] proposed a no-84 reference video quality assessment algorithm based on eye 85 tracking data for four different videos. 86 Dodge et al. [ Therefore, our motivation is to draw a close comparison 116 between eye tracking and GBVS generated salient point, 117 acquire knowledge of their similarity-divergence relation-118 ship, employ a number of visual sensitive features to highlight 119 conspicuous region in the scene, apply some heuristics to 120 determine the final ROI gleaned from subjective and objec-121 tive information, and finally compare it to the benchmark 122 ROI determined by experts' opinion to revive a more real-123 istic ROI of a scene. Beside this, to demonstrate consis-124 tency, the proposed method is compared with the recent deep 125 learning-based ROI and depth pixel-based ROI estimation 126 methods. This work can be applied in the areas of video 127 compression, medical image analysis, image segmentation 128 and many more.

129
The main contribution of this paper can be summarized as 130 follows:

131
(i) Carry out a comprehensive analysis on eye-tracking 132 data and associated parameters for visual saliency modelling. 133 (ii) Investigate similarity-divergence relationship by mak-134 ing a close comparison between eye-tracking (i.e. subjective) 135 and graph-based visual saliency (i.e. objective) modelling 136 information to develop a robust ROI.

137
(iii) Mathematically analyze the parameters to fix in-focus 138 region by analyzing eighteen videos seen by twenty-two peo-139 ple.

140
(iv) Develop an algorithm to determine a more realistic 141 ROI of a scene when both subjective and objective informa-142 tion are available.

143
(v) Compare the proposed algorithm with the recent deep 144 neural network and depth pixel based ROI approaches.

145
(vi) Incorporate a group of video coding experts' opinion 146 to justify and validate finally decided ROI.
The remainder of the paper is organized as follows: Section-II focuses on proposed experimental set-up; Section-149 III presents the experimental detail; the experimental results 150 are broadly discussed in Section-IV, while Section-V con-151 cludes the paper. It is noticed from FIGURE 1 that for almost all the sequences 215 used in this experiment, regardless of considering its duration 216 and emotional sensitivity, the right eye pupil sizes are always 217 greater than left ones for all participants. It is also noticed 218 that almost in all videos, the average left and right pupil 219 size of each individual participant is equal or greater than 220 3.5 mm. The normal pupil size tends to range between 2.0 and 221 5.0 mm depending on the lighting. As the effect of lighting 222 was not taken into account in this experiment, the recorded 223 pupil sizes would be suitable to capture relevant information 224 while watching videos [9].        Now, we analyse every video in the context of human 304 visual system data, which is captured using eye tracker (repre-305 FIGURE 6. Depth pixel-based saliency point. Here red and green marked points are candidate, but green marked area is finally selected saliency region.  From FIGURE 8, 308 is observed that the human visual system (i.e. subjective 309 estimation) for most of the videos is closely related to the 310 expert opinion while, MxSalNet, DP, and the GBVS do not 311 always become identical with the subjective ones because of 312 its operational dependency on high-contrast, object motion, 313 brightness, and resolution.

314
As human visual system data is closely related to expert 315 opinion as well as this group has specialization on video anal-316 ysis, coding, compression, quality, image processing, we take 317 expert opinion as ground of our analysis. From the analysis 318 of a particular frame of videos, we observe that there is a 319 good co-relation between Human visual system and Expert 320 opinion, while DP, MxSalNet, and GBVS are far from Expert 321  opinion. Thus, it is obvious that software-based ROI is not 322 always steadfast to define actual ROI of human.
(2) 395 (3) 396 Here, γ = 0.125. The value of range is 0 to 1 where 397 0 consider as centre of the image and 1 is the last outer 398 focusing tyre. When we provide the eye tracker data i.e. ∀, 399 ∂, µ and ϕ, it provides the tyre of focusing tyre of ETRD, 400 EO and GBVS for that video. For any sports video like Soccer, position of the ball always 403 keeps changing while game is in continuation. Hence the 404 focal point needs to be considered; may be the surroundings 405 of the ball. Like GBVS, high motion and brightness dom-406 inated areas, like red, yellow and green light symbols also 407 provide significant information which may also be considered 408   where S M denote the ROI modification parameters, S C , R I , 417 and C s are the centre sensitivity, in-focus region, and context-418 based saliency respectively.

419
Only GBVS cannot provide actual ROI of frame and 420 also when both GBVS and ETRD are available, there is 421 no method to reach ultimate single decision for actual ROI. 422 For this reason, a new algorithm is proposed in this paper 423 for determination of ROI based on subjective (ETRD) and 424 objective (GBVS) model which can meet the ROI of human. 425 to ETRD-ROI, ρ and distance centre to GBVS-ROI τ and 427 tyring concept are used for satisfying centre sensitivity S C .

428
To consider in focus region R I , fixation f is considered which 429 is the measurement of when and how much time a participant 430 gaze on a region. Beside this, motion detection is used to 431 understand the context C S of the video.

432
The determined point and its surrounding 20% area is 433 considered as ROI for better visualization.  To verify proposed algorithm's output, the distance of 441 ETRD and GBVS based ROI from EO based ROI for each 442 video is presented in Figure 17.