Modeling Temporal Visual Salience for Human Action Recognition Enabled Visual Anonymity Preservation

This paper proposes a novel approach for visually anonymizing video clips while retaining the ability to machine-based analysis of the video clip, such as, human action recognition. The visual anonymization is achieved by proposing a novel method for generating the anonymization silhouette by modeling the frame-wise temporal visual salience. This is followed by analysing these temporal salience-based silhouettes by extracting the proposed histograms of gradients in salience (HOG-S) for learning the action representation in the visually anonymized domain. Since the anonymization maps are based on the temporal salience maps represented in gray scale, only the moving body parts related to the motion of the action are represented in larger gray values forming highly anonymized silhouettes, resulting in the highest mean anonymity score (MAS), the least identifiable visual appearance attributes and a high utility of human-perceived utility in action recognition. In terms of machine-based human action recognition, using the proposed HOG-S features has resulted in the highest accuracy rate in the anonymized domain compared to those achieved from the existing anonymization methods. Overall, the proposed holistic human action recognition method, i.e., the temporal salience modeling followed by the HOG-S feature extraction, has resulted in the best human action recognition accuracy rates for datasets DHA, KTH, UIUC1, UCF Sports and HMDB51 with improvements of 3%, 1.6%, 0.8%, 1.3% and 16.7%, respectively. The proposed method outperforms both feature-based and deep learning based existing approaches.


I. INTRODUCTION
Vision-based human action recognition (HAR) plays an important role in surveillance [1], [2], human computer interaction [3], human object interactions [4], healthcare monitoring [5], assisted living [6], [7], smart homes [8], [9] and etc., since vision sensors are informative [10]- [12]. Such fusion between vision sensors and the computer vision has become essential for monitoring the daily human actions in ambient assisted living (AAL) [6], [7], [13], [14], although human action recognition is a challenging task [15]. However, exploring vision sensors for in-home monitoring has often found concerns in protecting visual privacy [6], [16]- [18]. Current solutions to address visual privacy concerns in video are mainly based on processing the pixel intensity values The associate editor coordinating the review of this manuscript and approving it for publication was Mohammad Shorif Uddin . spatially to cover the identity details. These include face or the whole body, by means of masking [19], blurring [20] and pixelation [21]. However, after visually anonymizing, the utility of such sequences in visual analysis, such as, action recognition, is severely affected. Some applications, such as, assisted living require analyzing such visually anonymized video for tasks like human action recognition. Therefore, new algorithms that can visually anonymize monitoring video while retaining the utility of the video for automated analysis are required. In this paper, we propose a new method for visual anonymization of video while retaining important salient features for human activity recognition in the visual anonymity domain.
Visual anonymization in monitoring applications usually adopt the image processing techniques, such as, Gaussian blurring [20], pixelation [21], blocking [22], cartooning [23] and masking with sold silhouette [19], to obfuscate the sensitive information. However, these methods require to consider the trade-off between the visual anonymity and the utility of the anonymized sequences for monitoring tasks [24]. Achieving this trade-off is one of the major challenges associated with using the video camera in AAL. In the case of privacy concealment, the existing filtering-based models lose the accuracy of low level features for modeling the most dominant human body parts that are responsible for representing the action. Thus, discrimination among the actions tends to be inaccurate from the perspective of both the human vision and computer vision. Therefore, exploring the spatial content to obfuscate the identity leads to inaccurate modeling and misses the discrimination among the actions in HAR.
Recently, visual saliency detection for video has been proposed to highlight the most dynamic salience content in video sequences [25]- [30]. The outcome of video saliency is a useful abstract for the most dominant visual information in the scene without showing the details since the salient content is represented through highlighting the essential content, simulating perception in the human vision system (HVS). Visual saliency can be due to the spatial attentive cues as in images as well as due to the temporal saliency due to the motion in a video sequence. Although, salience estimation for video has become a widely addressed topic recently, all methods consider joint spatial and temporal salience modeling. However, since our focus is in the utility, such as HAR, in this paper we propose a novel temporal salience estimation and demonstrate the use of such salience maps for visual anonymization and HAR. The temporal saliency also seems to be a useful tool for addressing the challenges, such as background clutter often seen in computer vision, since the spatial content is excluded in modeling the temporal salience. Also, we aim to compute the temporal salience as a map in gray scale highlighting from the least salient to the the most salient regions using 0 to 255 gray values, respectively.
Our proposal is to replace the video sequences with the computed temporal salience map sequences and then explore the salience sequence for utility tasks, such as, HAR. The computed temporal salience sequences not only capture the temporal events, as in emerging neuromorphic (event-based) cameras [31], but also records significance of those events by means of recording the magnitude of pixel-wise salience in a 0-255 range. Early results of our work was presented as conference papers [17], [32]. This paper extends the model with analysis for new HAR descriptors, extension to visual anonymizations and evaluation of visual anonymity using both objective and subjective metrics. The main contributions of this work are: 1) A new methodology for estimating the temporal saliency based on modeling the intensity changes between successive frames. 2) Exploring the temporal saliency maps for achieving visual anonymity addressing privacy concerns in videobased monitoring.
3) A methodology of exploring the anonymity domain by extracting new Histogram of Oriented Gradients in Salience (HOG-S) features for HAR. The rest of this paper is organized as follows: Section II reviews the related work in the literature. Section III presents the proposed method for extracting the temporal visual salience maps for visual anonymizing and extracting features in the anonymized domain for HAR. The performance evaluation of the proposed methodology in terms of both visual anonymization and anonymized domain HAR is presented and discussed in Section IV followed by the conclusions in Section V.

II. RELATED WORK
In this section, we briefly present the recent work on both privacy preservation and HAR.

A. PRIVACY PRESEVATION
Besides the work in this paper, other anonymity methods have been presented and emerged, which are valuable efforts to preserve privacy. However, these methods are mostly focused on covering the identity silhouette using image processing in spatial domain [19]- [23] or the use of low-resolution visual sensors [33]- [36], where less information for visual recognition is present. Using low-resolution sensors adopts a network of extremely low-resolution cameras [33]- [35] or low-resolution colour sensors [36] to capture lowresolution visual images. These sensors have been successfully exploited in the applications of activity recognition [33], behaviour understanding [34] and object localisation [36]. However, these sensors are more sensitive to the local changes in the light conditions [34], [36], which affects the reliability in HAR.
The second category of solutions is to adopt the image processing techniques, such as, blocking [22], cartooning [23], blurring [20], pixelation [21], to obfuscate the sensitive information. Their main characteristics are summarised in TABLE 1. These image filtering based methods destroy the original intensity magnitudes and destroying the valuable features. Therefore, exploring the anonymity domains of these methods for HAR affects the accuracy rates of recognition. Furthermore, the trade-off between the privacy protection and utility of the anonymized sequences for monitoring tasks has to be considered [24]. Often, a higher level of privacy protection means a low level of utility and vice versa. This trade-off is one of the major challenges associated with using video-based vision sensors in the application of AAL. Therefore, our proposed approach is a valuable contribution to the development of algorithms to preserve privacy while enabling the subsequent analysis utility tasks, such as HAR.

B. ACTION RECOGNITION USING HAND-CRAFTED FEATURES
Several recent works have been reported to represent the actions based on hand-crafted feature extraction. One of the most considered algorithms is the local dense trajectories representation using Histogram of Oriented Gradients (HOG) [37] due to its robustness [38]. The existing works on HOG-based HAR are categorised into two themes: 2D HOG [39]- [41] and 3D HOG [42]- [44] representations. In the first category, the dense features are extracted from a single image/frame to show the motion history. In the second category, a volumetric representation in space-time is exploited to represent the action. However, in both categories, redundant data, such as, the background, is exploited to extract the features to describe the actions. This redundancy affects the discriminating power of the descriptor and increases the storage requirements for this information and makes the complexity higher. Mainly, it is interesting to address these problems based on determining candidate local interest points [45], although interested point-based learning has also many problems. All existing methods for HAR are based on the raw data domain, such as, colour video. However, those algorithms do not perform well on image processing-based visually anonymized sequences.
Recently, saliency estimation has attracted much attention in image and video processing [25]- [30]. The visual saliency estimation algorithms highlight the most important visual content, i.e., foreground, and attenuate others, i.e., background. This representation substitutes the intensities with the salience magnitudes and reduces the redundancy through modeling the saliency map. Thus, the visual saliency offers a tool for addressing the problems mentioned above of visual information [46], [47], and makes the saliency-based representation useful and accurate for the feature learning applications.
All video salience algorithms focus on joint spatiotemporal salience. However, for our work we intend to use temporal salience only. Hence we propose a new approach for temporal salience estimation for video. There is also added advantage of exploring temporal salience maps for HAR, as such maps have already abstracted the original sequences to a motion-driven event map sequence with highlighted significance of the events in gray scale.

III. THE PROPOSED METHOD
Our proposed method is two-fold: 1) Temporal visual salience mapping for visually anonymizing the video sequences and 2) human action recognition in the visually anonymized domain. For the former, we propose a novel method for estimating temporal visual saliency as detailed in Section III-A. For the latter we propose the Histograms of Gradients in Salience (HOG-S) features extracted from the anonymity domain, i.e., the temporal visual salience map sequence as presented in Section III-B. It must be also noted that many traditional HAR methods [48] begin with temporal shot segmentation [49]- [51]. However, our proposed method detailed in this paper mainly focuses on action recognition from temporal salience maps from a given temporal window of video frames.
A. TEMPORAL VISUAL SALIENCE MODELING FOR VISUAL ANONYMIZING be the action dataset with V video sequences and Q set of action classes, where s i is the sequence with index i containing F frames and q ∈ Q action label. The proposed algorithm starts by calculating the frame difference, D t between each two consecutive frames, f t and f t − 1 ∈ s i , where t is the frame index, to define the change in the pixel intensity over time, as for all (x, y) spatial coordinates. The difference at a given pixel can occur for several reasons, for example, illumination change and global motion. Therefore, the frame difference is compared with a user-defined threshold, τ , in order to eliminate the small changes and maintain the dominated moving pixels as follows: where D t (x, y) is the frame difference at location (x, y) with respect to the threshold τ . Note that | · | denotes the absolute value.
Next, for each pixel location (x, y), we compute the Shannon's Entropy E(x, y) of the normalised power spectral density (PSD) of D t values considering an N × N pixel window block centred at (x, y). Let b m ∈ B be the corresponding N × N block with B = {b 1 , b 2 , · · · , b M }, M is the total number of blocks and m is the block index. In order to make up the blocks for pixel at the frame borders, the frame borders are padded with relevant number of zero values according to the chosen N . The PSD for each block, S b m , is defined as where is normalised to suppress the high variation among those in different blocks. This is achieved by normalising with respect to the sum of all PSD components of a given block. This is followed by the computation of Shannon's entropy (E b m ) of the normalised PSD of b m in order to get E t (x, y). The computation of E t (x, y) captures the contribution of the D t values in the neighbourhood of D t (x, y). The entropy E t (x, y) is proportional to the amount of variation of magnitudes of the corresponding S b m . For example the higher the variation in magnitudes in S b m the higher the value of E t (x, y). This local spectral entropy value, E t (x, y), fairly captures the variations in D t to identify the temporal salience in a frame. It exploits the source of the most dominant intensity changes to model the underlying motion (with respect to the action). Most of the time, it is difficult to determine the perfect value of τ in Eq. (2) to maintain the desired changes and suppress other noisy changes because the motion levels vary according to the actions in sequences. To make this representation more robust and generalised, we further vary τ by defining a set of thresholds, τ h = 2 h , where h = 1, · · · , H , with maximum number of user defined threshold levels, H . For each pixel location (x, y), a set of entropy values, E τ h t (x, y) for the corresponding block, b m , considering all τ h is computed. Finally, the weighted entropy,Ê t (x, y), across all entropy maps, E τ h t (x, y), over all H thresholds is computed asÊ This entropy map is then normalised to be in the range of gray level values in the range [0, 255] and smoothed by applying a 2D Gaussian kernel in order to fill in the small holes and obtain the final temporal visual salience map based silhouette, S t . It links the neighbouring pixels that are close to each other to construct the temporal silhouette region. The generation of the silhouette of the human in action based on the proposed temporal salience estimation algorithm is shown as a block diagram in FIGURE 1 and summarized in Algorithm 1. Filter D t to get D t using Eq. (2).

5:
for each location (x, y) do 6: Consider N × N pixel block centred on (x, y).

7:
Compute 2DFFT of block b m .

8:
Compute PSD S b m of b m using Eq. (3).
FIGURE 2(c) shows D t for a chosen threshold, τ h . FIGURE 2(d) shows the E τ h t (x, y) for pixels along two lines for x = 114 (in blue) and x = 350 (in red). There is no temporal activity along x = 114, hence E τ h t (114, y) values are zero. On the other hand, E τ h t (350, y) consists of non-zero values at pixels corresponding to locations where temporal activity is present.
The distribution of the temporal visual salience magnitudes on a frame is essentially based on the magnitude of the changes in the intensities of the pixels caused by the motion present in the action. If the intensity is changed significantly, VOLUME 8, 2020  this produces a temporal saliency with strongly highlighting and vice versa. Furthermore, proposing Eq. (4) has another essential goal of suppressing the global changes, i.e., global motion, that can come from the background objects of camera motion. FIGURE 3 shows an example of generated silhouettes using the proposed method. It demonstrates the benefit of using multiple thresholds to compute the weighted entropy, E t (x, y). It can be seen in FIGURE 3(c) that the generated silhouette further highlights the most dynamic body parts used in the action compared to the rest since the moving parts are represented with high temporal visual salience magnitude values.

B. HUMAN ACTION RECOGNITION IN THE VISUALLY ANONYMIZED DOMAIN
Our proposed silhouette generation for visually anonymizing in Section III-A produces a gray scale map corresponding to the temporal visual salience due to the motion in the sequence. In this section, we present the proposed methodology for analysing these silhouette maps for HAR. Our approach aims to construct a compact descriptor by exploiting the the temporal visual salience captured in the silhouettes. Most current HAR descriptors are based on the original or raw video data and estimated motion from video for extracting important features. Since motion information is already encapsulated in our silhouettes, our approach can effectively the analyze the video without needing to access to the original visually non-anonymized video or without computing complex motion estimations. To achieve this, we propose histograms of gradients in salience (HOG-S ), which is a local descriptor exploring the temporal visual salience captured in our visually anonymizing silhouettes.
The HOG-S focuses on the salience region, R t , spanning in a rectangular bounding box of K × L pixels, from the silhouette in frame t. Major steps of our approach include HOG-S feature vector extraction from the bounding boxes, HOG-S feature vector processing and training a classifier as illustrated in the block diagram in FIGURE 4. We start by computing gradients, ∇R t = (d x , d y ) for each pixel in the region R t , where d x and d y represent the horizonatla and vertical components approximated by finite differences. The gradient magnitude, G t , and the direction, θ t , are computed as follows: R t is partitioned into B K ×B L blocks, each containing pn×pn pixels. Then each block is further partitioned into p × p patches, with each patch containing n×n pixels. The gradient magnitudes and the corresponding directions in each patch are formed into 9-bin histograms and all histograms are concatenated into a single feature vector, v t , of length 9p 2 B K B L . This is followed by normalizing the vector as follows: where · 2 denotes l 2 -norm. However, just considering individualv t for individual frames cannot perfectly marginalise among features from other frames in accordance with the variations inside the action itself and similarities among other actions. This is addressed by considering the accumulated temporal changes to the feature vectors,V t = {v 0 ,v 1 ,v 2 , · · · ,v t } up to frame t to compute the final feature vector,ṽ t , at the time instant, t, as follows: where | · | denotes the absolute value of the vector elements. This is followed by applying the principle component analysis (PCA) on the set of feature vectorsṽ t of the sequence in order to reduce the dimensionality of the HOG-S descriptor and to maximise the variance leading to improving the discrimination of the HOG-S descriptors. Finally a classifier is trained using these feature vectors to recognise the human actions in the video. We have considered two classifiers, support vector machine (SVM) and K-nearest neighbour (KNN) for evaluating our proposed method as presented in Section IV.

IV. PERFORMANCE EVALUATION
In this section, we present the evaluation of the proposed method in terms of its performance in both visual anonymization and human action recognition in the visually anonymized domain. The datasets used and the experimental parameters are shown in Section IV-A. Firstly, for the completion of evaluation, we evaluate the performance of our proposed temporal visual salience modeling and compare with the existing video salience modeling to justify the suitability of our approach for the considered application in Section IV-B. Then, we evaluate the effectiveness of the proposed anonymization method by evaluating the recognizability of the humans in video sequences and the utility of such anonymized video by recognizing the activities they do. We evaluate both these objectives firstly using human observers 1 by conducting subjective evaluations as shown in Section IV-C. Finally, the performance of HAR using the proposed HOG-S features in the visually anonymized domain is presented in Section IV-D.
The Weizmann dataset contains V = 93 low resolution (144 × 180) 50 frame per second (fps) video sequences showing nine different people. Each of them performing Q = 10 different actions, e.g., bend, run, walk, skip, jack, jump, pjump, side, one hand wave and two hands wave. This dataset is recorded using a single static camera.
The KTH dataset contains V = 597 video sequences showing Q = 6 action classes, e.g., boxing, handwaving, handclapping, jogging, running and walking. There are 25 different subjects performing the actions in four different scenarios, e.g., three are outdoor and one is indoor. This dataset is recorded with four different cameras to capture the action of the subject in the scene from different views. There three static cameras and another one to record the actions with zooming. The sequences are captured over a homogeneous background with a static camera recording 25 frames per second. Each video has a resolution of 160 × 120.
Depth-included Human Action (DHA) dataset contains Q = 23 action classes performed by participating 21 different individuals (12 males and 9 females). It is recorded using a static Kinect camera in three different scenes with   running, jumping, waving, jumping jacks, clapping, jump from situp, raise one hand, stretching out, turning, sitting to standing, crawling, pushing up and standing to sitting. These actions are performed by 8 persons and recorded using a single static camera.
UCF Sports dataset includes a total of V = 150 sequences with the resolution of 720 × 480 represents Q = 10 actions. This dataset represents a natural collection of actions including a wide variation in the scenes and viewpoints. The actions included in this dataset are: Diving, Golf Swing, Kicking, Lifting, Riding Horse, Running, Skate Boarding, Swing-Bench, Swing-Side, Walking.
Finally, Human Motion Database (HMDB51) dataset, which is one of the largest datasets used in HAR, contains V = 6849 clips distributed in Q = 51 action classes. Each video clip has around 20 − −1000 frames. The action categories of this dataset can be grouped into five types based on the body movements. This dataset is considered challenging due to containing clips collected from the Internet and YouTube. Thus, this dataset can be considered as a real-world video clip collection.
In the experiments, we use N = 3 and h = 7 for evaluating the proposed visual anonymization algorithm. The weighted entropy mapsÊ t (x, y) are smoothed using a 2D Gaussian kernel with σ = 6. All maps are resized to the resolution 256 × 256 to apply the same parameters on all datasets. We adopt a bounding box approach with K = 168 FIGURE 7. Comparison salience maps for different actions using the proposed metod and the existing methods. Row 1: original RGB frames, Row 2: corresponding temporal salience maps using our proposed method, Row 3: corresponding salience maps using Kim et al. [28] and Row 4: corresponding salience maps using Fang et al. [25]. Column 1-4: four actions from the DHA dataset and shows the generated temporal visual salience maps for visually anonymizing a few sequences of actions in the Weizmann dataset. As we can see that the silhouette for a specific action is changed for different frames over time as the motion content due to the action varies. For instance, in the case of jacking action, third row in FIGURE 5, the silhouette has a different pattern every time, as some parts are attenuated and others gain extra highlighting. In addition, the algorithm generates different salience maps for one-hand waving and two-hands waving actions, as we can see in row 5 and row 6, respectively, since the patterns of these two actions are different. This representation is crucial to create a useful abstract at each frame to extract an efficient action description, which accurately identifies the variation between the actions, while the video is visually anonymized.
Although our method focuses only on temporal visual salience, for completeness of this paper, we compare our temporal salience modeling with three other existing work Fang et al. [25], Kim et al. [28] and Wang et al. [26], which are mainly full video visual salience modeling considering both temporal and spatial salience cues. However, our algorithm just considers temporal salience cues only. In this way, we can make sure that the full video is visually anonymized (using black pixels for salient areas) while showing only the gray scale salience map silhouette corresponding to the temporal salient regions related to the action. TABLE 2 and FIGURE 6 show the Area Under Curve (AUC) values measuring the accuracy of salient region detection and the average time take for computation for our proposed method and the exiting work considering three datasets. These results show that the proposed method, which only models temporal salience, has comparable accuracy in terms of AUC with the existing methods, while taking low computational time. Examples of salience maps for various action sequences from DHA and Weizmann datasets using our proposed method and existing work are shown in FIGURE 7. It is evident that our proposed salience maps only captures the body parts relevant to to the action, where as, other methods capture other spatial information and the full body which are not relevant to the action.

C. EVALUATION OF THE PROPOSED VISUAL ANONYMIZATION ALGORITHM
We evaluated the effectiveness of the proposed visual anonymization using human observers. A survey with 30 individuals participants was conducted to evaluate the proposed method and state-of-the-art filtering algorithms for visual anonymization. In this survey, the participants were divided into four groups, where each group evaluated a specific dataset anonymized using the proposed methods and the existing methods. The datasets of DHA, KTH, Weizmann, and UIUC1 were used in this subjective evaluation.
In total, 108 anonymized video sequences for different actions were selected equally from five existing methods (blurring with σ = 5, blurring with σ = 8, pixelation, solid silhouette and binary silhouette) and the proposed method. These sequences have been spread out into four groups and each group was allocated to separate a set of participants for evaluation. TABLE 3 shows information of each group of evaluation and the number of sequences that have been assigned to each group. FIGURE 8 shows a few example frames from the sequences used in the survey and their corresponding anonymized frames using the existing methods and the proposed method.
The purpose of the survey is two-fold. Firstly it aims to find out the effectiveness of the proposed method visually anonymization. Secondly, to evaluate whether the utility of the video is affected due to the anonymization. In this case the utility was considered as the ability for an observer to accurately recognize the action present in the sequence. Three questions, shown in TABLE 4, were included in the survey to achieve these two purposes.
The first question aims to evaluate the level of visual anonymization achieved by a particular method as perceived by the observer. They are asked to score the level of anonymity on a discrete scale from 0 (no anonymization) to 5 (perfect anonymization). The score is regarded to which one they thought that could provide enough protection and reducing the concern about privacy protection. The second question collects the identity attributes, such as, gender, apparent age, facial features, clothes, hair and race, that can be recognised by the participants. These attributes are considered sensitive information that has to be protected by a visual privacy preservation model. The unmeasurable attributes were not considered due to the difficulty to determine them in the visual domain. The response to this question needs to be compatible with that for the first question. For example, a score of 5 for the anonymization level means none of the identity clues can be recognized from the anonymized video. Finally, the third question estimates the ability of anonymization method to retain useful information that can be used to identify the human action present in the video. This quality relies on the level of anonymity. In other words, if we need to increase the anonymity, the quality of the information has to be discarded and vice versa. The participants were asked to label the action presented in the obfuscated sequence using the information that was retained in the concealment model.
At the the beginning of a survey session, the purpose of the evaluation is conveyed to the survey participants. The region of anonymity of a scene is restricted to the human in the scene, but not for the background. The test video set used in the survey consists of various people performing various actions. We aimed to minimize the repeat of the same person doing different actions. Using the same video sequences with versions can help the participants to use their memory to recall the missed details and/or biased to the same answer ignoring the difference between the models. However, in few cases we use two different models for the same sequence in order to analyse the ability of the participants to recognise between them and if the method can make the difference for the participant or not.
As shown in TABLE 3, the number of the video sequences in this evaluation is 108 sequences, distributed as follows: DHA=30, UIUC1=30, KTH=24 and Weizmann=24. VOLUME 8, 2020 The number of video sequences that have been used in the evaluation depends on the size of the dataset and the number of action labels in each dataset. Thus, the number of the video sequence is distributed among different anonymization models evaluated. Six models were evaluated for DHA, Wizemann and UIUC1 datasets while four models were evaluated for KTH, as Silhouette and Binary masks were not available for the actions in the KHA dataset. With three questions per sequence, the number of responses collected for each dataset is as follows: DHA=720, UIUC1=720, KTH=504 and Weizmann=504 witha total of 2448 responses. The rest of this sub-section shows an analysis of the survey responses to all three questions.

1) ANALYSIS OF RESPONSES TO QUESTION 1
The responses include an anonymity score for each visually anonymized video sequence. We define the Mean Anonymity Score (MAS) for a given method for a given dataset by taking the mean score of all the responses received for the given dataset using the given method. MAS for the six methods, four datasets and the average MAS for all datasets per method  are shown in TABLE 5. A MAS of 0 corresponds to the least anonymity and a MAS of 5 corresponds to the highest anonymity. According to these results, the proposed method has achieved the highest MAS compared to all other methods for all datasets.

2) ANALYSIS OF RESPONSES TO QUESTION 2
The second question aims to collect more details about the appearance attributes recognizable in the anonymized sequences. The question 2 specifically enquires the participants about recognizability of six attributes, i.e., gender, apparent age, facial features, clothes, hair and race of the humans in the test sequences. We have also included the option ''none'' to indicate if any of the above attributes is not identified. FIGURE 9 and FIGURE 10 summarize the responses presented in stack bars as percentages for each visual anonymization method for different datasets. The proposed anonymization method has recorded between 89% − 100% of non-recognizable attributes (as shown in green in FIGURE 9), which is the highest compared to the existing anonymization methods. This high level of anonymization proves that our proposed temporal visual salience modeling achieves better anonymity compared to the existing spatial (frame based) approaches for visual anonymization. This result also matches with the highest MAS score reported in Question 1.

3) ANALYSIS OF RESPONSES TO QUESTION 3
This question evaluates the utility of the anonymized sequence in action recognition as perceived by the participants in the survey. The number of accurately recognized actions in video sequences normalised with respect to the total responses given by the participants for a given anonymizing method for all four datasets are shown in TABLE 6. It is evident from the table that the action recognition rates by participants for some anonymization methods are better than that for the sequences that use the proposed anonymization method. On one hand, for instance, blurring model with σ = 5 seems to achieve better accuracy rates from the viewpoint of the participants. On the other hand, this means that the quality of the visual anonymity is low, so that it has not distorted the perception of motion present in the action. It can be seen that for some methods there can be a trade-off between the anonymity and the utility of the anonymized sequence. For this reason, we evaluate the anonymization methods using the joint performance in anonymization and utility as shown in FIGURE 11. It is clear from FIGURE 11 that the proposed temporal visual salience-based anonymity maps achieve the highest level of visual anonymity, outperforming the existing methods. It is also evident that the higher the anonynmity the lower the utility as can be seen for blurring based methods. VOLUME 8, 2020  In conclusion, the proposed approach results in excellent visual anonymity while transforming the original colour pixels into the temporal visual salience leading to an actionrelated informative domain, which can provide a good indication of the actions in sequences as perceived by the human participants in the survey. In the following section, we demonstrate the utility of our approach in machine-based HAR.  the HOG-S features for HAR as proposed in Section III-B. We report its performance in five datasets using both KNN and SVM classifiers with five fold cross-validation and compare with the existing methods. The number of PCA components that were used for KNN and SVM classifiers to get the HAR accuracy rates reported in this section are shown in TABLE 7. It must be noted that these numbers are much lass than the original feature length, which is 23040.Note that for HMDB51 dataset, we evaluated only using KNN classifier.   The accuracy percentages for the proposed method are shown in bold font in the table under each dataset. We also show the results with and without the PCA. Overall, the proposed method has resulted in the best performance for all but one datasets outperforming both feature-based and deep learning based methods. Only for the Wizemann dataset, the proposed method is the second best with just 0.19% lower than the best method. Without using the PCA, the KNN has shown better performance compared to that of SVM. Both classifiers have shown improved performance when the PCA is used prior to classification to reduce the dimensionality of the feature space. However, the SVM classifier has benefited the most by using the PCA. FIGURE 12 -FIGURE 17 show the corresponding confusion matrices for the proposed method for the five datasets, respectively.
Though the DHA dataset includes several actions with high similarity, our proposed method discriminates them accurately and outperforms the existing methods to achieve approximately 3% improvement, as can be seen in TABLE 8. The confusion matrix using the KNN classifier in FIGURE 12 shows that 8 out of 23; i.e., 34%, of actions have been fully recognised by proposed modeling method. Similarly, for the KTH dataset, the proposed method shows an improvement of 1.6% compared to the existing methods. For the Weizmann dataset, as shown in FIGURE 14, the proposed method has recognised 50% of actions with 100% accuracy. For the UIUC1 dataset, the proposed method has shown around 0.8% improvement compared to the existing methods. It can be seen in FIGURE 15 that the proposed method recognises the jumping action with 100% accuracy in spite of the similarity between this action and other actions in the dataset. In addition, 79% of action classes have been recognised with more than 99% accuracy. For the UCF Sports dataset, the proposed method has shown improvements around 3.7% compared to the existing feature-based methods, and 1.3% improvement compared to the deep learning based methods. It has also got 77% of action classes having perfect recognition while the rest having accuracy rates higher than 99.6%. For HMDB51 dataset, which is regarded as a complex dataset, our method has outperformed the existing methods, which are mainly deep learning-based, by 16.71%. The confusion matrix in FIGURE 17 shows that 21 out of 51 action classes, i.e., 41%, of classes have achieved 100% accuracy rates using the proposed method. All these datasets contain complex actions with high similarity, yet the proposed method has VOLUME 8, 2020  resulted in excellent recognition rates. The accurate discrimination between the actions in all these dataset proves the superiority of the proposed approach of exploiting the temporal visual salience modeling for visual anonymization followed by learning HOG-S features.
Finally, we revisit the utility of the visually anonymized sequence as perceived by the participants in terms of the action recognition rate vs. the level of visual anonymity (measured by MAS) shown in FIGURE 11. Here we evaluate the utility of the visually anonymized streams in terms of machine-based HAR as shown in TABLE 9. The utility in terms of machine-based HAR with respect to the visual anonymizing methods is summarized in FIGURE 18. It is evident that the proposed anonymization method combined with the proposed HOG-S based HAR provides the best accuracy rate for HAR as well as the highest MAS resulting in the best joint anonymizing and HAR methodology. It can be also highlighted that the machine-based utility is much higher than the human-perceived utility for the proposed anonymity silhouettes. This confirms the efficiency  of modeling temporal visual saliency to obtain a saliency driven silhouette for anonymization and the ability of the proposed HOG-S features to learn the important features in such anonymity maps.

V. CONCLUSIONS
In this paper, we have have presented a methodology for visually anonymizing video clips by modeling the temporal visual salience while retaining the computer-based utility of human action recognition. The novel temporal salience model proposed in this paper encapsulates the intensity of the motion dynamics of the action into the anoymization maps. This is followed by extracting the newly proposed HOG-S features for human action recognition in the visually anonymized domain. The proposed visually anonymization method has achieved the highest MAS compared to the existing methods for visually anonymizing. The human observer surveys conducted have confirmed that none of the six appearance attributes were recognizable for all sequences tested for KTH and UIUC1 datasets anonymized using the proposed method. Similarly, for DHA and Weizmann datasets, around 97% and 89% sequences were not able to recognize any of the attributes. The proposed method's high MAS has been justified by these results. In terms of the utility of the anonymized clips, our proposed anonymization method coupled with the proposed HOG-S feature learning approach has achieved the best machine perceived human action recognition accuracy rates, compared to those of existing anonymizing methods. The proposed HAR method has also exceeded the performance of human-perceived action recognition from videos anonymized using the proposed temporal salience-based anonymization method. Overall, when considered the proposed work as a holistic human action recognition method, i.e., the temporal salience modeling followed by the HOG-S feature extraction, it has resulted in the best human action recognition accuracy rates for datasets DHA, KTH, UIUC1, UCF Sports and HMDB51 with improvements of 3%, 1.6%, 0.8%, 1.3% and 16.7%, respectively outperforming both feature-based and deep learning based existing approaches. It also has shown the second best accuracy rate for the Weizmann dataset, with just 0.19% less than the best method. This superior performance is the result of the way in which the actions are modelled using the proposed temporal salience modeling leading to generation of the silhouette that captures the dynamics of the motion present in the action at a specific time. This work provides a very useful tool for human action recognition in vision-based assisted living.