Semantic Recognition of Human-Object Interactions via Gaussian-Based Elliptical Modeling and Pixel-Level Labeling

Human-Object Interaction (HOI) recognition, due to its significance in many computer vision-based applications, requires in-depth and meaningful details from image sequences. Incorporating semantics in scene understanding has led to a deep understanding of human-centric actions. Therefore, in this research work, we propose a semantic HOI recognition system based on multi-vision sensors. In the proposed system, the de-noised RGB and depth images, via Bilateral Filtering (BLF), are segmented into multiple clusters using a Simple Linear Iterative Clustering (SLIC) algorithm. The skeleton is then extracted from segmented RGB and depth images via Euclidean Distance Transform (EDT). Human joints, extracted from the skeleton, provide the annotations for accurate pixel-level labeling. An elliptical human model is then generated via a Gaussian Mixture Model (GMM). A Conditional Random Field (CRF) model is trained to allocate a specific label to each pixel of different human body parts and an interaction object. Two semantic feature types that are extracted from each labeled body part of the human and labelled objects are: Fiducial points and 3D point cloud. Features descriptors are quantized using Fisher’s Linear Discriminant Analysis (FLDA) and classified using K-ary Tree Hashing (KATH). In experimentation phase the recognition accuracy achieved with the Sports dataset is 92.88%, with the Sun Yat-Sen University (SYSU) 3D HOI dataset is 93.5% and with the Nanyang Technological University (NTU) RGB+D dataset it is 94.16%. The proposed system is validated via extensive experimentation and should be applicable to many computer-vision based applications such as healthcare monitoring, security systems and assisted living etc.


I. INTRODUCTION
Understanding Human-Object Interaction (HOI) is formulated on Human Action Recognition (HAR) [1]. However, HOI is not limited to identify human actions, it can also detect relationships between humans and objects [2]. This relationship is called the verb or the interaction between a human and an object. Hence HOI is called the identification The associate editor coordinating the review of this manuscript and approving it for publication was Haiyong Zheng . of triplets (human, verb, object) [3], [4]. It is a challenging field and is of particular interest in research. Due to the complex nature of HOI there is a need for a very thorough understanding of each movement involved in an interaction. Semantic segmentation has proved to be very effective in multiple domains of image processing and computer vision, such as intelligent transportation, medical imagery, object detection and human-computer interaction [5], [6]. Semantic segmentation is the clustering of pixels that belong to the same class and labeling them individually [7]. Therefore, we semantically segment different human body parts and their interaction object. Traditional HOI recognition systems based on semantic segmentation only consists of human and object labeling [8], [9]. However, in the proposed system, different human body parts along with their respective object are semantically segmented and labelled. In this way, movements performed by each body part are recorded individually resulting in the development of an accurate HOI recognition system.
Human-object interaction is a very popular research area due to its wide applicability to, e.g., healthcare monitoring [9], assisted living [10], surveillance [11], motion sensing games [12], content-based video indexing and retrieval, etc., in the field of computer vision [13], [14]. Thus, there is a need for a more reliable and accurate system. A lot of research work has been performed in recent years in this field. Nevertheless, there remain some challenges that need to be tackled, such as variation in lighting and occlusion of different objects [15], [16]. In order to overcome these challenges, we propose a fusion of RGB and depth sensors. Depth sensors overcome the problem of occlusion and prove to be very effective in action recognition by providing the extra depth information of each object involved in an interaction [17]. Moreover, we incorporate semantics in each action class which results in deeper understanding of each movement performed by each body part during interaction.
The proposed HOI recognition system consists of four major modules: image normalization, human and object segmentation, human body parts and object detection via elliptical modeling and pixel-level labeling, HOI interaction recognition via semantic feature extraction, dimensionality reduction and classification. First, RGB and depth images are normalized by removing noise via Bilateral Filtering (BLF). The de-noised images are subjected to a segmentation phase performed via a Simple Linear Iterative Clustering (SLIC) algorithm. After the segmentation of both humans and objects from the backgrounds, skeletons are extracted via Euclidean Distance Transform (EDT) to trace the human skeleton human. The branch points of skeletons are given as centroids to form clusters of different body regions by Gaussian Mixture Model (GMM). The orientation and region under each cluster is represented by an ellipse. Hence an elliptical model representing different body parts and their respective objects is produced.
After modeling each human and object, pixel-level labeling of each region under an ellipse is performed. Conditional Random Field (CRF) is trained to label each RGB and depth image. Two unique features from each labelled human body part and object are extracted. After feature extraction, Fisher's Linear Discriminant Analysis (FLDA) is used for the dimensionality reduction of feature descriptors. In the end, each action class is classified and recognized via a K-ary Tree Hashing (KATH) classifier. The efficiency of the proposed work is validated via experimentation over three datasets: the Sports dataset, the Sun Yat-sen University (SYSU) 3D HOI dataset and the Nanyang Technological University (NTU) RGB+D dataset.
The major contributions of this work can be summarized as follows.
• Improved silhouette segmentation for both RGB and depth images via a SLIC algorithm.
• A precise human body parts detection and ellipsoidal model that is generated from detected human body parts.
• Pixel-level labeling of each detected human body parts and object from both RGB and depth image sequences via CRF.
• The main contribution is accurate HOI detection via unique semantic feature extraction, from each labeled body part and object. The rest of the paper is structured as follows: Section II provides the related work. Section III presents the detail of each module of the proposed HOI system. Section IV describes the experiments performed for validating the performance of the system and comparison of the recognition rate of our work with other systems. Finally, Section V provides the conclusion with some future directions.

II. RELATED WORK
Many HOI recognition systems have been proposed in recent years comprising of both deep learning [18]- [20] and machine learning based approaches [21]. However, in our proposed work, we have developed a machine learning based multi-vision sensors system that incorporates a semantic segmentation technique. Therefore, we divide the related work into two sections. The first section describes related work that reports multi-vision sensors based HOI. The second section consists of action recognition systems based on different semantic segmentation techniques.

A. MULTI-VISION SENSORS BASED HOI SYSTEMS
Data acquisition in vision-sensors based action recognition systems comprise of RGB [22], [23], depth and skeletal data [24], [25]. In this section related work in the field of HOI systems based on all three aforementioned vision sensors techniques is presented. Yao and Fei-Fei [26] proposed an HOI system that consists of a mutual context for human and object. The two types of contextual data used in this method are: co-occurrence context models and the co-occurrence statistics between objects and human poses. Furthermore, to represent the relationships between humans and objects, a spatial context is also represented. The efficacy of the system is proved with two publically available RGB datasets but still, the system lacked annotation of human body parts and objects. Yan et al. [27] proposed a multitask neural network based HOI recognition system based on a combination of human body and hand motion. A digital glove called Wise-Glove was used to record the motion of the hands. A neural network based technique was used to identify object and HOI. Experimental results with both RGB and skeletal data achieved a better recognition rate but testing was performed with a very limited data range of eight action classes.
Meng et al. [28] proposed a system based on the distance of skeletal joints. This is a depth sensor-based system in which inter-joint and joint-object distance is calculated. The features in this system were pose invariant and classified by random forest. A good recognition rate was achieved but the system was tested on only one dataset. In [29] Wan et al. proposed a pose aware system for HOI detection. A global spatial configuration of HOI is captured to focus only on action related parts of humans. In order to incorporate pose, a multi-branch network is used to represent the relationships between semantic parts, objects and interaction contexts. Two publically available RGB datasets were used for experimentation. Robust predictions were made via fine grained HOI recognition rates. Qi et al. [30] proposed Graph Parsing Neural Network (GPNN) based HOI detection. The graph structure includes the adjacency matrix and node labels. Results on two RGB and one depth dataset proved the validity of the system. In [31] Gkioxari et al. proposed the detection of triplets, i.e., human, verb and object detection. They exploit the concept of the appearance of a person to determine the object and the interaction. An action-specific density is calculated to detect the targeted object. They proved the effectiveness of their approach through extensive experimentation on two RGB datasets. Li et al. [32] proposed a 3D pose based system and a new benchmark named Ambiguous-HOI. To mine features, a 2D and 3D representation network is proposed. To represent both humans and objects, a cross-modal consistency tasks and joint learning structure was proposed. Experimentation on two RGB datasets proved the better performance of the system.

B. SEMENTIC SEGMENTATION AND LABELING
Semantic segmentation is the significant part of our proposed methodology. In recent years, different methodologies have been adopted by researchers in the field of semantic segmentation [33], [34]. So, in this section we describe some semantic segmentation based scene understanding and human action identification systems. In [35], Zhou et al. proposed cascaded parsing network based HOI. An instance detection module and interaction reasoning module were proposed. HOI representation, in the form of instance and relation features, is parsed via GPNN. The detection of interaction is not only limited to a bounding box but to pixellevel segmentation of humans and objects. Experimental results demonstrate better performance than prior methods on two RGB datasets (V-COCO and HICO-DET). In [36], J. Ji et al. provided semantic segmentation for different action classes instantaneously. They used the concepts of multitask learning and contextual data. Region Based Convolutional Neural Networks (R-CNN) was used for pixel-level labeling. They proved the performance of the system by experimenting both detection and segmentation on one RGB dataset. Khowaja and Lee [37] proposed a semantic analysis of videos by applying localized sparse segmentation using global clustering. Through experimentation they proved that semantic images produce better activations by focusing on regions that are significant for action recognition. The use of approximate rank pooling from Long Short Term Memory (LSTM) showed better performance. High recognition rate with three public datasets validated the performance of the system.
In [38], Arnab et al. proposed semantic segmentation based scene understanding via Conditional Random Field (CRF) and neural networks. They used deep neural networks to automatically learn features. The mean field algorithm of CRF was used as a Recurrent Neural Network (RNN) layer. They improved the segmentation performance on an RGB public dataset (Pascal VOC). Paisitkriangkrai et al. [39] proposed a pixel labeling technique consisting of CRF and Convolutional Neural Networks (CNN). They proposed robust features by combining both hand-crafted and CNN extracted features. Then, to label probabilities, CRF was applied. As a result, segmentation was improved with the ISPRS labelling contest dataset. In [40] A. Jalal proposed a depth silhouette based HAR labeled human body parts and identified the centroids of each part. Random field was used to label and train the images. A motion vector comprising of magnitude and direction was computed via identified centroids. Experiments performed on six daily life activities show a better recognition rate than many state-of the art methods. All of these methodologies showed improvement in the recognition of human actions so we propose an HOI recognition system based on semantic human body parts segmentation.

III. THE PROPOSED APPROACH
The proposed approach consists of four major modules: image normalization, human and object segmentation, human and object detection, modeling and labeling and in the end HOI recognition. The overall architecture of the proposed system is shown in Fig.1. Detail of the techniques used for each of the aforementioned modules is explained in the following subsections.

A. IMAGE NORMALIZATION
During pre-processing, the raw images from both RGB and depth datasets are fed into the system. In order to keep dimension of images from all three datasets similar, they are cropped to a fixed dimension of 560 × 350. After identifying the initial region of interest, BLF is applied. BLF removes noise, smooths the images and preserve the edges of all the objects in the images [41]. All the images are de noised by Gaussian smoothing kernels. The intensity value of each pixel x of the image I is replaced by a weighted intensity obtained by neighboring pixels [42]. The range kernel f r smooths differences in intensities and spatial kernel g s smooths differences in coordinates. The filtered image I fil , obtained after applying bilateral filter, is defined as; where x i is one of the neighboring pixels from the specified neighborhood window centered at x and W p is defined as; The weight of W p , i.e., the normalization factor is assigned by spatial closeness and differences in intensity values.

B. HUMAN AND OBJECT SEGMENTATION
After the normalization phase, both humans and their respective interaction objects are segmented from the background using an SLIC algorithm [43]. In this section, we propose a linear iterative clustering based super-pixels approach. In this approach k-means the algorithm is used to generate super-pixels [44]. First, all RGB images are converted to a lab (l * specifies lightness, and a * and b * for the four colors: red, green, blue, and yellow) color space. After that, the numbers of super-pixels k is specified. Then, initial centers C i = [l i a i b i x i y i ] T , which are S pixels apart, are initialized for each cluster. The grid interval S = √ N /k produces nearly equal sized super-pixels. The centers of the super-pixels should not be at the edges of objects, for this reason the centers are moved to the lowest gradient positions in a 3 × 3 neighbourhood. After specifying a cluster, searching starts where each pixel i is assigned to its nearest cluster center. Compared to traditional k-means, the search space of the SLIC algorithm is very limited [45]. The search space is reduced by measuring the distance D to define the nearest centers for each pixel. This distance D is a 5D Euclidean distance in a labxy color space and is given by 3D color distance d c and 2D spatial distance d s as; As the depth images are given in grayscale, they only have l component so the distance d c for grayscale is given as d c = (l j − l i ) 2 .The normalized distance D is given by maximum distance within the cluster in color and space proximity N c and N s respectively as; At the final stage, after each pixel is assigned to the nearest neighbor, an update in cluster centers is made. At the end there are still some pixels that are not assigned to their respective clusters. So, a super-pixel merging algorithm [46] is performed to further refine the segmentation process. Different visual features are used to define n super-pixels X = [x 1 , . . . , x n ] ∈ R m×n in an image. These image features describe l semantic labels in an image and the similarity of any super-pixel x i and x j is given as; where δ is the weight factor for distance adjustment, represent the Euclidean distance between color, texture, sift and surf distances of super-pixels  i and j. The relationship between super-pixels is stored in Fig. 2 shows the results of SLIC segmentation over an RGB image while Fig.3. shows the results of the SLIC algorithm on depth images from the SYSU dataset.

C. HUMAN BODY PARTS DETECTION, MODELING AND LABELING
In this section human body parts are detected, elliptically modeled and labeled. This section is divided into three phases. In the first phase, human body parts are detected and in the second phase an ellipsoidal model of human body parts is generated. In the third phase, detected human body parts and the respective interaction object are labelled using CRF. Each of these phases is described in the following sub-sections.

1) HUMAN BODY PARTS DETECTION
In order to provide accurate human joints annotations, a human skeleton is first extracted via Euclidean Distance Transform (EDT) [47]. All segmented RGB and depth images are converted to binary images. The binary images are then converted to grayscale images in which only those foreground pixels p are taken whose distance from the background pixels q is minimum [48]. This grayscale image is called the Distance Transform [DT] and its pixel values are given by: This distance is calculated by Euclidean distance. Then a morphological operation of thinning is the applied to extract continuous skeleton pixels [49]. The operation of skeletonization further reduces the image to a single line without destruction of the structure of a human. The skeletonization process is demonstrated in Fig. 4.  In order to determine skeletal joint points, first, a Critical Point (CP) is determined [50]. This is the point with a minimal DT from the boundary. Keeping the CP as the root node, a tree is traversed in four directions, i.e., upward, downward, left and right. In each of these directions, a constraint is followed, i.e., only the foreground pixels having value of 1 are searched in 8-connected neighborhood [51]. There are two types of points in this search.
• Endpoints(EP): those points in which there is only one skeletal point among 8 neighborhoods.
• Bifurcation points(BP): those points in which there are three or more than three skeletal points among 8 neighborhoods. By keeping the root node as a parent node, the first child is searched in the upward direction. The direction of search is guided by the slop of the line that is connecting the root. The EP in the upward direction is the head and the BP is the neck. In the search from root node to the left direction, the EP is the left hand. The mean of the EP and the CP is the left elbow and the mean of the left elbow and the neck is the left shoulder. Similarly, these three joints of right arm are located by a search towards the right. The root or the CP is the torso of the human body. Searching from the CP in the downward direction, the first BF is the torso base. The EP in the left side is the left foot and, on the right side, it is the right foot. The mean point between the torso base and both feet is the left knee and the right knee point respectively. Similarly, the mean point between the torso base and both knees is the left and right upper legs, and the mean point between both the knees and both the feet is the right and left ankle joint respectively. In this way a total of eighteen human joints are located as shown in Fig. 5. During all this searching, the constraint of the pixel value as 1 is followed.

2) GAUSSIAN-BASED ELLIPTICAL MODELING
The binary image I with skeleton S and k numbers (i.e. 18) of joint annotations is fed to an elliptical modeling phase. As the object is already detected in section (B) so in this section a Gaussian Mixture Model-Expectation Maximization (GMM-EM) algorithm [52] is implemented to represent each body part with an ellipse. First of all, the human is represented by non-overlapping or partially overlapping circles CC. A graph G = (V , W ) in which V is a set of the nodes that represent the endpoints of a skeleton and W is the set of edges. The skeleton is segmented into l i parts by W where i ∈ {1, . . . , |W |} and W is the number of edges. From these segmented parts the 16-bin histogram of radii of each circle and shape complexity C is computed where C is an entropy function given as; Through these circles, ellipsoidal fitting within the boundary is initiated. Each joint location is the centroid c of the circle and the line joining two consecutive joints in the skeleton is the radius R of each circle. Each of these circles is tangent to the boundary at two points. The ellipse fitting process is carried out according to the GMM-EM algorithm. The GMM-EM algorithm is initialized by specifying the centroid and the number of clusters because random initialization will have led to a suboptimal local minimum of the problem [53]. The shape skeleton by EDT is exploited for a more informed decision at the GMM-EM initialization stage. The 2D Gaussian function used for clustering of foreground pixels is given by; where P i gives the probability of a foreground pixel p to belong to an ellipse E i with origin c i and M i represents a 2 × 2 matrix with eccentricity ad orientation information of E i . Moreover, the amplitude A i = 1 to keep the same values of P i (p) at the ellipse boundary for all ellipses. In this way the probability of a particular point belonging to an ellipse depends only on its position, orientation and eccentricity and not on the area of the ellipse. The object is already detected in Section B using SLIC. In this section the object is enclosed with a bounding box using connected components and blob analysis on the segmented image. The elliptical models for some sample images from SYSU and NTU datasets are shown in Fig. 6.

3) PIXEL-LEVEL LABELING OF HUMAN BODY PARTS AND THEIR RESPECTIVE OBJECTS VIA CRF
After detecting different human body parts, the results of the elliptical modeling phase are fed to the pixel-level labeling phase, as each human body part is already segmented with an ellipse in the previous phase. In this phase, fully-connected CRF [54] is used to assign a label for each pixel in each of the detected human body parts and the object. A relationship between the output variables (specified labels) represented as y = y 1 , y 2 , . . . .y N and observed features (such as pixel intensities) as input variable x is described by CRF in the form of conditional probability P (y | x) [55]. During pixel labeling, N is the total number of pixels and y i is the label assigned to the i th pixel. The modeling of P (y | x) in CRF is approached by representing y as a Markov random field. CRF is represented in the form of an undirected graph as G = (V , E) with V as a set of vertices or nodes and E as a set of edges of the graph. Each label y i corresponds to each node [56]. In order to assign a label to each pixel, the probability distribution of CRF is defined in the form of an energy function as; where Z x is the partition function to normalize the probability distribution and E (y; x) is the energy function that is the sum of smaller clique potentials ψ c (y c ; ) represented as; So, to define a label for each pixel, an energy function is defined as; where ψ U is the unary energy component associated with each pixel and ψ P is the pairwise energy component associated with a set of pixels ε. The most probable assignment to a label requires minimization of the energy function as; For inference, a mean-field algorithm is used as given in Algorithm 1 that approximates energy minimization [57]. This algorithm is initialized with Gibbs distribution and it performs in a loop for Q energy minimization. The weighted Gaussian is computed in a message passing step.
In order to train the data, maximum likelihood is used. The parameters that produce the training data with the highest probability under the model are chosen. On a training sample ((x 1 , y 1 ) , (x 2 , y 2 ) . . . (x T , y T )) conditional log likelihood is Message passing from all X j to all X i as; //where k (m) f i , f j Q J is Gaussian kernel and f i , f j is feature vector for pixel i and j// 3. Adding weight to filter outputs as Q 5. Adding unary potentials to Q Zi exp (Q i (l)) end while maximized with respect to unknown parameter θ as; Now a CRF model is trained to predict a correct label for each segmented body part and the interaction object. Some of the results of the CRF in the form of labelled body parts and objects are demonstrated in Fig. 7.

D. HUMAN-OBJECT INTERACTION RECOGNITION
HOI recognition is the identification of triplets (human, verb and object) as the human along with its body parts and interaction object are detected and labelled. Now, in this section, the verb, i.e., the interaction between the human and the object is identified. For HOI interaction recognition, this section is sub-divided into three modules. The first is the semantic feature extraction, the second is dimensionality reduction via FLDA, and the third phase is classification with KATH.

1) SEMANTIC FEATURE EXTRACTION
The two types of features extracted from each semantic region including human body parts and objects are fiducial points and 3D point cloud.  pixel values indicates LB = {lb 1 , lb 2 , . . . .lb n }. Where m and n are the total numbers of pixels in RB and LB respectively. After identifying RB and LB, peaks and valley points are detected in each side of the boundary. The first-order derivative is taken as local maxima and minima to detect peaks and valleys respectively. A change in the slope of a boundary from negative to positive is referred to as minima while a change in slope from positive to negative is referred as maxima. The first order derivative of RB from i = 1 . . . .m − 1 is drb i = rb i+1 − rb i and its vector is given as; The first order derivative of LB from j = 1 . . . .n − 1 is dlb j = lb j+1 − lb j and its vector is given as; The peaks P r of RB are given as; Similarly, the peaks P l of LB are calculated from dLB. On the other hand the valleys V r of RB are calculated as; Similarly, the valleys V l of LB are calculated from dLB. If the contour of any body part is flat, i.e., if it has consecutive VOLUME 9, 2021  zeros, then the median point of the consecutive zeroes is taken. The coordinates value of each FP is recorded in a feature vector and tracked with each changing frame. Peaks and valley point detected on boundaries of some body parts are displayed in Fig.8.
The Peak Points (PP) and Valley Points (VP) of all the body parts and the object may not be the same in every interaction class. So these points are fixed for each body part. Table 1 shows the number of peaks and valley points for each body part and object.

b: 3D POINT CLOUD
In this feature, humans along with their interaction objects are represented in the form of point clouds. The RGB labelled images are converted into 3D point clouds with xyz coordinates [59]. Let K be a point cloud then its coordinate is given as X k p = (x k , y k , z k ). The pixels in the RGB image are converted into 3D points. This conversion is made on the basis of pixel coordinates and their corresponding intensity values. In order to extract features from the point cloud, these points are down sampled using a Voxel Grid (VG) filter [60]. A voxel is a grid defined over 3D point clouds. A spatial average is taken inside each voxel to down sample the points. Those points, which lie inside the voxel bounds, are joined to form one output point. The points inside the voxel are given with the centroid as; where S is the number of points in a voxel A. In every interaction class, each human body along with the interacted object is down sampled to 6000 cloud points while maintaining the posture or shape of the human and interaction object as shown in Fig.9. The coordinate value along with the intensity of each point in a down-sampled point clouds is stored in a feature descriptor. The feature descriptor from the two extracted features of both the human and object are concatenated at the end (see Algorithm (2)).

Algorithm 2 Semantic Feature Extraction
Input: Labelled images Output: Feature Vectors containing Semantic Features

2) FISHER'S LINEAR DICRIMINANT ANALYSIS
After combining feature vectors of all the interaction classes, a complex matrix is generated. FLDA is used as a dimensionality reduction algorithm before classification. The objective is to reduce intra-class variance and to increase inter-class variance [23]. The FLDA is applied to the interaction classes of each dataset individually. In a multiclass discriminant analysis, let µ i be the mean of each HOI class C, be the same covariance and µ be the mean of class means, then the scatter between each class C is defined as; while T is the transpose and the separation of classes is given in direction w as; The rows represent the number of images in the training set of each dataset. The final dimension of the Sports dataset after feature reduction is 6120 × 250, the SYSU dataset is 6120 × 360 and that of the NTU dataset is 6120 × 380. The scatter plot for the Sports and NTU datasets are displayed in Fig.10.

3) K-ARY TREE HASHING
The optimized vectors of all three classes are fed to a KATH classifier. It is a graph-based classifier given as G = {g i } while i = 1 . . . .N and N is the total number of objects in the graph [61]. The graph consists of vertices V , undirected edges E and label function l : V → L to assign labels to nodes in g i from a label set L. Based on the structure of the graph and node labels, a class label y i is also given to each graph g i . Furthermore, a size K of the traversal table and MinHashes {D (r) } R r=1 for R iterations is also specified [62]. Random permutation functions {π where, if the state is true, then 1(state) is 1, otherwise it is 0. The KATH algorithm consist of three steps, namely, traversal

IV. EXPERIMENTAL SETUP AND RESULTS
This section gives the details of each experiment performed to validate the proposed system. All the processing and experimentation is performed on MATLAB (R2018a). The hardware system used is Intel Core i5 with 64-bit Windows-10.
The system has an 8 GB and 5 (GHz) CPU. We divided the experiments into two sections. In the first section HOI recognition performance in which recognition accuracies of each interaction class is given in the form of a confusion matrix and the precision, sensitivity, specificity and F1 scores are also measured along with comparisons of the proposed method with other state-of-the-art (SOTA) methods. In the second section, i.e., pixel-level labeling, the label accuracies for each human body part and object are measured using CRF. All these experiments are performed using a Leave One VOLUME 9, 2021 Algorithm 3 K-Ary Tree Hashing l (N v ))) , . . . , min (π k (l (N v )))] 6. T Subject Out (LOSO) cross-validation scheme. Each dataset is divided into N subsets containing k number of images. First all the subsets are used to train the system and then one subset is used for testing. The system is then validated by taking another subset for testing and the remaining subsets for training. The images of the subsets that are used for training are not included in the testing set. In case of the sports dataset, there is a different subject in each image of each interaction class so LOSO cannot be applied. The sports dataset is divided by splitting 50% images of each interaction class for training and the rest of 50% for testing of the system. In case of the SYSU and the NTU RGB+D dataset, LOSO is applied in which the actions performed by one subject are used for testing and the actions performed by the rest of the subjects are used for training. This section is further divided into two sections: dataset description and experimental results.

A. DATASETS DESCRIPTION
The three datasets that are used for experimentation are: The Sports dataset, the SYSU 3D HOI dataset and the NTU RGB+D dataset. Details of each dataset are given in following subsections:

1) THE SPORTS DATASET
This is a static image dataset that consists of six RGB sports activities. The activities performed in this dataset are: cricket batting, cricket bowling, croquet shot, tennis forehand, tennis serve and volleyball smash. The details of the dataset and samples are given in [63]. This is a complex dataset as the poses and scenes of many interaction classes are similar  to each other, e.g., volleyball smash and tennis serve. Few sample images of sports dataset are displayed in Fig. 11.

2) THE SYSU 3D HOI DATASET
This is an RGB-D dataset in which a Kinect sensor is used to collect RGB and depth images. This dataset consists of twelve human object interactions performed by 40 participants. The HOI classes in this dataset are sweeping, mopping, taking from wallet, taking out wallet, moving chair, sitting chair, packing backpacks, wearing backpacks, playing phone, calling phone, pouring and drinking. There are 480 video clips of different durations ranging from 1.9s to 21s. The details of the dataset and samples are given in [64]. Few RGB and depth sample images of SYSU dataset are displayed in Fig. 12.

3) THE NTU RGB+D DATASET
This is an RGB-D dataset that contains RGB, depth and 3D skeletal data. This dataset contains 56,880 video samples of 60 action classes. Only RGB data is in the form of video while the depth data is provided in the form of image sequences. Out of 56,880 video samples provided for 60 action classes, 2715 video samples are used in the proposed work. The actions in this dataset have three categories: 40 daily actions (e.g., reading, drinking, eating), nine medical conditions (e.g., falling down, sneezing, staggering,), and 11 mutual actions (e.g., hugging, punching, kicking). From the 40 daily actions, we only worked on twelve human-object interactions. The twelve interactions that we used for training and testing in the proposed system are: drink water, eat meal, brush teeth, brush hair, tear up paper, put on jacket, take off jacket, put on a hat/cap, take off a hat/cap, phone call, play with phone/tablet and taking a selfie. The objects in these interactions are: glass of water, meal, paper, jacket, hat and phone. The rest of the details and samples are given in [65]. Few sample images of NTU REGB+D dataset are displayed in Fig. 13.

B. PERFORMANCE METRICES AND RESULTS
The two types of experiments performed for system's validation were, HOI recognition performance via KATH and pixel-level labeling performance via CRF. The results for each experiment are given in the following sub-sections.

1) HOI RECOGNITION PERFORMANCE
In this section, the performance of the system validated from a mean accuracy, precision, sensitivity, specificity and F1 scores. A comparison of the proposed system with other SOTA methods is also given in this section. The results for each performance metric is given in the following subsection:

a: HOI CLASSIFICATION ACCURACY
This experiment was repeated three times on the testing sets for each dataset individually to evaluate the classification accuracy using the KATH classifier. The results of this experiment are given in the form of confusion matrix showing true positive, true negative, false positive and false negative for each class individually. The confusion matrix for the Sports, SYSY 3D HOI and NTU RGB+D datasets are given in Tables 2, 3 and 4 respectively. It can be observed, from Tables 2, 3 and 4, that classes of all three datasets achieved high recognition rates with the mean accuracy rates of 92.88%, 93.5% and 94.16% with the Sports, SYSU and NTU datasets respectively. However, there is still some confusion between interaction classes that involve similar actions such as the tennis forehand and the volleyball smash interactions in sports dataset. Similarly, weeping and mopping interactions of the SYSU dataset are confused with each other. It can also be observed from the results of this experiment that confusion happens among the interaction classes that involve similar objects. For example, moving chair, siting chair interactions of the SYSU dataset and the phone call, play with phone and taking selfie interactions of the NTU dataset.
The precision, sensitivity, specificity and F1 scores of the Sports, SYSU and NTU RGB+D datasets are given in Table 5, Table 6 and Table 7 respectively. It is observed from Table 5 that the positive predicted values, i.e., precision is very high for all the classes of the Sports dataset. The lowest precision of 84% is achieved with volleyball smash due to its high false positive rate. The volleyball smash also has the lowest sensitivity and F1 score due to its confusion with  cricket bowling and tennis forehand. the mean specificity of this dataset is 98% which means that it can accurately reject a sample if it does not belong to a class for which it is tested for. In case of the SYSU dataset, it is observed that the mean precision, sensitivity and F1 scores are as high as 93%, 93% and 94% respectively. Less precise results are obtained with the playing phone interaction due to its 12% false positive rate. This dataset has a mean specificity rate of 99 %. From Table 7 it is inferred that the NTU RGB +D dataset has the highest precision, sensitivity, specificity and F1 scores of above 90% among all three datasets. In this dataset only the brush teeth interaction has lower than 90% sensitivity due to the lower visibility of the object and the resemblance of the action to interactions like eating meal, brush hair. Overall, it is inferred from the results of this section that the proposed methodology is an accurate HOI recognition system.

c: COMPARISON WITH OTHER SOTA METHODS
In this section the proposed method is compared with different methodologies adopted by researchers for HOI recognition from recent years. The action recognition accuracies of each evaluated methodology are used for comparison with the proposed system. Table 8 gives the comparison of the proposed system with other SOTA systems in recent years.   In [66], a spatial and probabilistic configuration of the object is used in the form of exemplars. In [26] a mutual context of both human and object is utilized to recognize different body arts and objects. In [67] the spatial relationship between human and object is learned based on geometrical properties. A contextual relationship between postured human body parts and the object is measured in [68]. A latent structural SVM is used for learning. The recognition rate of the proposed system is 92.88% which is higher than the systems with which it was compared. Table 9 gives the comparison of the proposed system over the SYSU and the NTU RGB+D datasets. The results of the proposed system over the SYSU and NTU datasets are compared with joint heterogeneous features learning (JOULE), sparsified graph regression, multi-modality, Local Accumulative Frame Feature (LAFF), skeleton-based methods and pairwise wise features based-models. The comparison showed a higher recognition rate for the proposed system compared to the other systems.
The comparison of the recognition accuracy of the proposed method with other methods is shown in the form of a bar graph in Fig. 14.

2) PIXEL-WISE LABELING PERFORMANCE
In this experiment the performance of semantic segmentation which is implemented to label different human body parts is observed. This experiment is repeated three times VOLUME 9, 2021 to check the accuracy of pixel-level labeling in interaction classes of each dataset via CRF. For this experiment each dataset is divided into 50% for training and 50% for testing. The interaction classes for each dataset are given separately to the trained CRF model. The true positive, true negative, false positive and false negative of each labelled body part and object is evaluated individually from each class. The mean accuracy (Acc) of each labelled body part and object is calculated as; The per class accuracy of each interaction class is calculated from the confusion matrix of each class individually. The accuracy measure for each body part and object of the Sports, SYSU HOI and NTU RGB+D datasets is given in Tables 10, 11 and 12 respectively. It is observed from the results of this section that the proposed technique of labeling using the elliptical model results in better labeling accuracy for each body part and object. It is also observed during experimentation that the pose of the human has effected the recognition accuracy of some body parts. For example, in the croquet interaction of the sports dataset, accuracy of neck is 79% as the person's posture is bent and the neck is not visible. Similarly, the labelling accuracy of the torso in the mopping interaction of the SYSU dataset is affected. Furthermore, the labelling accuracy of very small-sized objects such as tooth-brush and wallet is also less than for larger objects. The overall accuracy rates over three datasets are higher than 90% due to the prior phase of segmenting each body part with elliptical modeling via GMM-EM.

V. DISCUSSION
The proposed approach is tested with one outdoor RGB dataset and two indoor RGB+D datasets and showed better performance. So this system is applicable to both indoor and outdoor environments. It is a complete interaction recognition system from preprocessing of images to the recognition of each interaction class. It should be applicable to many real-world scenarios of human behavior monitoring systems, surveillance systems and smart homes, etc. However, the system also has some limitations such as the detection of smaller objects and body parts is challenging. For example, the mean labeling accuracy of the neck is 88.17%, 88.73% and 89.08% in the sports, the SYSU and the NTU RGB+D dataset respectively. This labeling accuracy is lesser than other body parts due to the occlusion of the neck by either the objects or other body parts. Similarly, in the brush teeth interaction of NTU RGB+D dataset, the labeling accuracy of the toothbrush is less than other objects due to its smaller size and occlusion by the hand of the person. However, in the proposed work we still achieved 84% labeling accuracy due to the efficient SLIC segmentation algorithm and CRF-based labeling.

VI. CONCLUSION
In this paper, we proposed a novel framework for HOI recognition. The proposed system is based on semantic human body part segmentation and feature extraction. At first an efficient silhouette segmentation of the human and object is performed via an SLIC algorithm. Then human body parts are modeled via Gaussian-based elliptical modeling and labelled at the pixel-level using CRF. Two unique semantic features, i.e., fiducial points and cloud points, are extracted. These feature descriptors are then optimized via FLDA and classified with a KATH classifier. The validity of the proposed system is proved via extensive experimentation. The experimental section is divided into two section. At first the performance of HOI recognition is proved via accuracy, precision, sensitivity, specificity and F1 scores. The mean accuracy achieved with the Sports dataset is 92.88%, while for the SYSU dataset it is 93.5% and for the NTU RGB+D dataset it is 94.16%. Comparison with other SOTA systems showed that the proposed semantic HOI recognition system is proved to be more precise. A few instances of confusion occur between the interaction classes based on similar objects. However, the high rate of precision, sensitivity, specificity and F1 scores proved that the proposed system has high a capability of assigning accurate labels to its class and rejecting a sample if it does not belong to a specific label. In the second experiment the performance of semantic segmentation is proved via accuracy measure of human body parts and object labeling in the interaction classes of each dataset. The overall semantic segmentation accuracy of the proposed system is 91.26% with Sports dataset, 91.23% with the SYSU dataset and 91.42% with the NTU dataset. So, the proposed system is not limited to human interaction recognition, it is also applicable to other domains of computer vision such as human body parts segmentation, labelling and human pose estimation. It should be applicable to many computer-vision based applications such as healthcare monitoring, security systems and assisted living etc.
In future, we plan to investigate new features to work on multi-human and multi-object-based systems. Furthermore, we would like to work on more complex scenarios for human action recognition. We would like to increase the efficiency of labelling by applying some deep learning techniques.
YAZEED YASIN GHADI received the Ph.D. degree in electrical and computer engineering from Queensland University. He is currently an Assistant Professor of software engineering with Al Ain University. He was a Postdoctoral Researcher with Queensland University, before joining Al Ain University. He has published more than 25 peer-reviewed journals and conference papers and holds three pending patents. His current research interests include developing novel electro-acoustic-optic neural interfaces for large-scale high-resolution electrophysiology and distributed optogenetic stimulation. He was a recipient of several awards. His dissertation on developing novel hybrid plasmonic photonic on-chip biochemical sensors received the Sigma Xi Best Ph.D. Thesis Award.