Multi-sensorial Human Perceptual Experience Model Identifier for Haptics Virtual Reality Services in Tactful Networking

The tactful networking paradigm is expected to play a crucial role in the next generation networks. Accordingly, adaptive human-aware environments, sensitive to the daily human behavior and individual traits have to be provided, in order to offer a fully immersive and customized experience to users. On the basis of data collected by actual cognitive experiments, this paper proposes a learning framework to discover the multi-sensory human perceptual experience. The paper applies the mixture density network to identify the perception model considering different senses, and then the multi-sensory integration is performed, accordingly to the actual neuro-cognitive model. Furthermore, a supervised learning module has been used to cluster the users on the basis of the human perception identification strategy previously designed, assuming a multimodal structure for the cognitive brain activity. Finally, a practical contextualization has been presented, in relation to the haptics virtual reality services. What emerges from the results is the effectiveness of the tactful approach, i.e., brain-aware, involving the proposed framework, which is validated in comparison to the more conventional brain-agnostic scheme. In fact, the system performance, expressed in terms of reliability in guaranteeing the service exploitation before a target deadline based on the integrated perception, reaches remarkable improvements applying the brain-aware strategy, which exploits the human perception knowledge.


I. INTRODUCTION
One of the most salient features of the next generation networks is expected to be the need to manage human-aware applications, for example ultimate virtual reality services, or more in general those involving the extended reality (XR), or haptic communications. In fact, the new era of wireless networks will be characterized by architectures, communication models and technological solutions able to guarantee services based on daily humans behavior, including the psychological and cognitive aspects of the human brain, as well as the everyday humans habits, in order to meet the users expectations more naturally [1]. The recently emerged tactful networking paradigm has marked a divide between the user-centric applications, typical of the previous network generations, and the new era human-aware perspective, denoting a novel interdisciplinary area referred to the human behavior analysis. The user-centric approach is interested in whether and how the user participation may help to solve the challenges of the networks, in order to guarantee the users' target requirements satisfaction. Differently, with the advent of the tactful networking, which boosts the human-in-the-loop vision, the attention has been posed to the extraction and analysis of the human-traits, expressed in terms of human perception, capability in observation, interpretation, and reaction of the humans to the surrounding environment. Looking forward, by involving the human subjective sphere of the individuals within the new era networks and applications design, such as the human personality, the routines, the brain cognitive limitations, and so on, the quality-of-experience (QoE) may be empowered, offering the chance to realize tactful network ecosystems, i.e., environments adaptive to the human context, sensitive to human behavior and interests, and able to support real-time and interactive virtual environments soliciting the users' five senses [1].
Within this context, the exploitation of advanced machine learning techniques to deeply investigate the cognitive aspects of the human brain and the behavioral individual traits, is gaining momentum, expecting to play a crucial role in next generation networking [1]. In reference to this, some pioneer multidisciplinary studies have been proposed, for example [2] and [3], in which the users' delay perception has been analyzed and substantial efforts have been made to design a framework able to catch the human perception dynamics in relation to a wide range of individual features, such as age, personality, etc., representing remarkable literature improvements. What emerges in papers [2] and [3] is that there exist individual human brain limitations in perceiving quality-ofservice (QoS) improvements above a certain threshold, leading to biases in how the human brain translates the QoS in the actual QoE level, perceiving a QoE lower than the expected one [2]. In fact, the recent multi-disciplinary literature has started highlighting the actual cognitive limitations of the human brain in perceiving different QoS in video streaming transmission, in reference to both the rate and the delay metrics [2], [4], [5], confirming that human users cannot perceive QoS gains due to the intrinsic cognitive constraints.
Nevertheless, the challenges concerning the design of models and frameworks to realistically fit the actual capability of the human behavioral perception, in order to incorporate it in the practical realization of XR environments soliciting several senses simultaneously, are still numerous. For example, due to the intrinsic complexity of the data collection and the multi-facet nature of the topic, literature lacks in models to identify human perceptions different from that related to the visual cortex cognitive awareness.
This paper proposes the formulation of a statistical learning framework to model the human perception of the users considering different senses. More in depth, the contributions of the paper in comparison to the existing literature can be summarized as follows • A multidisciplinary machine learning based framework, which aims at providing a model to identify the multisensorial delay perceptions of the human brain, considering actual data deriving from cognitive science trade journals experiments as [6]- [8], and credited neurocognitive perceptive brain models [9]- [13]; • The application of a Gaussian mixture based learning method to identify the human perception related to different sensory channels to cluster users in relation to the Gaussian mode which better fits the corresponding perception dynamic; • The design of a framework to model the multi-sensorial human perception experience considering the integration of different sensory channels stimulated simultaneously, and the potential dominance of the visual perception on the haptic cue. The proposed framework has been contextualized to a tactful network arranged to handle haptics virtual reality (HVR) services, to show the advantages of performing human-perception based solutions, in reference to the system reliability in guaranteeing the service exploitation before a target deadline. The rest of the paper is organized as follow. Section II presents the review of the prior works related to our study. Section III describes the proposed human perception identifier model. Then, the Section IV presents the contextualization of the proposed model to a HVR scenario and the corresponding performance analysis. Finally, the conclusions are drawn in Section V.

II. RELATED WORKS
Several works have been proposed involving the users' personal information, such as the behavioral patterns or the social interactions, into frameworks devoted to address the traditional problems of the modern networks. Examples are represented in papers [15], [19], and [2]. The user-centric vision was adopted in paper [15], where authors aim at optimizing the resource allocation procedure in wireless small cell networks, throughout matching theory, considering device-to-device communication. Moreover, authors in [15] apply matching theory considering as relevant feature the social aspects, providing a context-aware resource allocation framework based on the social interactions. Studies about the human perception have been proposed in [19], in which the authors design a spectrum sharing strategy on the basis of the human psychological behavior of the users involved in the network. Paper [2] presents a power minimization considering the limitations of the human brain in perceiving different QoS levels, applying the Lyapunov optimization.
The multi-sensory perception has been the focus of paper [20], in which a multi-sensory learning framework has been designed to convert text sentence into visual and auditory representations to support autistic students. The paper [21] conducts an experiment about the human perception, in reference to both the visual and proprioceptive stimuli. The experiment is focused on 8 subjects with trained proprioception and 8 subjects with visual training. The corresponding results highlight that the learning rate of the visual perception is remarkably higher than that related to the proprioception; although the statistical error results to be decreasing as the learning training increases, the error still results significant for proprioception and non-significant for visual channel. Furthermore, the error in the test phase results to be higher than in the training because during the test, the process involved is the multi-sensory integration. The authors in paper [22] address the problem of the human-computer interaction, aiming at providing the state-of-the-art in data representation involving more than one sensory channel. The paper has considered 154 examples of multisensory data representations, in order to propose a design space along three dimensions: use of modalities, representation intent and human-data relations. In the paper [23], a control system prototype based on Arduino has been designed for the heat and scent simulations, within the VR environments. The proposed system represents a multisensory VR simulator developed in Unity 3D to prevent hazards such as fire, smoke, and so on. The  Multi-sensory human perception identification framework paper [24] exploits a multilayer perceptron architecture to classify unisensory stimuli, on the basis of which, then, a generalized feature-integrating model is proposed to analyze the multisensory combination. Furthermore, the development of a multi-sensory augmented reality system for the cultural heritage has been performed in [25], aiming at evoking several stimuli with SensiMAR prototype. The proposed system is referred to offer a multi-sensory experience involving the visual reconstructions, soundscape of ancient times, and smell very common during the historical period considered.
This paper aims at pursuing a multi-disciplinary approach to the multi-sensory human perception, in order to provide a valuable integrated multi-sensory model to be exploited in tactful networking applications. Nevertheless many studies have been conducted in relation to human visual-haptics combination, this paper has the goal to merge the neurocognitive aspects of the human brain with the machine learning techniques in order to provide a comprehensive framework for human perceptual experience to be exploited in tactful networking. In this sense, for the best of author's knowledge, there is not yet a similar work in the existing literature.

III. HUMAN PERCEPTUAL EXPERIENCE MODEL IDENTIFIER
As it is conspicuous from literature [2], [3], [6]- [8], the human cognitive perception does not result from only one distinctive trait of the individuals, rather it derives from a multitude of attributes, hereafter referred as features, such as gender, age, diseases, or the distance from a reference point.
In this paper the presence of diseases inhibiting one or more sensory channels is not taken into account, as well as the condition in which the sensory stimuli are not referred to the same scene. In order to design and support reactive multisenses soliciting XR environments, the match between user features profiles and the corresponding cognitive perception has to be performed. By focusing first of all on only one sensory channel j, let F = {1, ..., n} be the set of users, and let {f 1 , ..., f n } be the set of the features vectors corresponding to n users, respectively. Consequently, f i ∈ R m represents the features vector of the i-th user and, for each f i , i = 1, ..., n, F j represents the features matrix, in which entry i is f T i . Then, the vector β j is the user uni-sensory perception vector, whose element β i (t) j expresses the perception of user i in reference to the j-th human sense. The features vectors can be collected from neurocognitive experiments focusing on each sense, such as those in [6]- [8]. In accordance with Fig. 1 and Fig. 2, the proposed learning framework consists of an unsupervised learning module implementing the mixture density network (MDN) estimation [26], and of a supervised learning solution to cluster the users on the basis of the features they exhibit. More in depth, the unsupervised learning module allows to match the users' features vectors f i with the most probable perception mode, supposing a user uni-sensory perceptual experience. Then, a strategy for the multi-sensory integration is proposed, on the basis of the uni-sensory perceptual values previously considered. In this way, a labeled dataset in which each features vector is classified on the basis of the multi-sensory perception value VOLUME 4, 2016 is produced. Therefore, the labeled dataset is used to train a supervised learning module aiming at classifying the users on the basis of their multi-sensory perceptual experience value, in accordance with the corresponding features vector.

A. UNSUPERVISED UNI-SENSORY PERCEPTUAL EXPERIENCE IDENTIFICATION MODEL
The unsupervised learning block consists of the MDN module that adopts the Gaussian mixture model (GMM) and the Expectation Maximization algorithm (EM) to solve the unsupervised clustering problem [2], [27]. Accordingly to the statistical learning [27], the mixture model exploits the Bayes rule to give the degree of membership of an element to any cluster, hereafter referred as brain mode, by resorting to the conditional probabilities. More in depth, the MDN module allows to model the conditional probability distribution p(ω j i |f i ), where ω j i ∈ R m+1 represents the vector including β i (t) j and the corresponding features f i . From the GMM model assumption follows that we have a normalized linear combination of K Gaussian distributions. Therefore, we obtain that [2], [27] in which N (ω j i |µ j k , C j k ) represents the k-th Gaussian component density characterized by mean vector µ j k and covariance matrix C j k , considering the sensory channel j. Furthermore, by introducing the latent binary random vector z, in which only one component is set to one and the other are zero, we have that p(z k = 1) = π k , which expresses the Gaussian component activated. More in depth, under the assumption of a multi-modal cognitive human activity, z gives information about the Gaussian mode fitting the human brain. In order to cluster data on the basis of the most probable mode to which they belong to, the EM algorithm aims at finding π k , µ j k and C j k to maximize the log likelihood function expressed by [27] Due to the complexity in maximizing (2), the EM acts in accordance with the following steps [27] 1) Randomly initialize π k , µ j k and C j k ; 2) For each mode, compute the responsibility [27] 3) On the basis of the current responsibility value, the parameters are updated; 4) Repeat steps 1)-3) until convergence. Consequently, the output of the EM, i.e., the unsupervised learning, is represented by the GMM corresponding to the dataset matrix M j , where M j = [F j ||β j ]. Therefore, in reference to the obtained GMM model, each data sample ω j i , is labeled with the most probable Gaussian component k computed as follows [26] in which At this stage, before resorting to the supervised learning module, we need to integrate together the output vectors of the unsupervised learning, i.e., y j = [χ j 1 , ...., χ j n ] T , for each sense j.

B. MULTI-SENSORY INTEGRATION
As previously detailed, the unsupervised learning module returns, for each sense j, the output vector y j = [χ j 1 , ...., χ j n ] T , expressing the most probable brain mode for each user. Then, let K j k be the k-th cluster, then K j k includes all the users' having as most probable mode χ j k , in reference to the sensory channel j. Therefore, among the elements belonging to K j k , the user i ⋆ can be identified as follows where the set B j k = {i ∈ F|f i has mode K j k }. Similarly, the user belonging to the k-th cluster exhibiting the maximum perception value can be defined as What it is intuitive to understand, the perception level of the users belonging to the k-th cluster, considering the sensory system j, is surely within the range [β j Due to the fact that lower is the delay perceived on a sensory channel, greater is the sensibility of the users in discerning the fluctuations occurred on that sense, it is reasonable to consider the worst case value β j i ⋆ k as delay perception threshold for the considered k-th cluster of users when only one sensory channel is involved. In fact, the clustering method previously discussed may be useful to group together users with a common delay perception for a given sensory system. Nevertheless, whether and how the users' perception varies under the simultaneous stimulation of different sensory channels is still an open issues object of debate. Some studies have highlighted that sending stimuli contemporaneously on different sensory systems may originate interference on cognitive perception or even empower the human overall perception [9]- [13].
Although in many studies the advantages of the multisensory integration is emerged, other papers have highlighted that the brain, integrating information stemmed by some different channels, may lose the individual sensory system perception, leading to the metameric condition. Accordingly, there may be different physical stimuli that lead to exactly the same perceptual experience, indiscriminable from one another [10]. Within this context, in order to model the multi-sensory human perception under simultaneous stimulation of different sensory channels, we can define, for each user i, the multi-sensory vector s i ∈ R J , where J represents the number of sensory channels stimulated simultaneously. Furthermore, the entries in s i represent the cluster to which they belong to, derived from the module previously described. Then, numerous neuro-cognitive studies have highlighted that the integration of the sensory cues are interpreted by the brain as a weighted sum of individuals perceptions. Consequently, a prior estimation of the multi-sensory perception β i may be represented by the following expression where the weight w j has to be properly defined, for example as the normalized reciprocal variance: Consequently, the integrated perception β i results to be given by the sum of the uni-sensory perception, weighted with the reliability corresponding to each perception estimation, since the reciprocal of the variance is a measure of the reliability of such an estimation. Nevertheless, other experimental studies as [11], [28], [29] have brought to light the existence of a relationship of dominance and recessive among the different sensory channels. In particular, the visual perception has been recognized as dominant when it is integrated with the haptic channel and the variance associated with visual estimation is lower than that associated with haptic estimation [11]. In this case, the weight associated to the visual system w V is set as in which w H represents the weight given to the haptic cue and σ 2 H and σ 2 V are the variance of the haptic and visual modalities, respectively.

C. SUPERVISED USER-MULTISENSORY PERCEPTION ASSOCIATION
Once the multi-sensory integration is provided, we dispose of pair {f i , β i }, in which β i depends on the elements y j i , for each j, of the output vectors y j , i.e., s i . Thus, the pair {f i , β i } is used to train the supervised learning module which realizes a classifier that, for each human features vector f i as input, returns the corresponding integrated perception value β i . Let β = [β 1 , ...., β n ] T be the vector of the integrated perception values. Given the matrix F j and β, the supervised learning module builds a model g such that and ζ(·) is a 0 − 1 loss function [2], [27], and g is approximated exploiting the points (f i , β i ). In addition, to prevent overfitting, the Elbow method has been applied [2], [27]. Figure 4 summarized the proposed whole framework, posing emphasis on the relationship existing between the different modules of the framework. Therefore, the proposed framework, represented in Figure  2, acts as follows 1) For each features vector f i , and each sensory channel j stimulated, the GMM model based on ω j i is applied; 2) For each user i and sensory channel j, χ j i is computed; 3) For each most probable mode χ j i , the variance σ 2 j is calculated; 4) For each user i, the multi-sensory perception e is computed by resorting to the brain multi-sensory integration expressed by (8); 5) The model described in (8) and (9) is applied when σ V > σ H to find β i ; otherwise the multi-sensory perception β i is computed applying (8), (9), and (10); 6) At this point, a supervised supervised learning module is applied to cluster the users accordingly to their β i , on the basis of the features vector f i . More in depth, in reference to Fig. 2, this module aims at associating to the feature vector f i of each user, the corresponding effective delay perception β i . In fact, steps 1) − 5) produce a labeled dataset which can be used for training the supervised learning module, making possible the association between users'features vectors f i and the corresponding perception values β i .
In the following section a case of study related to the VR haptics services is presented, exploiting the multi-sensory human perceptual experience model identifier proposed.

IV. BRAIN-AWARE END-TO-END DELAY ANALYSIS FOR HAPTICS VR SERVICES
The problem presented here is contextualized to the HVR services, which is one of the most disruptive applications among those expected in the next generation networks. It is important to highlight that the analysis proposed in the VOLUME 4, 2016  following has the only purpose of providing a contextualization to the framework previously designed. Therefore, the markovian analysis presented represents a simplification of the problem in its complete form, and it has been given in order to provide an insight about the practical application of the multi-sensory perception model developed. Nevertheless, the proposed contextualization can be extended to more complex and detailed analysis. Typically, HRV services solicit the vision, the audition, and the touch. In order to provide a useful application example of the proposed framework, we refer to the scenario considered in [34] and [35], whose main parameters are reported in Table 1. In reference to Fig. 3, a set of Q HVR users has been considered. Each user injects in the network a Poisson traffic flow with mean rate λ q packets/ms, q = 1, ..., Q. The communication link is provided by a SBS, expected to operate at terahertz frequencies, which represent one of the most promising option to guarantee communications characterized by simultaneously high rate, high reliability, and low latency for immersive VR experiences. After receiving computation, the packets are sent back to the user, throughout the transmission subsystem. The focus here is the evaluation of the delay due to the computing and transmission subsystem, i.e., the end-to-end delay, in reference to the delay target deadlinek. Both the computation and the transmission subsystems service times have been assumed as exponentially distributed, with mean service timex 1 = 5 ms andx 2 = 0.75 ms, in accordance with the reference scenario analyzed in [34] [36] and [35]. Hereafter the reliability of the system refers to the probability of guaranteeing the target deadlinek. Therefore, the HVR packets arrive at the computational subsystem with a Poisson process with mean rate λ = Qλ q . The parameter of the exponential distribution modeling the service time of the computational subsystem is µ 1 = 1/x 1 . Consequently, the corresponding processor load ρ is given by ρ =x 1 λ q . The computation subsystem results to be a M/M/1 system. Due to the Burke's theorem [30], we have that the arrival process at the transmission subsystem results to be a Poisson process with mean rate λ, independent from that at the computational subsystem. Then, the exponential distribution modeling the service time of the transmission subsystem is µ 2 = 1/x 2 .
Consequently, the overall e2e delay needed to accomplish a HVR request service, is given by the sum of two independent time contributions: the time needed to complete service at the computation subsystem (t 1 ), and the time spent at the transmission subsystem (t 2 ) to send back to the HVR user the outcome of the service requested.
Let ψ 1 (t) and ψ 2 (t) be the pdf of the random variable t 1 and t 2 , respectively. The pdf of the overall e2e delay results to be given by the convolution In accordance with the standard queueing theory [30] we have where ϕ 1 = µ 1 − λ, and ϕ 2 = µ 2 − λ. After some algebraic manipulations we have that the reliability metric R for the tandem model considered is The uni-sensory model perception has been realized by considering the experimental data collected and merged by neuro-cognitive paper journals as [13], [31], [32], in reference to each sense involved in the HVR services. Consequently, the uni-sensory perception histogram corresponding to each sense involved in this application is reported in Fig. 5, Fig. 6, and Fig. 7, considering 100 users. Differently, Fig. 8 represents the histogram of the delay perception, considering 100 users.
On the basis of the integrated perception model obtained in Fig. 8, the value ofk may be set in accordance with the multi-sensory perception values estimated for the users experience and reported in that figure. It is important to highlight that the target deadlinek has been set equal to the statistical average of the perception values distribution illustrated in Fig. 8. In such a way, the target deadline results to be less stringent in comparison to the deadline imposed by the strict QoS requirements which, typically, is set to 20 as in standard literature [33], [34], [36] and [35]. In fact, the delay perception gives the opportunity to exploit more effectively the network resources and to offer simultaneously an immersive and fully satisfying XR experience to the users. Aiming at highlighting the benefits deriving from the tactful networking approach, considering the values presented in Fig. 8, Fig. 9 shows the system reliability as a function of the processor load ρ, considering Q = 1. The results are obtained by pursuing both the tactful networking (TN) and the more conventional brain-agnostic (BA) scheme, which considersk = 20 ms. Differently, the TN approach sets the target deadlinek equal to the statistical average of the obtained numerical values according to the perception values distribution illustrated in Fig. 8. As it is clearly evident from VOLUME 4, 2016 FIGURE 9. Reliability as a function of ρ the Fig. 9, the TN strategy guarantees reliability values higher than those reached by applying the BA scheme. Such a trend is also confirmed by Fig. 10, in which the reliability is expressed as a function of the number of users Q. In this case, the arrival rate λ q has been set the same for all the Q users and it is equal to 0.1 packets/ms. Also in this case, the results confirm the validity and the advantages of adopting a TN strategy, in order to offer more reliable HVR services and to handle a grater number of users. Fig. 11 considering a target reliability equal to 0.9, represents the number of users manageable by system without lowering reliability below the fixed target. As it is evident from the Fig. 11, also in this case, the improvements reached by applying the TN strategy confirms the validity of the human-aware approach. In addition, Fig. 12 and Fig. 13 express the system reliability as a function of the mean service time of the first (x 1 ) and the second (x 2 ) subsystem, respectively. From Fig. 12 and Fig. 13 is evident that the computation subsystem represents the bottleneck of the considered tandem network. In fact, the system reliability trend is steeper whenx 1 increases, in comparison to the reliability obtained by increasing the value ofx 2 . Finally, Fig. 14 shows the system reliability behavior for different values ofk, considering the BA approach. As it is straightforward to note, the reliability grows when we have less stringent deadline values.

V. CONCLUSION
This paper has proposed a learning based framework to model the human perception considering the five sensory systems, assuming the cognitive human brain activity as having a multimodal structure. The mixture density network has been applied to identify the conditional probability distribution of having, for each sense, a perception level, assuming some individual traits. Furthermore, a multi-sensory perception . Reliability as a function of ρ model has been assumed, in accordance to the neurocognitive literature, to perform sensory integration. Then, a supervised learning module has been used to cluster the users on the basis of their integrated perceptive sensitivity. Finally, the contextualization to the HVR services, in order to exhibit the advantages resulting from the adoption of a brain-aware decision making scheme based on the proposed framework, instead of a more common brain-agnostic approach, has been performed. The case of study presented aims at highlighting the performance improvements considering the system reliability in guaranteeing a perceptual target deadline.
BENEDETTA PICANO (S'17) received the B.S. degree in Computer Science, as the M.Sc. degree in Computer Engineering, from the University of Florence, where she received the Ph.D. degree in Information Engineering. She was a visiting researcher at the University of Houston. Her research fields include matching theory, nonlinear time series analysis, resource allocation in edge and fog computing infrastructures, and machine learning.