Long Gaps Missing IoT Sensors Time Series Data Imputation: A Bayesian Gaussian Approach

Missing sensor data is a common problem associated with Internet of Things ecosystems, which affects the accuracy of associated services such as adequate medical intervention for older adults living at home. This problem is caused by many factors, power down is one of them, communication failure and sensor failure are another two reasons. Multiple missing data imputation methods have been developed to address this issue. However, irregular temporal missing data locations are challenging to handle, due to lack of knowledge of their occurrence probability and their random temporal location. In this paper, we propose a Bayesian Gaussian Process based imputation technique that accounts for temporal forcing to fill in missing sensor data. Our approach; Bayesian Gaussian Process (BGaP); can efficiently impute missing data at any missing rate and for any temporal location using prior knowledge gathered from past observations. We illustrated how our approach performs using real data collected from sensors deployed in the residence of 10 older adults over a two-year period. Using our novel approach, we were able to impute all missing data which allowed us to observe long-term behavior changes that we would not have been able to observe otherwise.


I. INTRODUCTION
The Internet of Things (IoT) now makes it possible to deploy a large number of sensory nodes in diverse environments, for example, to monitor older adults' behavior. In retirement homes, sensory data helps in the decision-making process regarding the status of older adults.
In retirement homes, long-term behavior monitoring, and detection changes are important so that physical and cognitive decline can be captured early and, when properly managed, can increase the well-being of older adults for longer periods. In addition, long-term behavioral monitoring provides valuable information for eventual medical intervention. However, The associate editor coordinating the review of this manuscript and approving it for publication was Razi Iqbal . long-term behavior monitoring, and change detection need to be carried out continuously to obtain the best results; a constraint that is fulfilled by IoT ecosystems.
Degradation of quality of life for older adults is a consequence of cognitive and physical decline, which, when detected early enough, can result in better intervention and adaptation of medical care [1]. However, monitoring daily changes in older adults' behavior is challenging in practice [2] because it requires medical staff (e.g., nurses) to be constantly available for every patient. Even if this is achievable within small retirement homes with a crew that cycles day and night, it is a difficult, not to say impossible, task for a medium or large retirement home. One way to evaluate changes in the behavior of patients is to monitor the detailed behavior of every patient over time. Typically, medical staff only record a VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. IoT models architecture, (a) Five-layer architecture IoT model [6].
broad description of a patient's behavior. For example, a nurse would record that a patient goes to the bathroom at night but would not record the time, duration, or frequency of visits. Such information is valuable for assessing patients' health status. Frequent short bathroom visits could indicate urinary tract infection, whereas frequent long bathroom visits could indicate diarrhea [1]. In short, although broad descriptions to assess a patient's physical abilities are valuable, they are incomplete. In this respect, IoT technology can be used to monitor ambient environments unobtrusively and continuously using sensors and sensor nodes [3]. Hence, patient movements can be assessed using IoT sensors and translated into meaningful behavioral data.
IoT ecosystems consist of sensors and actuators that are used to harvest physical data from the environment [4], such as the ambient temperature and/or pressure. Although the three-layer model is considered the basic IoT model, in this study, we assume that a five-layer architecture model is used to realize IoT ecosystems [5], [6], as presented in Figure 1. A three-layer model is composed of a perception layer (the sensors) that resembles the edge layer in the five-layer model, a communication (or transmission) layer that resembles both the fog and network layer in the five-layer model, and an application layer that resembles the cloud and business layer in the five-layer model. The role of the perception layer is to collect data from the environment, while that of the transmission layer is to securely transmit the raw data collected by the sensors. The application layer stores and retrieves collected pre-or post-processed data to and from databases, while also providing special services to be performed on the data, including missing data recovery, anomalous data detection and decision-making.
In the perception layer, there exist several causes that lead to missing sensor data, such as power and hardware failure. However, the loss of data is not limited to the perception layer; it can also occur because of data exchange problems with the communication layer [3]. Regardless of the layer from which the missing data can occur, data can be missing for short as well as long periods. Currently, missing sensor data is one of the important challenges in IoT because incomplete data leads to insufficient information, which in turn results in inaccurate analysis and can ultimately lead to wrong interpretations and decisions that can have costly results both economically and socially [7]. Commonly, missing sensory data shift the statistical parameters estimated from a model using the collected data, resulting in bias in the mean and/or an increase in variance within the collected data that will hinder the possibility to efficiently and accurately monitored patients. Sensory data with missing rates above 50% are unreliable for the decision-making process [3]. Hence, handling sensor data, e.g., through imputation, is essential.
This study presents an approach for imputing long temporal gaps in sensor data that relies on dynamic linear modeling and focuses on the univariate case, where a single variable of interest is imputed. In this study, we will assume that missing data are missing completely at random, which assumes that the mechanisms resulting from the missing data are completely independent from the variables (observed or unobserved) that structure it. In the context of retirement home IoT, it means that the behaviors of the patients being monitored, the sensors and the sensor nodes are completely independent from the events causing missing data, which is a general but fair assumption. Statistically, making such an assumption is practical because it supposes that there is no bias in the available data.
The paper is organized as follows: Section II presents essential information about imputation and the methods used in this subfield of statistics. Section III presents the technical aspects of the proposed imputation approach. Results and discussion are presented in section IV before concluding (section V).

II. MISSING DATA BACKGROUND
From the sensory data acquisition perspective, what is missing data? It is when no values are obtained from a sensor during the observation process for a specific physical quantity, but they could have been obtained. Aside from malevolent tempering that could affect data acquisition from a sensor, which are situations we do not account for in this study, several causes lead to missing data including: sensor power down, sensor malfunction, and transmission failure issues. As a result, sample size is reduced, which can prevent some analyses from being performed because the statistical power to perform these analyses is too low [8], [9], [10].
There are two stages by which missing data can occur: (1) At the bulk or unit stage and (2) at the data item stage. Missing data at the unit or bulk stage is the result of malfunction, i.e., no data is collected from the sensor in these situations and can result in chunks of missing data. However, missing data at the item stage is sporadic. In this study, we will focus on missing data at the data item stage. To better understand how to handle missing data, the problem should be studied according to the proportion of missing data, the mechanism by which missing data happens and the pattern of missing data [9], [10], [11], [12].
Understanding the mechanisms by which missing data occur is important to properly impute them. In subsection A we explain the different levels at which missing data occur while in subsection B the different methods that have been proposed to impute these missing data are briefly presented.

A. MISSING DATA LEVELS
In this section we are presenting the different levels by which the missing data can occur. We start with the proportion of the missing data.

1) PROPORTION OF MISSING DATA
There are no predefined missing data proportion threshold that will lead to valid (or wrong) statistical inference. Yet, statistical inference quality is directly related to the amount of data available and in turn to the proportion of missing data. However, multiple studies have investigated how different proportions of missing data influence the quality of the statistical inference. For example, Schafer [13] concluded that 5% or less of missing data does not have major influences on the quality of a statistical inference. According to Bennett [14], statistical biasing is more likely to happen when the missing data rate is above 10%. However, based on a simulation study, Madley-Dowd et al. [15] concluded that the proportion of missing data should not be used to guide imputation strategy or inform on their efficiency, e.g., how imputation approaches handle bias in the data. That being said, information on the proportion of missing data is important to guide the decision about the imputation to use. Table 1 presents a commonly used guideline for missing data imputation strategies, which was initially proposed by Hair [9], [10]. However, the mechanisms that lead to missing data and their patterns in the missing data are much more important to account for in missing data analysis than the proportion of missing data [16].

2) STATISTICAL MECHANISMS UNDERLYING MISSING DATA
Rubin [17], stated that there are three mechanisms by which missing data typically occur: 1) missing completely at random (MCAR), 2) missing at random (MAR), and 3) missing not at random (MNAR). Being able to associate the missing data structure to one of these mechanisms is highly valuable because it guides the users to the best technique to properly handle the particularity of the data [18], [19], i.e., if we are able to correlate the situation at which the missing data happens or the structure of the time series including the missing data segments with one of the mentioned three missing data mechanisms. To understand the different mechanisms underlying missing data, we first need to define a mathematical reference to rely on. As a starting point, let us use a vector of data Y as a reference point. This vector Y is composed of observed and missing values and can thus be partitioned in two parts: the observed values (Y observed ) and the missing values (Y missing ). The complete array of the IoT data including the missing data can be defined as: Using these two parts of Y, we can calculate the missingness value r for each value of Y, meaning that r has the same dimension as Y. Depending on the statistical mechanism considered, the way the missingness is calculated will change.

a: MISSING COMPLETELY AT RANDOM (MCAR)
Data missing completely at random occurs when the missingness is completely unrelated to the mechanisms that structure the collected data. Although there could be variables structuring the missingness, such as punctual meteorological events like a storm that could cause electrical surges preventing a sensor to work, they are unrelated to the observation (movement of a patient or behavior of a sensor). In other words, MCAR occurs when missing data depends neither on the missing data nor the observed data. Mathematically, MCAR is not conditional on any values Under this assumption, the missing data is considered a random sample of the entire statistical population, which usually means that the standard error of the sample estimates is greater than that of the data. However, MCAR has the advantage of being unbiased. This is because of the reduced sample size [12].

b: MISSING AT RANDOM (MAR)
Data missing at random occurs when the probability of data missing depends on the observed data and not on the missing data itself. In more technical terms, for MAR the missingness is conditional on the observed values In MAR, it is assumed that there are variables of important for Y that also define the missingness. As in [20] and [21], it is impossible to test whether MAR assumption is valid for data or not solely with the prior knowledge of observed data. However, it is possible to inspect the tenability of MAR assumption using a t-test that test the difference between the means of the complete dataset and that of the missing dataset [16], [19].

c: MISSING NOT AT RANDOM (MNAR)
Data missing not at random occurs when the missingness depends on the missing or the observed data, for example, when a sensor does not make if specific voltage is reached or if a patient tempers with a sensor. Mathematically, MNAR is defined as

3) PATTERN OF MISSING DATA
There are three patterns missing data can follow [12] univariate, monotone [22] and arbitrary. A univariate pattern of missingness means that missing data can be attributed to a single variable. A monotone pattern of missingness occurs when missing values occur at a regular interval. In addition, VOLUME 10, 2022 a monotone pattern of missing data usually means that there is a dependence within the missing data itself. It is important to be aware that this regularity does not have to be temporal (e.g., at a specific time interval), it could also be because a specific voltage threshold is reached preventing the sensor to gather data. Other than the univariate and monotone patterns, in this paper we assumed that missing data patterns are arbitrary. From a computational perspective, univariate and monotone pattern of missingness are straightforward to handle compared to arbitrary missing data [12] hence statistical methods such as univariate regression [23] or mean substitution imputation by class [24] have been designed to approach either of these problems, respectively. A complete scheme representing the missing data problem is presented in Figure 2. In the following subsection we briefly present common imputation methods that can be used for sensory data.

B. IMPUTATION METHODS
Multiple techniques have been proposed to deal with missing data. Generally, deletion, ignorance and imputation are the three major classes of methods used to handle missing data. The disadvantage of deletion and ignorance is that they create a bias and reduce the amount of data available for analyses which in turn also reduced the quality of results. Conversely, the objective of the imputation is to replace missing data with reasonable values for the problem we are confronted with. It is important to be aware that the way data is imputed may change depending on the goal of our study. When thinking of missing sensor data, there are three classes of imputation methods [3], [25] depending on the type of information used to make the imputation. In the following lines we give a brief explanation of each one of these imputation classes.

1) SPATIAL IMPUTATION
Spatial imputation assumes we have a priori knowledge of the spatial correlation between sensors or sensor nodes that we can use as reference to make an imputation. Specifically, if two sensors are near one another, it is assumed that they capture a similar signal than when they are further apart. Spatial imputation uses this information to handle missing data. In this respect, spatial correlation is calculated using the spatial coordinates of the data. Several studies proposed different approaches to perform spatial imputation. For example, association rule mining techniques such as the Window Association Rule Mining [26] and Freshness Association Rule Mining [25] have been specifically designed for imputing data on networks of sensors. Technically, these methods estimate missing data using association rules among neighbor sensors for which data have been gathered. Although association rule mining was initially designed for spatial imputation, it can also be used for temporal imputation.

2) TEMPORAL IMPUTATION
Temporal imputation requires a priori knowledge of the temporal correlation between the readings collected from a single sensor. Similarly, to spatial imputation, when performing temporal imputation, it is assumed that data gathered at a short temporal interval are more similar than data gathered across longer period. In this respect, temporal correlation is calculated using the time at which each data from a sensor were gathered. Linear interpolation [27], Last Observation Carried Forward (LOCF) [28], autoregressive model [29], and Support Vector Regression (SVR) [30] are commonly used to perform temporal imputation although they can also be used to perform other types of imputations. However, these methods do not handle long temporal gaps efficiently and have a tendency to increase bias.

3) SPATIO-TEMPORAL IMPUTATION
In this type of methods, the imputation is performed based on the a priori joint correlation for both spatial and temporal correlations for sensors or sensor nodes. Among these methods are Spatial and Temporal Imputation [31], Data Estimation using Statistical Model (DESM) [32], k-nearest neighbor estimation (AKE) [33], and Bayesian Gaussian Process (BGP) [34], [35]. The latter method is the closest one to our method used in imputing the missing data in this paper, where the current observation data are considered as Gaussian distributed given the past observation data. In the following section, we present our proposed methodology used in this research.
The following section describes the proposed methodology in this research.

III. METHODOLOGY
To efficiently impute long gaps in the data, we need to use a model that can account for long tendencies within the data. Dynamic Linear Models (DLMs) [36] are an appealing option to consider because these models are flexible and can account for short as well as long tendencies in the data. Mathematically, DLM relies on two equations, the observation equation (5) and the state or system equation (6) where θ t defines the model structures (state) at time t that usually depends on different explanatory variables, v t ∼ N m (0, V t ) and w t ∼ N p (0, W t ). In words, v t follow a multivariate normal distribution of size m with a covariance matrix V t that needs to be estimated. Similarly, w t follow a multivariate normal distribution of size p with the covariance matrix W t that also needs to be estimated. In this general definition of a DLM, G and F are known matrices quantifying the importance of the model structure θ t through time.
In the most basic case, a DLM can be simplified to a random walk model with the following observation and state equations where, in this simpler case, v t ∼ N (0, σ 2 v ) and w t ∼ N (0, σ 2 w ). That is both v t and w t follow a univariate normal distribution. Compared to the more general the one presented in (7) and (8), all values associated to m, p, G and F are equal 1. In more colloquial terms, with the previous model, the time series is modeled as fluctuating around some level θ, which can change through time without any additional structuring constraints.
If we assume that we only have the time series (the observations), we can construct a DLM to fit our data based on the prior information we have.
In the next lines, we briefly present the practical implementation of the DLM and of the data used in our study.
For the practical implementation, ten different older adults were monitored in their residence for periods ranging from one month to several years to identify their long-term behavior over the monitoring period. The monitoring system relies on motion and door sensors spread all over the residence to follow the subjects' activity levels day and night [2]. The parameter used to measure the activity level per day for each subject is the number of movements each subject does in average per day. This is because the motion sensors are triggered by the subject's movement in front of the motion sensor. Figure 3 presents the locations of the sensors in a typical monitored room with pictures of the sensors that were deployed. Motion sensors are used to detect subjects' daily activities in the bedroom and bathroom while door sensors detected outing and visiting activities. Motion sensors were used to detect the subjects' movements within the sensors' line-of-sight, as described by Figure 4.
The overall activity level per day as monitored by the motion sensors for each subject is presented in Figure 5. The missing data are evident along the time series of the captured movements. The proportion of missing data for each subject is presented in Table 2.    After inspecting the data presented in Figures 5-14, we concluded that the missing data was MCAR. That is, the missing data gap location and size are randomly distributed   along the monitored period for each subject and as such there are no expected bias in the missing data. Note also that the missing pattern is univariate because only the daily activity of each subject is considered. Hence, state-space models are best suited for this type of missing data. State-space models   consider a time series as the output of a dynamic system perturbed by random noise, which were considered as following a Gaussian distribution around the states of the timeseries. Model estimation and forecasting were carried out by recursively computing the conditional distribution of the   daily activity, given the available information. In this sense, a natural way to treat this problem is through the Bayesian framework, where missing data can be estimated based on previously collected data.

A. STEPS FOR ESTIMATING DLM PARAMETERS (PROPOSED MATHEMATICAL IMPLEMENTATION)
i. Estimate the rolling mean value for the time series to have a mean (expected) value for each observation in addition of the observation itself. VOLUME 10, 2022 ii. Calculate the difference between each observation and its expected value to construct a vector of error, which represents the error distribution around the mean (expected) value along the time series. iii. Fit the error distribution to a Gaussian distribution and estimate its variance parameter. In more technical terms, in this step we estimate σ 2 v , which is the basis of the error in the observation equation (Equation 7). Note that the mean of the Gaussian distribution is assumed to be 0 because it is accounted for directly by the structure of the model. iv. Following, we ran an iterative loop to calculate the states of the model (θ t ), since, for each timestamp t, we have information on the observation (Y t ) and error value (v t ).
Note that the states of the model can be defined based on the user's preference. In our implementation, we used a polynomial of degree 6 with overall activity as explanatory variable. v. After estimating the states of the time series, we repeat the previous steps but this time to estimate σ 2 w and θ t−1 (Equation 8). vi. In a nutshell, the key idea is to estimate σ 2 v and σ 2 w that are the basis of the observation error and the states error, which can then be used to forecast missing data based on the prior knowledge of the observations and states together.

B. CONFIDENCE INTERVAL CALCULATIONS
For the imputation results to be less extreme, it is also possible to constrain v t . A way to do this is by calculating a confidence interval on the resulting values in v t . For example, calculating a confidence interval (CI) at a 95% confidence level can be performed as: where ±1.960 are lower and upper quantiles of the Gaussian distribution resulting in the area under the distribution to sum to 95% of the entire distribution, σ v is the observation standard deviation and N the number of missing values estimated.
As an example, if we reconstruct v t for subject A but relying on the 95% confidence interval instead of the entire data, more extreme values can be seen to have a strong impact on the structure of the missing values to be estimated ( Figure 15). The same results hold for the other subject considered here.
With the Gaussian parameters estimated for each subject's time series, we propose a procedure to estimate multiple segments of missing data for multiple subject's time series (Figure 16). The computational algorithm for estimating the missing sensory data based on DLM is presented in Algorithm 1. This algorithm is repeated for each subject independently.

IV. RESULTS AND DISCUSSION
The Gaussian distribution parameters for each subject with and without considering the 95% confidence interval are     Table 3 along with the corresponding missing data rates and available number of observations. The highest missing data rate is associated to subject G while subject B has the lowest missing data rate. The estimated missing data combined with the observed data are presented for each subject in Figures 17-26. For subjects C (Figure 19   Subjects were expected to be more frequently indoors during winter and less so during the summer of the same monitoring year. This was confirmed for subjects C (Figure 19 For subject A, there was a significant decrease in activity level between December and January compared to the October-November period, which was due to severe mobility impairments. Similarly, subject C ( Figure 19) also FIGURE 19. Estimated data segments overlapped with existing data segments for subject C, where the estimated segments are highlighted with green. experienced severe mobility impairments at the end of November 2015, resulting in very low activity levels, which we were able to detect with our imputation procedure. Note that the decrease in activity level common with the summer period was also clearly observed for subject C in May-August 2016.
There are spikes in the activity levels for all monitored subjects, which were attributed to nurse visiting the residences because with these visits the overall activity level inside the residence increased with the arrival of another person. These nurse visits are observed for subject A in mid-October 2014 and mid-December 2014 ( Figure 17). For subject B, they occur in mid-May 2015 and on mid-July and late July 2015 ( Figure 17). For subject C, they happen in late September 2015 ( Figure 19). For subject D, they take place from late April to early May 2015 ( Figure 20).        Figure 25). Lastly, for subject J, no nurse visits can be observed as the subject's behavior is highly fluctuating (Figure 26).

V. CONCLUSION
In this paper, we proposed and implemented an approach to estimate the missing sensor data independent of their respective temporal location. This approach is based on Bayesian Gaussian Process and is based on knowledge gain from past observations. We applied this approach to impute the missing sensor data obtained from the IoT overall activity monitoring system for older adults in residences. The imputation process verifies the assumption that indoor activities are higher in winter compared to summer. The imputation approach proposed also enables the identification of long-term behavior changes such as mobility impairment suffered by some subjects. We also verified the assumption that there are behavioral changes seasonally, especially between winter and summer with higher indoor activity during winter compared to summer periods. We were also able to justify nurse visits in the residence, with very high increase in activity level during the visit day compared to the past activity levels for the same subject.
In this study we collected data from real-life deployment and hence, the missing part in the data are not existent in reality. Based on this fact, the dataset collected does not include a no-missing data part, and in consequence the comparison between the missing dataset case and the no-missing dataset case is not feasible. However, we have limited the extreme values that might be obtained due to imputation by calculating the confidence interval for the estimated distribution of the past collected data, and then estimating the missing data based on it. This would limit the error and increase the accuracy of the missing data estimation. Moreover, the mathematical model utilized in our approach, is based on treating the acquired time series as an outcome from a random process, which limits the concept of the evaluation metric itself.
BESSAM ABDULRAZAK (Member, IEEE) received the B.Sc. degree in electronics from USTHB, Algeria, the M.Sc. degree in robotics from Paris 6, France, and the Ph.D. degree in computer science from Telecom SudParis, France. He is currently a Professor of computer science with the Université de Sherbrooke and the Director of the AMI Laboratory, Sherbrooke, QC, Canada. He is an active Researcher with the Research Center on Aging and the Interdisciplinary Institute for Technological Innovation. His research interests include the IoT, ubiquitous and pervasive computing, ambient intelligence, smart environments, assistive living technologies, context awareness, and software engineering. He has over 200 peer-reviewed publications, served as the general chair for a number of conferences and workshops, and serves on the editorial board of numerous international journals, as well as program committee of several conferences related to his research interests.
F. GUILLAUME BLANCHET received the B.Sc. and M.Sc. degrees in biological sciences from the Université de Montréal, Canada, and the Ph.D. degree in conservation biology from the University of Alberta, Canada. He is currently a Professor with the Department of Biology, Mathematics and Community Heath, Université de Sherbrooke, QC, Canada, and the Director of the Quantitative Biology Laboratory, Sherbrooke. He is an active Researcher with the Research Center on Aging and the Quebec Center for Biodiversity Science. His research interests include the development and the application of statistical methods and mathematical models to approach a variety of problems in biology and health sciences. He has over 40 peer-reviewed publications and serves on the editorial board of International Journal of Ecology (Population Ecology and Ecography).
HAMDI ALOULOU is currently an Associate Professor with the University of Monastir. He is a Senior Scientist with the Digital Research Centre of Sfax, Tunisia, and an Associate Researcher with the Institut Mines-Telecom, Paris, France. As part of his research work, he was actively involved in different European projects. His research interests include knowledge management and processing for decision-making applied in the domain of ambient intelligence/the Internet of Things (AmI/IoT). The target of his work is to set up intelligent living spaces in order to improve the quality of life and promote public, individual, and collective health.