Joint Probabilistic People Detection in Overlapping Depth Images

Privacy-preserving high-quality people detection is a vital computer vision task for various indoor scenarios, e.g. people counting, customer behavior analysis, ambient assisted living or smart homes. In this work a novel approach for people detection in multiple overlapping depth images is proposed. We present a probabilistic framework utilizing a generative scene model to jointly exploit the multi-view image evidence, allowing us to detect people from arbitrary viewpoints. Our approach makes use of mean-field variational inference to not only estimate the maximum a posteriori (MAP) state but to also approximate the posterior probability distribution of people present in the scene. Evaluation shows state-of-the-art results on a novel data set for indoor people detection and tracking in depth images from the top-view with high perspective distortions. Furthermore it can be demonstrated that our approach (compared to the the mono-view setup) successfully exploits the multi-view image evidence and robustly converges in only a few iterations.


I. INTRODUCTION
By virtue of the emergence of low-cost commodity depth sensors, there is an increasing demand for privacy-preserving high-quality people detection in various indoor scenarios, e.g. people counting, customer behavior analysis, public security, ambient assisted living or smart homes. In contrast to classical pedestrian detection approaches, the depth sensors capture the scene from the top-view to minimize occlusions in crowded scenes. However, due to the top-view and the limited mounting height in many indoor scenarios, the resulting field of view of a single depth sensor is quite limited, thus the observable area is rather small. This is an issue in many real-world applications such as customer behavior analysis in a shopping mall or airport. To provide complete detections in a wide-area scenario we therefore employ a multi-view approach. Apart from the increased observable area, there are additional advantages compared to the classical singleview approach. Since a single image does not capture all the details in a 3D scene, considering additional partially The associate editor coordinating the review of this manuscript and approving it for publication was Claudio Cusano .
overlapping views provides more information about the true scene state. This is especially relevant in situations where people are only partially visible in one camera view due to occlusion or the limited field of view (see Fig. 2). Hence the detection performance (including the reliability of the detection confidence) in the overlapping regions can be improved by the complementary image evidence from multiple views. In particular this is relevant for demanding applications such as emergency detection in an ambient assisted living context.
The general problem of people detection in a multicamera setup has been widely studied in computer vision literature. However, existing multi-camera people detection approaches mostly focus on outdoor pedestrian detection, capturing pedestrians from profile or frontal view and using monocular video cameras. In contrast, we focus on the task of people detection in multiple overlapping depth images. Due to the vertical top-view, position changes of pedestrians lead to drastically varying appearances, making it very challenging for off-the-shelf data-driven pedestrian detectors without a domain-specific large scale data set. Besides only few methods in the literature take advantage of the full VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ FIGURE 1. Overview of our approach. We use foreground-segmented depth observations from three sensors as input (left) to approximate the marginal probability distribution of people present in the scene (right).
multi-view image evidence from overlapping fields of view, in order to increase the detection performance. To overcome those shortcomings, we propose a novel approach which exploits the multi-view image evidence by (i) employing a generative scene model leading to a viewpoint independent detector without the need of a training data set; (ii) using a probabilistic framework which includes the full multi-view image evidence from all sensors to resolve occlusion as well as measurement noise; and (iii) instead of just estimating the maximum a posteriori (MAP) state, utilizing mean-field variational inference to approximate the posterior distribution of people present in the scene (see Fig. 1). In the evaluation we report state-of-the art results on a novel data set for indoor people detection in multiple top-view depth images.

II. RELATED WORK
Multi-camera people detection has been extensivley studied in the context of video surveillance. The vast majority of the existing approaches is based on multiple monocular video cameras observing an outdoor scene. However, the topic of indoor people detection in multiple depth images, especially in top-view, has not yet been explored in detail. Hence, we will first discuss approaches focusing on pedestrian detection with multiple monocular cameras, and their relation to our approach. In order to restrict our scope, we do not consider methods working across non-overlapping views [1]- [3] but rather focus on methods utilizing overlapping views. For an exhaustive survey of multi-camera people detection and tracking, we refer to [4]- [6]. For the rest of this section, we categorize the relevant literature into approaches utilizing multiple monocular video camera views (RGB-based approaches) and depth images (depth-based approaches).

A. RGB-BASED APPROACHES
Since people detection and tracking in single-camera views have been intensivly studied [7]- [9], many methods accomplish multi-view detection by fusing local detections or local tracklets into a common world coordinate system [10]- [12]. However, since the detection is performed independently for each view, those methods do not take full advantage of the multi-view information, thus making it harder to resolve occlusion and measurement noise. Besides, the vast majority of employed pedestrian detectors is optimized to detect people in frontal or profile view but not in the top-view [13], [14]. Homography based approaches project local image features from each sensor into a common plane to perform global detection [15]. In [16] a homographic occupancy constraint is proposed to handle occlusion and detect people on a common scene plane. Eshel and Moses [17] propose a similar approach, projecting the foreground pixels of all views into a common height plane for head detection. In [18] those approaches are extended by a multi-view Bayesian network in order to avoid false positive detections arising from occlusion artefacts.
Another class of related approaches addresses the problem of multi-camera detection by employing a generative model to jointly take advantage of the image evidence of all available views. Fleuret et al. [19] introduce the probabilistic occupancy map (POM). They use foregroundsegmented binary images as input and employ a simple person model expressed as a rectangular bounding box to estimate probabilities of occupancy by mean-field variational inference. The method used in our approach is heavily inspired by [19]. Alahi et al. [20] re-cast the problem as a linear inverse problem. Other than in [19], a silhouette is proposed as person model. Unlike our approach, both methods utilize only 2D models and fit them to a binary foreground mask.
Baque et al. [21] introduce a state-of-the-art end-to-end multi-view people detection architecture. They combine a classical Convolutional Neural Network (CNN) with Conditional Random Fields (CRFs) to resolve ambiguities arising from occlusion. Chavdarova and Fleuret [22] present a CNN architecture to allow for end-to-end multi-view people detection. To overcome the lack of an appropriate multi-view data set, a larger existing monocular pedestrian data set [23] is used. However, due to the lack of extensive labeled data for top-view people detection in depth images, both approaches are insufficient for our use case.

B. DEPTH-BASED APPROACHES
Since people detection in multiple depth images has rarely been studied, we first discuss relevant single-view approaches. The related problem of people counting with a single depth camera from the top-view has been studied in great detail [24]- [27]. In contrast to our proposed method, those approaches focus on integrated systems counting the number of persons crossing a certain virtual line, providing people detection only implicitly and in a rather small area. Recent CNN architectures [28]- [30] are successfully applied to single view depth image people detection leveraging many labeled images for training. Since in our top-view setup position changes of people lead to drastically varying appearances (compared to the classical frontal or profile view), those approaches need to be re-trained with a domain-specific large-scale data set. Both mentioned classes of methods only provide single-view detection.
In contrast to the methods mentioned above, only few existing approaches rely on multiple depth images for people detection. Tseng et al. [31] present an indoor people detection system based on multiple active sensors in top-view. Their approach is based on a fused virtual top-view depth image, obtained by the point cloud of each sensor. For the detection they employ a hemiellipsoidal head model to take advantage of the discriminative height difference around the head contour of a human. In contrast to our approach, the presented method relies on high-quality depth data. In previous work [32] we re-cast the problem of people detection and tracking with multiple depth sensors as an inverse problem, employing an approximately differentiable scene model to detect people from arbitrary viewpoints. However, as a consequence of the used optimization method, the number of people in the scene is required a priori, and a sufficiently good initialization is essential. Carraror et al. [33] propose an approach for human body pose estimation and tracking in a network of RGB-D sensors. To obtain a global 3D skeleton, CNN-based pose estimation [34] is applied to the RGB images of each single-view. However, due to the singleview detection approach they do not take advantage of the full multi-view information. In contrast to our work, the former approaches [31]- [33] estimate an MAP point estimate but do not provide a probability distribution over people present in the scene.

C. SUMMARY
To summarize, our work is highly inspired by [19]. In contrast, we use depth images as evidence and therefore are able to make use of a more specific generative scene model. We also propose a different strategy to approximate the final mean-field update expectation by making use of geometric scene knowledge and a pre-trained vocabulary. Our generative scene model is similar to our previous work [32]. However, the approach introduced in this work does not hinge on scene-specific a priori knowledge and provides an approximation to the full posterior distribution. In contrast to recent data-driven CNN architectures [22], [28]- [30], [33] our method requires no training data and the detection confidence can be quantified more precisely by approximating the posterior distribution. To the best of our knowledge, variational mean-field inference in combination with a generative scene model has not yet been applied to the problem of people detection in overlapping depth images.

III. APPROACH
The problem we address in this work is the detection of people given multiple overlapping depth images from, but not limited to, the top-view on the scene. The major challenges are (i) the different appearances of people due to the change of viewpoint (see Fig. 2); (ii) occlusions in more crowded scenes and (iii) the measurement noise due to commodity lowresolution depth sensors. To overcome challenge (i), we make use of a generative scene model (see Fig. 3), formulating the people detection problem as an analysis-by-synthesis problem. Challenge (ii) and (iii) are addressed by the proposed probabilistic model (see Sect. III-A), which jointly handles the multi-view information. Furthermore, the mean-field variational inference approach deals with occlusion implicitly and turns our statistical inference problem into a tractable optimization problem, in order to get an approximation of the proposed posterior distribution (see Sect. III-B).
Due to the available depth data, marker-free extrinsic calibration can be achieved in three simple steps: (1) for each sensor S 1 , . . . , S C the ground floor plane is estimated by a simple plane fit; (2) one arbitrary sensor coordinate system is defined as the common world coordinate system; (3) for each sensor S c the rigid body transformation to the common world coordinate system is obtained by corresponding natural image features in the overlapping fields of view. For the rest of this paper we define P c as the projection matrix for each sensor S c , which maps a point from the common world coordinate system to the corresponding image coordinates of each sensor.

A. PROBABILISTIC MODEL
Since we assume that the common ground floor plane is known from the initial calibration, we describe the presence of people in the scene in ground floor world coordinates. We discretize the ground floor area into a 2D-grid of n locations. Each location u i will be assigned a realization The likelihood construction is similar to our previous work [32], although we use a discrete grid instead of continuous person locations. To make the likelihood tractable, we assume that the views are conditionally independent for a fixed scene configuration x. Since we assume that only people are part of the foreground, and that the depth images are robust against illumination changes, this assumption can be justified. Thus, the likelihood factorizes as: We define the likelihood for one observation by employing a generative forward model G c ( x, P c ), which maps a scene configuration x and a given projection matrix P c to a synthetic observation (i.e. synthetic depth image) from the perspective of sensor S c . Therefore, we use a simple, rotationally symmetric 3D person model, consisting of a cylinder for the body and a sphere for the head, see Fig.3. For the sake of simplicity we assume that our given observations suffer from Gaussian noise, yielding an observation likelihood Since our generative forward model is not only a function of x but also of the projection matrix P c we incorporate the physical sensor model in a natural way into our framework, allowing us to detect people from arbitrary viewpoints and to easily integrate a new sensor modality into the network. Applying Bayes' theorem and assuming that the prior factorizes as p( x) = n i=1 p(x i ), we get the posterior distribution

B. MEAN-FIELD VARIATIONAL INFERENCE
Because of the dimensionality of the latent scene configuration space {0, 1} n , the partition function in (4) is intractable, and we cannot directly compute the posterior distribution. Instead we propose to apply Kullback-Leibler variational inference [35], [36] to approximate the complex distribution p( x| o) by a simpler proxy distribution q( x). Let · p(x) be the expectation with respect to a distribution p(x); then the optimization objective can be expressed aŝ To make the problem computationally tractable, we assume a fully-factorized distribution q( x) = n i=1 q i (x i ), known as the naive mean-field assumption. Let q( x \ x i ) denote the mean-field distribution excluding the element x i , namely q( x \ x i ) = n j=1:j =i q j (x j ). The general mean field equation, given by updates q i (x i ) depending on the previous mean-field state q( x \ x i ). It can be proven that updating q i (x i ) asynchronously according to (6) will decrease the KL divergence in (5) (see [37,625 ff.]). Since each x i is Bernoulli distributed, (6) (for x i being in state 1) can be written as with the partition function Additionally, let δ(I 1 , I 2 ) = 1 2σ 2 ||I 1 − I 2 || 2 2 be an image distance function, and τ i = log 1−p(x i =1) p(x i =1) a function of the prior. Inserting the joint probability distribution defined in (1-4) into (7), and using the relation e x e x +e y = 1 1+e y−x , the final asynchronous update of the probability q i (x i = 1) is a sigmoid function given as with the expectation Notice that G c ( x|x i = 1) maps a scene configuration x to a synthetic depth image in the perspective of sensor S c with x i forced to 1 (see Fig. 4). Following the argument given in [19], one can see how occlusion is handled in an implicit way: If the forward-model projection of a person located at u i is occluded by a projection of a person with a high probability of occupancy, the value of x i does not affect the image distance δ(o c , G c ( x|x i = s)). Thus, the expectation E c,i in (10) converges to zero.

C. APPROXIMATE MEAN-FIELD UPDATE
Still (10) is intractable due to the expectation · q( x\x i ) , which implies an iteration over all scene configurations. We approximate the expected value by considering only a relevant subset of scene configurations. Therefore, we exploit the fact that the difference only depends on the pixels belonging to the silhouette of the projection of the 3D model at location u i (see Fig. 4). For a simpler and faster implementation, we do not work on the exact silhouettes but on the corresponding rectangular bounding boxes, given as I c [u i ]. Thus only those scene configurations, for which the pixel values inside the bounding box I c [u i ] of the generated image G c ( x) are effected, need to be evaluated for the expectation E c,i in (10). We assume that only the projections of the direct eight neighbors of a grid location u i intersect with the bounding box I c [u i ]. For our top-view setup this is a valid assumption; however, for a frontal view setup, a more sophisticated approximation would be preferable. Consequently, we can approximate the expectation E c,i in (10) by the reduced neighborhood scene configuration˜ x i ∈ {0, 1} 8 . Since the local neighborhood (including x i ) allows only 2 9 = 512 possible scene configurations, we can effectively approximate the expectation.
Instead of the image distance δ(·, ·), derived from our probabilistic model, we introduce a weighted asymmetric image similarity δ asym (o, g) between a foreground segmented observation o and a generated image g. Since there is no need to compute the derivative of the distance function we replace the squared L2-norm by the more robust L1-norm. Let M : R W ×H → {0, 1} W ×H be a threshold function which maps an image to its binary foreground mask, M (i) = 1 − M (i) its inverse and the hadarmard product between two images. The asymmetric image similarity is given as with the design parameter α ∈ [0, 2]. For α = 1 the image similarity δ asym (o, g) is identical to the L1-norm o − g 1 . For α > 1 observed depth pixels which are not explained VOLUME 8, 2020 by the generative scene model will be penalized stronger. Let further (13) be the image similarity restricted to the cropped image region I c [u i ]. Then the approximated expectation can be written as Additionally, we normalize the expectation with respect to the size of the image slice |I c [u i ]|, to account for the viewpoint dependent size of a bounding box. In order to efficiently compute (14), we propose to pre-build for each u i a vocabulary of image sections I c [u i ] for all 512 possible scene configurationsx i .
The final mean-field updates can be executed asynchronously or synchronously. In an asynchronous meanfield update iteration, the individual q i (x i )'s are updated sequentially, whereas, in a synchronous update iteration, all the q i (x i ) are updated simultaneously, using the same previous mean-field state q( x). While asynchronous update provides theoretical convergence (see III-B), synchronous mean-field updates can be easily parallelized. For the optimization we use coordinate-ascent variational inference (CAVI) [35]. Hence, the probability for eachq i (x i ) will asynchronously be updated with respect to the previous meanfield state q( x) according to the final update equation

IV. EVALUATION A. DATA SET
To the best of our knowledge, currently no publicly available data set that covers the scenario of top-view people detection using multiple depth sensors with overlapping fields of view exists. Therefore, we introduce a novel data set to compare our approach with state-of-the art multi camera people detection approaches. The data set contains footage from an indoor office scene and is recorded from three low-resolution commodity stereo-vision-based depth sensors, covering a variety of constellations (see Fig. 7). The sensors have a top-view on the scene, are mounted at a height of three meters, and have fields of view with a significant joint overlap (see Fig. 2). They cover a visible area of approximately 20 m 2 with up to six individual people present in the scene, entering and leaving the visible area multiple times. The data set consists of 2200 annotated frames, captured with a resolution of 376 × 240 pixel each, providing raw rectified stereo image pairs as well as disparity maps obtained by block matching. In total we annotated the ground floor locations of 10435 targets. Additionally, we associated each detection with a track to allow for full detection and tracking evaluation. For the reproducibility of our results the data set will be made publicly available.

B. QUANTITATIVE ANALYSIS
For the evaluation of our approach we use a ground floor grid with 15 × 12 grid points, corresponding to a horizontal and vertical distance of 33 cm between adjacent grid points. As input observations we use foreground-segmented depth images, obtained by static background subtraction. Notice that we only focus on frame-by-frame detection, however the outcome of our approach could serve as input for trackingby-detection post-processing. We have noticed that our approach is quite sensitive to the initial marginal probabilities q init i (x i ). If the initial occupancy probability is too small, the expectation in (10) will inordinately favor scene configurations with only one person present; thus, occlusion is not taken into account in the first iteration. We therefore initialize each mean-field node with a prior of q init i (x i ) = p(x i ) = 0.5 by default. The design parameter of the asymmetric image similarity δ asym (·, ·) (see (12)) is set to α = 1.25, to penalize unexplained observations stronger. Fig. 6 depicts the impact of α on the precision-recall performance. The standard deviation of the measurement noise σ is set to a default value of 2 cm.
For the quantitative evaluation, a detection is assumed to be a true positive if it is in a radius of 30 cm of the ground truth. We show the performance of our approach based on the precision-recall curves in Fig. 5, where the precision is given by TP/(TP + FP) and the recall by TP/ (TP + FN ); TP, FP, FN are the counts of the true positives, false positives and false negatives, respectively. The F1-Score is given as F 1 = (2 × precision × recall)/(precision + recall).
We compare our approach with state-of-the-art monocular multi-view approaches. As a baseline on the given depth observations we introduce a difference of Gaussian (DoG) based blob detector. The methods to be compared are: • POM [19] works on binary input observations. For a fair comparison we use the same depth based foreground segmentation as in our approach. Also the grid layout and the camera calibrations are identical to our setup.
• Deep Occlussion [21] is the current state-of-the art end-to-end architecture for multi-view person detection. Due to the lack of a large data set we use the available pre-trained model without any further supervision. As input we stack the given gray scale observations to a three channel image to be compatible with the RGB architecture.
• DoG-Detector As a baseline on the given depth data we apply difference of Gaussian blob detection on the foreground segmented depth images of each sensor independently and project the resulting detections onto the common world ground plane. The final detections on the ground plane are obtained by proximity clustering. Fig. 5a depicts the performance of the examined approaches over all frames and views. The results show that without any further supervision the given data set is very challenging for deep learning architectures such as Deep Occlusion [21]. Due to the vertical top-view, the appearances  of people is drastically different compared to the classical profile view. The results of the DoG-Detector indicate that, even when considering proximity clustered results from all three views, naive blob-based single-view detectors are not competitive compared to the more sophisticated multi-view approaches in our scenario. Although POM [19] achieves remarkable performance in our setting, our approach outperforms POM in terms of precision, resulting in a better area under the curve value (AUC) as well as better F1-Score (see Table. 1).
In order to show how our probabilistic model exploits the multi-view evidence given by all three sensors, we evaluated the performance of our approach for all different combinations of sensor views contributing to the solution. For a fair comparison, we take only those people into account that are visible from all three sensors (see Fig. 2 for the fields of view of the sensors). Fig. 5b depicts how using the multiview information increases the detection performance. In the mono-view case, View 2 and View 3 by themselves do not perform well with F1-Scores of 0.61 and 0.73, respectively. However, combining the image evidence of View 2 and View 3 leads to a drastic performance increase, as evidenced by a best F1-Score of 0.92. Even View 1 achieves comparable good performance due to the general viewpoint, using the image evidence from all three sensors clearly outperforms all other view combinations. On a single CPU core, 1 our non-optimized Python implementation needs approximately 800 ms per frame. Although real-time performance is not reached yet, there are plenty of optimization options, such as parallel mean-field updates, or taking advantage of GPUs. Fig. 7 shows exemplary mean-field optimization results. The given samples illustrate that our approach is able to resolve challenging scenarios, suffering from occlusion and measurement noise, by making use of the full multi-view image evidence. Fig. 7c shows a typical false negative error on the  Evolution of asynchronous and synchronous mean-field updates. In the left-hand plots of (b) and (c), every path corresponds to the probability evolution of one q i (x i ). The probability evolution of six grid locations of interest are plotted in unique colors, the others are plotted in purple. The right-hand plots show the same process illustrated as probability maps for the first four iterations.

C. QUALITATIVE ANALYSIS
image border, which is the dominant error class occurring in the data set. Due to the stereo vision based sensors, the depth information is more noisy on the image border, eventually leading to an insufficient fit of the 3D model. To overcome this limitation, a richer probabilistic sensor model which takes systematically varying noise into account could be employed.
In Fig. 8, the mean-field optimization is illustrated for one exemplary frame, for both the asynchronous and the synchronous update strategy. Fig. 8c depicts a general disadvantage of synchronous mean-field updates. The simultaneous optimization potentially leads to oscillating marginal probabilities of adjacent grid locations. In Fig. 9 it is shown that asynchronous mean-field optimization converges after only few iterations whereas synchronous mean-field update suffers from the oscillating effects mentioned above.

V. CONCLUSION
In this work we have presented a novel approach for probabilistic people detection in multiple overlapping depth images. Our main contribution is the use of mean-field variational inference in combination with a generative scene model to jointly exploit the multi-view information in order to approximate the marginal probability distribution of people present in the scene. Our experiments have shown state-ofthe-art results on a novel data set for indoor people detection in overlapping depth images from the top-view. We have demonstrate that our approach achieves strong detection performance, outperforming state-of-the-art monocular multi-view people detection methods. We were also able to show that using multi-view image evidence increases the detection performance significantly compared to a single-view.
Future work will focus on incorporating temporal information into our probabilistic model in order to provide joint probabilistic detection and tracking.