Perspective Independent Ground Plane Estimation by 2D and 3D Data Analysis

Identifying the orientation and location of a camera placed arbitrarily in a room is a challenging problem. Existing approaches impose common assumptions (e.g. the ground plane is the largest plane in the scene, the camera roll angle is zero). We present a method for estimating the ground plane and camera orientation in an unknown indoor environment given RGB-D data (colour and depth) from a camera with arbitrary orientation and location assuming that at least one person can be seem smoothly moving within the camera field of view with their body perpendicular to the ground plane. From a set of RGB-D data trials captured using a Kinect sensor, we develop an approach to identify potential ground planes, cluster objects in the scenes and find 2D Scale-Invariant Feature Transform (SIFT) keypoints for those objects, and then build a motion sequence for each object by evaluating the intersection of each object’s histogram in three dimensions across frames. After finding the reliable homography for all objects, we identify the moving human object by checking the change in the histogram intersection, object dimensions and the trajectory vector of the homgraphy decomposition. We then estimate the ground plane from the potential planes using the normal vector of the homography decomposition, the trajectory vector, and the spatial relationship of the planes to the other objects in the scene. Our results show that the ground plane can be successfully detected, if visible, regardless of camera orientation, ground plane size, and movement speed of the human. We evaluated our approach on our own data and on three public datasets, robustly estimating the ground plane in all indoor scenarios. Our successful approach substantially reduces restrictions on a prior knowledge of the ground plane, and has broad application in conditions where environments are dynamic and cluttered, as well as fields such as automated robotics, localization and mapping.


I. INTRODUCTION
With one additional dimension, 3D data provide a more intuitive and realistic environmental perspective in computer vision applications than traditional 2D data. By combining traditional 2D RGB data with depth information, 3D data create a more comprehensive digital representation of real world environments, providing considerable value in many applications such as training and simulation [1]- [3], construction [4]- [6] and gaming [7]- [10]. The benefits of 3D data over 2D data are particularly noticeable in cluttered or dynamic environments. In these complex environments, The associate editor coordinating the review of this manuscript and approving it for publication was Guitao Cao . 3D data allow enhanced visual understandings, improved precision and accuracy, easier risk/issue identification and analysis, and intuitive model manipulation [11]- [15]. For example, operating rooms typically have many objects that frequently change depending on the nature of the emergency, including multiple humans who enter and exit the room and interact with the objects and each other. Constructing an accurate 3D model of an operating room and recording videos of various processes within the room could create a helpful and interactive tool for training and simulation, or be used in real time to observe and monitor the room. For applications like gaming, the room is often modified to accommodate placement of a sensor (i.e., clearing out a space), the sensor is intentionally located in an ideal position, and users are willing to undergo a calibration process if necessary. However, the applications we consider, such as the operating room, are complex, dynamic and cluttered real-world environments, where the sensor must be located out of the way of the processes or occupants of the room, and systems using the sensor would need to auto-calibrate because occupants of the room are unlikely to be willing to perform calibrations. Accordingly, in applications in these complex environments, the sensor's location and orientation in the room will generally be unknown (e.g., the sensor's field of view cannot be assumed to be parallel to the ground). In this paper, we focus on addressing the difficulties of estimating the ground plane and finding the camera orientation in a indoor environment without any prior knowledge of the sensor or room.
In order to process image information from a unknown environment, knowledge of the ground plane, and hence the position and orientation of the camera, is fundamental [16]- [20]. Indeed, most computer vision algorithms implicitly assume knowledge of the ground plane (e.g., that the ground is at the ''bottom'' of the scene [17], [21], [22] or is the largest plane [23], [25], [26]). However, in complex environments with unknown sensor placement, the ground plane may not be the largest visible plane (e.g., many objects on the ground) or at the ''bottom'' of the scene (e.g., overhead perspectives). Still, identifying the ground plane, and accordingly the camera position and orientation, is critical for most computer vision applications; especially for indoor tracking, exploring, navigation and scene analysis. For instance, in Simultaneous Localization and Mapping (SLAM) applications, RGB-D data have been used to extract the plane feature in indoor environments for localizing robot positions, outperforming both accuracy and efficiency of the traditional point feature-based methods, even with low image quality devices [55], [56]. With the recognition of the ground plane and camera orientation, the robot performs better SLAM occlusion detection during mapping [57] and obstacle detection [58]. In addition, finding the ground plane and calculating the camera orientation also facilitates improved 3D registration and 3D reconstruction of data from multiple sensors viewing the same scene by converting a 3D problem into a 2D problem. Ultimately our goal is estimating the ground plane for each sensor in a multi-sensor system, such that the ground can be used as a reference for finding the positions and orientations of each sensor relative to each other, which will facilitate the reliable 3D reconstruction of a complex room.
To accomplish our goal, we aim to develop a system that estimates the ground plane, camera orientations and relative locations of multiple RGB-D sensors with unknown positions and orientations in an indoor environment. Our only assumptions are: that most of at least one person can be seen smoothly moving in the RGB-D camera field of view; the person's body is perpendicular to the ground plane while moving; and the RGB-D camera's position and orientation remain unchanged until the ground plane estimation is complete. In order to estimate the ground plane under this condition, we combine the robustness of 3D Random Sample Consensus (RANSAC) and 2D homography decomposition. While 3D RANSAC extracts useful spatial information from each 3D point cloud segment, 2D homography decomposition constructs homography planes from people walking on the ground. Our approach even accommodates scenarios where the ground plane is a small region (i.e., barely visible) or even not visible in the field of view (FOV) of the sensor by utilizing other visible planes that are parallel to the trajectory of movement and estimating the actual ground plane.

II. RELATED WORK
Existing ground plane detection can be broadly categorized into 2D or 3D approaches based on the sensor type. Within 2D approaches, the most popular approach for ground plane estimation is homography. For example, homography-based approaches have been used to first find the feature key points in the scene, followed by Kalman filtering [27] or Modified Expectation Maximization [28] to build confidence in the ground plane transformation matrix across successive frames. These two approaches assumed the roll angle of sensors are zero and the camera only see the ground plane with objects above the plane. Homography has also been successfully used as a first step, with the homography decomposition results combined with a Bayes filter [29] or contour searching [30] to estimate the ground plane with 2D images. However, again the ground plane is assumed to be the area in front of the camera [29], or the single colour ground plane is assumed to occupy the majority of the FOV [30]. Other 2D approaches have used depth-image data or V-disparity values (the histogram of the disparity map [31]) rather than traditional RGB image data [23], [24]. Zhi Jin et al. [32] proposed a depth-map driven ground plane detection algorithm by growing a plane starting from the the largest area having similar depth values in the depth map, assuming the largest plane was the ground plane. Kircali and Tek [33] estimated the ground plane based on comparing the depth map of each new frame with a pre-calibrated depth map in which the ground plane was pre-defined. Assuming the majority area in the scene comprises the ground plane, the gradient of the V-disparity pixel values has also been successfully used to identify the ground plane with an arbitrary camera roll angle [23]. Furthermore, Cherian et al. [35] applied multiple texture based filters with a Markov Random Field to reconstruct the depth map from a single RGB image and estimate the ground plane based on texture-based searching segmentation. Due to the intrinsic features of the algorithm, this approach assumes the camera is parallel to the ground plane, and that the ground plane has a unique texture. Dragon et al. [34], [36] proposed an approach where RGB frames captured from a moving sensor are iteratively split into regions until reliable homographies can be estimated from the feature points within these regions. The decomposition of the homography with the highest probability indicates the orientation and ego motion of the sensor's movement. Unfortunately, this approach is not suitable for indoor environments with a stationary sensor VOLUME 8, 2020 because moving objects will be a small proportion of the scene, making it hard to distinguish between a homography generated from mismatched key points and a homography from a moving object. Further, their solution requires the shape of moving objects to remain unchanged to ensure successful feature correspondence between frames; a condition that cannot be guaranteed in indoor environments with an arbitrary fixed perspective. More recently, a ground plane estimation approach using monocular images with a predefined region of interest [38] was developed, but requires a known pitch angle. Although the above 2D approaches can successfully identify the ground plane, none of them work in dynamic or cluttered environments where the location and orientation of the sensor is unknown.
Ground plane estimation approaches in 3D commonly utilize 3D Hough transform or 3D RANSAC with the raw data. For example, a 3D Hough transform with a ball-based accumulator, which collects the vote values [37], has been used to define the ground plane based on the highest vote among accumulators [41]. Due to the voting procedure, this approach can only find the ground plane if it is the largest plane in the scene. 3D RANSAC, a more direct and brute-force approach, has been used on raw 3D data to find the ground plane with the assumption that the ground plane is the closest or largest plane in the camera FOV [21]. Other 3D approaches have used an estimation of the 3D normal vector for each raw data point rather than the raw points directly (e.g., [42] and [43]), but assume that the camera roll and pitch angles are zero. More recently, machine learning and a depth mask has been used, but requires minimal orientation variations (i.e., 0 ∼ 15 • ) [39]. Ground plane estimation has also been integrated into bigger applications (e.g., [21], [40], [57]), but they also share the common constraints, such as zero roll rotation or the ground plane being the largest plane. Similar to promising 2D ground plane approaches, these 3D approaches will also not work in cluttered or dynamic environments because of their underlying assumptions.
Together, the most robust and reliable 2D and 3D methods of finding the ground plane have common assumptions or predicates, such as the known and unchanged orientation of the camera, the ground plane being the largest plane in the field of view, the shape of moving objects in the scene remaining unchanged, the ground plane having a single color or depth value, or the ground plane only appearing at a certain location within the camera's FOV. While these assumptions restrict the complication of the ground plane estimation problem based on the requirements of specific applications, they cannot be used in real-world scenarios where the camera location and orientation are unknown, and the environment is complex, cluttered or dynamic. To overcome the limitations of these assumptions for our application, we build on the approach of Dragon et al. [34], [36] because the assumptions of their approach are closest to our conditions. Notably, while their approach requires the sensor to be moving, we assume that the sensor is stationary and something in the scene is instead moving. In our case, we will restrict our interest to a human moving in the scene, though this does not necessarily need to be the case. We present our approach to accomplish this in section III followed by our experimental setup and results in section IV. We then present our discussion and future work in section V.

III. METHODOLOGY
Our ground plane estimation approach combines the robustness of 2D and 3D computer vision algorithms. The major components of our approach are: 1) Data pre-processing (section III-A) where we described the preparation of 2D and 3D data with corresponding features; 2) 2D homography decomposition (section III-B), where we decomposed the 2D homography according to 3D feature restrictions to estimate the trajectory of any moving humanoid objects in the scene; and 3) 3D ground plane estimation (section III-C) where we derived the most probable ground plane by refining 2D homography decomposition results into confidence estimates.

A. DATA PRE-PROCESSING
To obtain a more useful 3D data representation, we first generated a 3D point cloud from the RGB-D data using the intrinsic and extrinsic parameters of the sensor. We calibrated using Zhang's approach with the intrinsic parameter matrix defined as: [44]: parameters R and T . Finally, using radial distortion k 1 , k 2 , k 3 and tangential distortion p 1 , p 2 coefficients, we calculated the camera matrix C by multiplying the intrinsic and extrinsic matrices, such that the depth images were undistorted based on camera parameters and distortion coefficients [45] according to From Eqs.(1), (2) and (3), the coordinates (x, y) and value of each pixel z in each depth image was transformed to an individual point (x , y , z ) in the associated 3D point cloud.
In general, the point cloud of an indoor environment is composed of planes (e.g., walls, floor), objects (e.g, drawers, chairs), and humans, though in some cases substantial portions of objects are also planes (e.g., desks). In a cluttered environment with unknown camera location and orientation, the ground plane may not be visible (e.g., if the sensor is on the ground facing up), or may be any region varying from a small region that is highly occluded by objects to the largest visible plane. Therefore, after down-sampling the point cloud by applying a voxel grid filter, we segmented the point cloud into planes and non-planar objects. First, we iteratively extracted, stored and removed the largest plane from the remaining point cloud, which is generated from the previous iteration, until the number of remaining points is less than 20% of the total points in the original point cloud, using Random Sample Consensus(RANSAC) [46] (See Algorithm 1).
pc ← pointCloud 4: originSize ← Size of pc 5: while Size of pc > 20%originSize do 6: plane ← RANSAC(pc) 7: if Size of plane < Threshold then 8: break; 9: planes ← ps 10: pc ← pc − plane 11: returnplanes, pc After we stored and removed the planes in the scene, we segmented the remaining point cloud into non-planar objects using Euclidean clustering [47]. We first employed Euclidean clustering to find groups of points that were physically close to each other, and then we stored all clustered objects S o and extracted planes S p .
To identify which clustered objects are moving in the scene in preparation for homography estimation, we needed to find corresponding objects between successive frames. We utilized SIFT [48] as the feature extractor on the RGB images to derive 2D feature points. SIFT was able to generate a sufficient number of 2D features for each object in the scenes; particularly for any humans. Additionally, SIFT accommodates a wide range of performance control through variation of the octave layer number nOct, edge-like feature filter threshold eThresh, and the sigma of Gaussian filter σ [49], allowing excellent optimization for keypoint detection. For each RGB frame, the 2D feature points were stored as an output of the data preparation phase, along with the 3D points of the clustered objects and the extracted planes.

B. HOMOGRAPHY ESTIMATION
A homography matrix [50] can be computed by matching features in two RGB images of an object captured by two cameras at different locations [27]. Since we assume the camera is static and humans move on the ground plane, we calculate the homography matrix using SIFT keypoints in two RGB frames, which are captured at time t and t + t, from a single sensor, using the moving humans as motion reference points. We used the homography between moving objects across successive frames to construct a plane that is perpendicular to the ground plane. With a minimal sample set of four feature key point correspondences between frames at time t and time t + t, a nine-parameter homography matrix can be generated, which represents the transformation between 2D points in image coordinates and 3D points in the camera coordinate system.
To find which objects were moving between successive frames, we implemented the Blockwise Linearity Assumption (see [34]). Instead of generating a result from each pair of consecutive frames, the Blockwise Linearity Assumption estimates an average result from an N-length block of frames by processing the first frame of the block, which is used as reference frame, and the i th frame in the block (where 1 < i ≤ N ). Assuming the human moves reasonably smoothly over the ground plane, the changes between the 1 st ∼ i th frame pair and the 1 st ∼ (i + 1) th frame pair within one block will grow linearly. We segmented the entire data set into blocks B = {F 1 , F 2 , . . . F x } of frames F ranging from frame 1 to x. Let S 1 o and S 2 o denote all the object segments in the first and second point clouds representing a pair of successive frames. We calculated the 1-D histogram of three dimensions Hist x , Hist y , Hist z for each object segment S 1 o i and S 2 o j . Then, we matched a pair of object segments in F 1 and F x that represented the same object O i by determining if the intersection ratio, which is the Jaccard index [54] of the pair of object segments between the histogram areas of S 1 o i and S 2 o j was greater than zero, and decreased as x increased. To ensure the histogram intersection was larger than zero between the first frame F 1 and frame F x , we chose a small block size similar to [34], [36]. The resulting list of matched pairs of 3D objects S 1 o i and S 2 o j , including any moving humans, were projected to 2D pixel clusters C 1 o i and C 2 o j according to where (x , y ) denotes the x and y values of a 3D point (x , y , z ), (x, y) denotes the corresponding distorted pixel coordinates and r = x 2 + y 2 . Consequently, each 2D pixel cluster C x o i is then converted to a 2D feature point cluster R x o i by using each 2D pixel (x i , y i ) as the center point and searching for the closest feature points within the radius τ , shown in Figure 1(a).
We removed any feature keypoints that were outside of the regions, and applied Motion-Split-And-Merge (MSAM) [36] to each pair of corresponding regions R 1 o i and R x o j in F 1 and F x respectively to find the most reliable keypoint clusters C x k i for generating homography matrices H x o i (Figure 1(b)). The homography matrix, which is directly generated from human feature points R x o i , can be unreliable because of the different movement patterns of human heads, chest, arms and legs. MSAM accounts for these differing movement patterns by finding the most reliable keypoints (most likely keypoints that are within head or body region) out of the set R x o i , allowing a reliable homography matrix to be generated that represents the human's stable movement through a block B(e.g, [60], [61] , and filtered out the invalid solutions to construct the most reliable decomposition solution Here, invalid homography solutions were characterized by checking if a 2D key point (x i , y i ) and a 3D point cloud point (x i , y i , z i ) within region R o j , which yields z i < 0 (x i , y i , z i ) = H (x i , y i , 1) and n T o i (x i , y i , z i ) = 1 , exists [34]. Finally, we built the set of all the moving objects in the scene S mo i by extracting the object regions that had large and successively decreasing differences in intersection coefficient intersection x o i among all objects O in a block. Based on the decomposition result and the assumption that the person body is perpendicular to the ground plane while moving, we use three conditions, which includes the longest edge E l of moving object bounding boxes larger than a length threshold Thresh l ; the ratios between the longest edge E l and other two edges are larger than a ratio threshold Thresh r ; and trajectory vector t i o i is perpendicular to the longest edge of object bounding box E l , to determine the moving humanoid object among all moving objects [62]. The homography decomposition result of the moving objects in a block were the output of this phase, allowing us to estimate the ground plane out of the candidate planes extracted in section III-A.

C. GROUND PLANE ESTIMATION
According to the assumption that a person is moving on the ground, the ground plane is then the plane that best satisfies the following criteria: c1: its normal is parallel to the plane that is defined by the block homography decomposition's normal vector and trajectory vector for the moving object; c2: it is parallel to the trajectory vector of any moving object; c3: it does not dissect any objects in the scene; c4: it is close to the object segments S o in the scene, and in particular to moving objects. Based on these criteria we built a confidence estimate cascaded filter to score the likeliness that an extracted plane is the ground plane, ranging from 0 (very unlikely) to 10 (very likely), from the complex and noisy 3D environment. Conceptually, we found all horizontal planes (those parallel to the homography's normal and trajectory vector) from all known planes. We then increased or decreased our confidence in horizontal planes based on their proximity to the boundary of the 3D scene. Finally, we adjusted our confidence estimates based on each plane's relationship to objects in the scene, prioritizing their spatial relationship to moving objects. To distinguish between low-confidence valid planes and invalid planes, whose confidence estimates are reduced by our cascaded filter, we assigned an small initial confidence conf I = 1 to each of the extracted planes S p that were found in section III-A. We then evaluated the fit of each plane to our criteria to complete our confidence estimates. The overall confidence of each potential ground plane is found as: where conf HD , conf RP , conf OD represent Homography Decomposition Checking confidence, Relative Position Checking confidence, and Object Distance Checking confidence respectively.

1) HOMOGRAPHY DECOMPOSITION CHECKING FILTER
We scored each ground plane according to criteria c1 and c2: how parallel each potential ground plane is to both the trajectory and the block homography decomposition of each human moving object. To identify the moving objects that were likely humans, we employed a heuristic. Since the camera orientation was arbitrary, we used the normal vector n pp i of each S p i as the camera's reference orientation. The complementary angle of the angle between n pp i and the x-axis θ nx indicates the roll angle of the camera, while the complementary angle of the angle between n pp i and the z-axis θ nz indicates the pitch angle of the camera. Hence, the roll rotation matrix and pitch rotation matrix were generated by:   based on right hand rule, where C θ nx and C θ nz represent the complementary angles of the roll and pitch angles respectively. After we transformed each moving object S mo i with its corresponding rotation matrices R roll and R pitch to ensure the bounding box of S mo i aligned with the x-, y-and z-axis, we determined whether the moving object was humanoid based on three conditions: 1) The longest dimension was at least 1.5 times larger than the other two dimensions [59]; 2) The longest edge of the bounding box was longer than a learned threshold; and 3) The trajectory vector t mo i was perpendicular to the longest edge of the bounding box. We first represented each human moving object S mo i as the 3D plane P homo o i , constructed from the normal vector n mo i and the trajectory vector t mo i . The contribution of the Homography Decomposition Checking confidence to the overall confidence estimate cascaded filter was: where θ is the angle between the normal P homo o i and the normal of S p i . The cosine of the angle is used to ensure that a small penalty is applied to planes that are nearly parallel (likely due to sensor noise), but a large penalty to planes that are not parallel. The constant scaling factor of two is the associated weight of this component relative to the other components of the confidence estimate cascaded filter. Since the confidence only represents the likelihood that a plane is horizontal, the associated weight factor is comparably small, while ensuring that the confidence score of planes that have a large θ angle are reduce to zero. Additionally, for each moving object, we generated a set of planes that were parallel to the movement of P homo o i as S pp i .

2) RELATIVE POSITION CHECKING FILTER
We scored each plane in S pp i according to criteria c3: how likely it is that potential ground planes do not dissect objects in the scene. In most cases, the ground plane will not have objects on both sides of it while other planes (e.g., tabletops) can have objects on both sides. In the exceptional scenario, where the floor contains planes with multiple height values (e.g. stairs or theater stages) and the person walks on the plane that has the higher height value, our confidence estimate directly relates to the size of each plane and the difference between the sensor and the two planes. We will discuss this rare scenario in the Section V. Furthermore, this filter was essential for remediating the effects of noise and sensor depth error in the data. We represented each plane S pp i by it's plane equation: The value of ρ will be positive, zero, or negative, indicating which side of the plane the point is on, or whether the point is on the plane. We applied the 3D coordinates (x , y , z ) of each point in each S pp i to Eq.(9), recording the number of positive ρ + and negative ρ − results, the maximum distance d max + i from the points above the plane to plane S pp i , and the maximum distance d max − i from the points below the plane to plane S pp i . The contribution of the Relative Position Checking score to the overall confidence estimate function was represented by: Again, the cosine of the proportion of points on one side of the plane was used to apply a smaller penalty from objects that are on one side of the plane and a larger penalty from objects that are on both sides of the plane. Additionally, similar to the Homography Decomposition Checking Filter factor, the constant scale factor of two again is the relative weight of this component to the overall confidence estimate.

3) OBJECT DISTANCES CHECKING FILTER
Finally, we scored each plane in S pp according to criteria c4: how close all objects in the scene S o are to the potential ground planes. Here, we utilized the knowledge that far more objects will be on the ground than any other plane, and in particular that people walk on the ground plane. Since some objects, such as decorations or lights can be on potential ground planes like the ceiling or walls, we assign higher weights to moving objects. In order to calculate the object-to-plane distances, we align all the 3D object segments S o and planes S pp i to the axes by applying roll and pitch rotation matrices R roll and R pitch found in section III-B. Since the ground plane is likely to be the highest or lowest plane in a 3D point cloud, our confidence estimate increased or decreased proportionally when the object-to-plane distance was smaller or larger than a learned value of one-fourth of the point cloud height. The contribution of the Object Distance Checking score to the overall confidence estimate function was represented by: (12) k mo = W mo k s (13) where k s and k mo are scaling factors for stationary and moving objects, D S s o i denotes the absolute distance between a stationary object to plane S pp i , D S m o i denotes the absolute distance between a moving human object to plane S pp i , N s and N mo denote the number of stationary object segments and the number of moving human objects, W mo denotes the weight of the moving human object, and h denotes the height of the point cloud, as the confidence representation of each plane S pp i . Additionally, the constant scale factor of five in Eq.(12) is the relative weight of this component, maximizing the confidence of the real ground plane, and providing sufficient penalty to reduce the confidence of planes in the middle of the room, such as a table, to zero.

4) GROUND PLANE CONFIDENCE
A potential ground plane with a confidence estimate found with Eq.(7) that exceeded a learned confidence threshold ζ was then highly likely to be the true ground plane, suggesting that no further processing was required. However, we could VOLUME 8, 2020  only find the ground plane if the ground plane belonged to a plane in the set S pp , and as such there may not be any planes such that conf S pp > ζ . In many practical cases, the actual ground plane could be a small plane in the camera FOV, which would cause the ground plane to be segmented as an object or part of another object in S o . Additionally, any surfaces that are close to the true ground plane, but having a larger area than the ground plane can lead to an incorrect identification of the true ground plane. Finally, the ground plane may not actually be visible in the scene. In these situations, we initiated a secondary ground plane estimation.

5) SECONDARY GROUND PLANE ESTIMATION
In the case where no plane from S pp satisfied the condition conf S pp > ζ , we applied RANSAC to each 3D object segment S o i , retrieving the largest plane within each object to generate set S pp sd and iterating through the steps of sections III-A to III-C. If no plane had a confidence conf S pp > ζ after the secondary estimation, the ground plane did not exist in the camera FOV. In this scenario, the plane from S pp that had the highest confidence was used to predict the ground plane. Using the distances d max + i and d max − i from section III-C.2, the ground plane formula was estimated as (A, B, C, D) are the plane coefficients, based on which Object Distance Checking confidence was higher. However, if any plane in S pp sd had a confidence conf S pp sd > ζ , the plane with the highest confidence was selected as the actual ground plane.

IV. EXPERIMENTS
We evaluated our algorithm on our own dataset of generated video sequences, as well as on all relevant video sequences from three public datasets. In this way, we ensured our algorithm was generalizable, repeatable, and insensitive to artifacts that may be present in our own data collection. Specifically, we focused on representative scenarios with: a high variety of camera orientations; camera locations; ground plane size, shape and visibility; and room and occupant complexity.

A. GENERATED VIDEO SEQUENCE DATA
We collected video sequence data using the Kinect v1 which provides an RGB image and a depth image with a 27 frame per second rate (FPS) on average, image data we combine to form an RGB-D image, using a MacBook Pro (Retina, 13-inch, Mid 2014) with Dual core i5 CPU and 8G memory. We recorded video sequences by placing the camera in 24 unique scenarios, which included various combinations of different camera orientations and locations, multiple planes, multiple people, diverse moving speeds, and various body appearance ratios.
Our captured video sequences contained 40-140 data frames from the time the first person entered the camera's field of view or started moving to the time the last person left the camera FOV or stopped moving. Similar to the work of [36], we chose an MSAM block size of five frames. From experimentation, we determined that planes with a confidence score ζ > 8.5 are highly likely to be the actual ground plane, while planes with a confidence score of 6.0 < ζ < 8.5 are planes that are parallel to the ground plane, and may be the ground plane. Based on our experimental results, the W mo for the Object Distances Checking step in Section III-C is optimally set to 8.0 to ensuring the moving person becomes the decisive factor among all other stationary objects and noise. Figures 2 and 3 demonstrate some representative data sets and examples of our ground plane estimation results.
In the data preparation step, SIFT generated an average of approximately 4,000 keypoints in each full 2D image with 10 layers in each octave, 0.02 as contrast threshold, 20 as filter out edge-like features threshold, and 1.0 as sigma. The size of voxel grid down-sample filter for point cloud frames we selected was 2cm. The RANSAC distance threshold and  the cluster tolerance of Euclidean clustering were 2.5 and 2 times the voxel grid filter size respectively. Based on these parameter, we extracted anywhere from 4 to 10 planes from each scene, varying based on the indoor environment complexity and camera perspective. In the homography estimation step (section III-B), we set the block size to five to ensure we achieve sufficient histogram intersection between the reference frame F 1 and frame F x . The number of SIFT feature keypoints on the human ranged from 150 to 380 out of the approximately 4,000 keypoints. In the experiments, the confidence of the results exceeded 7.5 even if the ground plane only occupied a small fraction in the FOVs in each scene where the ground plane was visible. Table 1 shows the confidence of the three planes that have the highest confidence, the human object's moving speed, the number of frames the algorithm took to estimate the ground plane, and the total number of ground plane candidates we had before the ground plane estimation checking steps. Notably, in Fig.2(f), the ground plane is not visible, so we do not identify any actual ground plane; rather we estimate the ground plane

B. PUBLIC DATASETS
We evaluated our algorithm's accuracy on three public datasets: the RGB-D People Dataset [63], [64], the SBM-RGBD Dataset [66] and the TVPR Dataset [65]. The large ground plane that is directly visible in the RGB-D People and TVPR Datasets allow our algorithm produce high confidence estimates for the ground plane -even higher than those generated from our more challenging data trials, with results shown in Table 2. Figure 4 shows some sample results obtained from the dataset trials in these public datasets.

V. DISCUSSION AND CONCLUSIONS
In this paper, we proposed a novel ground plane estimation method using the combination of 2D and 3D data analyses. VOLUME 8, 2020 Existing ground plane detection approaches require that significant assumptions are met (e.g., that the ground plane is the largest plane in the scene, the ground plane is at the bottom of the sensor field of view, that the ground plane is constant in colour or texture). These assumptions are not practical in dynamic or cluttered environments, or in situations where the sensor orientation or location are unknown, requiring more expensive and specialized equipment (e.g., to detect sensor orientation). Our approach robustly finds the indoor ground plane with unrestrictive assumptions: the sensors is an RGB-D camera; at least one person smoothly walks in the scene with most parts of the body visible within the camera field of view; and the human body is perpendicular to the ground plane while walking.
We first segment the point cloud that is generated from a pair of RGB and depth images into planes and object segments, while finding the SIFT 2D key points in the RGB image. This fundamental step requires the large planes and object segments corresponding to the real world objects and a sufficient amount of 2D feature key points. In general scenarios, our algorithm successfully segments all the planes and objects in the scene and provides a sufficient amount of SIFT feature points with the parameters we used in the experiments. Our algorithm can fail for one MSAM block if the majority of the human is not segmented as a single object segment, if no planes can be found in the FOV (i.e., RANSAC generates unreasonable planes), or if the 2D feature key points are too sparse to generate reliable homographies. However, these issues were resolved for all our trials by processing through the entire trial data set.
In the second step, we project 3D object segments to the 2D RGB image to find the regions that only contain the keypoints belonging to these objects, and apply MSAM to each region to find the decomposition of reliable homographies. MSAM splits the keypoints within each region in a tree structure taking 30-60 seconds to process with parallel threads, which makes real-time ground plane estimation unfeasible. Building object segment sequences within one block and identifying the human is achieved by calculating the histogram intersection ratio between two object segments. This approach is sensitive to movement in any direction; it provided 90% accuracy while matching corresponding object segments within a block, and only fails when Kinect sensor generates significant depth error. In addition, because of the depth error of our hardware sensor [52], [53], estimating the ground plane with only one block is not guaranteed for a video sequence because object translation could appear to occur in both directions for short sequences.
The final step builds the ground plane estimation confidence based on homography decomposition vectors, plane relative positions, and the distances between the planes and other objects. With only one iteration of the confidence estimation, our algorithm successfully estimated any ground plane that was large in the FOV. Only one additional iteration was required to retrieve the ground plane if it was smaller in the FOV. Our approach of identifying humans from all other objects is naive, mainly depending on the gross shape of the moving object segment and the correlation between the homography trajectory vector and moving object's bounding box. In some situations, such as if only the torso of the human (which has a similar dimension in both the x− and y − axes) is segmented as a moving object, our algorithm will ignore this potentially valid segment. Similarly, sequences exemplified in Figures 2(e) and 2(f) take significantly more frames to estimate the ground plane because the movement of the human's arms and legs changed the bounding box's dimensions of the human. The current solution is processing through the full trial data set until the algorithm identifies the human body, while this issue could be potentially solved by synchronizing with other sensor in the system viewing the same scene from a different perspective. Furthermore, due to the limitations of our camera's depth sensor (specifically lens distortion), any wall characterized by the x− and y − axes often consisted of multiple layers of points. The RANSAC algorithm in the data preparation step yielded one slice of the wall as an object segment with approximately one third probability, which had an almost equal distance to both the ceiling and the ground plane. Conditions like this led to us increasing the confidence weight of the moving objects relative to non-moving objects, enlarging the difference between the ceiling's confidence and the ground plane's confidence. Specific to the distortion issues associated with segmenting the wall, we also increased the RANSAC distance threshold between models to reduce the number of slices generated from one wall; an issue that could easily be rectified by using a sensor with a higher depth resolution and accuracy. Accordingly, the correct ground plane estimation results heavily relied on finding the accurate human (moving object). We noted that increasing the RANSAC distance threshold between models also had drawbacks: multi-plane surfaces, such as stairs ( fig.3(b)) and stages ( fig.3(a)), are merged as one plane. Since the resulting single plane representing the stairs has a large angle value relative to the ground plane ( fig.5; approximately 45 • ), it is found by our algorithm as a potential plane, but is ultimately given a low confidence as the actual ground plane. ( fig.3(b)). Similarly, the plane corresponding to the stage floor barely exceeds our ζ > 6.0 threshold ( fig.3(a)) because this plane (comprised of points from the stage plane and lower ground plane) is not parallel to the ground plane leading the low confidence provided by the Object Distances Checking step.
We evaluated our approach on our own dataset, which included 24 unique scenarios (e.g., sensor perspectives and orientations, number of persons walking in the scene), as well as on three public datasets (see [63], [64], [66] and [65]), where we included 26 additional scenarios. Our approach robustly estimated the ground plane directly (when the plane was visible) or indirectly (when the plane was not visible) with a large variety of sensor orientations, different ground plane area sizes, room complexities, and multiple persons in the scene in 50 of 50 scenarios (100%). Our experimental results show that our algorithm is insensitive to the movement speed of walking humans and is tolerant to partial occlusion of the human body. In cases where the ground plane is not visible the scene, we successfully estimated the ground plane formula by translating the plane with the highest confidence in the scene, suggesting that other sensors that can see the ground plane can help to accurately find the ground plane. This is exemplified through two scenes (e.g., Figures 2(b) and 2(f)) where we successfully identify the ground plane directly in once case (conf HP = 2.99, conf RP = 2.0, conf OD = 3.99, conf S = 8.99), and indirectly in another (conf HP = 2.56, conf RP = 2.0, conf OD = 0.0, conf S = 4.56). In all cases, we were able to find the ground plane or a plane parallel to the ground plane using RGB-D sensors data without any pre-calibration or a prior knowledge of the sensor location or orientation.
In the future, we will focus on improving the performance of the algorithm; switching to a better RGB-D sensors which provides higher quality data; enhancing the robustness and accuracy of the human object detection algorithm; and achieving potential human recognition or identification within a RGB-D camera system. In addition, we will also test our algorithm on video sequences that have higher indoor complexity and more people visible in the scene.