Development of a Smart Hallway for Marker-Less Human Foot Tracking and Stride Analysis

Objective: In this research, a marker-less ‘smart hallway’ is proposed where stride parameters are computed as a person walks through an institutional hallway. Stride analysis is a viable tool for identifying mobility changes, classifying abnormal gait, estimating fall risk, monitoring progression of rehabilitation programs, and indicating progression of nervous system related disorders. Methods: Smart hallway was build using multiple Intel RealSense D415 depth cameras. A novel algorithm was developed to track a human foot using combined point cloud data obtained from the smart hallway. A method was implemented to separate the left and right leg point cloud data, then find the average foot dimensions. Foot tracking was achieved by fitting a box with average foot dimensions to the foot, with the box’s base on the foot’s bottom plane. A smart hallway with this novel foot tracking algorithm was tested with 22 able-bodied volunteers by comparing marker-less system stride parameters with Vicon motion analysis output. Results: With smart hallway frame rate at approximately 60fps, temporal stride parameter absolute mean differences were less than 30ms. Random noise around the foot’s point cloud was observed, especially during foot strike phases. This caused errors in medial-lateral axis dependent parameters such as step width and foot angle. Anterior-posterior dependent (stride length, step length) absolute mean differences were less than 25mm. Conclusion: This novel marker-less smart hallway approach delivered promising results for stride analysis with small errors for temporal stride parameters, anterior-posterior stride parameters, and reasonable errors for medial-lateral spatial parameters.


I. INTRODUCTION
HUMAN stride analysis in clinical settings is often performed with optical marker tracking systems such as Vicon, Optitrack, and Qualisys, requiring expensive setup, specialized human resources, and dedicated laboratory space. Passive or active markers, such as light-emitting diodes, are placed on the body to track limbs for human gait acquisition and characterization [1]. Inertial Measurement Units (IMU) can also be attached to body parts to record inertial motion based [2] kinematic gait data. Affixing external sensors on the human body may cause discomfort to patients and substantially change their natural gait [3]. These systems also require technical expertise for attaching markers and conducting experiments.
Low-cost Kinect depth sensors for gaming showed potential for human gait-related health care applications; such as, fall risk [4]- [6], Parkinson's disease movement assessment [7], fall detection of people with multiple sclerosis [8], autism disorder identification [9], abnormal gait classification [10], [11], virtual gait training [12], diagnosis, monitoring, and rehabilitation [13]. Depth sensors capture both depth and color images. Depth data contains distances, at each pixel, between the depth sensor and objects in the capturing scene. With this depth information, the real 3D coordinates at each pixel are recorded, with the depth sensor as the origin (i.e., ''point cloud''). Most research on depth sensors for human movement analysis involved Microsoft Kinect. Kinect V2 systems can identify and track the majority of human joints by defining joint locations that constitute a human skeleton-model or building a human model from the scene's point cloud. However, gaps and limitations exist with frame rate [14] and lower body tracking, especially at the ankle and foot [14], [15]. Multiple Kinect V2 sensors can capture longer volumes, with promising results, but ankle tracking farther from the sensor was more inconsistent [16]. Kinect's machine learning-based skeleton tracking was not reliable, with tracking points sometimes moving outside the body and tracking varying with viewing angle [17]. Given these limitations, approaches that use the whole point cloud could provide better foot tracking results.
Patients, staff, and visitors typically move through similar hallways in hospitals, rehabilitation centers, or long-term care facilities [18]. This space could be utilized in an intelligent way by building a system to perform marker-less stride analysis as patients or residents walk through the hallway. Measuring patients every time they walk through the hallway, without intervention, could help identify changes in their movement status. To capture multiple strides and avoid occlusions while a person is walking, multiple depth sensors would be required. Unlike time-of-flight based Kinect V2 sensor, Stereoscopic infra-red based Intel RealSense D415 depth sensors were shown to be suitable for this application since they did not experience interference when multiple sensors were used simultaneously [19]. Furthermore, these sensors capture data at 60fps, which can provide sufficient data for analyzing temporal related stride parameters.
This research explored depth sensing technology for stride parameter analysis within an institutional hallway environment. The main contributions were developing, prototyping, and validating a novel point-cloud-based marker-less system with an innovative foot tracking algorithm and assessed system performance by comparing stride parameter output with industry standard marker-based motion analysis. Successful implementation of this smart hallway concept would introduce unobtrusive movement status assessment that can guide clinical decision-making, without introducing unsustainable human resource requirements. This would also become the basis for future data analytics applications for predicting changes in dementia, fall risk, or other aging-related conditions.

FIGURE 2.
Key points on the chessboard for room coordinate system (Red: x-axis, Green: y-axis, and Blue: z-axis). server approach) and spatial synchronization (stereo chess board method).
The new smart hallway system captures 848 × 480 pixel depth images and color images at approximately 60fps (frame rate varies slightly because of buffer time for transmitting data through Ethernet after capturing each frame), with a timestamp for each frame, from all the six sensors. To increase accuracy and reduce computation time, a non-zero median filter for every 2 × 2 pixels was applied to remove ''spikes'' in the depth data and down-sample to half resolution (424 × 240) [21].

A. SENSOR PARAMETERS
Intrinsic parameters such as focal length (f x , f y ) and principal point (c x , c y ) for Intel RealSense D415 depth and color cameras were obtained from the manufacturer. Two coordinate systems (depth, color) and extrinsic parameters (rotation and translation, to transform data between depth and color coordinate systems) were also obtained. This marker-less six depth sensors setup was spatially synchronized in the color coordinate system, such that the combined output point cloud was in the first sensor's color coordinate system.

B. ROOM COORDINATE SYSTEM
A reference coordinate system on the floor plane ( Fig. 1) was designed using a chessboard (Fig. 2), with x-axis in the medial-lateral (ML) walking direction, y-axis parallel to the walking pathway (anterior-posterior; AP), and z-axis outwards to the floor (Vertical; V). This reference coordinate VOLUME 9, 2021 system was labelled as the Room Coordinate System (RCS). Methods from our previous study [20] were used to transform point cloud data from all the sensors into the first sensor's color coordinate system, then transformed into the RCS by determining a transformation matrix T R←1 (1).
A 8 × 6 chessboard was placed on the floor with the horizontal edge parallel to the RCS x-axis and vertical edge parallel to the y-axis. The board's depth and color images were captured with the first sensor at 1280 × 720 resolution. The depth image was down-sampled to half resolution using median filter. 3D points were calculated from the depth image using depth intrinsic parameters and then the points were transformed into the color coordinate system using extrinsic parameters.
A new color image was constructed using 3D points and corresponding projection pixels onto the captured color image. For a 3D point (x, y, z) in the color coordinate system corresponding to row r d and column c d in the depth image, the projected pixel location (r c , c c ) in the captured color image was found using equations (2) and (3) (color camera's intrinsic parameters). The red, green, and blue channel values at row r d and column c d in the constructed color image (Fig. 2) were the values at row r c and column c c in the captured color image and the values of pixels corresponding to invalid 3D data (0,0,0) were set to zeroes.
Three key points k 1 , k 2 , and k 3 were identified in 100 frames. For each key point, 100 instants of its 3D location were obtained in the color coordinate system. Each dimension value (x, y, z) of these 100 3D points was sorted separately and the middle 50 values were averaged.
The RCS origin was at point k 2 (x 0 , y 0 , z 0 ), the unit vectors werex R (x x R , y x R , z x R ) from k 2 to k 3 ,ŷ R (x y R , y y R , z y R ) from k 2 to k 1 , andẑ R (x z R , y z R , z z R ) the cross product ofx R andŷ R with respect to the first sensor color coordinate system, whose origin was at the point (0, 0, 0) and corresponding unit vectors werex(1, 0, 0),ŷ (0, 1, 0), andẑ (0, 0, 1), respectively. The transformation matrix from the first sensor to the RCS was obtained using (1).

III. POINT CLOUD
A point cloud was generated from the six depth sensors. The process involved generating a background depth image from a static scene, subtracting the background information from the depth images, constructing walking human point cloud data for each sensor from the background subtracted depth images, and merging and transforming point clouds from the six sensors to RCS. The combined point cloud was filtered and smoothed to reduce noise.

A. BACKGROUND FRAME
From each sensor, 1000 depth frames of background data (without any objects) were captured. The pixel value at row y, column x of these background frames was represented as bf (j,i) yx for the i th frame of the j th sensor. This system was designed to work in the range of 200mm to 5000mm. All background frame pixels for the j th sensor BF (j) were initialized with 5000 (4), then the pixel value at row y, column x(BF (j) yx ) was updated with the minimum of BF (j) yx and bf (i) yx , iterating through 1000 frames (i = 1 to 1000) using eq. (5).

B. BACKGROUND SUBTRACTION
A background subtracted depth image for the j th sensor (BS (j) ) was obtained by pixel-wise comparison with the corresponding sensor's background frame (BF (j) ) [22]. For a depth frame from the j th sensor (DF (j) ), pixel values less than the background frame's pixel value, and greater than the minimum value (200 mm), were considered the same value in the BS (j) frame. Other pixel values were assigned the maximum value (5000 mm), as presented in eq. (6), for a pixel in y th row and x th column. For further processing, BS (j) was linearly scaled down to [0, 255] from [0, 5000].
From the scaled-down image (SBS (j) ), a Binary Background Subtracted image (BBS (j) ) was constructed based on eq. (7). A connected component filter [23] with 1000 pixels connected area cut-off was applied to the BBS (j) image and output was a Binary Filtered Background Subtracted image (BFBS (j) ). The BS (j) image was modified based on the BFBS (j) , pixel locations with zero value in BFBS (j) were assigned to zero in BS (j) frame (8). Sample BFBS frames from all sensors are shown in Fig. 3. White pixels in the BFBS frames were foreground and black pixels were background. Depth data was not captured in the small gaps among foreground pixels.

C. POINT CLOUD CONSTRUCTION
3D point cloud points were constructed from each sensor's background-subtracted depth images and then transformed into the first sensor's coordinate system [20]. This ''combined point cloud'' points were transformed into RCS by multiplying with transformation matrix T R←1 , obtained from (1).
The combined point cloud was filtered using a statistical outlier filter [24], smoothed with a moving least-squares technique [25], and then down-sampled with a voxel grid filter [26]. OpenCV libraries [27] were used for 2D image processing and PCL [28] for 3D point cloud processing.
For every 3D point in a point cloud, 100 neighbor points were analyzed to find outliers. Mean and standard deviation of distances of the closest 100 points from each point of interest were found. Points farther than one standard deviation from the point of interest were considered outliers and removed.
Point cloud points were smoothed by fitting a second-order polynomial equation to points within 30 mm of each point of interest in the point cloud. The point cloud was divided into 5mm × 5mm × 5mm voxels (3D boxes) and then downsampled by replacing points in a voxel with the centroid of these points. This method of down-sampling retained the point cloud surface and reduced computation time for point cloud processing.

IV. LEG SEGMENTATION
Since this application tracks a walking person's foot, point cloud points less than 70 cm from the floor were selected, since the foot and shank are always present in this region. The free parameters presented in the following sections were tuned to fit the foot tracking algorithm to an adult's (between 5 feet and 6 feet height) leg dimension and also based on the point cloud density obtained from six Intel RealSense D415 sensors. These parameters could be fine-tuned based on the person physical dimensions and point cloud density. This lower leg point cloud was divided into left and right leg points clouds. To segment a current point cloud frame, Euclidean clustering, average leg dimensions (calculated from the point cloud data), and past point cloud frames were used.

A. EUCLIDEAN CLUSTERING
Point cloud points were divided into clusters based on the Euclidean distances between points [29]. The clustering tolerance was 50 mm, which implies that the points within 50 mm radial distance from an interested point in the point cloud were clustered together.
Each cluster was verified using the number of points and cluster volume (i.e., volume of the bounding box around the cluster). Point clouds with two clusters, each with a minimum of 1000 points and cluster volume greater than 75 percent of the average leg volume were considered to contain data from two legs, and each cluster was considered an individual leg (Fig. 4). Point clouds with a single cluster greater than 1000 points and volume between 0.75 to 1.25 times the average volume was considered single leg data.
During mid-swing, when both legs are close together, noise between the legs caused the points to group into a single cluster. In these cases, the two legs data were identified as a single cluster (Fig. 5), with cluster volume greater than two times the average leg volume. Therefore, a different approach (''Moving points segmentation'') was used to segment the legs (Section IV-C).
Point cloud frames not in one of these three categories (Two valid legs, single leg, two legs as a single cluster) were ignored. VOLUME 9, 2021

B. LEG DIMENSIONS
Leg dimensions were calculated from a closest oriented bounding box (OBB, Table 1) around the leg point cloud. Frames with two separate Euclidean clusters (two legs) and each leg with more than 1000 points were considered for calculating average leg dimensions, from 40 valid leg point clouds ( Table 2 ). Dimensions (l, w, h) were calculated using Algorithm I (Table 1) and then sorted before calculating the average of the center 20 elements for each dimension.

C. MOVING POINTS SEGMENTATION
For a current point cloud frame (PC t ) with two legs identified as a single cluster, a reference point cloud (PC ref ) from the past frames was found using Algorithm III (Table 3 ).
From every point in the reference point cloud (PC ref ), points within 20 mm in PC t were categorized as the nonmoving leg point cloud PC 1 (repeated points were ignored).
All other points in PC t were moving points and categorized as the other leg's point cloud PC 2 . Each point cloud had at least 30 percent of the total points in PC t and the statistical outlier removal filter was applied (Section III-D). Euclidean clustering (Section IV-A) was applied to PC 1 and PC 2 , the biggest cluster from each point cloud was considered (Fig. 6).

V. FOOT TRACKING
Foot tracking was achieved from point cloud data by fitting a box with average foot dimensions around each foot, in each frame. The foot's bottom plane was calculated and used to define bounding box rotation and position. The foot's heel and toe points were based on the walking direction.

A. FOOT DIMENSIONS
Using Algorithm II (Table 2 ), for a valid frame, the volume of points in 12 slices were calculated. Each slice's volume was median filtered with both adjacent slice volumes using filter size = 3 and filter stride length = 1 (first and last elements were untouched).
The cut-off slice (i.e., slice defining top of foot) was defined by identifying the slice with the maximum volume (V max slice ) and then scanning upwards to find the slice with volume less than 60 percent of V max slice . The points below this cut-off slice defined the foot. An OBB was calculated around these points (Table 1, Algorithm I) and OBB dimensions were foot length (f l ), foot width (f w ), and foot height (f h ). These dimensions were found for 40 frames, values of each dimension were sorted and the center 20 elements were averaged.

B. FOOT ORIENTED BOUNDING BOX
The foot's bottom plane was found using Algorithm IV (Table 4 ). Points above this plane within the distance f h were considered to belong to foot point cloud (PC foot ) and points between 0.1 times f h and 0.9 times f h were segmented as the center foot's point cloud (PC center foot ). PC center foot points were projected onto the foot's bottom plane and then Algorithm I was partially applied (until step 6). OBB rotation around the foot (R

C. HEEL AND TOE SEGMENTATION
The point cloud data was transformed using 5000 mm translations in the x and y axes such that the walking pathway was always in the positive xy-plane. This reduced the complexity of further processing and understanding.
Left and right leg segmentation was based on the walking direction, calculated using OBB centroid trajectory. When walking towards the origin along a pathway parallel to the y-axis, the leg closer to the y-axis was the right leg and the other leg was labelled as left. Opposite leg classification was applied when walking away from the origin.
For each foot OBB, the center point (p toe OBB ) of the front two bottom corners and the center point (p heel OBB ) of the back two bottom corners were calculated using Algorithm V (Table 5 ). These points were considered as the toe and heel, respectively (Fig. 9).

VI. VALIDATION
The foot tracking algorithm was validated by comparing gold standard Vicon system output with the marker-less smart hallway system. Volunteer walking trials were captured VOLUME 9, 2021   simultaneously with both the systems and post-processing filters were applied. This section describes the data collection protocol and post-processing processes.

A. PROTOCOL
Twenty-two able-bodied volunteers were recruited from students and staff at the University of Ottawa. After informed consent, reflective markers were attached to the participant's  lower body (Fig. 10) (foot markers were used in this application) and then the participant walked 12 times with their natural gait and comfortable speed along a walkway with a 1.5m capture zone. Data were captured simultaneously with a 13 camera Vicon system at 100 Hz [31] and the new marker-less system at approximately 60 Hz. Since the Vicon system captured more than the 1.5m walkway, the data within the capture zone of both systems were used for calculating the stride parameters. This protocol was approved by the Research Ethics Board of the University of Ottawa (File number: H-08-18-860, Approval date: 29-10-2018) [32].
In this study, Vicon and the new marker-less system were not synchronized in time. Even though both systems captured data simultaneously, each system was independent. Stride parameters were calculated individually, then synchronized based on spatial foot events information.
B. POST-PROCESSING 3D positions of left toe, left heel, right toe, and right heel markers were reconstructed using Vicon Nexus software [33]. Gaps in the trial data were filled using cubic spline interpolation and then filtered using a 4th order dual-pass Butterworth lowpass filter with 20 Hz cut-off frequency.
Marker-less point cloud data were constructed from the depth images, then 3D locations of toes and heels were tracked. Left toe, left heel, right toe, and right heel were processed independently. Data outliers were statistically filtered, with values two standard deviations or more from the mean removed. Based on time stamp information, trajectory gaps were filled using cubic spline interpolation.
Since the capture time between frames was inconsistent, cubic spline interpolation was used to re-sample the data to 60Hz. This re-sampled data was low pass filtered using 4th order dual-pass Butterworth filter with a cut-off frequency of 12Hz. Using cubic spline interpolation, the low pass filtered data was then re-sampled gain to the originally captured time stamps.

VII. STRIDE PARAMETERS
This section describes the stride parameters used with both Vicon and marker less systems. The stride parameters were calculated by finding the foot events from the segmented heel and toe points obtained from the foot-tracking algorithm.
Foot events such as foot strike (FS) and foot-off (FO) were identified to calculate stride parameters. Vertical foot coordinates (z-axis) were used to identify FS and FO frames [34].

A. FOOT EVENTS 1) VICON
Peak vertical values in swing phase were detected for heel (Fig. 11) and toe (Fig. 12) markers. These peaks were based on the zero cross over from positive to negative in the vertical velocity, then a peak value greater than 75 percent of the maximum vertical value condition was applied.
Between two peaks, FS and FO should only occur once. Zero crossovers from negative to positive in the vertical velocity were concave shaped dips in the vertical displacement graph. These concave dips within the bottom 20 percent  of the vertical range were identified. The FS frame was the minimum dip between the two peaks in heel data (Fig. 11) and FO was the minimum dip in toe data (Fig. 12).
Additional conditions were applied to the minimum concave dips before the first peak and after the last peak. The minimum concave dip before the first peak with a distance (in frames) less than 50 percent of the frame length between the first two heel peaks was ignored, and greater than 50 percent in the toe data was ignored. Similarly, the number of frames between the last peak and the minimum concave dip after the last peak must be less than 50 percent of the frame length between the last two peaks for heel data and greater than 50 percent for toe data.

2) MARKER-LESS
Vertical direction data from the Marker-less system was not as smooth as the Vicon data. The foot event frames were initially estimated using AP (y-axis) data, then finalized based on the vertical data.
The frame where the foot reached a stationary state in the AP direction was considered the initial FS frame (Fig. 13). Vertical movements may occur after AP movements halted, so the closest concave dip within the next five frames in vertical data was considered as the final FS (Fig. 13). For cases with no concave dip, the initially estimated frame was considered the final FS. The final FO frame was determined from an initially estimated FO frame, where AP displacement began (Fig. 14), and five frames before the initial estimated VOLUME 9, 2021 FIGURE 13. FS events in marker-less system's heel data.  FO frame in the vertical direction (Fig. 14). This method is detailed in Algorithm VI (Table 6).
Left Foot Strike (LFS), Left Foot-Off (LFO), Right Foot Strike (RFS), and Right Foot-off (RFO) were validated according to the normal gait cycle event sequence (i.e., after LFO, the expected next event is LFS and then RFO). If multiple LFS events are identified before RFO, the closest to the RFO event was considered. If no LFS event was identified between LFO and RFO, then the events were ignored.

B. RESULTS
The stride parameters in this research were from one gait cycle (Table 7 ). Primary parameters were directly obtained from the tracking data and the derived parameters were calculated from the primary parameters.
Foot events from Vicon and marker-less systems were synced based on foot position of the first common foot event in the marker-less system.
Stride parameters from the both systems were compared and analyzed. For n samples, with i th sample represented as x i , the mean (µ) and standard deviation (σ ) were calculated using (9) and (10), respectively. For a parameter, with value v from the Vicon system and value m from marker-less system, the sample error (e) was calculated using (11). For each primary stride parameter, µ and σ of the error values were calculated. Most values farther than two σ from the µ were due to false detection of foot events because of insufficient capturing volume and noisy data. Since these erroneous data was not because of improper foot tracking, it was categorized as outliers and removed from the analysis. Primary stride parameter inliers were used to calculate derived stride parameters. For a stride parameter with N samples (inliers), i th Vicon's sample v i , i th marker-less system's sample m i , the mean error (e µ ), error's standard deviation (e σ ), absolute mean error (e abs µ ), absolute error's standard deviation (e abs σ ), minimum error (e min ), maximum error (e max ), Pearson coefficient (r), and the percentage of inliers (I % ) were calculated ( Table 8).
The mean and absolute errors for step length and step time were within the minimum detectable change (MDC) for older people (age; mean = 78.09, standard deviation = 6.2) (step length MDC 95 = 47 mm, step time MDC 95 = 42 ms). However, step width was slightly greater than MDC (left step width mean error = 24.15 mm, left step width absolute error = 29.86 mm, right step width mean error = 27.42 mm, right step width absolute mean error = 32.27 mm, step width for older people MDC 95 = 20 mm) [35]. The best Pearson correlation coefficient was for stride speed (r = 0.98) and lower values were obtained for left foot angle (r = 0.08) and left foot clearance (r = 0.18).

C. DISCUSSION
The novel smart hallway system successfully tracked the foot and provided viable stride parameter output that could be used for decision-making, in most cases. The marker-less system had small mean absolute errors for the majority of stride parameters, compared with the Vicon system. For all the parameters, greater than 90% were inliers. Most outliers were due to limitations of capturing zone and noise from the sensors.
To the best of our knowledge, this research is the first to report foot clearance with marker-less depth sensors. A maximum absolute mean error of 1.25 cm was observed for right foot clearance, which was too large for clinical assessment purposes. With a marker-less system frame rate at approximately 60fps, all temporal stride parameters were accurate within 10 ms mean error and 30 ms absolute mean error. Errors in spatial stride parameters were due to ''floor-plane to foot plantar surface'' noise generated in the depth images during foot contact phases.
In comparison to the Kinect V2 based studies [16], [17], while mean errors of walking speed, stride length, and step length parameters were in similar range, step width mean errors were high, and temporal parameters' mean errors (step time and stride time) showed better accuracy.
The new foot tracking algorithm, based on the fixed size OBB and foot bottom plane to define foot orientation, counteracted the AP noise to some extent. Average errors were higher in ML dependent stride parameters such as step width and foot angle.
Based on the errors and comparison with the MDC for older people [35], this novel marker-less system has potential to perform stride analysis on large population of older people in institutional hallways. This novel foot tracking algorithm could obtain more accurate stride parameters with better (less noisy) point cloud data.

VIII. CONCLUSION
In this research, we proposed a smart hallway using depth sensors for foot tracking and stride parameter analysis. With six temporally and spatially synchronized Intel RealSense D415 depth sensors, depth data were successfully background-subtracted and merged to form a walking human's point cloud time series. The point cloud was then effectively segmented into left and right foot point clouds. A bounding box was fitted around the foot in each leg's point cloud data. The bounding box around the foot in each frame enabled foot tracking, and stride parameter calculation. Most stride parameters obtained from this newly developed marker-less system comparable favorably with gold standard Vicon system output. While the marker-less system had promising results with accurate temporal stride parameters and small errors in spatial stride parameters, step width accuracy needs to improve and poor foot angle accuracy was observed due to noise around the foot as it approached the floor plane. Since foot clearance error was greater than 1 cm, and foot clearance varying between 2 and 3.2 cm, this error would need to be reduced to provide usable results for clinical decision-making.
Unlike the machine learning based skeleton tracking systems, foot landmarks from our proposed system never move outside the foot and data are captured at approximately 60 fps. This system could monitor a large number of people for long hours with no preparation time (no sensors attached to the body), without any discomfort, and without expert intervention.