Deep Learning for Monitoring of Human Gait: A Review

The essential human gait parameters are briefly reviewed, followed by a detailed review of the state of the art in deep learning for the human gait analysis. The modalities for capturing the gait data are grouped according to the sensing technology: video sequences, wearable sensors, and floor sensors, as well as the publicly available datasets. The established artificial neural network architectures for deep learning are reviewed for each group, and their performance are compared with particular emphasis on the spatiotemporal character of gait data and the motivation for multi-sensor, multi-modality fusion. It is shown that by most of the essential metrics, deep learning convolutional neural networks typically outperform shallow learning models. In the light of the discussed character of gait data, this is attributed to the possibility to extract the gait features automatically in deep learning as opposed to the shallow learning from the handcrafted gait features.

Deep Learning for Monitoring of Human Gait: A Review Abdullah S. Alharthi, Syed U. Yunas, and Krikor B. Ozanyan , Senior Member, IEEE Abstract-The essential human gait parameters are briefly reviewed, followed by a detailed review of the state of the art in deep learning for the human gait analysis.The modalities for capturing the gait data are grouped according to the sensing technology: video sequences, wearable sensors, and floor sensors, as well as the publicly available datasets.The established artificial neural network architectures for deep learning are reviewed for each group, and their performance are compared with particular emphasis on the spatiotemporal character of gait data and the motivation for multi-sensor, multi-modality fusion.It is shown that by most of the essential metrics, deep learning convolutional neural networks typically outperform shallow learning models.In the light of the discussed character of gait data, this is attributed to the possibility to extract the gait features automatically in deep learning as opposed to the shallow learning from the handcrafted gait features.
Index Terms-Deep learning, floor sensor, gait, neural network, sensor fusion, video sequence, wearable sensor.

I. INTRODUCTION
G AIT refers to the displacement of the center of gravity during locomotion.In humans, it is achieved through the synchronized movement of the lower limbs and the trunk, resulting in a move from one position to the other [1].It is a unique behavior trait for every human being, influenced by mutually independent factors, such as weight, gender and age.
The rich history of gait analysis is a record of a steady progression from descriptive studies to more sophisticated methods.Aristotle (350 BC) was the first to take note of animals and human gait [2].However, useful descriptions of how humans walk were first achieved in the works of Newton, Galileo and Leonardo da Vinci.Borelli, a student of Galileo and the father of biomechanics [3], gave a considerable impetus to scientific approaches to gait analysis by measuring the center of gravity of the human body and how humans keep balance while walking [4].In 1836, the Weber brothers described gait as a periodic movement and defined the gait cycle on the basis of the pendulum-like forward leg motion [5].In 1878, Muybridge used 12 cameras to capture racehorse gait to prove that all four horse hooves were off the ground while trotting.He also used a similar approach to capture a series of photographs of human movement [6].The first substantial quantitative use of gait analysis was in 1895 [4] when Braune and Fisher used a photographic technique to determine a human body's velocity, acceleration, and dimensional trajectory to estimate the forces involved during the gait cycle.In 1930s, Bernstein studied the dynamic locomotion of 150 subjects to determine the center of gravity of each limb segment of the subjects using a photographic technique [7].
Ground Reaction Force (GRF) was introduced in human gait understanding in 1924 when Cavanagh and Lafortune [8] designed a force plate to measure the magnitude and the direction of GRF.The platform was improved by Elftman in 1938 using a high-speed cinematic camera to capture a pointer movement resulting from the force applied to the platform [9].A substantial amount of knowledge was contributed to the human locomotion analysis in the 1950s, with the motivation to treat World War II veterans [10].
In the past two decades, the rapid rise in the capabilities of sensor systems involving analytical computing technologies has allowed the extraction of richer information from an increasing number of sensing modalities.In this context, developments in new gait-sensing instrumentation have underpinned the progress in the evaluation of different human locomotion parameters based on an ever increasing volume and quality of data.Understandably, this has also raised awareness of challenges brought forward by the necessity to achieve multi-source, multi-sensor fusion from big data with diverse characteristics.Furthermore, it is unclear whether the complex character of gait maps adequately onto simple and widely used measurands, typically delivered by systems for fast and reliable diagnostics, recognition and classification.However, progress in machine learning technology has resulted in deep learning models that can be applied with minimal pre-processing on complex data and are capable of faster, more accurate results from databases that are constantly growing in volume and range.It presents new opportunities for detection, fusion and classification from different multi-source, multi-sensor data.Among these, gait spatiotemporal parameters are currently attracting attention due to the possibility of using such information in a variety of applications, e.g.healthcare [11], [12], sport [13], [14], and identification of individuals for security [15], [16].
Gait analysis is still on its way to maturity, and there is no gold standard sensing or data processing method.Further in this Review, we organize the modalities mostly used to study human gait into three groups, based on the sensing principle as well as the amount and character of the generated sensor data: video sequence (VS), wearable sensors (WS), and floor sensors (FS).We show that the sensing principle used for this grouping also shapes the choice of deep learning processing methodology: the VS solutions are based on action recognition using spatiotemporal information; WS systems typically comprise inertial sensors to acquire human body velocity, acceleration and orientation during physical human activity; FS characteristically monitor the GRF induced by floor contact during the gait cycle.The data captured from these modalities is analyzed and classified using sophisticated supervised learning methods, based on appropriate assumptions.
This review is underpinned by an extensive literature search but only the most recent works, combining gait recognition with deep learning algorithms, are presented in more detail.

II. BACKGROUND
To outline the contribution of deep learning in human gait analysis, it is necessary to understand how humans walk and the applications of gait giving rise to the set of methods utilized for analysis.

A. Gait Parameters
Gait can be perceived as a transformation of a brain activity to muscle contraction patterns resulting in a walking sequence.It is a chain of commands generated in the brain and transmitted through the spinal cord to activate the lower neural center, which will consequently result in muscle contraction patterns assisted by sensory feedback from joints, muscles and other receptors to control the movements.This will result in the feet recurrently contacting the ground surface to move the trunk and lower limbs in a coordinated way, delivering a change in the body center-of-mass position.
Gait is a sequence of periodic events characterized as repetitive cycles for each foot [4].Each cycle is divided into two phases (see figure 1): a) Stance Phase (approximately 60% of the gait cycle, with the foot in contact with the ground).This phase is subdivided into four intervals (A, B, C, D).
b) Swing Phase (approximately 40% of the gait cycle with the foot swinging and not in contact with the ground).This phase is subdivided into three intervals (E, F, G).
A-Heel strike or Initial contact: It starts the moment the foot touches the ground, and it is the initial double-limb support interval.In the case of the right foot leading, the double support starts with left foot being on the ground when the right foot heel makes initial contact and finishes when the left foot leaves the ground with the left toe-off prepared to swing.At the end of this interval, the body weight is completely shifted onto the stance (leading) limb.
B-Loading response or Foot flat: This is a single support interval following the double support interval.The trunk is at its lowest position, the knee is flexed, and a plantarflexion occurs at the ankle.
C-Mid-stance: This is a single support interval between opposite toe-off and heel-off.The trunk is in its highest point D-Terminal stance or Heel-off: The heel rises in preparation for opposite swing.The trunk is sinking from its highest point, the knee has extant peak near the time of heel rise and ankle has dorsiflexion after heel rise.
E-Pre-swing: This is the second double-limb support interval.The opposite initial contact occurs, and the hip is beginning to flex, the knee is flexing, and the ankle is at plantarflexion.The toe is in last contact before the swing, finishing the push-off started in interval D.
F-Initial swing and Mid-swing: This interval begins with the toe-off into single support and starting to swing.The body weight is shifted to the opposite forefoot.In this instant, the knee joint gets the maximum flexion.The hip is flexing and the limb advances in preparation for a stride.
G-Terminal swing: This is the last interval of gait cycle and the end of the swing phase.The interval begins at maximum knee flexion and ends with maximum extension of the swinging limb forward.The hip continues flexion and the knee extends in regard to gravity, the ankle continues dorsiflexion to end neutral, ready for the heel strike.
With regard to the above gait events, the following parameters of human gait are usually analyzed in clinical settings [17] for healthcare tasks, using various sensing and data processing methods: • Cadence or rhythm (number of steps per unit time)  Body posture It is worth mentioning at this point, that while the listed parameters have clear observational value, it is difficult to claim that any of these, or their combination, would represent the maximum variability of the raw data due to a health condition.This difficulty has a direct impact on the ability to detect, with the lowest threshold affordable by the raw data quality, a meaningful deviations from the norm.

B. Applications of Gait Analysis
The field of research in human gait is broad, with many specific applications.In medical applications, as gait abnormality affects a high percentage of the population, gait is studied to diagnose neurodegenerative diseases such as Parkinson's disease (PD), myelopathies, spinal amyotrophy, multiple sclerosis, cerebellar ataxia, brain tumors, cranioencephalic trauma, certain types of dementia, neuromuscular diseases etc. [17].In fact, the ground reaction force of individuals during the gait cycle has been used to detect PD in [18].The study shows that stance time, swing time, stride time and foot strike profiles can be used to distinguish PD patients from healthy controls.In addition, the spatiotemporal parameters of gait have been studied [19] to assess lower limb prosthesis users.
In security applications, gait analysis as a biometric has proven its success to distinguish and identify people, with minimum cooperation required from the subject.The aim is to identify individuals from a distant based on their walking habit.Typically, individuals gait is captured by CCTV cameras as reported in [20], [21].In [22], [23], the ground reaction force has been found to be significant in identifying subjects based on their footstep signals and stepping behavior.
Injuries commonly occur during sports activity and some methods to evaluate athletes' recovery are based on gait, e.g. by analyzing forces exerted on each muscle through electromyography in [24].The kinematic parameters of gait are used to analyze various indoor and outdoor activities, such as sports training and clinical rehabilitation of patients using a wearable sensors [13].Even different gait characteristics assessment methods are used to assess athletes' ability to return to sport after surgery due to tear in the anterior cruciate ligament which causes knee instability [25].Further, the gait dual-task paradigm for comprehensive athlete evaluation following a sports-related concussion are reviewed in [14].
It is interesting to note that gait analysis is utilized to classify a person's gender based on their gait [26].Furthermore, attempts to identify a person's emotional state, such as pride, happiness, fear and anger, have been based on gait [27].

C. Deep Learning for Gait Analysis
Supervised machine learning is a branch of artificial intelligence (AI) and a specific kind of machine learning.Algorithms or mathematical models are built and trained with a given set of inputs and desired outputs.A learning algorithm trains the model based on two learning styles, shallow learning or deep learning, to produce a trained "machine" that carries out the desired task.The models are tested by exploring the data structure based on the learned mapping function to assign hypothesis class which is controlled by the user to evaluate the model performance [28].Shallow learning depends on handcrafted features learned in a predefined relationship between the inputs and the output, such as linear regression, logistic regression, decision tree, Support Vector Machine (SVM), random forest, naïve Bayes, and k-nearest neighbor.
Deep structured learning or hierarchical learning is inspired by the biological neural networks' structure and function.It is based initially on the concept of multi-layer Artificial Neural Network (ANN) with the aim to learn data representations automatically; thus, deep learning becomes the method of choice where the classification features, if known at all, are complex, with no straight forward quantitative relation to the raw data.Typically, the term 'deep' refers to the number of layers in the variety of possible networks structures: Deep Belief Networks (DBN), Feedforward Deep Networks (FDN), Boltzmann Machine (BM), Generative Adversarial Networks (GAN), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Long-Short Term Memory (LSTM) a special kind of RNN.A comprehensive presentation of the theory of ANNs and deep learning is not within the scope of this Review, and the reader is referred to established sources [29].Further, we focus on models with practical significance for gait applications such as CNN and LSTM [30].
The CNN model is suitable for processing 1D, 2D or 3D data that has a known grid-like topology [31].The network has the ability to learn a high level of abstraction and features from large datasets by applying a convolution operation to the input data.Commonly, the network consists of convolution layers, pooling layers and normalization layers, with a set of filters and weights shared among these layers.
The convolutional layers output a feature map harvested automatically from the raw input data.The pooling layers are utilized to reduce the size of representation and make the convolution layer output more robust [29], [30].The CNN model uses commonly two types of pooling layers: max pooling and average pooling.All convolution layers and pooling layers have activation functions (e.g.Sigmoid, Tanh, ReLU, Leaky ReLU), to calculate the weight of neuron and add a bias, deciding whether to fire the neuron or not [32].
LSTM networks are favorable for processing time series data, where the order is of importance, such as gait data sequences.In essence, they exploit recurrence, by using information from a previous forward pass over the network.
The computational complexities of deep learning are not specific to gait applications.The goal of using ANNs in gait analysis is to develop a model to extract gait features and perform well on unseen real-world gait data with high prediction accuracy.Commonly, for appropriate training and testing, the model is trained and validated on 70% of the data and tested on the remaining 30%.In supervised training, the procedure is launched by initializing the weights randomly, processing the inputs and comparing the resultant output against the desired output.During training, the weights and biases are adjusted in every iteration, until the error is minimized, and validation is used to estimate the model performance during training.Lastly, the model is tested with unseen data, allowing to identify over-training.
The widely used accuracy measure for ANN gait analysis is the confusion matrix [33].It is a table to visualize the number of predictions classified correctly and wrongly for each class.The table consists of true positive, true negative, false positive, and false negative classification occurrences.One of the advantages of the confusion matrix display is that it is straightforward to identify the decision confusions, thus possibly concluding on the quality of the model and data involved.

III. GAIT MODALITIES
The evolution of research in gait analysis suggests that, in order to capture the distinctiveness of gait, the various sensing modalities attempt to access biomechanical measures pertaining to the body's physical dimensions, body part masses, or the time-varying muscle-generated forces applied during the gait cycle.In the past decades, a number of modalities have proven their ability to capture gait characteristics and anomalies; however, the historically established methods used to analyze gait heavily rely on handcrafted features.With such an approach, salient features of the problem may be lost in the process of feature engineering, and the classification result can be data dependent.This can be mitigated by utilizing deep learning for its capability of automatic feature extraction, delivering high statistical confidence by learning rich features of gait patterns from sensor data.Sensing modalities for gait data capture used in conjunction with deep learning can be divided into three main groups: video sequence (VS), wearable sensors (WS), and floor sensors (FS); further, each of these is described in more detail.In addition, the different types of algorithms typically applied to analyze gait data are presented and their ability to adjust to the characteristics of a modality or and/or scenario is elucidated.

A. Video Sequence
Gait recognition based on VS has been driven by the advances in general machine learning and image processing methods.The most common aim is to distinguish the identity of a person from a distance.A typical VS system consists of several cameras with optics suitable for capturing the gait cycle.Common VS data sources are suitably positioned CCTV cameras.The information gathered in the form of sequential video frames is subjected to image processing techniques, such as threshold filtering, edge detection, pixel count, background segmentation, counting of light and dark pixels, and converting images to black and white [17].Gait recognition based on VS in literature is sub-divided into skeleton model-based From right to left: video sequence, silhouette images and EGI image [20].and skeleton model-free categories.(The above sub-division reference is to skeleton models, not machine learning models.) The model-based approach is in essence fitting video sequences of gait to multi-segment skeleton models, as proposed in [34], [35].This method is computationally expensive, because of fitting skeletal segment models on sensor data, as well as the need to use the model-derived parameters to extract features.The extracted features are classified using shallow learning methods.
The model-free approach is based on extracting gait from VS using feature engineering, as proposed in [36], [37].Here, deep learning is utilized to automatically extract gait features from VS, which maximizes the use of data variability and eliminates the dependence on handcrafting.Most of the available model-free processed data is represented by Gait Energy Image (GEI), maps of optical flow and silhouettes [38] or Chrono-Gait Images (CGI) [39], [40].These representations, extracted from VS, can capture both spatial and temporal information.As an example representation, GEI is defined mathematically as: where s is the total number of frames to represent one gait cycle, and F t (x, y) is the binary silhouette of the subject at time t. Figure 2 shows schematically the extraction of GEI from the video sequence.1) Video Sequence Databases: Once the VS representation algorithm is implemented, the machine learning model must be trained, validated and tested to assess its performance.The widely used benchmark is to train and test the algorithm with the following datasets (in chronological order of availability): CMU Motion of Body (MoBo) [41], USF Gait Based Human ID Challenge [42], CASIA [43], OU-ISIR treadmill [44], OU-ISIR [45] and TUM-GAID [46].
The Carnegie Mellon University Robotics Institute Motion of Body (MoBo) dataset [41] encompasses 25 subjects performing four different walking patterns on a treadmill, namely slow walk, fast walk, incline walk and walking with a ball.The subjects' gait is captured by six high-resolution cameras, distributed around the treadmill.
The University of South Florida Gait Based Human ID Challenge dataset [42] captures 122 subjects walking outside with shoes and clothes variations, as well as under different carrying load conditions.Gait is captured from a single viewing angle.
The Chinese Academy of Sciences Institute of Automation Gait Database CASIA [43] is divided into A, B, C, and D datasets.The CASIA A dataset contains 20 people; for each person, it contains 12 image sequences, four sequences for each of 3 angles (0, 45 and 90 degrees) to the image plane.The CASIA B dataset consists of 124 subjects' gait sequences captured from 11 views.The subjects performed normal walking, wearing a coat while walking, and carrying a bag while walking.The CASIA C dataset was captured by an infrared (thermal) camera from 153 subjects performing normal walking, slow walking, fast walking, and normal walking with a bag.The video sequence was taken from one angle at night time.The CASIA D dataset contains the video sequence and footprint images scans of 88 subjects with a wide age distribution.The video sequence is captured from a single angle and with no variations in clothing and carrying conditions.
The Osaka University Institute of Scientific and Industrial Research treadmill dataset, OU-ISIR treadmill [44], contains 200 subjects' gait captured on a treadmill by 25 cameras from different angles, 34 subjects with walking at different speeds and 68 subjects with 32 clothing variations.The dataset is distributed in the form of silhouette sequences of subjects while walking on a treadmill.The same group's database on normal surface walking (not involving a treadmill), OU-ISIR [45] dataset, contains 4,007 (2135 males and 1872 females) with ages from 1 to 94 years.The dataset consists of silhouette sequences of the subject's gait captured by two cameras.
The Technical University of Munich Gait from Audio, Image and Depth database, TUM-GAID [46], contains 305 subjects' gait captured by video recording cameras at a single angle, while subjects walk indoors in both directions.Six walking conditions are captured for each subject from the side view namely four normal walks: one with coating, shoes and one without (left and right), and two normal walks with carrying a backpack variation (left and right).32 subjects of the cohort are recorded in two sessions (January and April), adding clothes variation.
2) CNN Architectures: Table I summarizes the results yielded by gait recognition VS models, comparing deep convolutional ANNs with automatic feature extraction to shallow learning algorithms, where features are handcrafted.Deep learning models can be split into two groups, a single deep ANN and multiple deep ANNs joined in the last layer.The network inputs are single or a pair of processed silhouettes sequences.The latter case is mostly used for verifying individual's identity, with a view of 'probe and gallery' gait features.The 'probe' is an identified or verified subject, and the 'gallery' consists of templates as a browsing data set, where the probe is searched and matched to the closest instance in the gallery.These are examined in more detail below, for gait identification or verification.

a) Single deep ANNs:
The single ANN input is a video sequence of images, on which the top softmax layer will perform classification based on the desired output for the given input.The softmax score outputs 1 for the true-match subject and 0 for false-match subjects.During validation, the loss is computed using cross-entropy between the softmax outputs and the corresponding desired output (the ground truth).Single CNN with a single input architecture has been investigated by a number of groups, with some examples outlined below.[47] used a CNN model trained on a single input as GEI.For testing, the softmax classifier in the last layer based on Euclidean distance is replaced by a Support Vector Machine (SVM) classifier to compute one-vs-all (probe vs gallery).The model, evaluated on OU-ISIR Treadmill dataset, yielded competitive performance in clothing-invariant for the identification of people.
Yan et al. [20] proposed a CNN model with Multilayer Perceptron (MLP) classifier.The input is a single GEI for automatic extraction of gait features.The CASIA-B dataset is used for evaluating the methods.The model is trained using multitask learning to predict multiple human attributes.95.88% accuracy for each task is achieved; however, it was realized that the changes of scenes or view could be generalized better by training on more data.
Shiraga et al. [48] designed GEINet, which is a CNN with two sequential groups.The network input is a single GEI image (from OU-ISIR database) in the training stage.In the testing stage, the dissimilarity between a probe GEI and gallery GEI pair is computed using the distance between them at the fully connected layer.The model performs well on cross-view for gait verification and identification.
Wolf et al. [49] proposed a 3D CNN with a 3D spatiotemporal tensor as input, consisting of a grey-scale image for the first channel and optical flow for the second and third channels.The model is trained and tested using the CASIA-B dataset, MoBo database and UFS database.The approach was evaluated on variations in walking speed, clothing and the view angle.Based on this architecture, Castro et al. [50] used a spatiotemporal 3D tensor of the optical flow as the input of the CNN.The network was trained and tested using the TUM-GAID database with gait scenarios, clothing and carrying variations for each subject.Although the network accuracy was significantly improved using the optical flow rather than using silhouette-based input.However, it is difficult to generalize on which feature extraction method outperformed the other, since [49] and [50] are evaluated on different datasets.Nevertheless, it is clear that the optical flow feature can present robust gait spatiotemporal information for use in a CNN architecture.

b) Dual deep ANNs:
The input into a dual network consists of two different images, as probe and gallery under similar conditions; however, different gait scenarios, viewing angle, as well as clothes and carrying conditions, may be involved.This architecture is effective in gait verification since the networks have the same weight and structure, which allow the extraction of gait features automatically in the same manner.The outputs are matched using contrastive loss to find the Euclidean distance.The latter can be compared to a threshold to identify matching pairs or to label an imposter if a match cannot be found.Below is an outline of architectures applied for CNNs with two inputs.Figure 3 [54] presents some dual architectures used for verification and identification.
Zhang et al. [55] designed a shared parameters 'Siamese twin' CNN, each twin comprising a convolutional layer, a max-pooling layer and three fully connected layers to extract gait features automatically.The two twin outputs are connected to the contrastive loss layer.A pair of similar or dissimilar GEI images from the OU-ISIR database are used as an input to the Siamese network.In the training stage, the weights are shared simultaneously to optimize the network, and the model is fine-tuned by back-propagating with a contrastive loss.The gallery member with the nearest training sample is identified by testing to allow the feature metric computation of a discriminative loss function.The latter drives the similarity metric [56] to be small for pairs representing the same subject, and large for different subjects.Considering the changes of cross-view in real-world human identification scenarios, the model performs well in gait verification.
Wu et al. [57] proposed a CNN to extract gait features directly from the raw silhouettes' sequence for cross-view gait recognition.Gait sequences from the CASIA-B dataset are used to train and test the network.In the testing stage, the Euclidean distance is measured for similarity using the probe and gallery method, achieving an accuracy of 94.1%.Furthermore, in [58] several CNN that take two inputs as probe and gallery have been shown to outperform other approaches, including twin CNNs [55], [57].Two GEI images are used for gait verification based on cross-view gait recognition.The dataset to train and test the proposed networks are the CASIA-B dataset, OU-ISIR database and USF database.The proposed methods outperformed the previous state-of-the-art methods by a significant margin in the three datasets.
For cross-view gait recognition, Takemura et al. [54] considered different architectures for verification and identification.This is based on the assumption that the absolute similarity scores are important for the verification task, while the relative similarity scores between a probe and the galleries are important for the identification task.For verification, a Siamese CNN with shared parameters is proposed (see figure 3a) to discriminate whether two inputs originate from the same subject or not, based on the contrastive loss value.For identification, three parallel CNNs are deployed as a triplet network (see figure 3b).The triplet input is three GEIs: a query (the probe subject), a positive (from the same subject) and a negative (from a different gallery member).A triplet ranking loss is defined as the difference between two feature vector distances: the distance between positive and query and the distance between negative and query.The parameters of the triplet CNN are trained so that the dissimilarity between a probe and the same subject is relatively lower than that between a probe and different subjects.To accommodate possible substantial differences in the GEIs by viewing angle, low-level difference structures are introduced, as they are more directly affected by Fig. 4. Gait GEI images at 14 viewing angles [54].
appearance differences due to taking the difference between a matching pair closer to the input level (see figure 3c) and figure 3d).Cross-view gait recognition is demonstrated on OU-ISIR and OU-ISIR Multi-View Large Population datasets, with 10,307 subjects' video sequences captured from 14 angles (see figure 4); however, the existing methods are difficult to evaluate on this dataset, and OU-ISIR LP is utilized to confirm the hypothesis regarding the network architecture.
3) Transfer Learning: Transfer learning is a comparatively new concept in ANNs and is the next strongest driver, after supervised learning, of the commercial success of machine learning [59].Essentially, it is applying knowledge gained to solve a problem to a multiplicity of related problems.'Pre-trained' models are beneficial as a starting point on specific ANN solutions, given the vast computing and time resources required to develop detailed physical models on these problems.Compared to starting from scratch, Transfer learning allows a substantial jump in the starting point for the delivery of a related ANN model [60].
Li et al. [61] used supervised pre-training of a VGG-D CNN (Visual Geometry Group) model and evaluated the efficacy of learned features on gait recognition tasks.The network consists of 16 convolutional layers and 3 fully connected layers with a nearest neighbor classifier.The silhouette images from the OU-ISIR dataset are used to train and test the network without fine tuning to capture gait spatiotemporal aspects.The probe and gallery method is used to identify people in a cross-view setting, significantly outperforming prior stateof-the art methods for both verification and identification.
Alotaibi and Mahmood [15] determined empirically the appropriate CNN architecture for automatic gait feature extraction from GEI images using the CASIA-B dataset.They applied two transfer learning methods to the network pre-trained with 24 subjects.'Fine-tuned CNN' involved adding one more subject (new total of 25 subjects) and dropping the weights of the softmax layer followed by re-training of the entire model; 're-learn softmax only' involved 'freezing' the weights of the convolutional layers and the weights of the softmax layer were re-learned.While the computational time for pre-training was 124.82 s, adding a single subject by fine-tuned CNN took 42.41 s and only 22.12 s by softmax re-learning.

B. Wearable Sensors
WS are an obvious means to acquire human gait due to their convenience, efficiency and lower price.Unlike other gait capturing systems, WS impose upon the user to cooperate wearing the device in a non-invasive way to provide gait signals.The advances in electronic devices and signal processing techniques have extended the applications of WS sensors to produce a measurement of human body orientation, position and specific force in space and time.The inertial measurement unit (IMU) is a type of WS system that has been extensively used due to its small size, cost, light weight, and good precision characteristics.A typical IMU provides the most widely used combination of sensing modalities to capture human activities, including gait.It comprises of an accelerometer, a gyroscope and often a magnetometer, which gives the heading direction.Additional components such as batteries, microprocessors and communication modules are arranged to jointly operate an IMU system.
Gyroscope sensors measure the angular velocity as the rate of change of the sensor's orientation, while accelerometer sensors measure the acceleration of the body resulting from the acting forces in the opposite direction.A combination of these sensors can create a comprehensive report on the human body orientation, gravitational forces, velocity and acceleration [5].
Furthermore, it has been found convenient to use the gyroscope and accelerometer, usually integrated in a smartphone, benefiting from predictable availability and positioning, as well as eliminating the need for additional hardware.Mobile users' authentication is an acceptable approach when other gait authentication is not deployable.In the healthcare domain, IMU-equipped smartphones allow inexpensive prediction of falls due to neurological disorders or freezing of gait in patients [62].The computing power on-board of a smartphone can be used as a standalone system to perform all tasks required for decision making and communicating with healthcare providers in any life-threatening situation.
The analysis of WS signals is a challenging task considering the large number of observations recorded per unit time.This is due to the spatiotemporal nature of the gait cycle and the difficulty to relate in a straightforward manner WS signals to a known gait characteristic.Manual feature extraction is the classical way for gait analysis using WS, and it is time-consuming and depends on knowledge of the context in which the signals are acquired.Since performance is key in real world applications, deep learning has emerged as a promising data processing method by extracting The sensor position on the body and the number of sensors comprising the system are an essential factor for the quality of the harvested data.In a systematic review analysis, Panebianco et al. [63] assessed accuracy and repeatability using 17 algorithms for their ability to monitor temporal parameters of human gait from 5 IMUs: one on the back, two pairs on the shanks and two pairs on the feet.For estimates of stance time, algorithms based on the acceleration of the shank and foot perform better than those based on the lower back; however, the sensor position did not affect the step estimation.For toe-off and heel strike detection, algorithms estimating angular velocity performed better overall, with notable dependence on the sensor positioning.Analysis has concerned mostly with the distinction between normal and abnormal gait, as follows below.
1) Normal Gait Analysis: Analysis of normal gait parameters using WS has immensely attracted the interest of researchers and clinicians.The following are different methods and techniques that have been proposed and implemented for various applications.
Zebin et al. [64] proposed a system comprising 5 IMU sensors, worn on the lower back, thighs and shanks, for activity recognition including gait.A CNN based model is used to extract the features automatically from time-series raw data and achieve higher accuracy compared to the handcrafted features with shallow learning.In another work, 7 IMUs positioned on the chest, arms and legs along with the12 accelerometers close to the limb joints, were used by Ordóñez and Roggen [65].A DeepConvLSTM model is trained in a fully-supervised manner on human activities including gait.The DeepConvLSTM model outperforms previous results on the same dataset.However, increasing the number of sensors exacerbated the extraction of gait features compared to the use of WS attached to the pelvis and lower limb only [64].
For gait authentication, Gadaleta et al. [66] used a CNN model (see figure 5) to extract gait features from a single WS placed on the shank for each subject.Data from 15 subjects' gait is used in the training stage and 9 in the testing stage.In the latter, the network weights are frozen, and the CNN model is used to extract features, further the features are feed to SVM for classification.Thus, increasing the training dataset was suggested for improving the model performance.In a later work by Gadaleta and Rossi [67], the proposed CNN model is used to extract gait feature vector from a single subject automatically, the gait feature are used to train a single-class SVM.The system can distinguish between an impostor and the user whose gait is used for training.The IMU signals acquired from smartphones are tested on a user against 14 impostors, yielding false positive and false negative rates less than 0.15%.
Zhao and Zhou [68] proposed a CNN model for gait labeling and authentication.The input to the network for automatic gait features' extraction is an Angle-Embedded Gait Dynamic Image (AE-GDI), which is a transformation of a WS data series.This allowed comparison with the state-of-theart performance on VS (OU-ISIR) and WS (MCGILL [69]) datasets.
Similar to [64], Dehzangi et al. [70] placed 5 WS at various body locations.WS signals obtained from the sensors at chest, right wrist, knee and ankle, as well as the lower back of the subject, allows the study of CNN performance on time-frequency image transformation of raw signals.A total of 10 subjects' gait data were used to train and test the network; accounting for the multi-sensor character of the data, early and late fusion methods were applied, achieving stateof-the-art in subject identification.The deep learning approach to sensor fusion is addressed in more detail in Section VI.
2) Abnormal Gait Recognition: Deviations from normal gait are extensively studied by WS, the main targets being to classify neurodegenerative conditions, or to prevent falls in older adults.While the assumptions underlying various algorithms differ, in practical applications it often appears more convenient to use a single WS for capturing a discriminative gait feature.The sensor system embodiments used for abnormal gait analysis can be grouped into dedicated IMU systems and smartphones.Lorenzi et al. [71] used a single IMU unit positioned on the head, to collect gait patterns during the gait cycle, aiming to distinguish normal gait from the freezing of gait and irregular steps in Parkinson's disease (PD), using dynamic time warping to select the input features to the ANN.
Deep learning recommended itself as an improved approach to recognizing the abnormality in human gait, in terms of classification accuracy and computational requirements.Camps et al. [72] used a waist-positioned IMU and an 8-layers CCN to achieve an accuracy of 90.6% to detect freezing of gait (FOG) detection in PD patients.The optimal architecture implemented with two convolution layers and 20 convolution filters.The gait of 32 patients was recorded by a smartphone accelerometer and gyroscope casually placed in the subject's trouser pocket.The CNN detected the FOG events in Fourier space with 91.8% accuracy, which is slightly higher than the CNNs methods proposed in [72].
In a recent study, Xia et al. [74] proposed a CNN to extract gait features from three accelerometers positioned above the hip, knee, and ankle.Against the aim to distinguish FOG events from normal gait, evaluation on the Daphnet FOG dataset [75] from 10 subjects yielded an accuracy of 90.60%.Several other deep ANNs [76], [77] have been trained and tested for human activity recognition from raw spatiotemporal datasets, including the FOG dataset used in [75].Rad et al. [76] and Hammerla et al. [78] used a CNN performing well in human activity recognition; however, the performance on the FOG dataset was weaker.Murad and Pyun [79] improved the FOG recognition accuracy to 94.1% with their proposed deep RNN trained on the Daphnet FOG dataset.Ravì et al. [77] argued that deep learning models do not perform well when small number of activity are available and proposed feature fusion, where shallow features are fused with features derived by deep learning in the fully connected and the softmax layers.With Daphnet FOG data, this method yielded for 'freeze' and 'no freeze' precision of 67.89% and 97.40%, as well as recall of 59.52% and 98.15%, respectively.
As an alternative use of deep learning, stride length estimates are derived in clinical settings to indicate, an early or further progression stage of neurological disorders.In the work reported by Hannink et al. [80], stride length is estimated automatically using WS and deep CNNs.The WS set consists of a 3D-accelerometer and a 3D-gyroscope attached below each ankle joint.The aim of this approach is to extract spatiotemporal gait parameters to aid the physician in scoring gait impairment objectively.The CNN performance was evaluated on the eGAIT dataset [81], using 10-fold cross validation on three different stride types.It was observed that the performance was dependent on stride definition and the better results were achieved for mid-stance to mid-stance intervals.Importantly, the CNN analysis of WS data was not affected by the use of a four-wheeled walking aid, where the data processing became problematic with the GAITRite walkway sensor system (see Section IV.C.).
Gait analysis using WS has been extensively studied for the detection of falls in older adults.Most of the reported work is based on handcrafted features deep learning is appeared as an improved approach in terms of increased classification accuracy and reduced computational load.Aicha et al. [82] reported work on CNN, LSTM, and ConvLSTM models used to extract gait features from raw accelerometer signals positioned on the lower back.The model trained and tested on 296 participants' gait to predict fall risk as the main task and user identity as an auxiliary task.The models' performance with features extracted using deep learning was observed to be marginally better compared to handcrafted features.
Hu et al. [83] attempted to capture the higher risk of falling while walking on uneven surfaces as compared to the flat surfaces walk.Essential here is the ability of subjects, as a

C. Floor Sensors
One of the key points in monitoring human gait is to capture the forces placed on the ground by the foot during gait cycle.The interaction of the human body with the walking surface is the point of contact with the environment, which cannot be avoided or modified at will.This interaction is typically described in terms of the GRF. Figure 1 emphasizes that the details of the GRF dynamics follow the gait cycle, as the 7 intervals are defined by the contact of one or both feet with the walking surface.This interaction is highly individual: in the short term it can vary as a result of a temporary psychological or physiological condition and longer term changes can take place as a result of ageing or a longterm healthcare condition.Gait monitoring with floor sensors requires minimal, if any, cooperation or attention by the user and is amenable to embodiments for long period, continuous data capture.This motivates the advances in sensor technology for footsteps capturing systems and processing of GRF data to extract distinctive information on gait events, evolution of walking habits and reaction to physical and psychological interventions.Typical applications of FS are in the fields of biometrics, healthcare, sports, safety and security.
GRF data obtained from force plates has been successfully used for biometrics in [22], [84], [85].Vera-Rodriguez et al. [84] have assembled the largest to date footstep database, SFootBD [86], containing about 9900 single strides from 127 volunteers.In the healthcare context, GRF sensor data has been used for flat foot diagnosis in children [87], for falls detection in a smart home environment [88] and for monitoring performance on dual cognitive tasks [89], [90].Discrete switches [89], [91], [92], a row-column contact wire mesh [90] and pressure sensors [87], [88], [90], have been most commonly used as floor GRF sensors to derive stride length, width and duration; stride variability; cadence; velocity and other spatial characteristics of gait [89]- [92] as well as time-on-heel to time-on-toe ratio [93].While these features are of common use in healthcare practice, they are not straightforward to extract and interpret from substantial volumes of raw data.Consequently, data-mining methods and shallow machine learning have been introduced in the past couple of decades to process data from FS. Table III summarizes the results yielded by gait recognition models based on floor sensor using deep ANNs compared to shallow learning.
The recently demonstrated success of deep learning in processing of VS and WS data has induced interest in applying CNNs on data from FS. Singh et al. [94] proposed a pre-trained 17 layers CNN and gated recurrent units to extract gait features automatically from images of footstep GRF.The images were obtained on a 1 cm pixel grid covering an area of 80 cm×80 cm, as point measurements of resistance between the upper and lower surfaces of a conductive polymer fiber sheet.The raw sensor data is used in image format as an input to the Inception-v3 model.The model is tested on identifying 13 people and yielded an accuracy of 87.66%.The limited volume of the training dataset was identified as the main hurdle towards better performance of the proposed method.
Cantoral-Ceballos et al. [95] used a principally different approach to floor sensing: instead of point measurements, they used a distributed Plastic Optical Fiber (POF) sensor layer sandwiched unobtrusively between the top pile layer of a commercial carpet and deformable underlay, implementing Guided-Path Tomography [96] (iMAGiMAT, see figure 6).With frame rates of 256 Hz and spatial sampling adequate for inverting the data into footstep image frames, it was possible to capture in substantial detail the dynamics of an uninterrupted sequence of at least 4 footfalls at a time.Costilla-Reyes et al. [97] demonstrated that, in the classification of 10 manners of walking from temporal data subsets, deep learning models (Deep Feed Forward ANN with 10 hidden layers and a RNN) outperformed shallow learning, with some exceptions attributed to the shortage of training data.This was partially mitigated in a further work [98] where the UoM-Gat-13 dataset was introduced, as a full set of spatiotemporal raw signals (1400 frames at 256 Hz from each of the 116 sensors) from 10 manners of walking and 3 dual tasks.The raw signals were down-sampled, reshaped, and normalized to form a spatio-temporal input sequence for a CNN.The latter consisted of two convolutional layers, followed by one average pooling and one max pooling layers.The network, trained and tested on the UoM-Gait-13 dataset using a spatiotemporal Raw Sensor Matrix (RSM) representation, achieved classification accuracy of 97.88 ± 1.70%.For comparison, tomography images were reconstructed from the raw data and classifications by shallow and deep learning models were obtained for the three input options: raw spatiotemporal sequences, RSMs and reconstructed images.The deep learning approach with RSM input outperformed by a margin all others, on all measures: accuracy, precision, recall and F-score.
A deep residual ANN based on ResNet architecture was proposed by Costilla Reyes et al. [99] for footstep biometrics.The pressure magnitude exerted by footsteps is sampled by two floor mats, with 88 piezoelectric sensors each, arranged to capture most of the full gait cycle.Different representations are adopted for the raw spatial and temporal components of the data.For the spatial component, each footstep frame is reshaped into a 2D matrix with the sensors of the two mats concatenated and pixels re-calculated as accumulated pressure.The temporal component representation optimizes the data variability against training time by selecting frames corresponding to the heel strike, flat foot and heel-off intervals.Correspondingly, the network architecture consists of spatial and temporal streams; each stream has convolution, batch normalization, max pooling and fully connected layers.Features learnt by the model are classified in the final softmax layer using a one-vs-one linear SVM.Class-score level fusion, applied on the outputs from the classifiers of the spatial and temporal streams, were proven to perform better than lower level feature fusion.Biometric verification was demonstrated on the SFootBD database [86] for three benchmarks, driven by common security scenarios: airport concourse (40 stride footsteps for 40 users and 763 impostors), office area (200 stride footsteps for 15 users and 2697 impostors) and private dwelling (500 stride footsteps for 5 users and 5603 impostors).In all three benchmarks, the deep residual ANN outperformed shallow CNNs and FNNs, as well as the handcrafted feature approach in [86].In the private dwelling scenario alone, the models improved the Equal Error Ratio (EER), more than 3 times in validation and more than twice in evaluation, over the state-of-the-art.The superior performance was assigned to combining together the ResNet and SVM models, as well as the distinct representations for the spatial and temporal components.

IV. MULTI-MODALITY GAIT SENSOR FUSION
In its narrow sense, multi-sensor data fusion is combining data captured from multiple information sources, where the resulting information pool produces a new representation, distinct from those captured by individual sensors [100].Gait feature fusion has been extensively used to study human gait features and anomalies associated with forces generated during the gait cycle.Deep learning is called for to combine multi-sensor data from all three modalities reviewed in Section III.Several WS data are fused in the ANN layers to deliver body orientation, position and specific force in space and time.FS are based on sensor fusion since the proposed methods are based on using a set of switch sensors, pressure sensors or POF sensors to log the forces associated with foot ground contact.Further, we focus on the fusion of gait spatiotemporal sequences captured from at least two modalities, e.g.lower limb joint angle trajectories captured by VS or WS, and forces generated by the foot contact captured by FS or sensors under the foot.Table IV summarizes the results yielded by gait recognition models based on sensor fusion using deep ANNs.
In the healthcare context, deep learning has been used for data fusion to address gait-phase detection.Ding et al. [101] performed real-time gait-phase detection using one IMU sensor mounted on the shank to measure the absolute heading and angular velocity, as well as three foot-switches to label gait activities using deep learning.An LSTM-based gait-phase recognition algorithm is used to train the labeled data.Results showed 96.1% accuracy as compared to 89.1% and 91.8% for shallow learning techniques such as SVM and MLP, respectively.The reported results shows a strong correlation between gait phase and the kinematic of the shank.
Deep learning was used to fuse sensors data by Vu et al. [102] for gait-phase detection to assist in taking full control of gait for transtibial prostheses users.This will result in sufficient control of active prosthetic devices in real-world applications.The proposed algorithms detect gait-cycle percentages and predict future gait percentages in the case of a delay in the system.An Exponential Delay Fully Fig. 7. 3D CNN+LSTM architectural detail the proposed multimodal human gait recognition using VS and WS fusion [105].

TABLE IV RESULTS FOR GAIT RECOGNITION FROM MULTI-MODALITY SENSOR FUSION
connected ANN (ED-FNN) is developed for this purpose.It is based on short and long delay to predict fast changes in gait cycle progression on flat and 15-degree inclined surfaces.The model was trained and tested to detect gait-phase from raw IMU signals, positioned on the lower shank.Two force-sensitive resistors (FSR) were placed under the foot for accurate heel strike and toe-off detection.Although, strictly speaking, FSR data was not used in the ED-FNN processing; however, it contributed to a better quality data input to the network.The model performs well in an offline setting as compared to other methods based on handcrafted features.Furthermore, this methodology uses less computational power, which is an essential factor to deploy on autonomous systems.
Multi-channel redundant fusion for generating bipedal gait was proposed by Mazumder et al. [103] with the aim to obtain a robust stride time and gait phase using a Radial Basis ANN.The stride time is calculated and fused to derive a robust fail-safe timing information based on which joint trajectory mappings.The proposed methodology estimates the user's intention to start, stop or change a particular gait pattern.A set of sensors are used for test data, namely an IMU sensor, foot pressure sensors and a myoelectric sensor for electromyography (EMG).The four EMG signal channels are fused with the pressure and IMU sensors signals to estimate stride time using the ANN.The proposed method is tested on five subjects walking on a treadmill, yielding classification accuracy with minimum square error < 0.05.
In a similar approach, Mun et al. [104] used a deep ANN to estimate and quantify spatiotemporal gait parameters from foot characteristics.This was achieved with a footstep feature measurement system that scans the foot while a subject performs various motion tasks, and a set of IMU sensors integrated in a commercial motion-capture system (Xsens MVN, Enschede, The Netherlands), to detect heel strike and toe-off off events during gait cycle.The sensors data is fused in the deep layers of the ANN to estimate the gait features during the gait cycle, namely: stride length, step length, velocity, stride time, step time, single-limb support time, double-limb support time, as well as swing time and stance time.The proposed methodology yielded an accuracy of 95% in tests with 42 patients with predicted output of fast, normal and slow walk.
Gait recognition from two modalities has been proposed in studies of the lower limb trajectory by fusing VS and WS features extracted by deep learning network layers.Kumar et al. [105] proposed evolutionary 3DCNN+LSTM to extract features captured by VS and WS with IMU and pressure sensor signals.The system used to capture gait consists of a video camera, 17 precision IMU nodes and two pressure insoles included in a Shadow Motion wireless body suit.The proposed methods were tested on 19 males and 4 females performing four different walking styles, namely normal-walk, fast-walk, walking while listening to music and walking while watching video on mobile.The CNN is utilized to extract gait spatiotemporal features from VS. Two LSTMs models were used: one to process the CNN output and the other to extract gait spatial-temporal features from the WS.In the final stage, a Grey Wolf Optimizer [106] is used to fuse the LSTMs outputs.The model achieved an average accuracy of 91.3% on gait labeling.
It is also worth noting that Vera-Rodriguez et al. [107] have suggested that their fusion of FS and VS modalities, implemented on handcrafted features as an input to shallow learning methods, is amenable to deep learning methods for automatic extraction of fused features to improve the accuracy.
V. DISCUSSION Gait analysis does not benefit from the advantage of deploying traditional and proven methods, such as harmonic analysis where functions are represented in a full and orthogonal base, e.g. as a superposition of sine and cosine functions.Unfortunately, a full and orthogonal function base for gait is difficult to define, or in other words, the gait primitives are largely undefined.The gait features commonly used in practice (see Section II.A.) neither are fully independent nor do they exhaust all the possible members of the set.On this backdrop, the automatic extraction of those gait features which allow to distinguish a certain target case with the best accuracy, e.g. in healthcare or biometrics, appears to be a winning strategy.However, the nature of human gait requires distinct approaches depending on the character and volume of the recorded sensor data (sensing modality, availability of datasets, computational cost, etc.).This causes variations in the optimal choices made: the modalities for complementary fusion, the data representations, the optimal deep learning models, as well as the manner of their overall deployment.

A. Spatiotemporal Character of Gait Data
Because of the ambulatory nature of human gait, events defining the gait cycle are recurrent in space and time.Common patterns are manifested in a sequence of spatial regions and time periods, lending themselves to methods to suppress noise, e.g. by applying statistics over long data sequences.As the gait cycle duration is in the order of 1 s, acquired time sequences are usually abundantly sampled and require down-sampling to optimize computational resources and improve signal-to-noise.The approach to spatial sampling, however, is much less standard and is strongly modalitydependent (see Section V.B.).Estimating correctly the spatial resolution limit determined by the data is crucial for the choice of suitable data representations and for interpreting the calculated accuracy.
A direct comparison shows that classification with features automatically extracted from fused spatiotemporal FS data (see Section III.C.) yielded better accuracy compared to features from spatially-integrated temporal data or time-integrated footprints reconstructed from the same dataset.Thus the benefits of spatiotemporal data fusion by deep learning CNNs appear to be beyond doubt.However, that has been achieved either at the data representation level or at the classification score level and not in the deep layers of the CNN.Furthermore, when trained on reconstructed footprint images only, deep and shallow models exhibited comparable performance.Arguably, this is because the spatial reconstructions involve solving an ill-posed and ill-conditioned inverse problem.This generated a much larger feature vector of pixel values, compared to the substantially smaller number of values in a single measurement frame, used for spatiotemporal fusion where deep CNNs outperform shallow models.
An interesting consequence of the spatiotemporal character of gait is the drive towards accurate time-stamping.This has resulted in the combination of modalities, such as IMUs plus shoe-sole switches, to detect heel-strike and toe-off for correct labeling of gait phases from WS data.

B. Multi-Sensor Incentives for Deep Learning From Gait Data
Manual fusion may take place at the data pre-processing stage when constructing the data presentation, e.g.complementary fusion in the case of RSMs, or collaborative fusion to generate a sinogram image of the Radon transformed FS data [98], which then can be used for data inversion into 'center-of-mass" coordinates or footstep images (see figure 6).In contrast, automatic extraction of fused features may take place in the convolutional layers of the deep ANNs.However, the preference has been to fuse the spatial and temporal components at the classification score level (see Section III.C.) Deep learning from gait data is most mature for VS, due to the ease of borrowing methodology from well populated research areas, such as face recognition.Translation of approaches such as object segmentation and detection, as well as 'probe and gallery' methods in image recognition are a few examples.Other reasons for the notable progress made with VS is the relative abundancy of reliable databases with variations in the viewing angle and scenes -all facilitated by the variety of ubiquitous sensing technology and fueled by security and surveillance applications [108].This also may explain why the drive to fuse VS with other gait modalities is comparatively weak; in fact, VS arguably capture already fused data and gait monitoring uses only some of that data, as it does not concern with the part used for face recognition.Since VS for gait is less demanding in terms of pixel resolution and close-up, it can be speculated that accurate identification from gait could result in dis-applying facial recognition, wherever practical.Nevertheless, the fusion of VS with other modalities has been successful, mainly in the context of laboratory work (see Section IV).
WS offers the widest range of sensor and data types, as well as varying degree of consumer market penetration and level of collaboration by the user.Sensor fusion within the WS subset of modalities is systematically researched to deliver the next generation of health monitoring systems, and to satisfy the growing interests in activity recognition for commercial ecosystems comprising of smartphones, smartwatches, activity bands and other health/fitness monitoring devices.Progress will depend on the technology available, but it is very likely that such devices will use massively deep learning for sensor data fusion in order to personalize their owner's experience.
FS need to cover substantial areas where floor contact may be effected, thus they typically employ collaborative fusion from a large number of identical discrete or distributed sensors and measurands, deep learning from FS data is comparatively new and reports of complementary fusion with other gait modalities have been rare, which should be judged in view of the scarcity of appropriate datasets.Anyhow, the information resulting from sensor fusion with FS would be undoubtedly richer than just by VS and/or WS, thus delivering even more accurate classifications.Fusing FS with VS for security application can help to overcome the challenges of gait recognition in different scenes and limited angular views in VS; reciprocally, VS data can be invaluable to resolve the challenge of simultaneous users captured by FS systems.There is no doubt that fusion of all three gait modalities reviewed here will be more advantageous; however, this has to be measured against the practicality or designing, building, deploying and maintaining such complex systems.

VI. CONCLUSIONS
The character of gait data poses the problem of identifying features suitable for gait classifications, desirable in a number of application areas.The three gait-sensing modalities covered in this Review have produced data which is most amenable to the use of deep learning, to address the automatic extraction of such features.Deep learning CNNs typically outperform shallow learning models in the most essential metrics.Furthermore, multi-sensor and multi-modality fusion results in better accuracy and robustness.This is achieved by employing the available flexibility in data representations, ANN architectures and the choice of model hyper-parameters.Gait analysis benefits from methods introduced and tested in other applications of deep learning.However, it requires particular attention due to its spatiotemporal character, the options for ubiquitous gait sensing and the privacy concerns they raise, as well as the cost of achieving research, development and commercialization objectives.Deep learning from multi-sensor, multi-modality gait data offers new options in the strong drive towards personalized healthcare, as well as towards more robust and un-intrusive biometrics for safety and security.These are some of the challenges of the day, but the state-of-the art indicates a promising step reaching further into the future, rather than just the current horizon.

Fig. 1 .
Fig. 1.Important gait events and intervals in a normal gait cycle.andslowing its forward speed.The body center-of-mass is aligned with the forefoot (ball of the foot).D-Terminal stance or Heel-off: The heel rises in preparation for opposite swing.The trunk is sinking from its highest point, the knee has extant peak near the time of heel rise and ankle has dorsiflexion after heel rise.E-Pre-swing: This is the second double-limb support interval.The opposite initial contact occurs, and the hip is beginning to flex, the knee is flexing, and the ankle is at plantarflexion.The toe is in last contact before the swing, finishing the push-off started in interval D.F-Initial swing and Mid-swing: This interval begins with the toe-off into single support and starting to swing.The body weight is shifted to the opposite forefoot.In this instant, the knee joint gets the maximum flexion.The hip is flexing and the limb advances in preparation for a stride.G-Terminal swing: This is the last interval of gait cycle and the end of the swing phase.The interval begins at maximum knee flexion and ends with maximum extension of the swinging limb forward.The hip continues flexion and the knee extends in regard to gravity, the ankle continues dorsiflexion to end neutral, ready for the heel strike.With regard to the above gait events, the following parameters of human gait are usually analyzed in clinical settings[17] for healthcare tasks, using various sensing and data processing methods:

Fig. 3 .
Fig. 3. [54]: High-level difference architectures for small view-angle differences: a) Siamese CCN with probe and gallery input; b) Triplet CNN with positive probe, negative probe and gallery; Low-level difference variants of a) and b) for substantial view-angle differences: c) single CNN with probe and gallery; d) Siamese CCN.

Fig. 5 .
Fig.5.Convolutional neural network to extract and classify gait features from wearable inertial measurement unit with accelerometer and gyroscope sensors[66].

Fig. 6 .
Fig. 6. iMAGiMAT footstep imaging system.a) geometry number of the POF sensor elements (after [97]).b) tomography image reconstruction (right panel) as top view of deformation by a person standing (seen in the image in the left panel) with weight on the ball of right foot and the left heel [95].The orientation of a) is at 90 degrees with respect to the two panels in b).

TABLE I RESULTS
FOR GAIT RECOGNITION FROM VS Yeoh et al.

TABLE II RESULTS
FOR GAIT RECOGNITION FROM WS automatically reliable discriminative features of human gait, outperforming approaches based on handcrafted Table II summarizes the results yielded by gait recognition models based on WS using various deep ANN models.

TABLE III RESULTS
FOR GAIT RECOGNITION FROM FLOOR SENSORS function of age, to produce the stability required to avoid a fall.A single IMU unit positioned on the trunk delivered raw signals from 35 users: 17 older adults (age: 71.5 ± 4.2 years) and 18 young adults (age: 27.0 ± 4.7 years) used as an input to the LSTM network.Automatically extracted spatiotemporal gait parameters are used to classify age-related differences in walking on a flat or uneven surfaces.