Transportation Mode Recognition Fusing Wearable Motion, Sound, and Vision Sensors

We present the first work that investigates the potential of improving the performance of transportation mode recognition through fusing multimodal data from wearable sensors: motion, sound and vision. We first train three independent deep neural network (DNN) classifiers, which work with the three types of sensors, respectively. We then propose two schemes that fuse the classification results from the three mono-modal classifiers. The first scheme makes an ensemble decision with fixed rules including Sum, Product, Majority Voting, and Borda Count. The second scheme is an adaptive fuser built as another classifier (including Naive Bayes, Decision Tree, Random Forest and Neural Network) that learns enhanced predictions by combining the outputs from the three mono-modal classifiers. We verify the advantage of the proposed method with the state-of-the-art Sussex-Huawei Locomotion and Transportation (SHL) dataset recognizing the eight transportation activities: Still, Walk, Run, Bike, Bus, Car, Train and Subway. We achieve F1 scores of 79.4%, 82.1% and 72.8% with the mono-modal motion, sound and vision classifiers, respectively. The F1 score is remarkably improved to 94.5% and 95.5% by the two data fusion schemes, respectively. The recognition performance can be further improved with a post-processing scheme that exploits the temporal continuity of transportation. When assessing generalization of the model to unseen data, we show that while performance is reduced - as expected - for each individual classifier, the benefits of fusion are retained with performance improved by 15 percentage points. Besides the actual performance increase, this work, most importantly, opens up the possibility for dynamically fusing modalities to achieve distinct power-performance trade-off at run time.

T HE mode of transportation or locomotion is an important contextual about users during travel, including things such as walking, running, cycling, taking a bus, driving a car, etc [1], [2]. The knowledge of the transportation mode assists context-aware applications, such as activity and health Manuscript received February 21, 2020; accepted April 4, 2020. Date of publication April 13, 2020; date of current version July 17, 2020. This work was supported by HUAWEI Technologies through the project "Activity Sensing Technologies for Mobile Users." The work of Lin Wang was partly supported by the Institute of Coding, which is supported by the Office for Students (OfS) and the Higher Education Funding Council for Wales (HEFCW). The associate editor coordinating the review of this article and approving it for publication was Dr. You Li. (Sebastien Richoz and Lin Wang contributed equally to this work.) (Corresponding author: Lin Wang.) Sebastien  monitoring, individual environmental impact monitoring, and intelligent service adaptation [3]- [9]. Nowadays, users carry a growing variety of wearable devices during travel. Besides ubiquitous smartphones, it is ever more common to wear a smartwatch (some of them have integrated cameras), smart earbuds with microphones, bodyworn cameras such as life-loggers or even eye-wear computers (e.g. Google Glass, Spectacles by Snap). These devices are embedded with multimodal sensors including motion sensors, GPS (global positioning system), microphones and cameras. There have been many studies on analyzing the mode of transportation from the data captured by the sensors of these wearable devices with machine learning techniques [10]. Motion and GPS sensors are widely used for transportation mode detection. Motion sensors retrieve the orientation and vibration information of the mobile device while the GPS sensors capture the speed and trajectory of the user [11]- [16]. In comparison to continuous GPS smotion sensors are more desirable as they are much less energy demanding. The state of the art in motion-based transportation recognition performance was established in the SHL recognition  [17], [18]. The outcomes reveal that approaches based on motion sensors struggle distinguishing between distinct transportation modes of similar kinds: for example between train and subway (rail transport) or between bus and car (road transport).
Sound and vision are two important modalities that are available in wearable devices and can also be used to infer the user's context, although their application to recognizing the mode of transportation has been rarely reported. For instance, a recent challenge on detection and classification of acoustic scenes and events (DCASE) aims to classify various sound events in domestic and wild environments [19], [20]. There has been an increasing number of work using wearable cameras for life-logging, i.e. to recording surrounding environments and the daily life activities of people [21], [22]. The performance of visual object detection and acoustic event classification has progressed significantly since the introduction of deeplearning techniques. In addition, sound, vision and motion are complementary to each other as they each focus on different aspects of user context, providing a high diversity of knowledge.
Many machine learning approaches have been proposed to fuse multimodal information for classification tasks [31], [33], [39], [40]. These approaches can be categorized as early integration (data-layer fusion), late integration (decisionlayer fusion). The early integration method usually concatenates the data of all modalities as a single input vector for classification, and thus only needs a single classification model. The late integration method trains a separate classifier for each modality independently, and draws a final decision by combining outputs of the classifiers. The early integration method considers the cross-modal correlations from the initial stages, and thus potentially outperforms the late integration method, which does not share representations across different modalities and ignores the correlated characteristics among the modalities. However, the synchronization of multiple modalities and the handling of different data size and sampling rate remain an open problem of early integration methods. Furthermore, early integration methods do not easily allow for dynamically changing combinations of sensors: a classifier would need to be trained for each combination of sensors, which limits the scalability of the approach. Late integration methods are inherently modular: each separate classifier is optimized to the corresponding modality, which brings additional benefits of flexibility and scalability.
The motivation behind this paper is twofold. First, we are chiefly interested in investigating how the combination of the three sensor modalities may produce better recognition performance than using a single modality. Second, we are interested in modular approaches, i.e. approaches which enable seamlessly to combine one or more modalities together. Such approaches are important as they enable a system to combine dynamically modalities at runtime, as a way to achieve potentially changing power and performance trade-offs [38]. So far, the combination of motion, sound and vision captured from on-body sensors has not been systematically explored for the recognition of modes of transportations. Due to privacy Fig. 1. The equipment for SHL data collection has 4 smartphones and 1 camera. We use the data collected from the hand phone and the bodyworn camera.
issues, few transportation and locomotion datasets are publicly available with sound and vision modalities. Only a few work has been reported on transportation mode recognition with vision [23] or sound [24]- [26], and to our knowledge no work has addressed the combination of vision or sound with each other and with motion.
In this paper we conduct the first work that combines the motion, sound and vision modalities for transportation mode recognition. The state-of-the-art Sussex-Huawei Locomotion-Transportation dataset [10], [32] contains rich sensor modalities (including the three above mentioned sensors), which enable us to carry out this research, in order to recognize eight modes of transportation: being still, walking, running, cycling, being in a car, being on a bus, train or subway (Sec. II). Since the three modalities are captured with different sampling rates, the sensor data are not precisely synchronized, and we are interested in modular fusion which can be used in the future for dynamic power/performance management, we focus here on the late integration method. We first train three mono-modal deep neural network (DNN) classifiers, using the motion, sound and vision data, independently (Sec. III). We then evaluate two sets of modular data fusion schemes (ensemble decision and adaptive fusion) that fuse the classification results from mono-modal classifiers (Sec. IV). We compare the performance of combining different modalities with the SHL dataset at the task of recognizing the eight transportation modes (Sec. V). Experimental results demonstrate clear advantages of multimodal fusion. We further assess the generality of the proposed method to unseen data, particularly data comprising user variations (Sec. VI). After discussion in Sec. VII, we draw conclusions in Sec. VIII.

II. DATASET
The Sussex-Huawei Locomotion-Transportation (SHL) dataset is one of the biggest multimodal dataset for transportation and locomotion mode recognition from mobile devices [10], [32]. The dataset was recorded over 7 months by 3 users engaging in 8 different transportation modes: Still, Walk, Run, Bike, Car, Bus, Train and Subway. The duration of the dataset is 2812 hours, corresponding to a travel distance of 17,562 km in the south-east of the UK. The data was recorded using 4 smartphones placed at different locations on the body (hip pocket, hand, backpack, torso) and one bodyworn unstabilized camera mounted on the chest and facing forwards (see Fig. 1). The dataset contains 16 sensor modalities including motion, sound and vision. The dataset was used as in the recent SHL challenge 2018: a competition on motion sensor-based transportation activity recognition [17], [18].
The motion, sound and vision data contained in the SHL dataset enables us to investigate the potential of data fusion for transportation mode recognition. For ease of comparison, we use exactly the same training and testing data partitioning scheme as in the SHL challenge 2018 [17]. Specifically, we use the multimodal sensor data recorded by the first participant with hand smartphone during 82 days (5-8 hours per day), which is partitioned in 62 days (271 hours) for training and 20 days (95 hours) for testing. Fig. 2 depicts the duration of each class activity in the training and testing datasets.
The motion sensors include acceleration, gyroscope and magnetometer, which are all sampled at 100 Hz. The sound sensor (microphone) originally records sound at a sampling rate of 16 kHz, which is downsampled to 8 kHz before processing. The vision sensor (camera) takes one picture every 30 seconds (i.e. sampling rate 1/30 Hz).  Fig. 3(a) depicts the magnitude of the data provided by accelerometer, gyroscope and magnetometer, respectively. As a combination of the X/Y/Z-axes, the magnitude is robust to device orientation and rotation. Accelerometer shows higher energy for Walk, Run and Bike than for the other five activities. The accelerometer also shows evident cyclic behaviour for the Walk and Run activities. Similar observations can be made in the gyroscope data. The magnetometer does not seem to show visually distinctive patterns for different activities. This visual inspection shows that while some sensors provide clearly distinct signatures for some activities, distinguishing all 8 classes appears challenging, which motivates the use of machine learning methods capable of representation learningsuch as deep learning -further on in this article. Fig. 3(b) compares the short-time Fourier transform (STFT) spectrogram of the sound recorded during the 8 transportation activities. One big challenge for sound recognition is the influence of environmental noise. The first row shows a clean sound captured during transportation (without additional noise from the environment). The sound segment tends to show different spectrogram patterns for each activity. For instance, the activities Still, Car, Bus, Train and Subway tend to present different energy distribution in the low and high frequencies, while the activities Walk, Run and Bike tend to present different cyclic behaviour. In practice, the clean sound of each transportation activity is usually overlapped with additional noise from the environment, such as wind, friction, human speech, and other sound events nearby, as shown in the second row of Fig. 3(b). These environmental noises are typically much stronger than the clean transportation sound. This significantly increases the challenges when recognizing transportation activities. Fig. 3(c) compares the 8 transportation modes taken by the front-facing camera. The resolution of the photos is 1024 × 576 pixels. In this example, most of the images are easy to recognize due to the relevant information provided by the environment. For example, for Bike we clearly see the handlebar with the hand on it; for Car we can see the roof and part of the dashboard of the car, as well as the road. The seats, bars, frame of the windows, shapes of the doors that appear in Bus, Train and Subway provide good-quality information to recognize them, although distinguishing between these three transportation modes becomes already more difficult. In Still, the user is inside and might be sitting on his couch looking at the room. Walk and Run are more challenging to distinguish as the movements of the arms and the places the user go could be very similar. Note that not all the pictures in the dataset are so nicely represented: some photos are tilted, blurred, rotated, upside-down, bright, dark or occluded due to the position, orientation, time of the day or movements of the user, which result in a more challenging task to recognize the transportation modes. Fig. 4 illustrates the general processing pipeline of multimodal fusion. We first train three independent mono-modal classifiers with the motion, sound and vision data, respectively, and then fuse their results from better recognition performance.

III. MONO-MODAL CLASSIFIERS
All the three mono-modal classifiers are based on convolution neural networks (CNN). Each classifier predicts the probability (in the range [0, 1]) of each transportation activity, which is fed to the subsequent data fusion stage. The motion and sound classifiers process the sensor data per 5-second frames (one decision every 5 seconds) while the vision classifier processes every image (one decision every 30 seconds).

A. Motion Classifier
We have three motion sensors, i.e. accelerometer, gyroscope, magnetometer, each containing three channels of measurement along the X-, Y-, and Z-axis of device. Since the pose and orientation of the smartphone is unknown, we combine the three channels by computing the magnitude, i.e.
where i denotes the time index. We convert the time-domain raw data to the frequencydomain, and then cascade the data from three sensors into a vector as where S acc /S gyr /S mag denotes the magnitude of the Fourier transform of s acc /s gyr /s mag in one frame (retaining . Given the frame length 500, the size of S acc /S gyr /S mag is 251 × 1, and therefore the size of S F is 753 × 1. The data S F in each frame is normalized into the range [0, 1] before classification, usinḡ  whereS F (k) denotes the k-th frequency bin in s F , Q 95 (k) and Q 5 (k) denote the quantile 95 and quantile 5, respectively, across all the frames in the training data. Fig. 5(a) illustrates the deep architecture for motion-based classifier (T motion ), which we initially developed in [18]. The architecture consists of an input layer, multiple CNN and fully-connected neural network (FCNN) blocks and a decision block. The input layer receives and stores the frequencydomain motion sensor data S F . Each CNN block sequentially consists of a convolutional layer, a batch normalization (Norm) layer, and a nonlinear (ReLu) layer. Each FCNN block sequentially consists of a fully-connected (FC) layer, a batch normalization (Norm) layer, a nonlinear (ReLu) layer and a dropout layer. The decision block consists of a fully-connected layer, a nonlinear (Softmax) layer and a classification layer which infers the transportation mode. Table I gives the detailed configuration of the neural network.
For both training and testing dataset, we slide through the magnitude sensor data with a window of length 5 seconds and skip size of 5 seconds, generating framed data each containing 500 samples. This generates 195,688 frames of training data and 68,382 frames of testing data. The classification is conducted per individual frame. We use the Matlab Deep Learning Toolbox to implement the CNN classifier, using the stochastic gradient descent with momentum (SGDM) optimizer with default learning parameters.

B. Sound Classifier
For sound data, we compute STFT spectrogram in each 5-second frame and then feed it to the classifier. The STFT spectrogram is computed with a sliding window of length 500 and half overlap. Therefore, the size of the spectrogram of the 5-second frame is 251 × 161. Let's represent the STFT in a frame as S(k, l), where k and l denote the frequency and the STFT subframe indices, respectively.
To reduce the dynamic range of the data, we compute the log spectrogram as where | · | denotes the absolute value. We then normalize the data to the range of [0, 1] as where A max and A min denote the maximum and the minimum values in the log spectrogram A(:, :) throughout the training dataset. Fig. 5(b) illustrates the deep architecture of the sound-based classifier (T sound ), which we initially developed in [24]. The convolutional neural network consists of an input layer, two CNN and two FCNN blocks, and an output decision block.
The input layer receives and stores the original spectrogram I . Each CNN block sequentially consists of a convolutional layer, a batch normalization (Norm) layer, a nonlinear (ReLu) layer and a pooling layer. Each FCNN block sequentially consists of a fully-connected (FC) layer, a batch normalization (Norm) layer, a nonlinear (ReLu) layer and a dropout layer. The decision block consists of an FC layer, a nonlinear (Softmax) layer which outputs the classification result. Table II gives the detailed configuration of the neural network.
We slide through the training dataset with a window of length 5 seconds and skip size of 20 seconds, generating data frames each containing 40,000 samples. This generates 65,240 frames of training data. For testing, we use a sliding window of length 5 seconds and skip size of 5 seconds. This results in 68,382 frames of testing data, which is the same size as the motion testing data. We use the Matlab Deep Learning Toolbox to implement the CNN classifier, using the stochastic gradient descent with momentum (SGDM) optimizer with default learning parameters.

C. Vision Classifier
For image data, we employ a preprocessing procedure which resizes each image from 1024×576 to 224×224 before feeding it to the classifier.   [23], which is an adaptation of DenseNet169 [28]. DenseNet169 is a pre-trained CNN model on the ImageNet dataset for image recognition [29]. DenseNet169 consists of several dense blocks, transition layers and finally a classification layer. The dense blocks, each containing multiple densely connected convolution layers, are connected via a transitional layer, which consists of a convolution and a pooling layer. The dense blocks extract features from the input image, which are fed to the decision block for classification.
DenseNet169 was originally trained for image classification, and can not be used for transportation mode recognition directly. We employ a transfer learning scheme that adapts the DenseNet169 model to our classification problem. Specifically, we freeze the architecture and parameters of the DenseNet169 model except the 4th (and last) dense block. We replace the last FC layer and decision layer by an FC connected layer with 512 neurons followed by 8 Softmax activated neurons, which predicts the probability of each transportation activity. The parameters of the decision block are fine-tuned using the training data. Table III gives the detailed configuration of the neural network.
Following the same training/testing data split scheme, we have 31,287 images in the training set and 10,781 images in the testing set. The vision classifier (T vision ) is implemented with the Python Keras library using TensorFlow as the backend computing library.

IV. MULTIMODAL SENSOR FUSION
The motion, sound and vision sensors work independently with different sampling rates. It is necessary to synchronize the data before fusing the classifier results. We first introduce the data synchronization method and then present the two data fusion schemes: ensemble decision and adaptive fusion. Finally, we employ a post-processing scheme that can further improve the recognition accuracy.

A. Synchronization
When recording sensor data, the smartphone and the camera both log the absolute world time (Unix Epoch time). Following the procedure described in the document 1 "Data organisation and file formats", we can retrieve the absolute world time for each frame of sound and motion data and for each image, as illustrated in Fig. 6.
The motion and sound classifiers make a decision per 5 seconds, while the vision classifier makes a decision per 30 seconds. We thus need to interpolate the vision classifier output to make it consistent with the output from the other two classifiers. We use the zero-order hold rule for interpolation [30]. Specifically, as exemplified in Fig. 6, the decision of the current image is retained for 30 seconds until the next image. The decisions of the motion, sound, vision classifiers are then fused at the same world clock time.

B. Ensemble Decision
The ensemble decision scheme is based on simple mathematical rules to select the transportation mode (class) based on the outputs from multiple classifiers. Following the suggestions in [31], we consider the following four fixed rules: Majority Voting, Borda Count, Sum Rule and Product Rule. The first two are based on the output labels while the latter two are based on the class probability.
Let us use p c,t to represent the predicted probability of the class c by the classifier t. Suppose we have C classes and T classifiers, i.e. c ∈ [1, · · · C] and t ∈ [1, T ]. The decision of the classifier t would be d t = argmax c∈ [1,C] p c,t .
Majority Voting (MV) counts the class that appears the most in a sample, among the multiple classifiers. Let us use n c denotes the occurrence of the class c, the majority voting rule is expressed as [1,C] n c .
In Borda Count (BC), all the classes are ranked based on their predicted probability and are given weights based on the rank. For instance, the first one with the highest probability is given a weight C − 1, the second one is given a weight C − 2, and so forth, until the last one given a weight 0. We sum up the weights from all the classifiers and choose the one with the highest weight. In this way, the decision is less dependent on the probability value. Suppose the weight of class c by the classifier t is w c,t , the decision is given by The Sum Rule adds up the probabilities of each class across all classifiers and selects the one with the highest score as the transportation mode. This is expressed as Product Rule is the same as Sum Rule except that it multiplies the probabilities of each class across all classifiers instead of adding them up. This is expressed as

C. Adaptive Fusion
In the adaptive fusion scheme, we try to learn the relationship between the outputs from multiple mono-modal classifiers and the joint decision with a classical machine learning classifier (adaptive fuser), such as naive Bayes (NB), decision tree (DT), random forest (RF) and multi-layer perceptron neural network (NN). Fig. 7 depicts the processing pipeline of the adaptive fusion scheme. The input to the adaptive fuser is the set of class probabilities { p c,t } with c = [1, . . . , C], t = [1, . . . , T ], and the output of the fuser is the joint decision d A ∈ [1, C].
The parameters of the fusing classifier model are obtained by feeding the mono-modal classifier outputs for the training data { p c,t } train and the ground-truth label {Label} train . The outputs { p c,t } train is obtained via leave-one-out crossvalidation. Specifically, we divide the training data into K-folds. For each fold, we train the mono-modal classifier with the K-1 folds and test with this fold. Cascading the testing results for all the K folds, we obtain the mono-modal classifier outputs for the whole training set, i.e. { p c,t } train . The monoclassifier output for the testing set { p c,t } test is obtained by feeding the testing data to the classifier trained with the whole training set.
The adaptive fuser is implemented with the Python Scikitlearn library, using default parameters during training.

D. Post-Processing
The classification system makes a decision every frame (5 seconds). Since the transportation mode of a user typically continues for a certain period and there is a strong correlation between neighbouring frames [18], we reasonably assume that the transportation mode will remain unchanged for a certain time, e.g. in a window consisting of F frames. We employ a majority voting scheme to further improve the recognition performance at individual frames.
Suppose the prediction results in the f frame is d( f ) and the results in the previous F − 1 frames is d( f − F + 1), . . . , d( f −1). The occurrence of each activities in these F continuous frames is counted as n f (1), . . . , n f (C). The transportation mode of the current frame is determined as [1,C] n f (c).
In the experiment in Sec. V-D, we try different window length varying from 5 seconds (1 frame) to 180 seconds (36 frames) to see the impact of the window length on the recognition performance.

A. Evaluation Measure
We use F1-score over all the activities to evaluate the recognition performance using the testing dataset.
Let M i j be the (i, j )-th element of the confusion matrix. It represents the number of samples originally belonging to class i which are recognized as class j . Let C = 8 be the number of classes. The F1-score is defined as below.
We compare the F1-score achieved using one, two and three modalities, respectively. For two modalities, we consider

B. Single Modality
The F1-scores of each mono-modal classifier are given in Table IV as a baseline performance. Sound achieves the highest recognition performance (82.2%), followed by motion (79.4%), and vision achieves the lowest performance (72.8%). Fig. 8(a) shows the confusion matrices for these 3 modalities.
Sound is better at classifying the vehicle activities (Car, Bus, Train and Subway) than motion sensors. This is because each vehicle transportation typically emits unique sound that distinguishes itself from other activities, but presents similar motion patterns. Motion sensor is better at classifying pedestrian activities (Still, Walk, Run, Bike) than sound. This is because pedestrian and biking activities require strong user engagement, but emit sound which is much weaker than environmental noise. This implies that the combination of the two modalities potentially leads to better recognition result. Vision performs poorly at distinguishing between Still, Walk and Run, possibly due to the operating environment of the three activities are similar. Vision performs relatively better at distinguishing the remaining five activities. Vision performs better at distinguish vehicle activities than motion, but worse than sound. However, vision performs the best when identifying the Subway activity. Some objects, such as people and seats, can be used to effectively infer the external environments. Overall, the recognition results using motion and using sound are truly complementary. Additionally using vision could further improve the discriminability between vehicle activities. Table V compares the data fusion results applied to all the possible combinations of the three sensor modalities.

C. Multimodality
For ensemble decision, the two probability-based approaches (Sum and Product) significantly outperform the two label-based approaches (MV and BC). When combining the three modalities, the highest F1-score achieved by the probability based and the ensemble decision based approaches are 94.5% (Product) and 89.9% (MV), respectively. For the two probability-based approach, the Product Rule (94.5%) performs slightly better than the Sum Rule (93.0%). When combining two modalities, the label-based fusion approach does not show evident advantages over using single modality while the probability-based approaches achieve higher F1-scores. This is possibly due to a limited amount of classifiers available (maximum three) for data fusion. The probability-based approaches perform more robustly when only a few classifiers are available. Furthermore, the labelbased fusion approaches loose information by operating on a crisp decision, whereas probability-based approaches retain more information which can be exploited during fusion.
For adaptive fusion, random forest performs the best among the four fusers. When combining the three modalities, the four fusers achieve F1-scores of 92.3% (NB), 90.7% (DT), 95.5% (RF) and 94.6% (NN), respectively. In comparison, the ensemble decision method achieves the highest F1-score of 94.5% (Product), which is about 1% lower than the RF. When combining two modalities, the Product Rule and the RF achieve F1-scores of 91.5% vs 92.5% for {motion, sound}, 91.0% vs 90.9% for {sound, vision}, and 89.2% vs 90.0% for {motion, vision}, respectively. Overall, the adaptive fusion method performs slightly better than the ensemble decision method. The increase of performance by adaptive fusion can be justified by the capacity of machine learning classifiers identifying specific relationships between the classifier outputs and the joint decision. However, the downside is that in adaptive fusion the classifier might over-fit on the training data and thus may not generalize well to unseen data. Fig. 8 show the confusion matrices obtained by the different data fusion strategies. We consider Product and RF for ensemble decision and adaptive fusion, respectively. As suggested in Sec. V-B and also confirmed in Table V, data fusion can improve the recognition performance significantly by exploiting the complementarity between motion, sound and vision. For instance, {sound, motion} improves the recognition For ease of comparison, we extract the diagonal elements in each confusion matrices and depict them in Fig. 9. The diagonal element indicates the ability of the classifier to identify the corresponding class. We only consider the Product Rule for data fusion. For single modality, motion performs the best at identifying Still, Walk, Run and Bike; sound performs the best at identifying Bus and Car; vision performs the best at identifying Train and Subway. For dual modality, {motion, sound} performs the best at identifying Still, Walk, Run; and performs equally well as other dual modalities at identifying Bus and Car; and worse at identifying Train and Subway. {Motion, vision} performs the best at identifying Walk, Run, Bike, and worst at Still, Car, Bus and Train. {Sound, vision} performs the best at identifying vehicles, including Train, Subway, Bus and Car; and performs worst at identifying Still, Walk, Run and Bike. Finally, for triplemodality, the performance can be improved for identifying each class activity over using dual modality.

D. Post-Processing Results
We apply post-processing to the multimodal classifier (Product Rule) with various combination of sensors: {moti on, sound}, {moti on, vision}, {sound, vision} and  {moti on, sound, vison}. Fig. 10 depicts the post-processing results achieved with a smoothing window sizing from 5 seconds (1 frame) to 180 seconds (36 frames) at a step of 5 seconds.
For each multimodal classifier, the post-processing performance shows a similar variation trend with increasing window size. The F1 score improves remarkably (e.g. from 91.5% to 94.4% for {moti on, sound}) when the smoothing window size grows from 5 seconds to 15 seconds. The performance then improves quickly (e.g. from 94.4% to 96.0% for {moti on, sound}) for smoothing window size [15,40] seconds, and then slowly (e.g. from 96.0% to 96.8%) for smoothing window size [40,80] seconds. The improvement becomes marginal when the smoothing window size is larger than 80 seconds; and then the improvement appears unstable when the window size is larger than 150 seconds. This is explained by Fig. 11, which shows the duration of each continuous activity in the testing dataset. In Fig. 11, the y-axis denotes the cumulative distribution, i.e. the percentage ratio between the number of continuous activity periods with duration less than a certain value and the total number continuous activity periods. It can be observed that only 2% activities last less than 60 seconds, and 90% of activities last more than 200 seconds. A similar distribution can be observed in the training dataset (which is not shown here). This verifies the feasibility of temporal smoothing, and also indicates that it appears reasonable to choose a smoothing window of length around 40-60 seconds.  While the recognition performance improves with the smoothing window size, in a mobile computing scenario, the choice of post-processing window size will need to be decided based on the needs of the application. For applications requiring real-time response, a shorter post-processing window will be desired, while for applications doing longitudinal statistics where high accuracy is preferable a longer postprocessing window should be employed.

VI. GENERALIZATION TO UNSEEN DATA
In Sec. V, the mono-modal classifiers and the adaptive fusers are trained and tested using different folds of data from the same user (User 1 -U1). In this section, we further investigate the generality of the proposed method with a new dataset (U23), which contains the data from User 2 and User 3 in the SHL dataset. The data collection protocol of U23 was the same as U1: using a smartphone at the hand position and a body-worn camera. The total duration of the data in U23 is 356 hours, with the duration of each class activity shown in Fig. 12.
We perform mono-modal classification with the three modalities, and perform multimodal fusion with Product-based ensemble decision and the RF-based adaptive fusion. We apply the same mono-modal classifiers and adaptive fusers, that are trained with U1 in Sec. V, to U23. This means the training and testing are conducted with the data from different users, which is a more challenging task as the three users in the SHL dataset tend to have different behaviours and habits and device wearing styles. This allows to evaluate the performance of the algorithms on an 'unseen' dataset. Fig. 13 depicts the recognition and fusion result in terms of F1 score and confusion matrix. Comparing Fig. 8 (for U1) and Fig. 13 (for U23), it can be observed that the recognition performance (F1 score) of the three mono-modal classifiers drops significantly when training and testing with different users. Specifically, the performance of motion classifier drops 16.1pp (percentage point) from 79.4% to 63.3%; the sound classifier drops 12.6pp from 82.1% to 69.5%; and the vision classifier drops 26.8pp from 72.8% to 46.0%. Such behaviour is expected as we evaluate a user-specific model to unseen users. 2 Among the three modalities, the sound modality is the most robust to the variation of users as the microphone captures the sound from surrounding environment, which is not affected by the user behaviour. The motion modality is less robust than sound, as the behaviour varies with users. The vision modality is least robust to user variation, possibly because of the different styles in which users carried the body-worn camera.
The recognition performance (F1 score) of the multimodal fusion also drops. For instance, the Product-based trimodal classifier {Motion, sound, vision} drops 9.6pp from 94.5% to 84.9%, and the RF-based tri-modal classifier drops 12.3pp from 95.5% to 83.2%. Nevertheless, the benefit of fusion and the complementarity of the three modalities can still be observed from the confusion matrices in Fig. 8. For instance, motion is good at distinguishing between pedestrian activities and poor at vehicle activities; sound is good at distinguishing between vehicle activities and poor at pedestrian activities. It can also be observed that, in this experiment, motion performs poorly at identifying the Bike activity while sound and vision both perform better. Taking advantage of this complementarity, multimodal fusion always improves the performance over mono-modal classifier. For instance, the Product-based tri-modal classifier improves the performance of the sound classifier (the best performing monomodal classifier) by 15.4pp from 69.5% (sound) to 84.9%. More importantly, this improvement (15.4pp for U23) is even higher than what we achieved for U1 (12.5pp from 82.1% to 94.5%, in Fig. 8). This implies that the multimodal fusion can improve the robustness to user variation.
Both RF-based and Product-based fusion schemes work effectively improving the performance over mono-modal classifiers. However, the RF-based scheme is less robust to user variation than the Product-based scheme. For instance, the RF-based tri-modal fusion (83.2%) performs 1.7pp lower than the Product-based fusion (84.9%) for U23 (in Fig. 13). This is in contrast to the result reported for U1 (in Fig. 8), where the RF-based scheme is 1pp higher than the Product-based scheme.
If we apply post-filtering, the recognition performance can be further improved. For instance, while the details are not reported in the paper, we observe that the F1 score of the Product-based tri-modal classifier is improved by 4.8pp, from 84.9% to 89.7%, with a smoothing window length 45 seconds.
In short, the experimental results above verify the generality of the proposed multimodal fusion methods. While the recognition performance mono-modal classifiers drops due to user variation, the complementary of the three modalities can still be observed, and the fusion of any two or three modalities always improves the recognition performance over mono-modal classifier, and it also improves the robustness to user variation. Both Product-based and RF-based fusion scheme works well for multimodal fusion although the RF-based scheme shows slight performance drop due to user variation.
We would like to highlight that the experiment in this section mainly aims to assess the generality of the proposed method, rather than developing a user-independent recognition system, which is in practice should be trained on multiple users to capture the user variation. However this is the best evaluation of generalization to new data that we can perform using the SHL dataset (with only 3 users), and taking into account there are no other suitable public multimodal transportation datasets available for a similar analysis.

VII. DISCUSSION
The mono-modal classifiers employed in this paper are adapted directly from our previous work [18], [23], [24], which are comparable to the state of the art. The sound and vision classifiers are among the first works that are applied to transportation mode recognition. The motion classifier was used to benchmark the SHL Challenge 2018 [17] and performed slightly worse than the winner of that challenge [32]. Since this paper mainly focuses on multimodal fusion, we did not aim to maximize the performance of each mono-modal classifier. However, we believe that, with the late integration strategy, the performance of the multimodal classifier would be further improved when each mono-modal classifier would be optimized independently. There are two interesting directions that could be investigated.
• First, all the three mono-modal classifiers employ a convolutional neural network. A recurrent neural network (e.g. LSTM [33]) could be employed to exploit the temporal correlation and to improve the recognition performance for time-series signals such as motion and sound. In principle a recurrent network could also be applied to video streams, however the images in the SHL dataset come from a timelapse camera which took a picture every 30 seconds. • Second, over-fitting is a crucial issue in activity recognition, such as a classifier trained on one user with specific senor placement tends to show degraded performance for other users and sensor placement [10]. In this paper, the mono-modal classifiers simply employ a generic technique, e.g. dropout [34], to tackle the over-fitting problem. While the classifiers show promising results on the same user, the performance drops significantly in case of user variation. In future, more techniques could be employed to tackle the over-fitting problem, including some techniques reported in the SHL challenge, such as augmented learning, transfer learning, and designing hand-crafted features [32], [35], [36]. In essence, multimodal fusion improves the recognition performance at the expense of increasing the sensor channels and also the computational complexity. For instance, we use a computer equipped with an Intel i7-4770 4-core CPU @ 3.40 GHz with 32 GB memory, and a GeForce GTX 1080 Ti GPU with 3584 CUDA cores @ 1.58 GHz and 11 GB memory; and the computation time of applying each monomodal classifier on the testing dataset of User 1 is 7.5 seconds, 70.3 seconds, and 196.9 seconds, for motion, sound and vision, respectively. The vision classifier has the largest computational complexity, followed by sound and motion. The sound classifier has higher computational complexity than the motion classifier, as the sampling rate of sound (8k Hz) is much higher than motion (100 Hz). Current and upcoming smartphones offer increasingly powerful hardware acceleration for inference, which makes even seemingly complex models suitable for embedded execution (e.g. [37]). While a mobile phone implementation would show different numbers, the relative complexity of the different modalities is likely to be similar.
In this paper we focused on a modular fusion approach of "late integration" which allows to modularly combine classifiers together. Future work may explore the dynamic selection of classifiers to fuse at any given time to achieve particular application requirements, such as maximizing performance overall or for a particular set of activities, or minimizing power consumption. While a tri-model system performs the best, for dual-modality systems, {sound, motion} achieves robust performance in most cases. For battery consumption optimization and low powered devices, we recommend to use in first instance the motion sensors, then combine it with sound sensor and finally with vision sensor. Also, for the specific task of recognizing the transportation mode, an efficient solution would be to combine all the three modalities but prioritizing motion by limiting the use of vision and sound. For instance, let us assume that motion could classify Still, Walk, Run, Bike and Vehicle. If the class is Vehicle, then the sound and vision can be used to further classify Car, Bus, Train and Subway. We can also activate modalities in specific situations. For instance, if in Bus, we only activate walk detection based on motion.

VIII. CONCLUSION
We applied data fusion methods to combine the output of three expert classifiers, dealing respectively with motion, sound and vision data, in order to improve the recognition of eight different transportation modes. Two sets of fusion techniques, ensemble decision (Majority Voting, Borda Count, Sum Rule and Product Rule) and adaptive fusion (Naive Bayesian, Decision Tree, Random Tree, Neural Network), are considered. Experimental results demonstrate that, by fusing any two modalities or all the three modalities, better recognition performance can always be achieved over using a single modality. The proposed multimodal fusion methods show good generality and can improve the robustness to user variation.
If we only look at the recognition result on the testing dataset of User 1, the best performance achieved by using a single modality is achieved by sound (F1 82.1%). When fusing three modalities, the best performance is achieved by the Product Rule (F1 94.5%) for ensemble decision, and is achieved by Random Forest (F1 95.5%) for adaptive fusing. This improves the recognition performance by 12.3pp (percentage point) and 13.3pp over the best mono-modal classifier (sound), respectively. The adaptive fuser improves performance by 1pp in the best case {moti on, sound, vision} compared to ensemble decision, and at worst led to only a minor decline in performance (-0.1pp with {sound, vision}). This indicates that, as an overall recommendation, an adaptive fuser should be favoured in the majority of the cases. In addition, the effect of post-processing method to recognize the transportation mode over a longer period of time already improves the F1 score by 2 pp within a 15-second window and 4 pp within a 45-second window. By comparing multiple combination of the modalities, i.e. {motion, sound}, {sound, vision}, {motion, vision} and {motion, sound, vision}, we deducted that dualmodality based systems should prioritize motion and sound for more robustness and power-consumption efficiency.
Although the adaptive fuser performs better than ensemble decision for multimodal fusion, future development should consider the generalization of our methods on external datasets, as over-fitting might have occurred due to the sound and vision data closely related to the environment of the country where the dataset was collected (United Kingdom). However, to date no other dataset exists for such analysis to our knowledge.
More important than demonstrating a particular numerical value of the performance increase through fusion, this work opens up a space for the design of activity aware systems on smartphones which are able to dynamically balance power and performance requirements according to the needs of an application. Thanks to the modular "late integration" fusion approach which we follow here, the number of modalities and classifiers which are combined can easily be modulated. Exploring such a dynamic fusion remains the object of future work. Also, we might consider comparing the results of this research with a more complex classifier that should take as input directly all the three modalities. However, the improvements achieved here with the presented fusion methods are already outperforming mono-modality based classifiers, even though a more complex approach would be more desirable. Sebastien Richoz received the bachelor's degree in computer science majoring in software engineering and the master's degree in engineering majoring in information and communication technologies from the University of Applied Sciences HES-SO, Switzerland, in 2017 and 2019, respectively. He joined the University of Sussex to work on human activity recognition and cancer research using machine learning techniques, image processing, and object detection methods. He is a Research Assistant with the School of Engineering and Informatics, University of Sussex. His research interests include computer vision and artificial intelligence. His research focuses on computational behavior analytics such as the use of machine learning techniques, miniature intelligent sensor systems, and other data sources to recognize, qualify, quantify, and eventually understand human behaviors and the wider context in which they occur. He has established a number of recognized datasets for human activity recognition from wearable sensors, in particular the OPPORTUNITY dataset. He has put forward human activity recognition pipelines capable of adaptation and exploiting opportunistic sensing, and more recently capable of lifelong learning for open-ended activity recognition.