Enabling Reproducible Research in Sensor-Based Transportation Mode Recognition With the Sussex-Huawei Dataset

Transportation and locomotion mode recognition from multimodal smartphone sensors is useful for providing just-in-time context-aware assistance. However, the field is currently held back by the lack of standardized datasets, recognition tasks, and evaluation criteria. Currently, the recognition methods are often tested on the ad hoc datasets acquired for one-off recognition problems and with different choices of sensors. This prevents a systematic comparative evaluation of methods within and across research groups. Our goal is to address these issues by: 1) introducing a publicly available, large-scale dataset for transportation and locomotion mode recognition from multimodal smartphone sensors; 2) suggesting 12 reference recognition scenarios, which are a superset of the tasks we identified in the related work; 3) suggesting relevant combinations of sensors to use based on energy considerations among accelerometer, gyroscope, magnetometer, and global positioning system sensors; and 4) defining precise evaluation criteria, including training and testing sets, evaluation measures, and user-independent and sensor-placement independent evaluations. Based on this, we report a systematic study of the relevance of statistical and frequency features based on the information theoretical criteria to inform recognition systems. We then systematically report the reference performance obtained on all the identified recognition scenarios using a machine-learning recognition pipeline. The extent of this analysis and the clear definition of the recognition tasks enable future researchers to evaluate their own methods in a comparable manner, thus contributing to further advances in the field. The dataset and the code are available online.11http://www.shl-dataset.org/


I. INTRODUCTION
Today's mobile phones come equipped with a rich set of sensors, including accelerometer, gyroscope, magnetometer, global positioning system (GPS) and others, which can be used to discover user activities and context [1], [2].Transportation and locomotion modes are an important element of the user's context that denotes how users move about, such as by walking, running, cycling, driving car, taking bus or subway (Fig. 1) [3], [4].Transportation and locomotion mode recognition is useful for a variety of applications, such as human-centered activity monitoring [5], [6], individual environmental impact monitoring [7], [8], just-in-time distributed intelligent service adaptation [9], [10], and implicit human computer interaction [11]- [13].
In recent years, there have been numerous studies showing how to recognize transportation modes from multimodal smartphone sensor data with machine learning techniques [3], [4], [14].However, there is still not a well-recognized dataset that can be used for performance evaluation by the research community.To date, most research groups assess the performance of their algorithms using their own collected data, which cover a different number of transportation activities and sensor modalities.Due to the complexity of the data collection procedure and the need to protect participant privacy, these ad-hoc datasets often have a short duration and remain private.This prevents the comparison of different approaches in a replicable and fair manner within and across research groups and impedes the progress in this research area.
Considering this we believe that there is a need for advancing reproducible research in sensor-based transportation and locomotion mode recognition.This requires publicly available datasets, common recognition tasks (i.e.number and type of transportation and locomotion classes to recognize), common combinations of sensors to use, and identical evaluation procedures.Ideally, these datasets should contain sufficient transportation activities, sensor modalities, and recording duration in order to verify the versatility of the developed algorithms.The recognition tasks and evaluation measures should cover the most common application needs currently identified by the research community and should be forward looking to accommodate upcoming application needs.The objective of this paper is to support reproducible and comparable research within and across research groups in the field of transportation mode recognition.
Other research communities have acknowledged the need to establish reference recognition tasks to support scientific advances in their field.This is the case, for example, in computer vision with the PASCAL Visual Object Classes challenge [15] or the ImageNet Large Scale Visual Recognition Challenge [16] and in speech recognition with the CHiME corpus and recognition challenge [17].
We have previously introduced the large-scale Sussex-Huawei Locomotion (SHL) dataset which was recorded over a period of seven months by three participants engaging in eight transportation activities in real-life setting, including Still, Walk, Run, Bike, Car, Bus, Train and Subway [18], [19].The dataset contains multimodal data from 16 smartphone sensors, which are precisely annotated and amount up to 2800 hours.We use this dataset as a baseline to establish a standardized evaluation framework and to promote reproducible research in the field.The contribution of the paper is summarized as below.
1) Survey of the state-of-the-art.We conducted a comprehensive literature review over the 30 academic articles published in recent years on the problem of transportation mode recognition.We conducted a very thorough state-of-theart analysis in terms of dataset availability, including sensor modalities and number of classes, and in terms of recognition pipeline characteristics, including processing window size, used features and classifiers, postprocessing techniques.To our knowledge, this is one of the most comprehensive literature reviews in the field of transportation mode recognition from mobile devices.This will give readers a clear understanding of the state-of-the-art in this field.Through state-of-the-art analysis, we found out that the lack of standard dataset, unified recognition task and evaluation criteria prevents a fair comparison between different research groups, and thus holds back the progress of research in the field.This paper thus aims to address these challenges with the SHL dataset, which is one of biggest and publicly available dataset in the field.
2) Standardized evaluation framework with baseline implementation.To enable reproducible research, we precisely defined standardized evaluation process.This academic contribution will enable researchers to compare methods "likes to likes": they will be able to use the exact same tasks to compare methods, therefore helping to clearly identify benefits of novel methods.The framework consists of 12 evaluation scenarios, 6 groups of sensor modalities, and 3 types of cross-validation schemes, leading to 729 recognition tasks in total.These tasks are defined considering both the sensor modalities of the SHL dataset and the various recognition tasks we identified from our related work review.We implemented a basic recognition pipeline to report baseline performance for all these tasks and will make the source code publicly available.Researchers in this field will have several options to develop new methods based on our evaluation framework.They will be able i) to evaluate their new newly develop algorithms with this dataset and the evaluation tasks; ii) to apply the baseline recognition system with their own dataset; iii) to create the recognition tasks based on the recommendation of the paper with their own dataset and own algorithms and compare with the baseline results reported in the paper.We believe this will advance the progress the research in this field significantly.
3) Feature analysis and feature selection based on the SHL dataset.The large amount of data in the dataset allows us conduct a thorough analysis to investigate the ability of a large set of features to distinguish between any two transportation activities.We proposed a large set of features (2727 in total), which include all the features considered in the literature plus additional features computed based on the time-domain quantile values and frequency-domain subband energies.We proposed a feature analysis method based on mutual information.The method visualizes the ability of each feature and sensor modality to distinguish any two transportation activities.We further proposed a feature selection method based on pair-wise maximum-relevanceminimum-redundency (MRMR) which selects a small set of features that are suitable for recognizing the 8 class activities.The large set of features, the feature analysis and visualization, and the feature selection method are new in this research field.This will give readers a better understanding of the dataset, and will help them to identify better features and develop new recognition methodologies in their work.Thanks to this, our work is the first to show clearly which frequency band contains the most valuable information to distinguish transportation modes, and it is the first to clearly identify that magnetic field sensors provides additional critical information to distinguish between modes of transport, contrarily to a common held assumption.
The organization of the paper is as follows.After reviewing the state of the art in Sec.II, we introduce the SHL dataset in Sec.III and recommend a list of standard transportation mode evaluation tasks in Sec.IV.We perform feature analysis in Sec.V and establish the baseline performance in Sec.VI.After discussions in Sec VII we draw conclusions in Sec.VIII.

II. STATE OF THE ART A. APPROACHES TO TRANSPORTATION MODE RECOGNITION
Fig. 1 depicts a basic processing pipeline for predicting the transportation mode using the multimodal sensors embedded in the smartphone carried by the user.The multimodal sensor data (such as inertial and GPS) are first segmented into frames with a sliding window.The data in each frame is used to compute a vector of features.These feature vectors are processed by a classifier which aims to recognize the transportation mode of the user.
Table 1 gives a comprehensive summary on the literatures that work on transportation and locomotion mode recognition, which can be categorized into three families: inertial based, location based, and hybrid.Inertial based approaches employ inertial sensors to detect the acceleration (accelerometer), rotation (gyroscope) and ambient magnetic field (magnetometer) of the mobile device, and predict the transportation mode of the user based on the motion pattern of the mobile device itself [20]- [35], [54].Location based approaches employ the GPS receiver to detect the location of the mobile device, and predict the transportation mode based on the motion pattern of the user, such as GPS speed, GPS acceleration, and the trajectory of the trip [38]- [47].Geographic information system (GIS) can be used to further improve the recognition accuracy by exploiting information such as the closeness to train stations, bus stops, rail lines, and roads [43], [44], [46].Hybrid approaches combine inertial and GPS sensors to predict the transportation mode and thus usually perform better than using one modality alone [48]- [53].We analyze the state of the art from four aspects: dataset and sensor modality, type of classifier, decision window, and number of classes.
Dataset and modality.Due to costs and time required to collect and annotate datasets, most research groups working with inertial sensors used datasets with limited duration (dozens of hours).Due to the earlier availability of accelerometers on mobile phones, the majority of datasets to date include accelerometers as the sole modality.Some exceptions include three datasets with multiple modalities (accelerometer, gyroscope, magnetometer) but a limited duration of 12 hours [24], 25 hours [23], and 13 hours [55], respectively; two datasets with single modality (accelerometer) but a long duration of 100 hours [25] and 890 hours [29], respectively; and a large dataset with multiple inertial modalities and a long duration of 8311 hours [20], [21].A common problem is that none of dataset mentioned above is publicly available except [55] with 13 hours of data.Most research groups working with GPS sensors only used large dataset containing hundreds to thousands of trips.Geolife, a large dataset with GPS information from 9043 trips is publicly available [56].
There are only a few research groups working on hybrid approachers, including [49], [50], [52] who used datasets with a duration between 100 to 350 hours.Currently all the datasets reported with hybrid approaches contain only two modalities, i.e.GPS and accelerometer.All these datasets have much less modalities than the SHL dataset.The richest dataset [50] contain 2 modalities and 355 hours of data, which is significantly less than SHL with 16 modalities and 2800 hours of data.
Number of classes.Most papers reviewed report a different classification task, ranging from recognizing three transportation classes (e.g.Walk, Car and Train [34]) to ten (e.g.Still, Walk, Run, Bike, Motorcycle, Car, Bus, Subway, Train, and High speed rail [20]).Among various transportation activities, the most frequently considered ones are Still, Walk, Run, Bike, Car, Bus, Train and Subway.The variety of transportation mode recognition tasks creates a problem of reproducible research.
Decision window size.The sensor data are divided into frames with a sliding window and processed per frame.There is a trade-off when choosing the size of the sliding window, which affects the classification accuracy, response time (latency), and memory size [22], [26].The preferred choice of window size varies in the papers we reviewed.Generally, inertial based approaches use a short window size [25] 150 varying from 1 second to 18 seconds, aiming at real-time decision.The most widely used choice is around 5 seconds.An exception was reported in [37], which used a barometer sensor alone to predict the mode of transportation within a window size of 200 seconds.Location based approaches usually employ a long window varying from several minutes to tens of minutes or even the entire trip.In the latter case, the decisions are made offline with applications in travel surveys.Hybrid approaches target real-time decision by combining inertial and GPS sensors, and thus prefer a short window with sizes similar to the ones used in inertial based approaches.
Classifier.Various classifiers have been employed for the recognition task.Decision tree (DT), K-nearest neighbour (KNN), support vector machine (SVM) and naive Bayesian (NB) are the most frequently used classifiers.Several schemes were proposed to improve the classification performance, such as ensemble classifiers, multi-layer classifiers, and post-processing.AdaBoost [22], [40] and random forest (RF) [24], [29], [32], [35], [40], [43], [50] ensemble a set of simple classifiers for the optimal decision.Multi-layer classifiers typically perform a coarse-grained distinction between pedestrian and motorized transportation in the first tier, and then perform a fine-grained classification in the subsequent tiers [22], [25]- [27], [53].Post-processing can reduce the classification error effectively by using a voting scheme which exploits the temporal correlation between consecutive frames [22], [28] or using a hidden Markov model (HMM) to capture the transition probability between different classes [48]- [50], [52].Long-term features were computed using the information from whole trip to improve the classification accuracy in short segments [25].Deep learning, which attracts significant interests in the machine learning community, was recently applied to the transportation mode recognition task [21], [41].For

B. FEATURES FOR TRANSPORTATION MODE RECOGNITION
Feature computation is the key for transportation mode recognition.Most publications report a different scheme to compute features from the multimodal sensor data.To help understand the state of the art, we first summarize the data channels that are used to compute features from various modalities (Table 2), and then summarize specific features that are computed in each data channel (Table 3 and 4).Table 2 lists the data channels that are used to compute features from inertial and GPS sensors.Accelerometer, which measures the acceleration along three device axes, is Stand deviation (variance) [38]- [40] Sinuosity [40] Range; Interquartile range [40] Max [47] Quantile 25 and 75 [40], [46] Quantile 95 [42], [46] Three maximum values [38]- [40] Three minimum values [40] Autocorrelation; Kurtosis; Skewness [40] Heading change rate [40], [43] Velocity change rate; Stop rate [40] the most favoured modality among inertial sensors.Since the pose and orientation of the mobile device is typically unknown, several approaches have been proposed to extract orientation independent information, e.g. by computing the magnitude which combines acceleration from three axes [20], [22], [23], [27], [28], [30]- [36], [49]- [54], by decomposing the magnitude along a vertical and horizontal earth coordinate system [23], [25], [33], or by projecting the raw acceleration of the three device axes into a 3D earth coordinate system [26], [29], [48].The magnitudes of the data from other modalities, including gyroscope [20], [22], [32], [35], magnetometer [20], [22], [32], [35] and barometer [26], [37], have also been used for feature computation.Table 3 lists the specific features that can be computed in each inertial sensor data channel (Table 2), which can be time-domain and frequency-domain.The time-domain features are computed based on a frame of samples while the frequency-domain features are computed based on the fast Fourier transform (FFT) of a frame of samples.Mean, standard deviation, mean crossing rate, and energy are among the most popular time-domain features.The quantile value and quantile range of the samples in a frame are widely used to represent the minimum, maximum, median value and interquartile range of the samples in a frame.However, the choices on which quantile appear to be rather ad-hoc among the literature.Statistical measures such as auto-correlation, kurtosis, and skewness are less frequently reported.The most used frequency domain feature is the frequency with the highest energy peak.The energy in different frequency bands is a widely used feature.However, the choices of a specific subband appears to be rather ad-hoc among the literature.For instance, the reference [25], [49] considered the energy specifically at 1 Hz, 2 Hz, • • • , 10 Hz, while the reference [51] considered the energy between 0 and 1 Hz, 1 and 3 Hz, 3 and 5 Hz, 5 and 16 Hz.Some statistical features such as the ratio between the first and the second FFT peaks, the mean and standard deviation of the FFT coefficients have also been suggested.
Table 2 also lists the data channels that can be derived from the GPS sensors, including speed, acceleration, turning angle and trajectory.These data channels are inferred from the change of GPS location over time.Table 4 lists the specific features that can be computed in these data channels.GPS features are usually computed in the time domain only.Mean and standard deviation are two most popular features computed from speed, acceleration and turn angle.Different choices of quantile and quantile ranges (e.g.max, quartile, and interquartile range) and statistics (e.g.kurtosis and skewness) are widely used features computed from speed, acceleration and turn angle.Several advanced features including heading change rate, stop rate, and velocity change rate are also proposed and computed for a single trip.For hybrid approaches, which compute GPS features in a short window, only mean and standard deviation of speed or acceleration are used [48]- [53].
To summarize, while transportation mode recognition has been investigated intensively and with great advances reported in recent years, the work of various research groups was conducted in a rather isolated way and does not show close inter-connection in the research community.Each work appears to define its own transportation mode classification problem (e.g. the number of activities considered), and proposes a solution with different parameters (e.g.window size, sensor modality, classifier), and often verified with ad-hoc datasets which are not public available.A fair comparison of results between different groups is very difficult.As the number of publications increases, this obviously holds back research advances in this area as it prevents systematic comparative evaluation of novel methods or sensors.
The research community has proposed a large number of features for transportation mode recognition.While effective, these features appear to be defined in as rather ad-hoc manner and they are computed from different modalities.In particular, there is few unity in the literature on the timedomain quantiles and sub-band energy to employ as features.

III. SHL DATASET
The University of Sussex-Huawei Locomotion (SHL) dataset is a major outcome of our large-scale longitudinal data collection campaign, which collected 2812 hours of labeled data over a period of 7 months which corresponds to 17,562 km in the south-east of the UK including London [18], [19].The SHL dataset was recorded by three participants engaging in eight transportation and locomotion activities in real-life settings: Still, Walk, Run, Bike, Car, Bus, Train and Subway.Each participant carried four Huawei Mate 9 smartphones at four body positions simultaneously: in the hand, at the torso (located in a shirt or jacket pocket or a torso strap), at the hip, in a backpack or handbag (Fig. 2).Each smartphone logged the data of the 16 sensors available in the smartphone, including inertial sensors, GPS, ambient pressure sensor, ambient humidity, etc.The data from four smartphones leads to a total duration of 4 × 703 = 2812 hours.In addition to the smartphones, each participant wore a front-facing camera to record images of the environment during the journey, which  was used to precisely annotate the activities of the user.Table 5 indicates the characteristics of the dataset.Fig. 3 depicts (a) the duration of each transportation activity performed by the three participants and (b) the duration of the transportation activities where the GPS data is available.The GPS information might not always be available during the journey, e.g. when the user is taking a subway or is staying inside a building.In the dataset, we regard a segment as 'GPS off' if this segment has no GPS information available for more than 10 seconds.We refer to the case (a) Dataset-E, i.e. the entire dataset is used, and the case (b) as Dataset-IG, i.e. the subset of Dataset-E where data from the GPS sensor is available.The total amount of data are 2812 and 2036 hours, respectively.
The SHL dataset is well suited to enable systematic comparative evaluations of recognition methods.It contains all the modalities ever used in the 34 related work and contains all of the activity classes in 25 out of 34 related work.The duration of the dataset is much longer than any dataset reported in the literature with both inertial and GPS data.The dataset contains data recorded at multiple body positions and by multiple users.Therefore, this dataset allows to replicate the majority (25 out of 34) of the experiments reviewed in the related work.
For clarity, we introduce the following naming schemes for the transportation and locomotion modes: S1-Still, W2-Walk, R3-Run, B4-Bike, C5-Car, B6-Bus, T7-Train, S8-Subway.W2, R3 and B4 belong to the pedestrian activity of the user, where W2 and R3 can be categorized as foot activities.C5, B6, T7 and S8 belong to a family of vehicular transportation, where C5 and B6 can be categorized as road transportation and T7 and S8 categorized as rail transportation.

IV. RECOMMENDED TRANSPORTATION MODE RECOGNITION TASKS
In order to enable reproducible research in transportation mode recognition it is important that the recognition scenarios are well defined.However, it is also important that they suit existing and foreseeable demands from different applications.In this section we propose a list of generalized transportation mode recognition tasks that aim to cover most application scenarios considered in the literature.As shown in Table 6 these tasks consists of 12 subgroups (scenarios) based on the eight classes in the SHL dataset: S1-Still, W2-Walk, R3-Run, B4-Bike, C5-Car, B6-Bus, T7-Train, S8-Subway.
This subgrouping scheme merges one or more activities together into a new class based on application interests.For instance, Pedestrian (Walk, Run, Bike), Vehicle (Bus, Car, Train, Subway), Foot (Walk, Run), Road vehicle (Bus, Car), Rail vehicle (Train, Subway), are new classes merging existing activities.A detailed description of the 12 scenarios is given below.6 links the 12 scenarios to related literature in the first column.These 12 scenarios cover most transportation mode recognition tasks considered in the literature (25 out of 34 related work) and link closely to the remaining ones which contain more activities than the SHL dataset, e.g.Motorcycle [20]- [22], [36], [48], [50], [52], E-bike [42], Boat and Plane [32].Some of these scenarios can be used to encourage a more ecologically friendly or physically active lifestyle, or provide appropriate contextual information.
When developing a system to automatically recognize transportation modes it is important to evaluate it according to its final usage patterns.We thus propose to evaluate the recognition performance of the 12 scenarios from three perspective: user-independent, position-dependent, and timeinvariant evaluation (Table 7).
Generally, a recognition system should work regardless of whom is using it.However, human motion dynamics varies between users due to physical characteristics and habits.For instance, different users may have different gait styles and ideal walking or jogging speed, or may engage in different activities when they are in public transport (e.g.reading a book, tapping to music, etc.).User-independent activity recognition aims to design recognition systems that will generalize well to new users [57].We divide the dataset based on the three users and evaluate the performance with a leaveone-user-out crossvalidation, e.g.training with the data from User 2 and User 3 and testing with the data from User 1.
A recognition system based on smartphones should ideally operate regardless of where the users carry their phone, i.e. it should be position-independent.We divide the dataset based on the four positions and evaluate the performance with a leave-one-position-out cross-validation, e.g.training with the data from Torso, Hip and Bag and testing with the data from Hand.
Finally, a system should keep operate over time, despite possible changes in behaviour (e.g.due to injury, different preferences or habits), i.e. it should be time-invariant.With data collected over the course of 7 months, we can assess this in the SHL dataset through a leave-one-period-out crossvalidation, where a period is composed of the data of consecutive days of recordings.Specifically, we divide the dataset into four periods based on the recording dates of the three users, and perform training with three periods and testing with the remaining period.Table 8 presents the number of recording days in each period, and the duration of each transportation activity within each period.
only comprised an accelerometer as a motion sensor and thus a large amount of work focused on transportation mode recognition using this sensor only.As time evolves, multimodal motion sensors (accelerometer, gyroscope and magnetometer) were integrated into a single smartphone chip.In recent years an increasing number of work performs transportation mode recognition using multimodal sensors.
Because not all the work use the same sensor configuration, we need to evaluate the recognition performance using combination of sensors which form a superset of the related work.To this end, we propose the following six group of modalities as a combination of accelerometer, gyroscope, magnetometer and GPS: A (Acc), AG (Acc + Gyr), AGM (Acc + Gyr + Mag), P (GPS), AP (Acc + GPS), AGMP (Acc + Gyr + Mag + GPS).First, acceleration and GPS are the most common sensors in smartphones and we are interested to investigate the recognition performance with these two modalities alone (A and P) and the combination of them (AP).Second, the energy usage of a gyroscope is significantly higher (order of magnitude) than that of an accelerometer, which thus essentially comes for free if the gyroscope is turned on.The magnetometer uses about twice the energy than the gyroscope.When the magnetometer is enabled, the gyroscope and accelerometer can be enabled with little extra energy usage.We thus propose to use the combinations AG and AGM.GPS uses an order of magnitude more than the magnetometer.If we turn on GPS, the other motion sensors can be enabled as well without significant energetic impact.We thus propose to evaluate the combination AGMP.Table 7 lists 792 recognition tasks, as a combination of a recognition scenario, out of the 12 suggested, a leave-oneout scheme to assess user, position or temporal independence, and a group of sensor modalities.
GPS is not always available in the entire dataset (see Fig. 3 and Table 8).When evaluating modalities A, AG and AGM, we use the entire dataset, i.e. the Dataset-E.When evaluating the modalities P, AP and AGMP, we use the dataset where the GPS is available, i.e.Dataset-IG (see Fig. 3 and Table 8).For ease of comparison between the six groups of modalities, we also use Datase-IG to evaluate A, AG and AGM.
For performance evaluation, we opt for two measures, recognition accuracy and F1 score, which are widely used in the literature.While recognition accuracy gives an intuitive indication of the performance, F1 score can better handle imbalance datasets between classes.The two measures can be computed from the confusion matrix between the output labels and the ground-truth labels.Let M ij be the (i, j)-the element of the confusion matrix.It represents the number of samples originally belonging to class i which are recognized as class j.Let C be the number of classes.The accuracy (R) and the F1-score (F) are defined as follows.

V. FEATURE ANALYSIS
The large amount of data in the SHL dataset allows us to conduct a thorough analysis to investigate the ability of a large set of features to distinguish between any two transportation activities.To this end, we first define a  set of features that can be computed from the various modalities, and then perform a discriminablity analysis based on the mutual information between these features and the transportation modes.Finally we employ a filterbased feature selection algorithm employing a maximumrelevance-minimum-redundency (MRMR) criteria [58] to preselect a subset of features, which are subsequently used to establish the baseline performance measures for the tasks identified in the previous section.

A. FEATURE EXTRACTION
We compute the features within a short-time window of 5.12 seconds, which is the most common duration we identified in Table 1.As shown in the state-of-the-art analysis in Sec.II-B and Table 4, most GPS features are computed in long temporal intervals except the mean speed and mean acceleration.As we are interested in just-in-time context recognition and thus work with short frames, we only compute these two features for the GPS data.For this reason, the analysis of GPS features will not be considered in this section and will be limited to the data coming from the three inertial sensors: accelerometer, gyroscope and magnetometer.For each modality we use the magnitude of the data channel for feature computation.The magnitude has been widely used in the literature and is robust to the variation of device orientation (Table 2).Through related work analysis, we noticed that while a variety of features have been proposed for transportation mode recognition, the choices of these features appear to be rather ad-hoc, especially on the subband energy and the quantile range.It would be interesting to find out which feature provides the most distinctive power for the recognition task.To perform an exhaustive evaluation, we compute all the features that are listed in the literature (Table 3) and we additionally compute a set of quantile and subband features.Table 9 lists the features to be computed, which can be categorized into three families: subband energy (E), time-domain quantile (Q), and the remaining timedomain and frequency-domain (T +F) features.
A subband is usually defined with two parameters: centre frequency ω c and bandwidth ω b .The frequencies in a subband is thus given by ω ∈ . Instead of evaluating the ad-hoc subband features defined in the literature, we propose to systematically compute a set of subband features with all possible parameters of ω c and ω b .The highest frequency of the data is 50 Hz as the sampling rate is 100 H.We consider the following bandwidth: ω b ∈ {1, 2, 3, 4, 5, 10, 15, 20, 25} Hz.For each bandwidth ω b , we vary the centre frequency from ω b 2 to 50 − ω b 2 with a step of 1 Hz.For the bandwidth ω b = 1 Hz the center frequency is increased with a step of 0.5 Hz.For each subband, we consider two types of features: the absolute energy and the energy ratio.Let {S 1 , • • • , S K } represent the K = 257 FFT coefficients of a frame of data and let k L and k H denotes the indices of the lower and upper frequencies of a subband , the two features are defined as Finally we obtain 846 features in the set E as shown in Table 9.A quantile range [q L , q H ] is defined as s(q H ) − s(q L ), the difference between two percentile values s(q L ) and s(q H ) of a frame of samples s.Instead of evaluating the ad-hoc quantile and quantile-range features defined in the literature, we propose to systematically compute a set of quantile features with a list of possible parameters of q L and q H .We consider the following 9 quantile values q L , q H ∈ {0, 5, 10, 25, 50, 75, 90, 95, 100} with q L ≤ q H .This results in 9 quantiles with q L = q H and 36 quantile ranges with q L < q H . Finally we obtain 45 features in the set Q as shown in Table 9.
We include all the time-domain (T ) and frequency-domain (F) features, excluding the quantile and subband features, that are listed in Table 3, which yields 17 features containing 8 elements in the time domain and 9 in the frequency domain.
With the proposed scheme, we compute 17 + 45 + 855 = 908 features for each modality and thus 3 × 908 = 2724 features per frame of inertial sensor data in total.The frames are obtained by sliding a window of 5.12 seconds with 2.56 s overlap on the entire dataset.This yields 3.95 million frames, each containing 2724 features.

B. FEATURE ANALYSIS BASED ON MUTUAL INFORMATION
Given so many features computed in each data frame, we are interested in finding the answers to three questions: which modality, which quantile range, and which subband is most informative to distinguish between transportation modes.
Mutual information (MI) is widely used to measure the relevance between features and target classes, and also the dependency between features [58]- [60].Given two variables x and y, the probability density functions (pdf) p(x) and p(y), and the joint pdf p(x, y), the mutual informational is defined as I(x; y) = p(x, y) log p(x, y) p(x)p(y) dxdy.
The MI I(x; y) lies in the range [0, 1], with a value close to 1 indicating a strong dependency between two variables and a value of 0 indicating independence between them.For a specific recognition problem with a feature f and a set of classes C, a higher MI value I(f, C) indicates a stronger ability of the feature to distinguish between these classes [59].We employ mutual information as a measure to investigate the discriminablity of each feature on any two transportation activities.Given the eight activity classes in the SHL dataset there are 28 pair-wise combinations of any two.We compute the mutual information between each feature and each class pair.When computing mutual information, the pdf of the feature variable in Eq. ( 6) is approximated with the histogram over all (3.95 million) instances.Specifically, p(x) or p(y) is approximated with a 1-dimension histogram with a fixed number of 500 bins; p(x, y) is approximated with a 2-dimension histogram with 200 bins at each dimension.

1) Modality
For a specific recognition problem, a feature with a higher MI value usually indicates a stronger ability to separate the target classes.We thus use the number of features with a high value of MI (above a threshold I T ) contained in one single modality (accelerometer, gyroscope, or magnetometer) to indicate the significance of this modality to the recognition task.If we do not consider the redundancy of the features in the same modality, the more high-MI features the more important this modality is.For each modality (with 908 features) and each pair of classes (the 28 pair-wise combination from eight classes), we compute the number of features with MI above a threshold I T .We plot in Fig. 4 how this number varies in function of the threshold I T .The following observations can be made regarding the significance of each modality.
All the three modalities present very few features with high MI values for two pairs (C5 vs B6) and (T7 vs S8).The likely explanation for this is that the motion pattern of Car and Bus are very similar specially in short time frames, and so do the Train and Subway.For (T7 vs S8) no feature from the three modalities present an MI value higher than 0.05, which implies the two classes are almost indistinguishable with a single feature.For (C5 vs B6), all the features from gyroscope and magnetometer shows an MI value below 0.05, while accelerometer has less than 10 features with MI between 0.05 and 0.1.This implies that accelerometer provides more distinctive power than the other two modalities for separating C5 and B6, although making this distinction appears to be comparatively more difficult.
For each of the remaining 26 pairs, either one or several of the three modalities can provide features with a high MI value.Accelerometer and gyroscope show similar significance curves across many class pairs, such as (S1 vs W2/R3/B4) and (W2/R3 vs C5/B6/T7/S8).These two modalities provide a similar number of features with high MI values (e.g.> 0.7) when distinguishing between Still (S1) and pedestrian (W2/ R3/B4), and between foot (W2/R3) and vehicles (C5/B6/ T7/S8).Accelerometer provides more high-MI features than gyroscope for most of the remaining pairs, e.g. when distinguishing between Still (C1) and four vehicles (C5/B6/ T7/S8), and also between the three pedestrian activities (W2 vs R3 vs B4).Gyroscope provides more high-MI features than accelerometer when distinguishing Bike (B4) and the four vehicles.This is possibly because the Bike activity introduces more rotational motions than vehicles.Both accelerometer and gyroscope provides very few high-MI features when distinguishing between the four vehicles (i.e.C5 vs B6 vs T7 vs S8).
Magnetometer usually provides much less high-MI features than accelerometer and gyroscope for most class pairs, because the ambient magnetic field is not closely related to the human activity in open-spaces, where there is little magnetic disturbance due to the presence of surrounding metals.However, the magnetometer provides significantly more high-MI features than the other two modalities when distinguishing between Still (S1) and rail transportation (T7/S8), and between driving (C5/B6) and rail (T7/S8).This is an interesting observation that has not been reported in the previous literature.One possible explanation could be the influence of metal casing of the train and subway.
2) Subband energy Fig. 5 visualizes the MI values between subband features (family E) from the three modalities and the 28 class pairs.Each subfigure contains 28 panels corresponding to the 28 class pairs.Each panel consists of two parts: the upper block shows the MI between the energy-ratio features and the class pairs; the lower block shows the MI between absolute-energy features and the class pairs.The x-axis denotes the center frequency of the subband which varies from 0 to 50 Hz, while the y-axis denotes the bandwidth, which varies from 1 to 25 Hz.Based on the MI values we can easily find out which subband provides a higher discriminablity between the target classes.
For accelerometer and gyroscope in Fig. 5(a) and (b), the lower block (absolute energy) provide more high-MI features than the upper block (energy ratio).For accelerometer, most high MI values are observed in low frequency, especially between 0 and 10 Hz.For gyroscope, most high MI values are observed in low frequency, especially between 5 and 10 Hz and some class pairs, e.g.(B4 vs C5/B6/T7/S8), present high MI values in the frequency band between 0 and 5 Hz.For accelerometer a larger bandwidth does not show evident advantages over a lower bandwidth.For gyroscope, a larger bandwidth shows evident advantages over a smaller bandwidth.For instance, the subbands with 1 Hz bandwidth usually present low MI values.For magnetometer in Fig. 5(c), the upper block (energy ratio) provides more high-MI features than the lower block (absolute energy).This is in contrast to the other two modalities.For most class pairs, high MI values are observed in the frequency bands 0-15 Hz and 25-35 Hz.The bandwidth around 10 Hz seems to presents higher MI values than other bandwidths.This is consistent with the observations made in Fig. 4 that magnetometer provides more discriminablity between (S1/C5/B6) and (T7/S8).
3) Quantile Fig. 6 visualizes the MI values of various quantile features (family Q) from the three modalities.The MI is computed between each feature and each of the 28 class pairs.Each subfigure contains 28 panesl corresponding to the 28 class pairs.The x-and y-axes denote the upper and lower bounds of a quantile range.Thus each cell with coordinate (q x , q y ) represents a quantile range value between [q y , q x ].The 9 specific quantile values, from 0 to 100, are listed in Table 9.A cell with the same coordinates, i.e. q x = q y , represent the quantile value q x .The image in each panel resembles a lowertriangular area.Based on the MI values we can easily find out which quantile range has a higher discriminablity between the target classes.
For accelerometer in Fig. 6(a), the middle part of the triangular area in each panel tends to present higher MI values for most class pairs, e.g. the quantile range 25-75.For gyroscope in Fig. 6(b), the left part of the triangular area in each panel tends to present higher MI values for most class pairs, e.g. the quantile range 10-50.For magnetometer in Fig. 6(c), the right part of the triangular area in each panel tends to present higher MI values for most class pairs, e.g. the quantile range 0-100.The indices 1-9 in the second column denote the frequencydomain features: DC, highest FFT value and frequency, ratio between the first and second peak, mean, standard deviation, kurtosis, skewness, and energy.It appears that all these features are important for one or more class pairs.

C. FEATURE ANALYSIS BASED ON MRMR
The importance analysis in Sec.V-B relies only on the correlation between individual features and the target classes and does not consider the redundancy between the features.Since activity recognition usually uses multiple features, it is important to see which features will be selected after removing inter-feature redundancy.
MRMR is a well-known feature selection method which can select a set of features that has the maximum relevance with the target class and minimum redundancy between each other [58].We thus employ MRMR to identify important features with least redundancy.Given the target classes C and an initial set F with n features, MRMR aims to find a subset S ⊂ F with k features that maximizes the mutual information between the features and the class I(C; S) and minimize the mutual information between the features in

Feature index T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F
T F  the subset I(f i ; f j ).An incremental search scheme is used which in each step selects a new feature that maximize the objective function J(f i ):

Feature index T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F T F
where H(f i ) denotes the entropy of the feature f i , and f s denotes a feature in the subset S. The normalization in the second term of ( 7) aims to limit the MI within the range [0, 1] in order to prevent over-weighting nonredundant features [60].
As shown in Sec.V-B, each feature presents different MI values for different class pairs, and consequently each class pair leads to a different optimal set of features according to the MRMR criterion.To avoid removing features that are potentially useful, we perform feature selection per class pair and per modality by applying MRMR independently to each of the three feature families: E, Q, T +F.Fig. 8 depicts the block diagrams of the pair-wise MRMR feature selection method.
For each modality, we select 10 features from E for each class pair and then combine selected features from the 28 class pairs together.This procedure is repeated for Q (5 features per class pair) and T +F (5 features per class pair).Fig. 9 illustrates the selected features from the families E, Q, and T +F for the three modalities.It may happen that some class pairs lead to the selection of the same feature.We thus use color to indicate how often a feature is selected, which can range from 'never' up to a feature being selected 28 times, i.e. once for each of the 28 class pairs.The more frequently selected, the more important a feature is.A summary on the selection result is given below.
For accelerometer the MRMR algorithm produces 147 features including 104 subband features (E), 29 quantile features (Q) and 14 time-frequency features (T +F).The most selected subband features (Fig. 9(a)) tend to have a center frequency between 0 and 5 Hz and a bandwidth between 1 and 5 Hz.These features appear in both upper block (energy ratio) and lower block (absolute energy) of Fig. 9(a).The most selected quantile features (Fig. 9(d)) appear on the left side of the triangular area with a narrow interval between lower and upper quantiles.For TF features in Fig. 9(g), most features are selected except two timedomain features (energy and kurtosis) and one frequencydomain feature (energy).For gyroscope the MRMR algorithm produces 150 features including 108 subband features (E), 28 quantile features (Q) and 14 time-frequency features (T +F).The most selected subband features (Fig. 9(b)) tend to distribute sparsely at subbands with a center frequency between 0 and 30 Hz, and a bandwidth between 1 and 25 Hz.These features appear in both upper block (energy ratio) and lower block (absolute energy) in Fig. 9(b).The most selected quantile features (Fig. 9(e)) tend to appear at the left side of the triangular shape, with a narrow interval between lower and upper quantiles.For TF features in Fig. 9(e), most features are selected except one time-domain features (energy) and two frequency-domain feature (kurtosis and energy).
For magnetometer, the MRMR algorithm produces 148 features including 104 subband features (E), 30 quantile features (Q) and 14 time-frequency features (T +F).The most selected subband features (Fig. 9(c)) appear in the upper block (energy ratio) and very few appear in the lower block (absolute energy).These features tend to distribute densely at subbands with a center frequency between 0 and 15 Hz and a bandwidth between 1 and 10 Hz, and also tend to distribute at subbands with a center frequency between 20 and 30 Hz and a bandwidth between 20 and 25 Hz.The most selected quantile features (9(f)) tend to appear at the left side of the triangular shape, with a narrow interval between lower and upper quantiles.However, a feature covering the full range between quantile 0 and quantile 100 is also selected for multiple times.For TF features in Fig. 9(i), most features are selected except one time-domain features (energy) and two frequency-domain feature (highest FFT value and energy).
Finally, Table 10 lists the five most frequently reoccurring features in E, Q, T +F, respectively in each modality.
Note that while the proposed MRMR-based feature analysis procedure is computationally expensive, this computation only occurs when the system is developed, i.e. in the training stage.At run-time, in a deployed system, only the selected features need to be computed (i.e.MRMR needs not be run in a production system, only during development).This reduces the computation significantly in the deployed system as less features are computed and used for the classification.
To summarize, the significance analysis in Sec.V-B and Sec.V-C gives us an idea on which feature provides crucial information for a specific recognition task.We can use the features selected in this section as a starting point to establish the baseline performance of the defined recognition tasks.

VI. BASELINE PERFORMANCE A. PROCESSING PIPELINE
Fig. 10 illustrates the processing pipeline for establishing baseline performance for the recommended recognition tasks using the SHL dataset.
We compute the recognition performance for each recognition task which is defined as a combination of leave-one-out scheme, an evaluation scenario, and a group of modalities in Table 7.We first divide the entire dataset into training and testing folds according to the leave-one-out evaluation strategy indicated in Table 7.For the training dataset, we use a sliding window with a length of 5.12 seconds and 2.56second overlap to segment the sensor data into frames and in each frame we extract a set of features {f e } identified in Sec.V-C, including 147 accelerometer features, 150 gyroscope features and 148 magnetometer features (Fig. 9).For each of the 12 evaluation scenarios, we apply MRMR to select 50 features independently for each of the three modalities: accelerometer, gyroscope and magnetometer, and compute two features for the GPS modality: mean speed and mean acceleration.The speed and acceleration is estimated based on the change of GPS coordinates (latitude and longitude) over time with the Matlab Mapping Toolbox.
For each group of modalities, we combine all the features computed on each constituent modality in a single feature vector {f s }.For instance, the feature vector of the modality group AGM consists of 150 elements.The resulting feature vector and associated class label corresponding to each frame of data in the train set are used to train the classifier model.The testing dataset comprises all the data frames in the leftout fold of the cross-validation.Based on the indices of the features selected in the training stage, we compute the same set of features {f s } and feed them to the trained classifier model to recognize the transportation mode in each frame.
We employ a decision tree as a baseline classifier due to its popularity in transportation mode recognition (e.g.18 out of 34 related work).We implemented the recognition system using Matlab's built-in function 'fitctree'.We use the default parameter for this function except setting the parameter 'MinParentSize' (the minimum number of observations per branch node in the tree) to 10000 C , where C is the number of the classes for a specific recognition task, and setting the parameter 'MinLeafSize' (the minimum number of observations per leaf node in the tree) as MinParentSize

5
. We use large values for these two parameters to prevent overfitting in the training stage.
As already discussed in Sec.IV, the evaluation of the groups of modalities A, AG and AGM will be made on Dataset-E and Dataset-IG, respectively, and the evaluation of P, AP and AGMP will be made on Dataset-IG.

B. RESULTS
Table 11 reports in detail the baseline performance, in terms of recognition accuracy and F1 score, of the 396 recognition tasks, consisting of 12 evaluation scenarios, 11 leave-oneout cross-validations (three users, four positions and four periods), and three groups of modalities (A, AG and AGM) obtained using Dataset-E.Table 12 reports the baseline performance of the 729 recognition tasks with six groups of modalities (A, AG, AGM, P, AP and AGMP) obtained using Dataset-IG.To investigate the influence of different modalities on the recognition performance, we average the F1 scores on all the 11 cross-validation cases for each recognition scenario and each group of modalities.Fig. 11(a) depicts the mean F1 score for the 12 recognition scenarios and three groups of modalities (A, AG and AGM), obtained using Dataset-E and Dataset-IG, respectively.Fig. 11(b) depicts the mean F1 score for the 12 scenarios and six modality groups, obtained using Dataset-IG.Regardless of their different amount of data, Dataset-E and Dataset-IG achieve very similar F1 scores for all groups of modalities and recognition scenarios.Meanwhile, the F1 score by Dataset-E is slightly higher than that by Dataset-IG, possibly because the former one has a more balanced data between classes.For each recognition scenario, using more modalities appears to always increase the recognition performance.Specifically, the following observations can be made.
• The combination of accelerometer and gyroscope (AG) tends to improve the recognition performance over that obtained with an accelerometer alone (A) slightly.• Including the magnetometer (AGM) tends to improve the recognition performance much more significantly.The pronounced improvement by combining accelerometer and magnetometer is due to the complementarity between the two, i.e. one is based on the motion of the device while the other is based on the ambient magnetic field around the device.As shown in Fig. 4, the magnetometer tends to provide more features with high MI values for class pairs where accelerometer and gyroscope provide very few features with high MI values.
• The GPS modality alone, with only two features, does not provide sufficient discriminablility between the target classes.However, combining GPS and accelerometer (AP) tends to improve the recognition performance significantly over using either modality alone (A or P).The combination of GPS and accelerometer (AP) outperforms the combination of three inertial sensors (AGM).The combination of all the four modalities (AGMP) only improves the recognition performance slightly over AP.
We use the standard deviation of F1 score to investigate the influence of user, position and temporal variation on the recognition performance.For user variation, we compute the standard deviation of F1 score across three users per recognition scenario and per group of modalities, and then average the standard deviation values across the 12 recognition scenarios for each group of modalities.We repeat the same procedure for position variation (with four positions) and temporal variation (with four periods).All the results are obtained using Dataset-IG.Fig. 11(c) depicts the standard deviation for the three variations (user, position, and period) and six groups of modalities, where a smaller standard deviation implies more robustness of recognition system to the variation.The following observations can be made.
• When using inertial sensors (A, AG, AGM), the position variation tends to introduce the largest standard deviation among the three, because human engages with the recording device differently depending on the wearing position.It can be observed in both Table 11 and Table 12 the recognition performance at the four positions can be ranked as Hand > Torso > Hips > Bag.
• When using both inertial and GPS sensors the standard deviation of position variation is reduced significantly.This demonstrates that GPS can increase the robustness of the recognition system to position variation, because GPS information does not vary much with wearing positions.When using inertial sensors only, the user variation has the second largest standard deviation because each user has a different behaviour style during the travel.• When using GPS alone, the user variation appears to have the largest standard deviation of the recognition performance.This is possibly because each user has a different speed when performing walking, running, biking and driving activities.The temporal variation tends to have the smaller standard deviation of the recognition performance across all the five groups of modalities (except P -GPS alone).
Fig. 12 lists the confusion matrices for Scenario 12 (the most difficult scenario with eight classes) evaluated on Period 3 (leave-one-period-out cross-validation).The first row shows the results for the three groups of modalities (A, AG and AGM) obtained with Dataset-E.The second and third rows show the results for the six groups of modalities (A, AG, AGM, P, AP and AGMP) obtained with Dataset-IG.From the confusion matrices, we can draw similar conclusions as we did from Fig. 11.As shown in the first and the second rows of Fig. 12, Dataset-E and Dataset-IG achieve a similar recognition accuracy for A, AG and AGM, whereas Dataset-E achieves a slightly higher F1 score than Dataset-IG.This is because that Dataset-E has more balanced data between the eight classes, as supported by the recognition result for the class S8 -Subway, where Dataset-E achieves a much higher recognition accuracy (e.g.53.8% vs 30.3% for AGM in the confusion matrix).
From the confusion matrices in the second and third rows we can clearly see how the recognition performance is improved by using more modalities.Specifically, the following observations can be made.Train/Subway is also reduced significantly.• When using GPS alone, the classifier presents a very low recognition accuracy for Run (7%) and tends to misclassify it as Bike and Walk.This is due to the fact that the running speed of some of the subjects may not have been significantly faster than walking, or in a similar range to leisurely cycling.The classifier also presents a very low recognition accuracy for Subway (0.7%) and tends to misclassify it as Car and Bus.This is linked to the speed of the vehicles: a subway is 40-60 km per hour, which is similar to bus, and often to car in cities.
• When combining GPS and accelerometer (AG), the recognition accuracy for each class is improved remarkably in comparison to using accelerometer alone (A).In particularly, the recognition accuracy of Car and Bus has each been improved from 45% to around 70%. Train can be better recognized with the accuracy improved from 51% to 67%.
• Comparing AGM and AGMP, the latter one improves the recognition accuracy of Bus, Car and Train remarkably with each above 10%, but achieves a decreased recognition rate of Subway.This is possibly because Subway does not have sufficient GPS data available, thus leading to a biased classification result.Interestingly, the availability of GPS does show a strong indication of Still (inside) or Subway.This fact could be be further exploited to improve the recognition performance.

VII. DISCUSSION
We recommend 792 recognition tasks as a combination of 12 recognition scenarios, six groups of modalities, and three leave-one-out cross-validation evaluation criteria to be used by the research community for a standardized comparison.These recognition tasks are defined based on the SHL dataset and constitute a superset covering the majority recognition tasks considered in the literature, except some transportation activities not included in the SHL dataset.We suggest to use the naming scheme "Task-Scenario-Crossvalidation-Modality" when performing a specific evaluation task using the SHL dataset.Here 'Scenario' can be 'O1-O12'; 'Crossvalidation' can be 'UX', 'PX', and 'TX' denoting user-independent, position-independent, and time-invariant evaluation with folder 'X' out; 'Modality' can be 'A', 'AG', 'AGM', 'P', 'AP' and 'AGMP' (see Table 7).For instance, "Task-O12-U1-A" denotes the leave-User1-out evaluation on Scenario 12 using the accelerometer modality, while 'Task-O2-P2-AP" denotes the leave-Torso-out evaluation on Scenario 2 using the accelerometer and GPS modalities.
With this naming scheme we can easily associate a specific recognition task in the related work with the one defined in this paper.For instance, related work [49] addressed Scenario 4, with an 'user-independent', 'position-independent' and 'time-invariant' evaluation using the group of modalities 'AP'.The authors of [49] would be able to apply their algorithms to SHL dataset and compare with baseline results reported in this paper (e.g.Table 12).In case that the average performance of cross-validation is reported, we recommend to use the name 'Task-O12-P-AP' to represent the average position-indepedent cross-validation performance for Scenario 12 using the accelerometer and GPS modalities.
In this paper we mainly aim to establish a standard performance evaluation framework rather than pursuing the maximum recognition performance.The recognition pipeline presented in this paper is a baseline implementation, which aims to provide reference results to enable reproducible comparison.For this reason, we employ a well understood classifier, the decision tree, in our pipeline.In fact, the recognition performance is affected by several aspects including the features, classifiers and the recognition tasks.All the observations and conclusions made in this paper are confined to the baseline implementation.However, all the feature analysis results presented in Sec.V are classifier-agnostic.In particular, our identification of relevant   TABLE 11.F1 score (F) and recognition accuracy (R) for each recognition task obtained using Dataset-E (the entire dataset).
frequency bands as well as the importance of magnetic field sensors are novel findings standing on their own irrespective of the classifier used, as they are the result of an information theoretical analysis.
There are a variety of ways to improve the recognition performance.Apart from using DT, we could use advanced classifiers, such as SVM and random forest.Post-filtering techniques, such as HMM and voting scheme, could be further employed to correct the prediction at individual frames.Some new features could be extracted from the sensor data, e.g. using deep learning, to further improve the recognition performance.In short, the improvement of the any proposed method could be identified easily by comparing with the baseline performance on the standard recognition tasks.
We perform feature computation and activity recognition with a sliding window of size 5.12 seconds.This window length is widely used in the related work and appears to be a good balance between decision time and accuracy.Ideally, the scientific community should standardize on a common window length, because the recognition performance varies significantly with the window length.However, if it is not possible to use a 5.12-second window size, other window lengths which are reported in the related work should ideally be used to enable comparison of methods.Researches using this dataset can always, based on their preference, establish their own baseline performance by targeting the recognition tasks defined in the paper.For instance, we think 60 seconds is also a good choice of window size, which is short enough for contextual awareness yet allows more complex GPS features.
In the SHL dataset the GPS information is not always available.Therefore, we evaluated different groups of modalities with two types of datasets: Dataset-E (the entire dataset) and Dataset-IG (the subset of Dataset-E where the GPS is available).In practice it may happen that the GPS is available sometimes and unavailable at other times.In this case it would be desirable to have two classifiers, one for when GPS is unavailable and one for when GPS is available, that can switch dynamically depending on the scenario.We would encourage the users to implement such a dynamic classifier and compare with the baseline results obtained with both Dataset-E and Dataset-IG.
The limited number of users might be a weak point of the SHL dataset.However, the variability in the sensor signal during transportation is primarily stemming from the motion of the vehicle as the movements of users within a vehicle are constrained (e.g. the movement of the bag containing the smartphone of two distinct users travelling in a bus would be quite similar).Therefore, when making the data collection protocol, we emphasized long travel distance and long duration recordings (over 7 months) at the expense of less users.We compensated this deficiency with rich sensor modalities (15 sensor modalities), multiple recording locations on body (4 locations), and high-quality annotations (28 context labels in total) [18], [19].Meanwhile, we also realized the importance of having sufficient users and having a large geographical diversity in the dataset, so that the generality of the developed transportation mode recognition approaches can be verified with different people from different areas.Due to the limited time and funding, the data collection is confined mainly to the south of UK.Despite this, the SHL dataset is already one of the biggest datasets (in terms of duration, sensor modality and public availability) in the research community.We will continue improving the quality and size of the dataset in the future.By releasing this dataset and the tools to collect data, the scientific community can also contribute to expand it.

VIII. CONCLUSIONS
In this paper we aim to advance the state-of-the-art research in transportation mode recognition by proposing standardized dataset, recognition tasks and evaluation criteria.We introduced a publicly available, large scale dataset (the Sussex-Huawei Locomotion dataset) for transportation mode recognition from multimodal smartphone sensors.The dataset consists of three users wearing four smartphones and conducting eight different transportation activities spanning seven months, leading to 2800 hours recording with 16 sensor modalities.The long duration, rich sensor modalities, the multiple users with various sensor placement, and the variety of transportation activities make the dataset a perfect candidate for establishing standard evaluation tasks.We recommended 12 reference scenarios which cover most recognition tasks identified in related work and defined three types of crossvalidation measures including user-independent, sensor placement-independent and time-invariant evaluations.We suggested six relevant combinations of sensors to use based on energy considerations among accelerometer, gyroscope, magnetometer and GPS sensors.Taking advantage of the large amount of data, we computed a large number of statistical and frequency features in order to perform a systematic significance analysis based on the information theoretical criteria.We reported the reference performance on all the identified recognition scenarios with a machinelearning baseline.We provided guidelines on using the dataset and the defined recognition scenarios and evaluation criteria to generate reproducible and comparable results.We recommended researchers using the dataset to adhere to the tasks defined in this paper and refer to them with the name 'Task-Scenario-Crossvalidation-Modality'.
Through feature analysis we identified, for accelerometer, that important subband features mainly come from the frequency band between 0 and 10 Hz and compute both absolute energy and energy ratio; that important quantile features usually have a narrow interval between lower and upper quantiles; and that time-domain energy and timedomain kurtosis and frequency-domain energy are irrelevant features.We identified that, for gyroscope, important subband features mainly come from the frequency band between 0 and 30 Hz and compute both absolute energy and energy ratio; that important quantile features usually have a narrow interval between lower and upper quantiles; and that timedomain energy and frequency-domain energy and frequencydomain kurtosis are irrelevant features.We identified, for magnetometer, that important subband features mainly come from the frequency band between 0 and 30 Hz and compute energy ratio only; that important quantile features usually have a narrow interval between lower and upper quantiles; and that time-domain energy and frequency-domain energy and the highest FFT value are irrelevant features.
The reference performance reported on the identified recognition scenarios demonstrates that advantages of using multiple modalities for transportation mode recognition.Particularly, the magnetometer modality is complementary to the accelerometer/gyroscope modality and combining the three can improve the recognition performance significantly over accelerometer and gyroscope.Similarly, combining GPS and accelerometer can also improve the recognition performance significantly over using accelerometer alone, and also over the combining of three inertial sensors.
We make the dataset and the baseline implementation publicly available to encourage a reproducible and fair comparison by the research community [19].Future work would be to improve the recognition performance and to verify the generality of the SHL dataset by applying the classifier trained with the SHL dataset on other existing transportation mode recognition dataset.

FIGURE 1 .
FIGURE 1. Transportation mode recognition from mobile phone sensor data is generally addressed using streaming machine learning techniques.The data from multimodal sensors (a) are segmented into short frames of sensor signals (b) on which features are computed yielding a feature vector (c).A classifier (d) then maps the feature vector in one of the transportation classes (e).

FIGURE 2 .TABLE 5 .
FIGURE 2. A participant wearing the four smartphones and a camera during data collection.

FIGURE 3 .
FIGURE 3. (a) Amount of data (Dataset-E) collected for each of the eight transportation activities by the three users.(b) Amount of data (Dataset-IG) where GPS is available.

•
Scenarios 9 and 10 categorize the four vehicle classes into Private road vehicle (Car), Public road vehicle (Bus), and Rail vehicle (Subway and Train).Scenario 9 additionally merges Walk and Run into Foot.• Scenario 11 only merges Walk and Run into Foot.Scenario 12 does not have any subgrouping, i.e. with the original eight classes contained in the SHL dataset.Table

6 .
This work is licensed under a Creative Commons Attribution 3.0 License.For more information, see http://creativecommons.org/licenses/by/3.0/.This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2019.2890793,IEEE Access L. Wang et al.: Enabling Reproducible Research in Sensor-Based Transportation Mode Recognition with the Sussex-Huawei Dataset TABLE Subgrouping based on the eight classes in the SHL dataset: S1-Still, W2-Walk, R3-Run, B4-Bike, C5-Car, B6-Bus, T7-Train, S8-Subway.

TABLE 9 . 2 F
Feature analysis: subband (E) and quantile (Q) features, and the remaining time-frequency domain (T +F ) features.TypeFeatures DimensionEEnergy and energy ratio with scan width 1 Hz and skip 0.5 Hz 198 Energy and energy ratio with scan width 2 Hz and skip 1 Hz 98 Energy and energy ratio with scan width 3 Hz and skip 1 Hz 96 Energy and energy ratio with scan width 4 Hz and skip 1 Hz 94 Energy and energy ratio with scan width 5 Hz and skip 1 Hz 92 Energy and energy ratio with scan width 10 Hz and skip 1 Hz 82 Energy and energy ratio with scan width 15 Hz and skip 1 Hz 72 Energy and energy ratio with scan width 20 Hz and skip 1 Hz 62 Energy and energy ratio with scan width 25 Hz and skip 1 DC component of FFT 1 Highest FFT value and frequency 2 Ratio between the highest and the second FFT peaks 1

FIGURE 4 .
FIGURE 4.For each modality (accelerometer, gyroscope and magnetometer) we extract 908 features and compute the MI between each feature and the 28 pair-wise combinations of the eight classes: S1-Still, W2-Walk, R3-Run, B4-Bike, C5-Car, B6-Bus, T7-Train, S8-Subway.The figure shows the number of features from each modality that presents an MI value above a specified threshold.

Fig. 7 FIGURE 5 .
Fig.7visualizes the MI values of the time-frequency features from family T +F.The MI is computed between each feature and each of the 28 class pairs.Each subfigure contains 28 panels corresponding to the 28 class pairs.In each panel the indices 1-8 in the first column denote the time-domain features: mean, standard deviation, energy, mean crossing rate, kurtosis, skewness, auto-correlation value and offset.

FIGURE 9 .
FIGURE 9. Merging the selected features from the 28 class pairs for each modality.The first row shows the selected subband features; the second row shows the selected quantile features; the third row shows the selected TF features.The color denotes the number of occurrence of each feature in the 28 class pairs.

FIGURE 10 .
FIGURE 10.The processing pipeline using the SHL dataset, which is divided into the training and testing datasets according to the leave-one-out strategy.The training dataset is used for feature selection and classifier model training (top block).The testing dataset is used for performance evaluation (bottom block).

VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 3.0 License.For more information, see http://creativecommons.org/licenses/by/3.0/.This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2019.2890793,IEEE Access L. Wang et al.: Enabling Reproducible Research in Sensor-Based Transportation Mode Recognition with the Sussex-Huawei Dataset

FIGURE 11 .
FIGURE 11.Visualization of the F1 score results.(a) The mean F1 score for each scenario obtained by the modalities A, AG and AGM, and with Dataset-E and Dataset-IG.(b) The mean F1 score for each scenario obtained by the modalities A, AG, AGM, P, AP and AGMP, and with Dataset-IG.(c) The standard deviation of the F1 score across users, positions and periods for each group of modalities (Dataset-IG).
This work is licensed under a Creative Commons Attribution 3.0 License.For more information, see http://creativecommons.org/licenses/by/3.0/.This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2019.2890793,IEEE Access L. Wang et al.: Enabling Reproducible Research in Sensor-Based Transportation Mode Recognition with the Sussex-Huawei Dataset

TABLE 2 .
This work is licensed under a Creative Commons Attribution 3.0 License.For more information, see http://creativecommons.org/licenses/by/3.0/.This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2019.2890793,IEEE Access L. Wang et al.: Enabling Reproducible Research in Sensor-Based Transportation Mode Recognition with the Sussex-Huawei Dataset Data channels derived from the smartphone sensors.

TABLE 4 .
Features computed on the data channel derived from the GPS sensors.
Scenarios 3 and 4 merge the four vehicle activities into a new group Vehicle.Scenario 3 additionally merges Walk and Run into Foot.
• Scenario 1 is based on the physical activity of the user and categorizes the eight activities into Physically Active (Walk, Run and Bike) and Inactive (Still and Vehicle).• Scenario 2 is based on the power source (humanpowered or machine-powered) and categorizes the eight activities into Still, Pedestrian (Walk, Run and Bike) and Vehicle (Car, Bus, Train and Subway).• • Scenarios 5 and 6 categorize the four vehicle activities into Road vehicle (Car and Bus) and Rail vehicle (Train and Subway).Scenario 5 additionally merges Walk and Run into Foot.• Scenarios 7 and 8 categorize the four vehicle activities into Private vehicle (Car) and Public vehicle (Bus, Train and Subway).Scenario 7 additionally merges Walk and Run into Foot.

TABLE 7 .
Recommended transportation mode recognition tasks using SHL dataset.

TABLE 8 .
Division of the SHL dataset based on the recording days.

TABLE 10 .
Five most frequently reoccurring subband, quantile and TF features in the 28 class pairs for each modality.Key: ωc -center frequency; ω b -bandwidth.

•
When using accelerometer (A) alone, the classifier can recognize Still, Walk and Run robustly, but presents significant ambiguities between Car and Bus, and between Train and Subway, and certain ambiguities between Still and Train/Subway, and relatively low recognition rate of Bike.Car and Bus may have similar sensor vibration intensity, thus leading to larger confusion between each other; so does the pair Train and Subway.Bike may be mis-recognized as Walk, Bus or Car, each with a probability of around 7%.

TABLE 12 .
F1 score (F) and recognition accuracy (A) for each recognition task obtained usint Dataset-IG (GPS available).