Evaluation of Railway Passenger Comfort With Machine Learning

Railway passenger comfort has been considered a growingly important field to attract more passengers from other public transports such as air flights. To allow passengers and train companies to estimate the onboard passenger comfort level, we propose a phone-based hybrid machine learning (ML) model combining pre-train convolutional neural network as a feature extractor and support vector regressor as a predictor. To better demonstrate the capacity of the proposed model, two sub-models of the hybrid model and the same hybrid model but with non-pre-train feature extractor are adopted to be benchmarks. The raw field data is acquired from a corridor between the University of Birmingham station and Birmingham International station using phones, subsequently calculated to corresponding comfort level according to UIC 513. The four models are trained by the dataset in two domains – time domain and frequency domain, then optimized by random search and validated by 10-fold cross-validation. The proposed method yields the best performance with an $R^{2}$ of 0.988 ± 0.004, a root-mean-square error (RMSE) of 0.028 ± 0.015, and a mean-absolute-error (MAE) of 0.02 ± 0.005. The results of this study underpin the possibility that the railway passenger has the access to quantify the level of comfort and the real-time assistance for the train driver to calibrate the driving style from the proposed system.


I. INTRODUCTION
The train has gained popularity in many countries as its high speed and shorter time over air transportation before and after boarding, especially in short or medium distance travel [1].With the more demanding requirement of passengers, they expect a more accurate timetable, higher speed, and more comfortable service.Passenger comfort is an increasingly notable area that attracts attention from train companies and academia to improve it, making railway transportation more competitive.Román et al. analyzed the potential competition of the high-speed train with air transport in the case of Madrid-Barcelona.They unveiled that passenger is willing to pay more if the level of comfort drops [2].Broadly, passenger comfort is an overall evaluation of the train journey depending on train vibration, temperature, acoustic, lighting, interior design, windows, etc.There is no universal standard to reckon passenger comfort.Section II have summarized three traditional approaches quantifying railway passenger The associate editor coordinating the review of this manuscript and approving it for publication was Jesus Felez .comfort.However, we consider that the implementation of the three methods is not easy for everyone.
Our research proposes a smartphone-based hybrid ML model, CNN, to extract informative features from the raw data and SVR to predict passenger comfort using the extracted features.The proposed model promises to enable the train passenger to quantify the comfort level easily and provide the potential to feedback to the train company.There is a potential that the train company can use this costefficiency method to monitor the abnormality of passenger comfort.

A. CONTRIBUTIONS
Our primary contribution is adopting the phone-based ML model, providing a handy way to measure passenger comfort.The length of the section is adjustable to fit the specific train lines.UIC 513 suggests that the passenger comfort is calculated every five minutes; however, the time gap between stations is less than five minutes.For instance, the train takes about 2 minutes from the University station to the Fiveways station in the UK.More importantly, this framework can be implemented to be online assistance to monitor the driver's driving style.A shorter period than five minutes is beneficial to a more explicit monitor.

II. LITERATURE REVIEW
In this section, we have fleshed out the traditional methodologies evaluating the railway passenger comfort to date, the critical status of vibration in railway passenger comfort, and the highlight of the gaps that may occur in the use of machine learning in the context of railway passenger comfort evaluation.
Method 1 -Oboknb and Clarke used a questionnaire to procure how comfortable the passenger felt about the journey [3].They admitted that many pitfalls could emerge in the interpretation of the data.However, valuable information can still be extracted if a delicate questionnaire and data interpretation are used.Shinkansen has conducted a questionnaire to define the impact of physical attributes such as vibration, acoustics, humidity, lighting, air freshness, air pressure, seat size, etc., on passenger comfort.It was concluded that passenger comfort has 70% correlation with vibration, acoustics, seat size, interior design, and air freshness.
Method 2 -the use of vibration to calculate passenger comfort has been well established in different regions, such as ISO 2631, Sperling's method, and UIC 513.ISO 2631 provides an R.M.S-based approach to evaluate people's whole body vibration [4] in various conditions such as in a building or a car, which has not been adjusted according to train passengers.Sperling's method is more suitable for comparative analysis between vehicles [5], [6].UIC 513 is dedicated to guiding the train passenger comfort evaluation using train vibration.
Method 3 -some of the studies considered using more features such as ventilation and lighting than vibration [7] to assess the train passenger comfort.Rather than just mechanical attributes, [8] introduced biological parameters such as the variation in heart rate to determine the passenger's comfort.More physical features (noise, air pressure, temperature, humidity, and illumination) and physiological factors (body pressure distribution, Electroencephalography, and Electromyography) were taken into account in [9] to estimate the passenger comfort on a high-speed train.Many pieces of evidence support the use of physiological factors to reflect the state of humans.However, it is not practical and expensive to use these features to estimate passengers' comfort as additional sensors are needed, and it is suspicious to violate the passengers' privacy to obtain the physical feature from passengers.Moreover, there is no universally applicable evaluation standard for the physiological factors.
It is noted that the three existing methods have no potential to evaluate passenger comfort continuously.More importantly, the measurement policy is not user-friendly, which implies that the execution of the measurement has to be done by a specialist.For instance, method 1 needs an appropriate survey.The passenger cannot answer the tedious questions about every train journey; method 2 demands proper set-up of accelerometers to allow vibration collection; method 3 requires experts to commission vibration measurement.
Most existing research recognizes vibrations' critical role in this comfort level domain.[10] has claimed the vibration relevant ride comfort for a passenger train and the vibration related to a safety concern for a freight train should be anticipated at the origin of the design stage.Other studies like [11] strived to enhance a bogie suspension, hence increase the passenger's comfort; [12] has exerted a numerical model validated with the field test, concluding the rail irregularities with short-wavelength redounds to passenger's comfort; subsequently, [13] has used a magneto-theological (MR) damper to reduce the disutility yielded by the large-amplitude vibration caused by rail irregularities.A full-scale examination with the MR damper was carried out at speeds from 80-350 km/h, concluding that excellent comfort was performed according to UIC 513.[14] considered there is a strong relationship between the number and the distance of train/bus stops and the comfort level.In [11]- [14] the common motivation for their studies is the importance of vibrations as a better bogie suspension can isolate more vibration to car bodies; irregularities trigger wilder train dynamic that affects the train vibration; closer and more train stops lead to more acceleration/deceleration, subsequently influence the passenger comfort.It is rationale to adopt vibration to calculate the passenger comfort since the literature has well established the application and feasibility of using vibration.
ML techniques have been thrived to process vibration data in various fields such as structural damage detection [15]- [17], human activity recognition [18]- [20], and the prediction of road surface roughness [21]- [23].These researches agree with the success of the ML techniques in vibration analysis.However, there is little study evaluation of ride comfort using ML in railway vehicles.Azzoug and Kaewunruen proposed an artificial neural network (ANN) to predict ride comfort [24].There was a limitation that the running vehicle was only allowed to run at 5 -17 mph.However, the main interest was testing the feasibility of a smartphone phone as a vibration collected tool by a comparative analysis of the data from a sophisticated accelerometer and two smartphones.It was concluded that the phone-based accelerometer is trustworthy.The fully connected ANN has a large number of parameters, making the model time-consumed to train, especially for the vibration dataset that each sample is significant in general.We think that ANN is computational cost to process such data we use.This issue can be more dominant if the trained ML model is implemented to a smartphone.Therefore, a more efficient way is required.standard was rolled out officially, namely UIC 513.This standard is a statistical method that integrates the relationship between vibration and human comfort and the impact of the vibration on ergonomics.UIC 513 is reliable as it focuses on the vibration characteristics, which predominantly affect passenger comfort.A comparative analysis based on ISO 2631 and UIC 513 has been conducted in [25] in a field testing from two corridors, Beijing to Chengdu and Shenzhen to Guangzhou, showing that the two standards have a large overlapped with UIC 513 produces a slightly larger area of uncomfortable than that in ISO 2631.This implies that UIC 513 is more sensitive to the uncomfortable zone.

III. PASSENGER COMFORT MEASUREMENT
Fig. 1 provides a flowchart to overview how the data flow from a smartphone to comfort level using UIC 513.UIC 513 specifies that passenger comfort is evaluated in every five-minute section.Each five-minute section is further split into 60 five-second sub-sections, and each subsection is transformed to frequency domain before being weighted.The RMS value of each subsection is calculated.Finally, the confidence probability of 95% is applied to calculate the passenger comfort using (1).
where: N mv -the passenger comfort of the simplified measurement.
a -acceleration.X , Y , and Z -the directions of the acceleration (See Fig. 1).P -vehicle floor where the acceleration is collected.W d , W b -frequency weighting where b is for Z direction and d is for X and Y .
The ratings of comfort level are tabulated in Table 1.There are five levels from level one to five.UIC 513 recommends  the upper limits of the comfort level for rural trains, traditional trains, and posh trains are 4, 3, 2, respectively.
The correlation between the frequency of the vibration and the sensitivity of human beings is complex.To quantify the physical values of vibration to passengers' feelings, the frequency weighting curves in the vertical direction, longitudinal, and horizontal direction are given in Fig. 2. Several weighing factors are up to around one corresponding to some frequencies implying that people are more sensitive to that range of frequencies such as 0.5 -5 Hz in the longitudinal and horizontal directions, 6 -10 Hz in the vertical direction.
In our work, we have carried out a data collection with a smartphone on a return trip from University station to Birmingham International station.To make sure the acceleration sensed by the phone is only from the train, the phone is asked to remain stable on the floor with no relative movement with the running train.A mobile app ''Phyphox'' is installed on the phone to allow access to the accelerometer embedded in the phone, visualizing and saving the vibrations.Fig. 3 provides an example of the raw data, showing the vibrations in horizontal and longitudinal directions are larger than those in the vertical direction.
As aforementioned, UIC 513 evaluates passenger comfort every five minutes.However, we use a slightly different scenario from UIC 513 since we promise to enable passengers to quantify the comfort level with their phones, which means that the passengers will need to keep their phones with no movement in five minutes.It is not practical for passengers to place their phones on the floor doing nothing for five minutes.More importantly, to estimate the passenger comfort in the section shorter than five minutes, we decide to evaluate the comfort level on a more detailed scale of five seconds, the so-called instance comfort level.

IV. HYBRID MODEL DEVELOPMENT
In this section, we discuss the details of how we configure the hybrid model, hyperparameter optimization, and 10-fold cross-validation

A. DATA PRE-PROCESSING
There are two datasets used: time-domain and frequency domain.The two datasets are processed in the same way.To predict passenger comfort in five seconds, the raw dataset is sliced into samples, and the first sample X 0 is given by where: X o : the first sample a: acceleration x, y, and z: the three directions shown in Fig. 1.
The subscripts 1 -500: time steps (100 Hz sample rate is set, producing 500 timesteps in five seconds. After the raw data is sliced into five-section windows, the dataset is split into ten elements with ten equal sizes.As 10-fold cross-validation is used, every nine elements are used to train the model, and left one-fold is used to validate the model.More details on 10-fold validation are discussed in Section IV (D).It is worth knowing that the validation set should hold-out and with no touch before the model is selected.Therefore, when applying the Min-Max normalization subsequently, only the information from the training set is employed to normalize the validation set.

B. HYBRID MODEL
CNN is initially designed for classifying images.However, we still use CNN in our work for two reasons: (1) the principle of CNN provides a more efficient way than ANN.This allows us to save computational costs when the proposed model is applied to phones.(2) CNN can be an auto feature extractor to avoid the usage of handcrafted features, which might introduce human bias if the dataset cannot be understood thoroughly.
Extensive research has shown that CNN has presented promising results in a variety of fields as sentences modelling [26], heartbeats classification [27], and a CNN for image classification dedicated to phone devices with significant computational cost reduction [28].A substantial advantage of CNN is the reduction of parameters compared to ANN, which allows a more extensive and deeper model for a more complex task.To receive a five-second sample, one node of ANN produces 1,500 weights.One node might not be capable enough to achieve the required mission so that we might need to increase the number of nodes that considerably scale up the size of weights.This situation can become more dominant if the passenger comfort for a more extended period is estimated using ANN as the size of each sample is growing.Another advantage of using CNN is that it automatically allows the model to obtain abstract features.For instance, CNN can detect the edge and the simple shapes of the image in the first layers.The more detailed features can be extracted in the fowling layers.Our project can also introduce handcrafted features if the handcrafted features can be predefined by human experts.However, it is hard to cover all the features essential to the required task and challenging to avoid bias.UIC 513 infers passenger level based on the intensity of the vibration, which reflects that the peak of the vibration plays an important role.However, we should also consider the intensity at different frequencies as passengers can percept differently even though the intensity of the vibration is the same but at different frequencies.
In summary, it is sensible to use CNN as a feature extractor when the handcrafted features are hard to define.Fig. 4 presents how the data flow in the hybrid model with two submodels: CNN and SVR.As shown in Fig. 4(a), CNN outputs the intermediate features before the prediction layer to the SVR predictor.Unlike some researches which replaced the prediction layer of CNN with SVR as shown in Fig. 4(b), such as [29] to predict guide RNA activity, [30] to predict wastewater index, and [31] to predict short-term traffic flow.These works concluded that the hybrid model performs better than either only CNN or SVR.However, the scenario in Fig. 4(b) does not present a satisfactory result in our work.The outcomes that compare the impact of the with and without CNN prediction layer are shown in SECTION V.

C. HYPER-PARAMETERS TUNING
Zhang and Wallace have carried out an extensive experiment on the effect of each CNN's hyperparameter for sentence classification [32].Although the goal was different from ours, we can still use it as a guideline as [32] has included almost all the hyper-parameters of CNN.The tuned parameters are listed in Table 2, consisting of the boundary for each hyperparameter used in the proposed model.Random search, a strategy for hyper-parameter optimization, is dedicated to  searching the hyper-parameters with optimal scores [33].It is noted that the dropout [34] and the regularization [35] are used to prevent overfitting.Learning rate decay by epochs is adopted to speed up the training process given by (3).decay rate = learning rate epochs (3)

D. K-FOLD CROSS-VALIDATION AND EVALUATION MEASUREMENT
It is easy to yield an overoptimistic outcome if the same dataset was used to train and evaluate a model as early as 1930 [36].This issue has been addressed by cross-validation estimating the model's performance with a new dataset [37].
To evaluate a model rigorously and minimize the bias related to randomly sampling, the way to split the data has been a major interest as the limited amount of data in practice.Our study introduces k-fold cross-validation to perform model selection and compare different learning algorithms.K-fold cross-validation splits the dataset into k segments with equal size or approximately equal size where k-1 components are used to train the model.The kth element is employed to test the model performance [38].The k is set to ten as [39] concluded k = 10 is sensible if we aim to measure the model's error.
We adopt three indicators, the coefficient of determination R 2 , root mean square error (RMSE), and mean absolute error (MAE).The definition of R 2 makes it useful to estimate the success of a model predicting the dependent variable from the independent variables [40].Both RMSE and MAE are widely used to measure the average performance of an ML model, although there are arguments between two widely cited papers [41], [42].The dispute is not our interest, so that we decide to deploy both of the two measures to include their advantages.The three metrics are given by: where p i and y i are the predictions and the actual values; ȳ is the mean of the label values in the validation set; m is the number of samples in the validation set.

V. RESULTS AND DISCUSSION
In this section, we perform the results produced by four models (CNN, SVR, pre-train CNN + SVR, non-pre-train CNN + SVR) based on two kinds of the dataset (time domain and frequency domain).This section commences with Table 3 presenting the optimal hyper-parameters searched by random search.The results follow the order: CNN, SVR, pre-train CNN + SVR, and non-pre-train CNN + SVR with timedomain data and frequency domain subsequently.Fig. 5(a-c) plots the three measures R 2 , MAE, and RMSE using time-domain data to compare the different performances between each fold.It is very apparent from Fig. 5(a-c) that the pre-train CNN + SVR model has gained the highest R 2 , lowest MAE, and RMSE.The pretrain CNN + SVR model's R 2 appears to fluctuate mildly, unveiling a 0.04 standard deviation as shown in Table 4.The most interesting aspect in Fig. 5(b) and Fig. 5(c) is that the non-pre-train CNN + SVR procures a smaller MAE but larger RMSE than those of SVR in the fifth fold.It is due to the RMSE squares the error assigning a bigger weight to more considerable errors.This implies that the non-pretrain CNN + SVR produces some significant error points in the fifth fold.Although it has been widely argued that either MAE or RMSE should be used [41], [42], it is sensible to use both RMSE and MAE as biases can emerge in certain conditions, such as the circumstance just mentioned.
Further analysis of the four models' performance using frequency domain data is provided in Fig. 5(d-f).It is noted that there is no performance for SVR due to the unavailable using a complex number.In the three subfigures Fig. 5(d-f), the pre-train CNN + SVR model gives the highest R 2 and lowest MAE and RMSE in every fold.It is concluded that the pre-trained CNN + SVR using the dataset in frequency domain outperforms all other scenarios, which is further proven by Table 4 providing the overall performance.
The results, calculated by the mean and the standard deviation in ten folds in the form of µ ± σ as shown in Table 4 [43], indicate the overall performance comparison.The proposed model illustrates a significantly large average R 2 (0.988) than others and the slightest standard deviation (0.004), yielding a coefficient of variation of 0.4%.The proposed model is the best in terms of high performance and satisfactory stability.The average value for MAE and RMSE is 0.02 and 0.028, yielded by the proposed model, both indicating that the prediction error is minor.More details at the sample level can be seen from Fig. 6 in the appendix.Most of the actual values stay between the region from 0 to 1, which reveals the tested train line section is at an excellent comfort level, referring to Table 1.It is worth knowing that our work uses a regression model that predicts the comfort level's exact value, not the interval of the comfort level presented in Table 1.If we now turn to Fig. 6(i) -fold 9, the predictions at the two peaks offer an apparent gap from the two actual values; however, the true values and the predictions remain in the same comfort interval.It can be observed that the model's performance can be enhanced if the label of the dataset is transformed to the interval of the comfort level.
The three models, the non-pre-train CNN + SVR, SVR, and CNN, are also tuned by the same method random search.The difference between CNN and the proposed method is the prediction layer.The best R 2 CNN can perform is 0.706 ± 0.129.The reason can be that the prediction layer of the  proposed model is SVR which is more potent than a dense layer used by CNN.The excellent generalization capability and accuracy that SVR can achieve have been proven [44].
The effectiveness of CNN to be a feature extractor can be seen from the comparison of non-pre-train CNN + SVR and SVR.Around a 10% increase of R 2 can be found from SVR to non-pre-train CNN + SVR.A leap of R 2 has been observed from non-pre-train CNN + SVR to pre-train CNN + SVR.An explanation for this might be that the feature extractor shown in Fig. 4(a) has been trained to minimize the loss between the labels and the predictions during the CNN data flow.The features extracted by the pre-trained feature As mentioned in the literature review, the questionnaire needs massive effort from train crews and passengers, which may thwart the long-term implementation.Besides, it is challenging to get away from the false interpretation of the information collected.In contrast, our system is less likely to suffer from the effort and time consumed.Besides, the guideline we used, UIC 513, devotes itself to the vibration, avoiding subjective judgments in the questionnaire.Our implementation contrasts with that of method 2 discussed in the literature review requires an accelerometer to sense the train vibration and a computer to visualize and save vibrations.Smartphones provide us with a compact to obtain train vibrations.The results shown in this work also prove the feasibility of the proposed system at a higher speed than that of [24] conducted the field test at 5 -17 mph.

VI. CONCLUSION
This study set out to measure train passenger comfort level using a phone-based machine learning model.We adopt a hybrid model combining CNN and SVR.Pre-train CNN extracts informative features, and SVR predicts the passenger comfort using the features.With a hyperparameter optimization method random search, optimal results are gained to four models.It has been inferred that the hybrid model shows superior performance than one of the sub-models.The most critical finding reveals that a pre-train feature extractor outperforms the non-pre-train one.
The finding will be of interest to train companies keen to improve the comfort level.The proposed solution can be integrated easily into an onboard driving-aided system that provides real-time feedback to the driver on how the current driving style impacts passenger comfort.The insights gained from this study may assist passengers with a simple way to quantify the level of comfort and engineers a straight ward way to calculate the comfort index as the traditional way is more complicated.A limitation of this study is that most of the actual values in the dataset fall into the interval from 0 -1, as can be seen from Fig. 6.It is noticeable that this study is limited by the absence of more datasets with uncomfortable intervals.Despite its limitation, the proposed model presents a satisfactory result for the limited uncomfortable samples.The major effort we can put into making the result more convincing is using 10-fold crossvalidation.Therefore, a larger number of field data with various comfort intervals could also be conducted to further determine the proposed model's effectiveness.
A. UIC 513 GUIDELINE In 1988, European Rail Research Institute (ERRI) B153 rolled out a UIC standard to assess the passenger level based on ISO 2631 after decade research.In 1994, the VOLUME 10, 2022

FIGURE 1 .
FIGURE 1. Flowchart to calculate the passenger comfort level.

FIGURE 3 .
FIGURE 3.An example of raw data.

FIGURE 4 .
FIGURE 4. Two scenarios to implement feature extractor (a) pre-train feature extractor and (b) non-pre-train feature extractor.

FIGURE 5 .
FIGURE 5. R 2 , MAE, and RMSE for four models using time-domain data (a-b) and frequency-domain data (d-f).

FIGURE 6 .
FIGURE 6.The actual values and predictions correspond to samples in ten folds.(a)-(j) are fold one to fold ten, respectively.

TABLE 2 .
Hyper-parameter for the hybrid model.