Improving the Prediction of Passenger Numbers in Public Transit Networks by Combining Short-Term Forecasts With Real-Time Occupancy Data

Passengers of public transportation nowadays expect reliable and accurate travel information. The need for occupancy information is becoming more prevalent in intelligent public transport systems as people started avoiding overcrowded vehicles during the COVID-19 pandemic. Furthermore, public transportation companies require accurate occupancy forecasts to improve service quality. We present a novel approach to improve the prediction of passenger numbers that enhances a day-ahead prediction with real-time data. We first train a baseline predictor on historical automatic passenger counting data. Next, we train a real-time model on the deviations between baseline prediction and observed values, thus capturing events not addressed by the baseline. For the forecast, we attempt to detect emerging patterns in real time and adjust the baseline prediction with deviations from the patterns. Our experiments with data from Germany show that the proposed model improves the forecast of the baseline model and is only outperformed by artificial neural networks in some instances. If the training sets only cover a limited period of up to four months, our approach outperforms competing methods. For larger training sets, there are mixed results in the sense that for some test cases, certain types of neural networks yield slightly better results, but our method still performs well with less training effort, is explainable along the whole prediction process and can be applied to existing prediction methods.


I. INTRODUCTION
T HE GLOBAL challenges of climate change and urbanization necessitate the evolution of the transportation sector to be more environmentally sustainable while also handling an increasing number of passengers in metropolitan areas. This transformation requires travelers to switch from their current means of mobility to a more sustainable mode of transportation. In urban areas, public transportation systems The review of this article was arranged by Associate Editor Erik Jenelius. have an excellent environmental footprint while also ensuring a high throughput of passengers, making them a good candidate for acting as the backbone of future transportation systems [1]. Together with delays [2], crowded vehicles are one of the main negative experiences of travelers [3], hindering them from switching to public transport. In particular, since the COVID-19 pandemic, people have avoided overly crowded vehicles. Real-time information may influence the traveler's decision to travel (travel choice), which mode of transportation to choose for the journey (mode choice), on which path to travel (route choice), where to start and end the trip (boarding and alighting choice) and at which time to travel (departure time choice) [4]. As such, while not directly influencing the supply of public transportation, realtime crowding information (RTCI) may allow the distribution of the peak demand more evenly by informing passengers about the public transportation system's current conditions, hence, improving the quality of the public transportation system for travelers overall.
Due to the ongoing digitalization, most public transportation companies nowadays employ an intelligent transport system (ITS) for monitoring and managing their fleet of vehicles. Real-time vehicle delays and sometimes real-time passenger occupancy information are already processed and shown to travelers. The next step for public transport providers is to improve the real-time monitoring of the fleet and its operational data by forecasting future conditions, such as expected passenger numbers or vehicle travel time, by training machine learning algorithms on the stored operational data. This step is also motivated by the circumstance that the prediction of passenger numbers is the prerequisite for effective early proactive interventions by vehicle dispatchers [5]. In this paper, we propose a novel methodology for forecasting the number of passengers, which incorporates a prediction based on historical automated passenger counting (APC) data and adjusts this forecast with real-time APC observations. Our algorithm aims to help dispatchers forecast future fleet conditions and helps travelers gain information about their trip. Our fundamental goal is the possible deployment under real-world conditions in an ITS and sketching a migration path from already employed predictions to our hybrid forecasting algorithm.
In the literature, many positive aspects of accurate RTCI are listed. Evidence that overcrowded stations and vehicles increase the dwell time of vehicles at their stations and decrease the level of service is well-established in the literature [6], [7]. On the other side, there is evidence that RTCI may even help increase the perceived service level of public transportation as travelers feel well-informed [8]. RTCI also positively affects the distribution of the load in the transportation network and increases the passengers' satisfaction; it is especially well-suited for reducing crowding during peak-hour traffic. In a simulation, Drabicki et al. [9] measured a decrease of up to 30 percent of passengers for the most occupied vehicles during peak-hour traffic. While crowded vehicles are also often delayed and may already be avoided, the literature shows that travelers avoid overcrowded vehicles, even if they are on time [10]. For a related problem, the prediction of occupancy for wagons of a train, passengers usually take the closest car if no RTCI is available [11]; with RTCI, the passengers more evenly distribute themselves among different wagons [12]. Zhang et al. [12] found that the number of passengers in the most crowded vehicle decreased by around 4 percent and increased in the least crowded vehicle by the same amount. Overall, even if RTCI does not influence any travel decision, passengers rate the availability of the information positively.
In this paper, we work on the problem of predicting the number of passengers in so-called demand segments. We can then later adjust this forecast to passenger load or occupancy in specific network segments or specific vehicles, for example, to display the passenger occupancy in vehicles for travelers. If we talk about passenger occupancy in this paper, we always refer to the number of passengers in a vehicle. In contrast, passenger load refers to the number of passengers on a transportation network segment for a specific time. Passenger occupancy forecasting for public transportation vehicles plays a pivotal role in traveler satisfaction and has already been widely researched in the literature [13]. Even though past research addressed vital issues regarding passenger occupancy and load prediction, our solution focuses on real-time forecasts targeting an easy integration in existing intelligent transport system. Often literature does not regard the scalability of the approaches to complete transportation networks and evaluates their approach only on a single line; as such, our case study highlights the viability of training the proposed model on a complete transportation network. Recent related work often experimented with very accurate but computationally complex forecasting algorithms, such as deep neural networks. By doing so, these works favor accuracy over practical applicability defined by larger data requirements and time requirements during training, the algorithm's scalability to a larger transportation network, and interpretability by ITS operators of the model. The primary objective of our proposed method is to demonstrate practical applicability while still being more accurate than the currently deployed algorithms in ITS. Apart from research projects, passenger forecast algorithms for RTCI systems are either very simple or nonexistent in Germany. If crowding information is available at all, it often only contains real-time crowding information, or simple historic average forecasts are used. Therefore, we aim to develop a short-term prediction algorithm with practical applicability in current the ITS that balances accuracy and efficiency.
We evaluated multiple time series prediction algorithms as a baseline for our hybrid algorithm, such as seasonal autoregressive integrated moving average (SARIMA) [14]. We extended the day-ahead forecasting from a baseline prediction algorithm by including real-time APC data through learning characteristic passenger occupation patterns that we can apply to the baseline prediction, similar to the idea of the Profile Similarity Model employed for travel time prediction [15], [16]. Our prediction target is neither the passenger load nor the passenger occupation itself but a derived metric, the number of passengers in a demand segment, which describes the segment between two stops of a specific line in a particular time interval. We use this derived metric to aggregate the occupation of multiple vehicles belonging to a line in the same demand segment. To create a prediction for the passenger numbers in demand segments, we designed a two-step approach: First, we create a baseline prediction, and second, we apply certain deviations to the forecast at the time of a prediction request. As such, the algorithm first computes a prediction with a time series algorithm that was trained on a historical data set. This paper found that SARIMA is the best-performing simple time series algorithm for the baseline prediction that also has low complexity. Still, any day-ahead forecasting algorithm, such as typical machine learning or deep learning models, can be used for the baseline prediction. Next, we calculate the deviations between the predicted passenger numbers with the baseline forecasting algorithm and the observed passenger numbers in the historical data set. The algorithm then clusters these passenger number deviations to create clusters of similar deviation patterns. This means we obtain Characteristic Profiles for which our currently trained baseline algorithm often produces mispredictions. In the algorithm's second step, once the algorithm needs to compute the prediction, it selects a suitable Characteristic Profile, i.e., the most similar profile, applies its deviations to the prediction of the baseline prediction, and then translates the number of passengers in the demand segment into passenger occupancy in the individual vehicles. With this two-stage approach, we ensure that already implemented RTCI in ITS with day-ahead forecasting can still be used but add the possibility also to use available real-time occupancy data.
The remainder of this paper is structured as follows: Section II first introduces relevant approaches from the literature and then discusses the advantages and disadvantages of the different methods. Next, Section III defines our tackled problem in more detail, introduces our data model, and then highlights the details of our short-term passenger prediction model. Afterward, Section IV presents the evaluation methodology, the data set, and the evaluation of the proposed model itself on an extensive real-world data set spanning a year from Germany from a mid-sized city with around 200 000 inhabitants in the metropolitan area. Lastly, Section V concludes this paper and suggests areas for future work.

II. LITERATURE REVIEW
The analysis of passenger occupancy in vehicles and passenger load on network segments has always been a focus of public transportation operators to improve their service and adapt it to passenger needs. Before operators could automatically collect large quantities of occupancy data, surveyors were tasked with counting passengers at critical locations of the transportation network to assess the passenger numbers and derive the transportation demand [7]. With the emergence of ITS around 2000, data collection has been simplified [17]: Data sources such as the data of automated fare collection (AFC) systems [18], or APC systems with cameras, light barriers, LiDAR [19], weight sensors [20], Wi-Fi data [21], [22], or crowdsourcing [23], [24] became available. Some of these data sources continually transmit their data to an ITS; thus, real-time passenger information systems became possible. The data set in this paper uses data obtained from an APC system utilizing motion sensors installed over the doors of the vehicles and an automated vehicle location (AVL) system for the vehicles' location. Communicating the current passenger occupancy to customers is already beneficial for imminent trips; however, predicting the occupancy for future trips is even more valuable to travelers so that they have enough time to adjust their travel plan to the information. This literature review introduces different prediction problems related to passenger occupancy, passenger load, passenger demand, and passenger flow prediction. We discuss their applications in public transportation systems and highlight current machine-learning approaches and models.
Our primary goal in this paper is the efficient shortterm prediction of passenger numbers for travel information systems and the predictive disposition of buses in periods of high occupancy. Hence, we mainly focus on the literature on the short-term prediction of in-vehicle occupancy with real-time information with RTCI systems for passengers and dispatchers. Nevertheless, we also introduce relevant research in related areas, such as passenger flow prediction, i.e., how many passengers are expected to travel which way from one location to another, or passenger demand prediction, i.e., how many passengers are expected to travel from one place to another. As practical applicability in ITS systems is the focus of this work, we are looking for models with low data and time requirements, high scalability, flexibility, and interpretability. In more detail, we analyze the models' prediction accuracy, data requirement for training, time requirement for training, scalability regarding whole transportation networks, flexibility to react to changes in the input data, e.g., a timetable change, and finally, the interpretability of the model's prediction. Most related work solely focused on the accuracy of the result, but we are also willing to accept a mediocre accuracy if the model's other parameters are satisfying.

A. FORECASTING MODELS FOR PASSENGER LOAD PREDICTION
Fundamentally different approaches have been proposed in the literature for forecasting passenger load, passenger occupancy, passenger flow, and passenger demand [13].
Simple models are still widely prevalent in ITS because of their low complexity. Historical average (HA) models forecast the number of passengers based on the average passenger occupancy in the past. The model usually performs poorly but is simple to understand and quick to implement. The last observation carried forward (LOCF) model forecasts passenger occupancy by returning the last measured value. The model usually performs well for imminent events but rapidly loses accuracy for events further in the future. The literature sometimes uses the HA model as a benchmark for historical data models and the LOCF model as a benchmark for real-time models [25], [26]. Still, their accuracy is not appropriate anymore when creating new algorithms.
Statistical models define a relationship between independent variables to dependent target variables. Wood et al. [27] implemented a linear regression model to predict the passenger occupancy of vehicles by regarding the last available passenger count, the current time, the deviation between scheduled arrival times and actual arrival times, the temperature, the precipitation, and the snow depth as input variables. Linear regression models are often easy to implement and understand [28]. Because of their simplicity, they are often used as an accuracy benchmark, similar to simple models [27], [29]. For passenger flow prediction, the relationship is often non-linear, limiting the accuracy of these models; therefore, the literature also proposes non-linear regression models, such as models with a Poisson distribution [30]. Fitting multiple Poisson models increase the model's accuracy for different passenger arrival patterns, such as morning or afternoon peaks. Poisson regression outperforms the historic average model but is still inferior to more complex models [31]. Jenelius [32] predict passenger occupancy with lasso regularization to avoid overfitting by simultaneously selecting features and estimating parameters. The developed univariate regression model does not consider correlations between passenger occupation and alighting events at different stations. For this, Jenelius [32] also implemented a partial least squares (PLS) regression model. The results indicate that lasso and PLS regression work similarly well if only operating on historical data, but lasso regression outperforms PLS if real-time passenger counts are available. Overall linear and straightforward non-linear regression models improve upon the simple methods but still lack accuracy.
Time series analysis is a subdomain of general statistical analysis that examines data sampled over time. The analysis of passenger occupancy over time is a typical problem for time series analysis, as different algorithms have been proposed, such as the autoregressive moving average (ARMA) model. The autoregressive integrated moving average (ARIMA) model is often used if the time series is not stationary, and regular ARMA models do not work under these conditions [14], [33]. SARIMA can handle the seasonality of weekdays versus weekends compared to other time series algorithms [34], [35]. Interactive Multiple Model (IMM) algorithms are hybrid models with different baseline predictors that can be used. For IMM models, mostly different ARIMA algorithms are trained on different time horizons, such as weekly, daily, hourly, or 15-minute [36], [37]. The different models should capture the stationary, the periodicity, and the volatility of real-time passenger demand with different resolutions [37]. Time series algorithms have also been tested with real-time smartcard data from an AFC system [36]. The class of ARMA models has moderate accuracy, especially regarding the seasonality of the data, e.g., peak commuter occupancy versus nighttime traffic or vacation times. However, the main disadvantage of ARMA models is their inability to account for external features, such as the influence of the weather or irregular events that may influence the target variable. The data and time requirements are more extensive than simple regression models, but they retain high flexibility and interpretability.
Kalman Filters (KFs) are often combined with other models, especially for combining historical models with a real-time component [38], [39], [40]. The primary advantage of KF models is their noise robustness and little dependence on large training data sets. For example, Vidya et al. [39] estimated the passenger flow with a Geometric Brownian motion as an internal model and updated its prediction with a KF by applying currently observed passenger numbers, highlighting the KF model's ability to function with very few data points. On its own, a KF often lacks accuracy, but with other models, a KF can improve other models' accuracy by updating a prediction with observed values.
Machine learning models learn parameters of mathematical functions to predict a value based on a set of feature variables, similar to other simpler regression models. Support vector regression (SVR) models are well-suited for passenger occupancy and flow prediction as they allow the definition of an acceptable error margin [14], [41], [42]. Low errors in the prediction might be acceptable depending on how the RTCI displays the information to the users; hence setting the acceptable error margin could be sensible in practice. For example, suppose the prediction is aggregated to categorical values such as not crowded, lightly crowded, or crowded. In that case, low divergences in the numerical occupancy prediction are less impactful than fewer, more significant deviations. Furthermore, SVR models are wellsuited because they can handle non-linear relations with the kernel trick and can thus also detect more complex patterns in the data. Usually, SVR models perform pretty well as a univariate model but have problems scaling the predictions to complete transportation networks as a multivariate model.
Ensemble Learning combines many smaller machine learning models with low accuracy into a larger model with better accuracy. Random forest (RF) [43], gradient boosting regression tree [43], LightGBM [44], and XGBoost [18] algorithms have been proposed for passenger load and flow prediction. Boosted models often have an advantage over non-boosted models, but for noisy data, they tend to overfit. For passenger occupancy prediction, boosted models usually have a good performance due to their ability to handle colinear features or omit features with low importance while only requiring a medium amount of data [44].
Clustering approaches are not widely researched for passenger occupancy prediction in public transportation, even for general traffic prediction in a broader sense; few clustering approaches have been investigated in detail [45]. Noursalehi et al. [46] developed a Dynamic Linear Model for detecting abnormal passenger occupancy patterns. By identifying large deviations between expected passenger occupancy based on a historical forecast and measured occupancy, it is possible to locate stochastic external influence. In contrast, similar occupancy patterns are expected to occur due to the same underlying process. As such, the clusters implicitly contain data about events, such as larger gatherings, weather influences, and typical seasonal influences, such as rush-hour or nighttime traffic. The clustering helps to identify patterns that strongly deviate from the typical day that may occur due to events and identify changes in passenger numbers based on the same phenomenon [46].
Neural networks are a subclass of machine learning algorithms that learn the weights of edges in a graph of neurons. Multiple neural network approaches have been successfully applied to passenger load and occupancy prediction.
Artificial neural networks (ANNs) have been hypothesized very early to be good predictors for passenger load predictions due to their ability to handle non-linear relations and stochastic processes well [47]. In some studies, ANN models displayed a good generalization capacity and prediction quality; however, the data and time requirement during training is usually higher than for classical machine learning regression models [48]. Other studies show that ANN approaches do not necessarily outperform machine learning approaches such as random forests [49]. Given the results with regular ANN, Li et al. [50] propose an ANN with an radial basis function (RBF), as it has a simple structure while maintaining good global approximation ability. They summarize that, on the one hand, the RBF model reaches a good prediction quality and gives essential insights into the data and the passenger flow between different stations. On the other hand, it has a complex internal structure and requires more training time than other machine-learning approaches. Regular ANN models are often neither flexible regarding structural changes in the public transportation network nor are their results interpretable.
Long short-term memory neural network (LSTM) models are deep learning approaches based on recurrent neural network (RNN) models that are especially well-suited for time series prediction [51]. LSTM models are usually quite accurate [52]; however, their main disadvantage is their complicated training procedure and lack of interpretability. For the lack of interpretability, Monje et al. [53] constructed a surrogate tree to obtain rules to explain the LSTM model's black box results. Furthermore, LSTM models can handle time series with temporal variability [26], which is particularly helpful when dealing with a time series consisting of arrival and departure events of public transit vehicles. LSTM models, in contrast to classic time series algorithms such as ARIMA, are also able to predict passenger load accurately across multiple steps in the time series [26]. Furthermore, they are also often part of hybrid models that attempt to combine the strength of multiple predictors. Lin and Tian [54] first applied an RF model to filter the features and then used the LSTM model for the final prediction. Other hybrid models attempt to differentiate between standard passenger flow and passenger flow during abnormal times [55]. The authors argue that predictors often do not learn abnormal events as they tend to be non-reoccurring. Fontes et al. [56] apply a multilayer perceptron model as a regression model to analyze the impact of the weather on passenger demand.
They argue that the model's performance increases when weather data is included as a feature, highlighting the importance of the weather on transit ridership. Nevertheless, a highly complex model is required to obtain good results. Bapaume et al. [57] proposed an inpainting image-oriented approach by encoding the transportation network information in images. The model is based on a convolutional neural network (CNN) called U-net and demonstrated high accuracy for predicting typical passenger occupancy. It also excelled in atypical usage situations, such as train station closures or strike events. Huang et al. [58] combined passenger flow prediction with arrival time prediction in a hybrid LSTM model to provide a holistic information basis for dispatchers. Zhang et al. [59] applied a convolutional long short-term memory neural network (ConvLSTM) model. One main advantage of convolutional models is their remarkable ability to work with many stochastic features, which other regression or time series models often lack. ConvLSTM models are especially well-suited for regression analysis of temporalspatial data, as they also incorporate the spatial attributes of vehicle occupancy, such as the station's location. Lastly, a class of neural networks called graph neural networks (GNN) is particularly suitable for passenger occupancy prediction because of the similarity between a public transportation network and a graph [60]. These models learn on a graphlike structure, such as public transportation networks. Zhang et al. [61] propose a short-term multi-graph GNN model to capture temporal patterns in different time horizons and compare it to classical machine learning and other neural network approaches. Here, the GNN model outperforms all other tested models for most test cases, even powerful prediction algorithms such as ConvLSTM models. Table 1 compares the approaches on a coarse-grained level according to their accuracy, data and time requirements, scalability, flexibility, and interpretability. The table is a qualitative summary of the previous section and gives a quick overview of different prediction algorithms from the cited literature. Unfortunately, no public data sets for passenger occupancy prediction are available to our knowledge, so a direct quantitative comparison between approaches implemented by different researchers is difficult. At the same time, researchers often do not discuss their data and the time required for the model training, making it impossible to evaluate specific implementations without a reimplementation. Therefore, we qualitatively compare the results in the table according to the observations in the previously introduced literature and general knowledge of specific models in domains other than passenger occupancy prediction to give a concise and simplified summary of which model excels in which category. Recall that our goal is not necessarily to create the most accurate model but to create a model with reasonably high accuracy while maintaining good practical applicability. For accuracy, some results indicate that models with a higher time and data requirement during training outperform simpler models: RF models and neural networks often outperform linear regression models and time series models such as ARIMA [14], [49]. Nevertheless, some linear regression models perform slightly worse than ensemble techniques, such as RF models [27]. However, more complicated models are not always better, but a more in-depth consideration is necessary [62]. Zhang et al. [62] remark that the most complex models are not necessarily the most accurate ones, but the most appropriate ones are. As such, there is also evidence that LSTM models do not necessarily predict with much larger accuracy than RF models if not appropriate [18], [43], [54]. Nar and Arslankaya [29] also noted that decision trees outperform simple ANN and SVR models. One of the more accurate models for passenger flow prediction is the convolutional GNN model, outperforming all other models [61]. Li et al. [63] show that ARIMA models fail to capture non-linear patterns as they miss stochastic relationships in the data. More complex ConvLSTM models significantly outperform plain ARIMA models in terms of accuracy [59]. Nevertheless, if ARIMA is combined in a hybrid model that considers further patterns, its accuracy strongly rises [63]. Other hybrid models utilizing SARIMA and real-time data also offer promising results regarding their high accuracy and high efficiency [34]. As such, there is evidence that if simpler models that are not inherently accurate are combined with another predictor, they can better detect relevant patterns in the data.

B. DISCUSSION
Assessing the time and data requirements of the models from the literature is difficult as often neither the training times of the models nor the model's accuracy compared to the number of data points used for the training are given. Research in other areas suggests that more complex models such as CNN, GNN, or ConvLSTM often require more training data and more computational resources than classical machine-learning approaches. For passenger flow prediction, ANN and LSTM models train significantly longer than XGBoost [18]. As such, we expect more complex networks to always require much data and training time before reaching their high accuracy. When comparing time and data requirements in the context of passenger occupancy prediction, a model's data requirement is more crucial. Long training times can be compensated by faster hardware, whereas data requirement for training remains critical, especially for lines with few data points. As such, we only regard models with a low to medium data requirement, as we aim to create a model with practical relevance that is also usable without many data points available.
Scalability describes how well an approach scales to predict passengers on multiple lines in a transportation network. Unfortunately, the scalability of not all models is researched in the literature, as sometimes the model's performance is only evaluated on a single line. In general, simpler models that train are often also scalable by training a univariate model for each line. Training one model per line is no longer feasible for models with extensive data or time requirements; hence, such models are often trained on the complete network, so-called multivariate models. For example, Toque et al. [43] compared a multivariate LSTM with a univariate RF model. While the univariate RF model outperforms the multivariate LSTM model, the performance of the LSTM model is still noteworthy as a whole model predicts all passenger flows simultaneously. Both models are scalable in this scenario, as RF models are fast to train, and the ConvLSTM model spans the whole transportation network. Nevertheless, a univariate model, as it is only trained on a single line, might miss the correlation between different lines. Multivariate models require much fine-tuning before being able to produce high-quality predictions. Usually, more complex models such as ConvLSTM or GNN scale through being multivariate and simpler models such as ARIMA or RF scale by training multiple models.
The flexibility of a model describes how well a model can adapt to changes in the underlying transportation network, e.g., adjustments in the timetable or slight changes in a vehicle's route due to construction work, without retraining on new data. Multivariate models such as a ConvLSTM model can usually compensate for such effects very well; univariate models that train on a single line may be confronted with unknown occupancy patterns in the data due to changes in the route. As such, univariate usually are more inflexible as their training set is restricted to patterns occurring on a single line. Often, models utilizing real-time data are more flexible as they can adjust their prediction based on currently observed patterns and do not only predict based on observed patterns in the past. As such, simple models that do not require much time and data to train are flexible because public transit operators can frequently retrain them. Multivariate and complex models such as GNN are flexible as they are trained on multiple routes and should be able to detect the relevant patterns across numerous lines. Finally, models with a real-time component are flexible, as they should be able to extract the relevant patterns not based on historical data but from the APC data feed.
A model's interpretability is crucial for dispatchers that operate the ITS to understand based on which reason an algorithm computed a particular prediction. Models that build a complex internal model, such as SVR, ANN, or LSTM approaches, are usually difficult to interpret. Simpler models, such as HA, LOCF, ARMA, or classification, are usually better to interpret. Clustering algorithms, for example, usually extract clusters that correspond to graspable situations, such as large volumes of passengers in the morning due to commuting. Nevertheless, there are attempts to apply other methods to black box approaches, such as LSTM models, that make their decisions understandable for humans [53]. Hence, a model can be interpretable because of its simplicity or because explainable AI approaches enable humans to understand a model's black box result.
The previous discussion and Table 1 showed that approaches that use a complex model such as ConvLSTM or GNN have the highest accuracy and can detect relevant patterns even in complex scenarios solely by training on a historical time series. Unfortunately, this high accuracy comes with the cost of requiring an extensive data set and a large amount of processing power during training. Furthermore, these models are not interpretable on their own. Their scalability and flexibility are reached by training a multivariate model on all lines, which requires more fine-tuning than a univariate model.
In this work, we present a predictor with practical applicability; therefore, we aim to create an algorithm that balances its accuracy and the other introduced properties, i.e., its efficiency, to remain usable in the current ITS. Models with average accuracy and efficiency include KF, ARMA, regression, ensemble learning, and classification models. For accurate short-term passenger flow and load, many fixed and stochastic features must be regarded [64], [65]. Nevertheless, we chose baseline algorithms that probably do not capture these features independently and instead decided to explore other means of capturing these missed stochastic features in a second step. Noursalehi et al. [46] demonstrated the viability of compensating missed stochastic features by manually clustering external events and incorporating this information with AFC data. Furthermore, models can identify general patterns in APC data by directly clustering it [66]. For AVL data, it has also been shown that Characteristic Profiles can capture relevant patterns for a vehicle's travel time [16]. As such, we are exploring automating the process of searching for irregular passenger numbers that cannot be explained by internal features by clustering deviations and adapting the baseline prediction with insights from live APC data. With such a hybrid approach, we aim to create a model with low computational requirements, adequate prediction accuracy, and high practical applicability.

III. APPROACH
With a more accurate prediction, public transport providers can assign vehicles more efficiently, and passengers can improve their travel planning. Algorithms training solely on historical APC data often already capture the cyclic nature of passenger numbers. Still, some effects caused by external factors, such as irregular events or weather patterns, can, at best, be predicted with computationally complex models [18] or with significant manual research [46]. As such, we present an online algorithm that detects irregular events as their effects unfold by analyzing live APC data and then applying their effects on a previously computed baseline prediction. Figure 1 shows the overview of the forecasting model and its steps, the preprocessing, baseline prediction, reduction, and prediction step.

A. ALGORITHM'S WORKFLOW
In the preprocessing step, the raw APC data is transformed into an appropriate format for the later stages. Passenger occupancy, in our case, is measured by door sensors that recognize boarding and alighting passengers. The raw APC data is aggregated by the ITS and contains one record for each bus after it leaves a bus stop. The records contain the date, planned departure time, actual departure time, bus line, last bus stop, GPS coordinates, vehicle identifier, number of passengers that boarded and alighted the bus, passenger occupancy, data on the APC quality, temperature, and precipitation. From this data set, we are interested in the date, planned departure time, bus line, last stop, and the number of passengers on the bus. The next stop, which is relevant for the prediction, is retrieved by identifying the next record for the current bus.
The data is then binned into large vectors called demand segments, a combination of a time interval and a network segment defined by the bus line, last stop, and next stop. Formally, the demand segments are indices that identify vectors. For simplicity, we use word demand segments for both the vectors and the indices. If multiple buses are recorded on the same demand segment, the processed data shows the sum of their occupancies. This is also a valid assumption in practice because the aggregated occupancy reflects that if public transport provider assigns an additional vehicle to a line, travelers would also more evenly distribute themselves among the vehicles.
Not all of these discovered demand segments have values present for sufficiently many days. This can have a multitude of reasons: 1) Some demand segments may only be serviced on certain days (e.g., bus lines for students on school days or night buses on weekends), 2) there may be changes in the schedule during the data collection period, 3) while late buses are recorded in their original demand segment, additional buses sent by dispatch do not have a planned demand segment and are recorded in the demand segment they were observed in, 4) not all buses are equipped with APC devices. We removed all demand segments with more than 40 percent missing measurements to ensure that the data used for training and testing had enough data points. Note that demand segments only served on weekdays have 2 7 = 28.6 percent missing measurements. With the possibility of more measurements missing due to the above-noted issues, a smaller cutoff for missing data is not sensible.
Each day is represented by one vector whose components consist of all recorded passenger numbers summed up into the corresponding demand segments. In other words, each component of the vector demand segment consists of the number of passengers during a specific time interval, bus line, and segment between two stops. This means that the demand segments show how many people traveled during which time intervals on specific network segments. Note that passengers usually travel along multiple stops, in which case they are counted multiple times and therefore appear in multiple demand segments. This also means that the target for our prediction is the demand segments, i.e., our algorithm predicts how many passengers are likely traveling at a specific time on a particular network segment. Figure 2a shows the passenger numbers on a fictional network segment and bus line throughout the day.
To obtain the baseline prediction, we apply the different time series algorithms to the preprocessed demand segments to predict general trends. These baseline predictions are already valid forecasts, although they most likely miss patterns that only regularly occur in the data. We then improve upon the baseline prediction in the next step. Due to our preprocessing, we train separate univariate models for each demand segment, requiring fast training times. We tested algorithms such as SARIMA, lasso least angle regression (LassoLars), k-nearest neighbors (K-NN), orthogonal matching pursuit (OMP), passive-aggressive regression (PAR), and contextual historical average (CHA) for the baseline prediction. During the training of the different algorithms, good hyperparameters of all models are determined with a repeated holdout validation approach.
For the baseline, we follow a one-day ahead forecasting strategy for each demand segment; thus, the time series regards the passenger numbers on the demand segment from the previous day. While most time series algorithms can predict an arbitrary number of steps in a time series, many achieve this by predicting the next value with the previously predicted value. This feedback significantly amplifies prediction errors, especially when predicting a value in the evening on a time series with 15-minute intervals. Even those that do not rely on this have an increasing error or significantly increasing complexity as the number of relevant historical data points grows. On the other hand, if the prediction is made separately for each demand segment, each series has one value per day, leading to a more accurate prediction by the baseline prediction as the error cannot accumulate over an entire day. At the same time, each time series prediction is simple enough, keeping the overall computational effort manageable. In the evaluation, we also evaluate the time series algorithm in real-time, rather than as day-ahead forecasting, and will compare the results.
The task of the reduction step is to aggregate the data obtained in the previous step. For this, the algorithm computes Characteristic Profiles from the difference between predicted and observed values from the recorded passenger numbers on each demand segment, i.e., the deviations. Figure 2a illustrates how deviations are calculated. After computing the deviations, the algorithm applies a clustering algorithm to the deviation to reduce multiple observations of similar patterns to a single cluster. Each cluster represents one Characteristic Profile, i.e., the cluster's center. Depending on the clustering algorithm, the cluster center is the mean of all data points or the most central data point. This means that each cluster contains deviations following the same trend. Figure 2b shows a potential Characteristic Profile that may have been extracted from a cluster of similar deviation patterns. For the evaluation, we have implemented k-means and k-medoids clustering for the reduction step. After reducing the deviations to the Characteristic Profiles, the algorithm is fully trained and can forecast passenger numbers on demand segments.
We drafted multiple hypotheses about the longevity of relevant real-time detectable patterns and their influence on passenger numbers and implemented multiple high-level reduction approaches. Since the Characteristic Profiles represent these patterns, different high-level approaches for the reduction apply the clustering to differently structured data. We call these reduction methods high-level approaches for the reduction, as different low-level clustering algorithms such as k-medoids or k-means can be employed. As such, depending on the chosen high-level approach and the specific clustering algorithm, we create different sets of deviations and Characteristic Profiles.
For the final real-time prediction, the algorithm combines the results of the baseline prediction with real-time events identified by the Characteristic Profile. For each time interval, the deviations between recorded passenger numbers and the corresponding passenger numbers from the baseline prediction are matched to the extracted Characteristic Profiles. The real-time prediction is calculated by applying the Characteristic Profile to all future values from the baseline prediction. This is done by adding the signed deviations to the predicted passenger numbers for the upcoming time intervals. In Figure 2b, the vertical red line marks the point of the prediction. All later observed values and deviations are unknown at that point, matching the situation in a realworld deployment of the algorithm in ITS. The observed deviations so far are matched to the pictured Characteristic Profile, which is then added or subtracted to the baseline prediction in black to form the real-time prediction in cyan. Thus, we obtain a prediction of passenger numbers for a specific demand segment. We regard all buses traveling at that time on the network segment to obtain the final occupancy prediction for a single vehicle traveling on the related network segment. Therefore, we assign all passengers to that vehicle as forecasted passenger occupancy if only a single vehicle is observed on the network segment in the given time interval. We assign the calculated passenger numbers proportional to the current measured vehicle occupancy if multiple vehicles are observed. This completes the description of the algorithm's workflow; as a next step, we formalize these steps and introduce them in more detail.

B. FORMALIZATION OF THE DATA
To illustrate the different approaches, we introduce a formal notation of the data. We separate one day into time intervals of For this, we first identify the network segments as j = (line, from, to), containing the public transport line and the direction of travel by regarding the from and to parameters. Including the travel direction in the segment usually results in higher accuracy [49], [65]. Next, the algorithm assigns to each network segment the number of passengers, turning them into demand segments for each time interval and each operating day. We model a demand segment as follows: D (line,from,to),t (i) contains the number of passengers on day i = 1, . . . , d 0 within the 15-minute time interval beginning at t ∈ T at j = (line, from, to). The latter means a restriction to a particular line, and the network segment between two bus stops from and to. Thus, the index j refers to the network and points to all relevant combinations of network segments. This set is defined as J . The index i represents the operating day where d 0 denotes the total number of days.
The data set or matrix of observed passenger numbers can then be described as This notation implies that a row Y i,− of the observation matrix Y (i.e., the variables to be predicted) collects all possible observed demand segments on day i, whereas a column Y −,(j,t) contains values for a fixed demand segment, belonging to a network segment j and time t, but for all days. The latter can be interpreted as a time series based on the baseline prediction.
If we denote J by the total number of demand segments, which is the same for each day, Y takes values in R d 0 ×J . Analogously to the observed data, we collect the data of predicted values by the baseline prediction in the matrixŶ = (Ŷ i,(j,t) Lastly, we define the matrix of deviations, which is the basis of the real-time prediction, as contains all deviations or errors of the baseline prediction with respect to the observed values. Note that the order of the columns of the above matrices is not relevant here. Without loss of generality, we can assume that for a fixed day i, the values corresponding to the demand segments D j,t (i) are first sorted by augmenting time intervals t to simplify the definition of time-windows and then alphabetical for each network segment j = (line, from, to).

C. HIGH-LEVEL APPROACHES FOR THE REDUCTION
With the formalization of the input data, we can now introduce the different approaches to cluster the deviations for the real-time prediction. We define the following approaches of Complete Day and Seperate Moving Window as high-level approaches, as they influence which data points are regarded together for the clustering and thus strongly influence the detection of patterns.
The Complete Day (CD) approach focuses on patterns that span the entire day. The approach assumes that patterns early in the day also influence patterns found later in the day. For Complete Day (CD), the clustering is performed on the entire deviation vectors i,− ∈ R J . Recall that each row of contains the deviations of every demand segment on day i. This means that d 0 data points i,− live in an J-dimensional space.
The Seperate Moving Window (SMW) approach is more suited to patterns that span a shorter time, as it only considers windows of fixed size. These overlapping windows are obtained by selecting a subset of the indexes corresponding to the window from the deviation vectors before clustering each separately. The real-time prediction for any time beyond this window is the same as the baseline prediction. Formally, we look at moving windows of size t := t 2 − t 1 with t 1 < t 2 , t 1 , t 2 ∈ T , at initial time t 0 . The corresponding submatrices are defined as Hence, for every day i, a vector i,− (t 0 , t) describes all deviations of all demand segments in the time window from t 0 to t 0 + t. The idea behind the moving windows is to identify more patterns in the deviations by choosing small enough time windows. The CD approach considers the whole day while moving windows detect patterns within a few hours in practice. The overlapping (assuming t > 00:15) of the vectors i,− (t 0 , t) guarantees that patterns located between otherwise disjoint windows are also considered. For the clustering in the SMW approach, the vectors i,− (t 0 , t), i = 1, . . . , d 0 , serve as its input for each initial time t 0 ∈ T . In other words, there is a matrix { i,− (t 0 , t)| i = 1, . . . , d 0 } for every t 0 and hence many different clusters. In contrast to CD, there is more than one single matrix. The vectors i,− (t 0 , t) live in a space strictly smaller than dimension J depending on the time t since only demand segments D j,t (i) within the time frame t 0 ≤ t < t 0 + t are contained. If we denote by J t 0 , t the number of demand segments in that time interval, we have ∈ R d 0 ×J t 0 , t .

D. CONSTRUCTION OF THE REAL-TIME PREDICTION
The real-time prediction is constructed by correcting the baseline prediction at an initial time s ∈ T up to an upcoming time t + ∈ T , t + > s. The observed deviations on the current day up the time s are matched to a cluster center derived from one of the above approaches. This cluster center represents a Characteristic Profile of deviations which is added to the baseline prediction for the next t + := t + − s hours. Let k ∈ N be a fixed number of clusters. As a clustering we define a mapping where C l ∈ R n are the centers of the clusters, n, m ∈ N. We consider two different metrics that determine the forming of the clusters: The Manhattan distance L 1 and Euclidean distance L 2 which are defined by Let us first consider the CD approach. For a fixed current day i 0 = 1, . . . , d 0 , we are looking for a correction of the valuesŶ i 0 ,(j,t) for all network segments j ∈ J and t = s, . . . , t + . As explained for the CD approach, the cluster centers are C l = C( ) l ∈ R J for l = 1, . . . , k. The matching to a cluster center C l is in fact a projection mapping with respect to L p . Let r = r(s) be the number of demand segments at a fixed day up to time s, i.e., r = |J × {t ∈ T | t ≤ s}|. Further, we define by π C : R r −→ { C 1,l , . . . , C r,l } l=1,...,k the closest projection of an r-dimensional vector to the set of cluster centers restricted to subspace R r . The property |π C (v) − v| = min l=1,...,k C 1,l , . . . , C r,l − v is fulfilled for v ∈ R r . It means that π C (v) matches a vector to the closest cluster center with respect to L − p. We call the cluster index satisfying this property l 0 = l 0 (v) and hence the corresponding cluster center C l 0 = π C (v). Back to our application, the updated matrixŶ of real-time predicted passenger numbers on a demand segment is constructed aŝ ,(j,t) ) j∈J t≤s ). This expression matches the deviations on the current day i 0 and up to time s to a specific cluster center derived from passenger numbers at the other days. The cluster center, which corresponds to one pattern of deviation for the whole day i 0 , is then added to the baseline prediction up to the time t + in order to improve the prediction.
The clustering works similar with the SMW approach. Recall that the moving windows (t 0 , t) separate the data, resulting in many different clusterings C( (t 0 , t)) = {C t 0 ,l } l=1,...,k with t 0 ∈ T denoting the beginning of the moving window. From the point in time s there is a prediction range up to t + > s as before and a detection range t 0 < s. A Characteristic Profile of deviations is then detected between t 0 and s on that day among the clusters and we set t = t + − t 0 as a boundary for the moving window. The procedure is almost the same as in CD, but we replace r with r(s, t 0 ) = |J × {t ∈ T | t 0 ≤ t ≤ s}|. Therefore, we obtain the real-time prediction bŷ where t 0 = t + − t with fixed parameters t and t + > s.
With this, both high-level approaches Complete Day and Seperate Moving Window, and their corresponding real-time adjustments for predicting the number of passengers in a demand segment are introduced.

E. LIMITATIONS AND ASSUMPTIONS
The operation of a vehicle fleet suffers from disturbances and delays. Hence, data from different days are different due to changes in schedule, late buses, and extra buses added by dispatchers. This may cause the underlying demand to actualize as passenger numbers in later demand segments or demand segments belonging to different lines. To approximate the actual demand, we record late buses in the demand segment they were planned for, as the passengers often wait for the bus or have already boarded it before it became delayed. Extra buses are recorded in the demand segment they serve, as no planned demand segment exists in the data. Small changes in the order of buses caused by delays at bus stops with multiple lines may change which bus a passenger chooses. Modeling this phenomenon is outside the scope of this article.
Next, not all demand segments have enough data for the training as they serve some edge cases, e.g., the last vehicle trip on a day often follows a slightly different route. We dropped demand segments with insufficient training data for our approach and are falling back to the provision of realtime occupancy data. Lastly, we do not directly predict the passenger occupancy for a certain vehicle but rather for a demand segment. Suppose multiple vehicles from the same line travel on a demand segment in a certain interval in the same direction. In that case, it is unclear how the predicted passenger load should be assigned to the vehicles. Here, we decided to distribute the predicted number of passengers proportionally, according to the currently measured passenger occupancy. Ideally, the public transport operator could choose the time interval so that precisely one vehicle travels in each demand segment. In this case, the prediction for that demand segment is the predicted occupancy for this vehicle. However, this is usually not feasible due to schedule changes, additionally dispatched buses, or the risk of creating more demand segments with too few observed values, leading to more discarded data. Still, in our experiments, we could directly translate many demand segments to vehicle occupancy on most days.

IV. EVALUATION
We introduce the evaluation methodology, data set, and obtained results in the following. Afterward, we discuss the results of the different steps and their implications.

A. METHODOLOGY
In this evaluation, we evaluate different baseline predictions, the influence of the high-level reduction approaches, and varying clustering algorithms in different real-time settings. First, we compare the different baseline predictors. For that, we trained multiple time series prediction algorithms with the Python libraries scikit-learn [67], statsmodels [68], and PyCaret [69] and selected the best-performing algorithms with their best-found parameters. After choosing a single baseline predictor for our hybrid algorithm, we analyze the influence of the hyperparameters for our novel real-time model. For this, we look closely at the accuracy of the high-level reduction approaches CD and SMW and their performance with different parameters. As our approach requires real-time data, we simulate the time property of our real-time APC data stream by using parts of our historical data as simulated real-time data. Therefore, we split our available data set so that the last interval of the data is not used for the training or cross-validation but acts as APC observations made available to the predictor as time in the simulation progresses. Both approaches allow for varying the same parameters: the clustering algorithm itself, the number of Characteristic Profiles to be extracted, and the distance metric used by the clustering algorithm. After analyzing the model's accuracy in-depth, we also regard our other performance metrics, such as training time and data requirements during the training. Then, we compare our hybrid model to other time series regression algorithms operating in real-time rather than one day ahead and discuss our findings.
The evaluation data set is from a German city and its surroundings, with a population of approximately 200 000 in the metropolitan area. Unfortunately, we can neither name the city nor directly show the transportation network due to privacy concerns with the data. The data was recorded from January 1st, 2019, to February 29th, 2020, thus exclusively featuring data before the outbreak of the COVID-19 pandemic. The approaches are evaluated by comparing the mean squared error (MSE), which is defined as Symbolically, x is a placeholder for the observed passenger number on a demand segment andx for the predicted passenger number on a demand segment (baseline or real-time). Thus, it is possible to analyze the prediction errors for a fixed day i 0 . We chose MSE because it is a wellknown measurement and approximates the real-world cost of the prediction error: If the prediction is off by a few passengers, the transport company will, in most cases, still use the same number of buses. The more the prediction is off, the more passengers are negatively affected by worse overcrowding, and the transport company will increasingly distribute buses non-optimally. For example, the real-world cost of five demand segments being mispredicted by five passengers each is smaller than that of one demand segment being mispredicted by 25 passengers. In contrast to the mean absolute error, MSE correctly measures this.
All figures showing only one network segment or multiple demand segments from one network segment use the same network segment to ensure comparability. As we cannot show the transportation network, we will describe the layout of the transportation network: The network segment is close to the city center, facing inwards. At the target stop, lines from multiple directions merge, and most take a common path toward the central train station. In the given period, the demand segment with the highest recorded passenger number belongs to this network segment. The evaluation mimics the deployment of the algorithm under real-world constraints; as such, the algorithm only uses data available in common ITS software.
After data cleaning, we have 2782 network and 27 994 demand segments in the complete data set. On most weekdays, between 23 000 and 26 000 demand segments are present. The number of demand segments on weekends drops to about 18 000 on Saturdays and about 7500 on Sundays. Within these bounds, noticeably fewer demand segments are registered on weekdays from the beginning of 2020, likely due to changes in the bus schedule.
The data set was divided into three subsets for training and testing: The first data set T b , spanning from 2019-01-01 to 2019-05-11, was used for training the baseline prediction. The second data set T rt , spanning from 2019-05-12 to 2019-09-27, was used to calculate deviations from the baseline and train the real-time prediction. The third data set T test , spanning from 2019-09-28 to 2020-02-29, was used to evaluate all approaches (both baseline and real-time). We have chosen this validation approach as it best matches the deployment of the algorithm in real-time.

B. RESULTS
For the evaluation, we first want to ensure that all models perform as well as possible. Hence, we first select optimal parameters for the different baseline predictions and choose the best baseline prediction to be combined with our real-time prediction algorithm. Next, we pick the best hyperparameters for the real-time component of our hybrid algorithm. We then examine the performance of our hybrid model in detail. Finally, we compare the results of our model with other time series algorithms and discuss the results.

1) TRAINING OF THE BASELINE PREDICTION
As we train a model for each demand segment in T b , it is infeasible to optimize the hyperparameters for each demand segment separately. Instead, we chose the demand segment (j, t) with the highest recorded passenger number on a single day for the evaluation. We train the models on the time series (Y i,(j,t) ) i∈T b and tune the hyperparameters of all baseline prediction approaches on the time series of the value recorded on this demand segment during all days.
In the survey of previous approaches, we found various suitable approaches for the baseline prediction and wanted to examine their accuracy. Here, it is not essential to choose the most accurate predictor, as we are improving this baseline prediction later on again, but rather choosing an algorithm that works reasonably well enough. Among other algorithms, SARIMA is computationally efficient and has shown adequate accuracy in the related work; thus, we selected it as our baseline to train for the multi-step prediction. We determined all hyperparameters of the approaches using a grid search, applying the repeated holdout method to split up the training and test sets. Initially, the performance of PyCaret's grid search optimization of SARIMA is worse than what we expected from published research articles. After determining the reason for this, a wrongly assumed non-stationary behavior of the time series, we manually adjusted the hyperparameter tuning process of SARIMA: To find optimal parameters, we first conducted the augmented Dickey-Fuller test [70] and Canova-Hansen test [71], determining that the time series is stationary in both seasonal and non-seasonal behavior. This sets d = 0 and D = 0 in SARIMA. Since we have a time series of daily data, we set the seasonal period to s = 7, i.e., we expect patterns to repeat weekly. For p, q, P, and Q, we performed a full grid search on three sampled train-test set combinations we retrieved from T b ∪ T rt ∪ T test using the repeated holdout method. The repeated holdout method is similar to crossvalidation in that it trains and evaluates predictors multiple times on different subsets of the available data. However, it differs in the selection process: The test set always follows the training set chronologically, making it necessary to disregard parts of the data in each training or evaluation run. Cerqueira et al. [72] found repeated holdout to produce the best results on real-world time series. We selected the combination of hyperparameters with the lowest aggregated corrected Akaike information criterion (AICc) [73] aggregated over all test sets. The result is p = 2, q = 0, P = 1, Q = 1. For the other algorithms, we kept the computed hyperparameters from PyCaret, which also performs a grid search and the repeated holdout method, the same way as we manually applied it to SARIMA. Here, the obtained results were consistent which what we expected and what the results in the related work suggest. In particular, the issue of the seemingly non-stationary time series does not impact the other algorithms; therefore, we think all hyperparameters are well optimized after our training and validation.
After training the algorithms and selecting the best parameters, we compared the results of the five baseline predictors in a day-ahead forecasting mode. Figure 3 shows the prediction accuracy of k-nearest neighbors (K-NN), lasso least angle regression (LassoLars), orthogonal matching pursuit (OMP), passive-aggressive regression (PAR), contextual historical average (CHA), and SARIMA over the time of the day of all demand segments. We included the CHA for comparison, with the day of the week as its context. Interestingly, all baseline predictors show a similar pattern over the day, as their prediction error spikes in the morning and just before 14:00. Our hypothesis for the significant increase in the prediction error is that none of the algorithms can accurately detect the start and end of vacations. This results in a considerable prediction error for times when pupils travel to school. Note that in these spikes, most baseline predictions outperform the CHA. We will examine this pattern further when we analyze the real-time prediction algorithms. Figure 3 shows that PAR performs worst, as it cannot accurately detect weekly effects such as weekday vs. weekend and also cannot efficiently utilize its online learning approach. The other approaches all perform significantly better, but overall, SARIMA performs best. Due to the high performance of SARIMA and the very similar patterns of all evaluated baseline predictors, we select SARIMA as the baseline for our hybrid algorithm for the remainder of the evaluation.

2) TRAINING OF THE REAL-TIME PREDICTION
Using the previously determined hyperparameters, we trained all baseline predictions on the data set T b and predicted the values for the times in both T rt and T test . We then calculated the deviations by subtracting the observed passenger numbers for the times in T rt from the predicted values. To find  = 00:15 (rep-holdout). Each configuration includes the distance metric, the reduction approach and the cluster count. The colors indicate the improvement over SARIMA as the best baseline prediction.
optimal parameters for the reduction in our novel approach, we sampled three train-test set combinations from deviations in T rt ∪ T test using the repeated holdout method from Cerqueira et al. [72]. In the evaluation, when the average prediction error over all three test sets is shown, it is marked by the keyword rep-holdout in the caption. We evaluated several hyperparameter combinations in Table 2 in a grid search. For the final evaluation, we trained the real-time prediction on the deviations in T rt . Finally, we validated our real-time approach with set T test . Each high-level reduction approach can be configured with the following parameters: • the cluster count k • the clustering algorithm: k-means or k-medoids • with or without principal component analysis (PCA) • the distance metric L 1 and L 2 for the clustering We found that most configurations improve the results over the baseline prediction from SARIMA. Apart from the clustering algorithm, the parameter choices only have a small impact on the prediction accuracy. In the following, we discuss the influence of each parameter on its own in detail. We illustrate this with SMW, but the qualitative patterns also hold for the CD approach.
In many figures, the prediction accuracy of the baseline prediction fluctuates at different times. This is an artifact of the evaluation process: Real-Time prediction approaches only provide a prediction for a demand segment once some real-time data for the day comes in. That means that for larger t + , intervals from the early morning are not evaluated. SARIMA was evaluated the same way to provide  a meaningful comparison, leading to a varying MSE for SARIMA. Figure 4 shows that k = 3 had the overall best performance, but k = 2 and k = 5 produced similar results. In our experiment, only values for k, which are considerably larger than the optimum, resulted in a substantially larger prediction error. However, it should be noted that a smaller k improved SARIMA for higher t + . We hypothesize that this effect comes mainly from the early morning when little data is available to match the correct profile. With more Characteristic Profiles, it is more likely that noise influences the matching of the Characteristic Profile, leading to a larger prediction error for this time frame. The influence of the distance metric is even less significant, especially for smaller k, as Figure 5 shows. For larger k, L 1 results in a somewhat increased accuracy, especially for CD (comp. Table 2). Overall, the algorithm is robust to sub-optimal parameter values. This robustness is especially important for k as it is a quantitative parameter, and it is much easier to optimize if the optimal value does not need to be found precisely for a good result. Regarding the qualitative parameters, L 1 and L 2 performed better by small margins in different configurations. We did not find any particular reason for this behavior.
Among the clustering algorithms, k-means produced the most accurate results, as shown in Figure 6. Compared to k-means, k-medoids leads to worse prediction accuracy, but it still improves the baseline prediction. The in-sample prediction error as MSE of k-means (28.5) and k-medoids (30.1) with k = 3, shows k-medoids produces slightly worse centroids. We hypothesize that this effect occurs because of our preprocessing: The dimensionality of the data is much larger than the number of available data points, as each demand segment increases the dimensionality of our data. K-medoids chooses the most central cluster member as its centroid, which then becomes the Characteristic Profile. Since the number of demand segments is significantly higher than the number of days clustered, it is almost inevitable that the median Characteristic Profile deviates from sensible values in some dimensions and, therefore, in some demand segments. K-means, in contrast, results in the arithmetic mean for each component, eliminating such values that occur because k-medoids is forced to select one of the measurements as a cluster center. Therefore, even if the Characteristic Profile is matched correctly, k-medoids results in larger deviations between the cluster center and its members, reducing the prediction accuracy.
This hypothesis is further supported by the fact that k-medoids, with PCA, leads to a prediction accuracy much closer to k-means. By using PCA, the dimensionality is reduced, and therefore choosing a cluster member as the centroid leads to smaller deviations: To retrieve a Characteristic Profile, the centroids are reversely transformed by PCA, only retaining the principal components, which excludes small deviations. We hypothesized that reducing the dimensionality with PCA improves the accuracy, as our vector embedding of the data with demand segments results in a large dimensional vector. While PCA was also beneficial for k-means, it is unclear whether this effect primarily came  from the complexity reduction for clustering or the reduction in smaller deviations resulting from the reverse transformation of cluster centers. This concludes the examination of the hyperparameters for the real-time approach, and we continue analyzing the model in more detail in different scenarios.

3) DETAILED ANALYSIS OF THE ALGORITHM'S PERFORMANCE
We use only the best configurations in this section according to the previous comparison. This is k-means with L 1 and k = 15 for CD and L 2 and k = 3 for SMW. Some differences will not be inherent to the approach but result from different values for k.
In many cases, SARIMA already produces a reasonably accurate prediction, as shown in Figure 7 for an example day and demand segment. In the figure, we compare the normalized number of observed passengers against the normalized number forecasted by SARIMA. Figure 8 compares the reduction approaches, showing that they perform relatively similarly for short-term predictions, with both reducing the prediction error.
A more in-depth study of the prediction error on each day reveals that school vacations reduce the prediction accuracy  of the baseline prediction, as shown in Figure 9. On vacation days, the real-time predictions are significantly more accurate than the baseline prediction, while on non-vacation days, there are smaller differences in overall prediction error. This behavior can be explained by the way SARIMA works: As there is no information about vacations and pupils suddenly change their behavior, i.e., they stop going to school all at once, SARIMA cannot possibly predict this. Since the changes in school days are not as drastic, we hypothesize that the vacation pattern is the most significant pattern accurately discovered by the real-time approaches. This behavior highlights the major advantage of our approach: Without explicit knowledge of vacation times, the real-time component adjusts the baseline prediction in such a way that the effects of vacation times are regarded. Figure 10 supports this hypothesis: The early morning and afternoon are when pupils typically go to school or leave, respectively. The fact that in-between, the baseline prediction has a relatively small prediction error fits the hypothesis because pupils are usually at school during this time. Conversely, if the spikes were mainly the result of a misprediction of adults going to and from work, we would expect the second spike during the late afternoon, not around 13:00. In conclusion, the vacation pattern is likely the most significant relevant pattern in the data leading to high prediction errors for SARIMA that are then reduced by the real-time adjustments. This likely also holds for the other baseline algorithms from Figure 3, as they exhibit a very similar error pattern.
We take a closer look at the prediction error on average over all vacation days in Figure 11. Here, the real-time predictions show a significantly better performance. Under the assumption that the data set contains no other significant patterns not modeled by SARIMA, our approach included all essential patterns. This suggests that the real-time prediction approach could produce predictions that are an even larger improvement over SARIMA on data sets with more patterns. While the MSE over all network segments gives a good high-level overview of the prediction accuracy, it does not support conclusions about the nature of the prediction error. A given MSE could be caused by a few segments with large prediction errors while other segments have almost no error, or it could be caused by a more evenly distributed smaller prediction error. The evaluation suggests that for our data set, large fractions of the MSE are caused by a few demand segments: In the city, the data set was taken from, many bus lines end in comparatively rural areas with few inhabitants and then enter the city. Naturally, there are relatively few passengers in the outskirts as most passengers, who live in the city or its periphery, only use the parts of the line passing through the city. But with few passengers, the absolute fluctuations and, thus, the prediction error will be much smaller than in the city itself. Figure 10 shows that the prediction error is concentrated in a small proportion of time intervals.
To complement the high-level analysis with the MSE, Figure 12 shows the development of predicted and recorded passenger numbers during a school vacation day. We chose the network segment because the highest recorded number of passengers in the data set was on a demand segment belonging to this network segment. It is oriented inwards toward the city center, and at the end stop, multiple bus lines coalesce. In Figure 12, the fluctuations from one interval to the next stem from a mismatch between the bus frequency of 10 minutes and the interval size of 15 minutes, meaning every second demand segment records two buses, not just one. Apart from that, the plot shows that our novel real-time prediction approach with both reduction approaches correctly predicted that SARIMA would make a large prediction error during the morning spike, as both lines are significantly closer to the observed green line. At other times of day, the baseline prediction is already quite accurate; hence the realtime adjustments are relatively small. In practice, only the demand spike during the morning would warrant a change in the assignment of buses, and the novel prediction approach corrected that spike.
Next, we want to look at the Characteristic Profiles detected by the two high-level approaches. Recall that both of the high-level approaches were derived from a hypothesis about the nature of patterns in the data set: 1) CD postulates that significant patterns span the entire day, and 2) SMW assumes that patterns during different times of day exist separately.
Our analysis of the Characteristic Profiles shows which patterns were detected, which will either support or contradict the respective hypotheses. Again, the network segment with the highest passenger numbers is used. Figure 13 shows the profiles for Complete Day, which mostly contains the larger error in the early morning in both directions. Note that, like in Figure 12, the spike in the early afternoon is not visible: Since the network segments contain the direction of travel, passengers that travel into the city in the morning are recorded here, while the other direction in the afternoon is recorded in a different demand segment. When introducing CD, we hypothesized that it would be able to predict later effects of patterns that span the entire day early. Figure 14 shows that the prediction for 13:00 is already improved with data from the early morning, supporting both this hypothesis and the implicit hypothesis that these patterns exist in real-world data.
In the early morning, Seperate Moving Window detects profiles similar to the same timespan of profiles in CD, as seen in Figure 15. During other times of day, the overall accuracy, as observed in Figure 10, is very similar to SARIMA, which is mirrored in Characteristic Profiles that only have small absolute deviations. While this means that the experiments we conducted do not support the hypothesis that SMW is more suited for smaller patterns, this can also be explained by the data set: Firstly, the overall prediction error for SARIMA is small during the late afternoon. Secondly, SMW can predict patterns in intervals where SARIMA had a large prediction error. Lastly, in the city where the data was recorded, public transport usage is relatively low compared to larger cities. In conclusion, SMW does not have an advantage in accuracy over CD on this data set. We suspect that the number of relevant patterns in the data set is small enough to be detected by CD, and SMW requires more patterns to show its advantage. This is supported by the fact that SMW has been shown to provide a similar prediction accuracy as CD while not detecting significant patterns in most moving windows, suggesting that it would be able to detect more patterns in a different data set, at least if more moving windows contain such patterns.
Finally, Table 3 shows the time taken for training and predicting a single instance for the baseline and each highlevel approach. For the time demand during training, both CD and SMW approaches performed similarly when we consider that they both rely on SARIMA or the CHA, i.e., we also have to regard their training time. With 70 minutes of training time, it is feasible to retrain the SARIMA baseline and recluster the deviations daily, allowing it to quickly react to changes in the transportation network. If a faster combination of algorithms for the baseline and the real-time approach is required, one could also combine the CHA and the CD approach to reach an overall training time of 12 seconds while reducing the performance, as we will see in the next section. Nevertheless, this also highlights the versatility and the practical applicability of our approach, as baseline and real-time models can be combined in such a way that they fit well to the respective use cases.

4) COMPARISON WITH OTHER PREDICTION APPROACHES
Next, we compare the accuracy of our model to univariate and multivariate models from the literature. ANN-based predictions have shown excellent accuracy as multivariate models; as such, we implemented a CNN and an LSTMbased prediction. We transformed the data into a multivariate time series over all network segments. The new time series has 2782 components and an entry for every 15-minute interval on every day. The LSTM is configured with one dense layer with 3000 neurons, one LSTM layer with 1000 units, a second dense layer with 1500 neurons, and an output layer. For each network segment, the CNN takes three time series with data from a two-hour window as its input: the most recent, one day before the prediction target, and one week before the prediction target. After two convolutional layers with 32 and 8 filters, the CNN consists of one dense layer with 1000 neurons and the output layer. As Figure 16 shows, both CNN and LSTM outperform our approach for small t + . However, for larger t + , our approach is more accurate.
For practical applicability, an approach should be easy to configure. To achieve good prediction accuracy for CNN and LSTM, we started with the configurations used for a related problem by Zhang et al. [61]. However, since these produced very inaccurate results on our dataset, we had to experiment with adding, removing, replacing, and changing the size of multiple layers in the neural network to find a configuration that worked well. In contrast, our two-step approach produces a similar accuracy for many different configurations, as shown in Table 2. While finding the optimal configuration may still be challenging, it is easy to find a good one.
Additionally, we inspected how the available training data impacts prediction accuracy. We selected training data sets with 6, 14, 42, and 120 days. Since we already know that school vacations form a very relevant pattern, we ensured that each data set contains both school and vacation days. To train our approach with 42 or fewer days, we replaced SARIMA with the CHA as a baseline and set k = 3 because SARIMA cannot be trained sensibly with so few data points. For 120 and the full data set of 270 days, we used SARIMA and k = 15 like before. Figure 17 shows that for 120 days, our approach has a slight advantage, which increases for even smaller data sets. While for very small data sets, the accuracy is highly dependent on the days in the data set, our hybrid approach is still more accurate than the CHA, showing that it can be trained on little data if necessary. In summary, if enough training data is available, neural network approaches also work well for predicting the multivariate time series.  Next, we compare the implemented hybrid approach to other univariate time series prediction algorithms from the literature. The hyperparameters were determined with a grid search using PyCaret with cross-validation and fixed for all network segments. We trained univariate models for each network segment, using only data from the network segment for the prediction. For this, we included the algorithms K-NN, Decision Trees, Random Forests, basic Gradient Boosting, and LightGBM. We chose these algorithms, as the literature review showed that decision trees and related algorithms are particularly well-suited for forecasting passenger occupancy [43]. To reduce the training and prediction effort, we trained the other real-time approaches on a reduced data set, which only includes the two most traveled network segments, (32, from, to) and (32, to, from). We added the latter to ensure both spikes in prediction error are represented, as the former does not include the afternoon spike. Reducing the number of network segments is common practice in the literature, as often the algorithms do not scale to whole transportation networks. Figure 18 compares the accuracy of all models we evaluated. The real-time and the day-ahead approaches have a significant error on this data set as it is the network segment with the largest number of passengers and fluctuations, making it more difficult to predict the correct number of passengers compared to the overall data set. We see that all real-time time series approaches have large prediction errors for medium t + , but substantially improve later. As noted earlier, we suspect this stems from the increased complexity of forecasting all data points for a network segment with a single model. This model has to predict daily fluctuations, weekly fluctuations, and the alternating pattern caused by the different number of buses in adjacent intervals due to our demand segment aggregation. This is consistent with the fact that even SARIMA outperformed the real-time approaches as it is trained separately per demand segment, reducing the number of fluctuations it has to predict. Our evaluation showed that Random Forests and its boosted variants, basic Gradient Boosting and LightGBM, have the highest accuracy for models used in real-time.
Regarding the error distribution over the day, all real-time approaches show a similar pattern: There is a huge spike of the MSE in the morning, similar to our approaches, but significantly larger. The error is then reduced over the day until rising again at around 13:00 and 16:00. As we performed a grid search and cross-validated all approaches, we think it is unlikely that configurations with a much better performance exist. Given the closeness of SARIMA and the real-time approaches for t + = 00:15, an improved configuration may outperform SARIMA. Nevertheless, this becomes more unlikely for greater t + , meaning our approach still has an advantage. In summary, our analysis showed that univariate time series models have to predict too many fluctuations at once and, therefore, have a very large error for t + .

C. DISCUSSION
As can be seen in Figures 9 and 10, SARIMA provides an accurate prediction for most days. Figure 7 shows an example prediction from a busy network segment, demonstrating SARIMA's ability to make accurate predictions. Since our baseline provides a good prediction, the real-time prediction approaches can concentrate on patterns that the baseline cannot predict; hence the comparison with SARIMA as a baseline provides meaningful insights into these patterns and the efficacy of the real-time approaches. SARIMA as a baseline allowed our real-time approach to detect patterns in the data apart from the daily and weekly seasonal effects. Especially in comparison with univariate time series algorithms employed in real-time, there is a significant gap in prediction accuracy. Our evaluation of the multivariate ANNs shows that under optimal conditions, they outperform our hybrid approach. However, our approach has advantages, apart from accuracy, over ANN, which we will discuss in this section.
CD achieved good prediction accuracy on the given data set and could predict later events with a significant lead time. SMW achieved a similar accuracy with the inherent drawback of shorter lead times. However, the data set likely does not contain significant uncorrelated patterns, where SMW would have an advantage over CD. In contrast to related work, the data set does not appear to contain many detectable patterns. Noursalehi et al. [46] examined the London metropolitan area and found several distinct patterns. Compared to our data set originating from a smaller German city, there are only comparatively few public transport trips per capita. This shows that SMW has few advantages in this data set. In particular, the concentration of patterns in a few intervals in the morning and afternoon prevents an analysis of the strengths hypothesized for SMW. Furthermore, both CD and SMW are very robust to slightly sub-optimal parameter values. Given the similarity in accuracy for different configurations and that accuracy declined for a further increase in the number of Characteristic Profiles, the hyperparameters for both approaches are likely near the optimum. Overall, the accuracy of our real-time prediction is good, and the real-time adjustments with CD and SMW are sensible additions to classical machine learning algorithms purely using historical data.
Regarding the data demand, we observed that our approach can outperform the CHA with just a few days, although the precise selection of the data set has a large influence on the prediction accuracy. With 6 weeks of data, it stabilizes its accuracy with only small improvements provided by further data. Since SARIMA is not yet accurate with 6 weeks of training data, we used the CHA as a baseline in these cases. In contrast, the ANN-based methods started to outperform the CHA for 120 days. They were only more accurate than our hybrid approach when trained on the complete data set.
This shows that the clustering of the deviations in the second step only requires enough data points of each pattern to allow a cluster to be formed. In a real-world application, this would be especially useful for less frequent patterns like vacations: If the initial data set does not contain vacations, frequent retraining allows our approach to pick up the vacation pattern quickly when the first vacations are recorded. As we combine historical and real-time forecasts, the prediction for imminent events will be much better than with algorithms that purely predict based on historical data, even if not much historical data was available in the first place. We argue that our algorithm well balances prediction accuracy, on the one hand, and complexity during the training with a low data demand and low computational complexity, on the other hand. Furthermore, as the baseline prediction can be replaced, as we showed, it can be easily adapted to improve upon an already existing baseline prediction in practice. This exchange of the baseline also allows the creation of a two-step prediction algorithm that perfectly fits the use case; for example, Table 3 showed that our approach with CHA as baseline and CD as the real-time component would only require 12 seconds of training time by sacrificing a bit accuracy as seen in Figure 16.
The scalability describes how well the algorithm is suited for forecasting predictions for a complete transportation network. Often, other approaches are trained as univariate models on single lines and are unable to detect patterns that span over multiple lines, as they miss scalability. Our baseline algorithm is trained individually on each demand segment, whereas the other algorithms are trained on a time series over each network segment. From the further algorithms we have tested, the ANNs have shown the highest scalability, as they allowed to be trained on a multivariate time series, detecting patterns spanning over multiple lines. As Table 3 shows, our algorithm's training is fast enough to be applied daily, and even on significantly larger data sets, it is likely to still work sufficiently fast. Compared to the other time series algorithms, our reduction step ensures that realtime patterns are detected across demand segments. Hence, our algorithm and ANNs are highly scalable, as we can apply both without difficulties to the whole transportation network.
Next, flexibility describes how well the algorithm is adaptable to slight timetable or route adjustments by the public transportation provider or the changed behavior of customers. Either the algorithm can directly compensate for these changes, or the data and time demand during training is so low that frequent retraining is possible. Our approach does not require much time or data during training; hence, daily or weekly retraining is feasible, even if the transportation schedule has been significantly altered. It is an open question of how a neural network model reacts to significant changes to the transportation system that have not been observed so far, such as an overhaul of the transportation schedule. Furthermore, the real-time component of our algorithm ensures that even without frequent retraining, the model can compensate for changes better than the other time series algorithms only training on historical data. Lastly, we regard the interpretability of the models' results. For our algorithm, the analysis of the deviations of each distinct cluster in the evaluation showed that the results of our algorithm are highly interpretable by domain experts. Experts could label each cluster of deviations with, for example, vacation or school traffic deviation. Such labels could then help the dispatchers of public transportation companies. Other machine learning approaches often miss this interpretability of the results as they work as black boxes or require additional steps to enable an understanding of a model's output. Even though our algorithm shows less accuracy than state-of-the-art neural network approaches, it has a reasonable accuracy even in comparison, combined with low data and time demand, high scalability, high flexibility, and high interpretability. Thus, in our opinion, we have reached our initial goal of filling the gap of a forecasting algorithm that well balances its accuracy and complexity so that it can be implemented in RTCI systems.

V. CONCLUSION AND FUTURE WORK
In this paper, we motivated the need for an algorithm for the short-term real-time prediction of passenger numbers on demand segments with low computational complexity. We highlighted the literature's gap regarding efficient and accurate passenger occupancy algorithms that are well integrateable into current RTCI systems. During the design of the proposed algorithm, we ensured that the algorithm works with common APC data and only used data usually available in ITS. As such, we have designed a two-step algorithm that best uses the available data. First, the algorithm generates a baseline prediction with a day-ahead time series model based on past observations in the demand segments and generates Characteristic Profiles. Secondly, the algorithm adjusts this prediction with real-time information on the demand segment at the forecast time. This two-step approach facilitates the implementation of the proposed algorithms directly in ITS.
Our evaluation highlights the proposed algorithm's accuracy and efficiency and shows its practical applicability in ITS. In our evaluation, SARIMA was the most accurate time series algorithm that is also efficient, as it regarded the seasonality of public transportation network load. Compared to the baseline prediction with SARIMA, real-time data reduced the error by around 8.3 percent while only marginally increasing the training time. These results are comparable to results from other researchers that combined multiple time series models with real-time data [36], [41]. Especially for effects such as the start of school vacations, the algorithm significantly increased the accuracy over the baseline prediction by up to 24.8 percent. The detection of significant patterns emphasizes the algorithm's ability to learn deviations in average passenger numbers without external data such as weather, vacations, or other special events. The algorithm is also relatively insensitive to fine-tuning, simplifying the practical application of the algorithm in future products. Furthermore, the algorithm's training and prediction time are negligible, allowing for daily retraining and ensuring high flexibility as accurate data is always available. The low training and prediction time also ensures low hardware requirements so that the algorithm can easily be deployed in existing systems and the high scalability of the algorithm as it easily scales to the whole transportation network. We also showed that the Characteristic Profiles can be analyzed to explain the real-time prediction. In this case, we found that the approach mainly detected vacations. If transport companies label their data, they can even show this explanation to dispatchers as additional information, either increasing the confidence in the prediction if the label is deemed correct or enabling the dispatcher to revise the prediction if the respective event did not or will not occur on the current day. Overall, the evaluation shows the approach's feasibility of improving a baseline prediction by applying deviations to regularly expected passengers in the demand segment with the example of SARIMA.
While ANNs outperform our approach when trained on the full data set, it is still preferable in many cases. Firstly, predictions that are made further ahead than 30 minutes are more accurate. Secondly, when trained on fewer data, our approach also outperforms ANNs, especially if only a few weeks of training data are available. Lastly, our approach is much more easily explainable, which is an additional benefit for practical application.
For future work, we are primarily interested in evaluating the algorithm with different data sets and more complex baseline predictors. While the data set used in this evaluation spanned a time frame over more than one year, many vehicles are seldom completely occupied; as such, the effects might differ in more densely built cities, passenger numbers are strongly influenced by students commuting to school, with an overall relatively low number of passengers per capita, and finally not all demand segments had valid data points as not all vehicles were equipped with an APC system. Testing the algorithm in larger metropolitan areas, such as large capitals, will likely lead to different Characteristic Profiles caused by other demographics. For example, Noursalehi et al. [46] manually identified patterns in the transportation data; it would be interesting to perform our analysis on the same data set to see which patterns can automatically be extracted as Characteristic Profiles. Regarding the data set, future work should share their public data set with passenger occupancy so that different algorithms can be tested on the same data set, significantly increasing the comparability of approaches.
We set out to design a practical, implementable algorithm by combining algorithms with an average accuracy but high efficiency with a live APC data feed. The algorithm uses readily available data in ITS and shows high accuracy, stability, flexibility, and explainability, emphasizing the practical applicability of the model.