NAPC: A Neural Algorithm for Automated Passenger Counting in Public Transport on a Privacy-Friendly Dataset

Real-time load information in public transport is of high importance for both passengers and service providers. Neural algorithms have shown a high performance on various object counting tasks and play a continually growing methodological role in developing automated passenger counting systems. However, the publication of public-space video footage is often contradicted by legal and ethical considerations to protect the passengers’ privacy. This work proposes an end-to-end Long Short-Term Memory network with a problem-adapted cost function that learned to count boarding and alighting passengers on a publicly available, comprehensive dataset of approx.13,000 manually annotated low-resolution 3D LiDAR video recordings (depth information only) from the doorways of a regional train. These depth recordings do not allow the identiﬁcation of single individuals. For each door opening phase, the trained models predict the correct passenger count (ranging from 0 to 67) in approx.96% of boarding and alighting, respectively. Repeated training with different training and validation sets conﬁrms the independence of this result from a speciﬁc test set.


I. INTRODUCTION
T HE day-to-day operational management of transport systems relies on large networks of sensors, actuators, and software to provide passengers with safe, reliable, and affordable means of transportation. Improving on these goals is not only of scholarly concern but the provision of accessible and sustainable public transport systems by 2030 is one of the targets within the United Nations' Sustainable Development Goals framework [1]. Indeed, the efficiency and quality of service in public transportation are related to the quality of the living conditions of many citizens.
An essential task in the management of transport systems is the monitoring of per-vehicle load information. It provides a valuable service for passengers to select a comfortable itinerary, service providers to schedule their vehicle fleet or share revenues within a transportation network operated by several independent companies, and policymakers to identify structurally relevant transportation routes.
The history of automated passenger counting (APC) goes back to the mid-1970's [2] and links to other disciplines such as people detection, crowd density estimation, and people flow counting. The fast development of hardware and machine learning techniques made it possible for neural networks to be successfully applied in various tasks related to passenger counting. Stewart et al. [3] introduced an end-to-end people detection algorithm in images of crowded scenes using a recur-Manuscript received August 30, 2021; revised November 12, 2021; accepted December 20, 2021. This work was funded by a grant from the European Regional Development Fund (grant number 10167006) to K. Obermayer. We acknowledge support by the German Research Foundation and the Open Access Publication Fund of TU Berlin. Asterisk indicates that the first two authors contributed equally to this work. The dagger symbol indicates the corresponding author (thomas.goerttler@tuberlin.de). All authors are with the Technische Universität Berlin, Institute of Software Engineering and Theoretical Computer Science, Neural Information Processing Group, Straße des 17. Juni 135, 10623 Berlin, Germany.
The authors wish to thank Michael Siebert and Jan Sablatnig (Interautomation Deutschland GmbH, Berlin) for providing domain expertise, tools, and the 3D LiDAR (ToF) recordings. The authors wish to thank Dimitra Zarafeta and the many labeling assistants for their help viewing and annotating the video material. rent neural network with Long Short-Term Memory (LSTM) units [4]. In this algorithm, a convolutional neural network (CNN) first encodes each image as a 15 × 20 × 1024 dimensional high-level descriptor matrix, which is subsequently decoded by an LSTM network into a variable-length sequence of bounding boxes and corresponding confidence values that a previously undetected person is present within the respective image region. A confidence threshold is used to terminate the detection process [3].
For RGB-D based human detection the multi-glimpse LSTM in [5] and an asymmetric adaptive fusion two-stream network (AAFTS-net, [6]) were proposed. For real-time people detection in top-view depth images from video surveillance systems, WatchNet and its extension WatchNet++ were presented in [7] and [8]. WatchNet consists of a feature extraction module and a series of prediction stages that sequentially refine the prediction maps for human body landmarks (head and shoulders) and is trained with artificial and real depth data for people detection. The neural object detection method YOLO (You Look Only Once, [9]) was used among other applications for person detection from an overhead view [10], for pedestrian detection [11], and for person detection in thermal images [12]. The neural network based object detection method SSD (Single Shot multi-box Detector, [13]) is deployed for top view people detection and counting [14]. Wang et al. introduced an end-to-end deep CNN regression model for counting people in extremely dense crowds [15]. To reduce false positives, they intentionally included potentially confusing negative samples such as lush trees, buildings, and some natural scene images in the training data. Following the work of [15], CNNs have been widely applied in human detection or feature extraction for people counting research [16]. In the context of detecting humans, CNNs were deployed for head-shoulder detection in crowd images [17], for predicting density maps on a given crowd image [18], for human detection in nighttime images obtained by a visible light camera [19], and for passenger recognition in passenger flow monitoring systems [20]. CNNs were also used for feature extraction. Gao et al. combined CNN-based feature extraction with Adaboost in an algorithm for counting people in crowded surveillance environments based on head detection. In contrast, [22] used CNN-autoencoder feature extraction in an algorithm estimating passenger occupancy in crowds of passengers on a bus. A region-based CNN (R-CNN [23]) was deployed in a variety of human detection tasks, e.g. in a crowd [24], in complex scenes [25], and in drone imagery [26].
Wilie et al. introduced an end-to-end people counting algorithm from 2D crowd images by using a pre-trained network, the Xception (Extreme Inception, [28]) network, and adding a fully connected network on the top of the pre-trained network [27]. A class of end-to-end architectures called long-term recurrent convolutional neural network (LRCN) is proposed by Donahue et al. for visual recognition and description in video data [29]. LRCN is constructed by combining a CNN and an LSTM network using variable-length inputs and generating variable-length representations. Massa et al. proposed a regression model called LRCN-RetailNet for counting people in videos captured by low-cost surveillance cameras in retail stores [30]. The model is trained on sequences of a fixed number of continuous images in time, where the number of images is a hyperparameter (in their experiments, the three values 5, 9, 12 are used). Each sequence is annotated with the number of people in the last image of the sequence. LRCN-RetailNet predicts the number of people at the time of the last frame of the video [30].
Recently, neural networks were also applied to boarding and alighting passenger counting on video data in public buses. Liu et al. used a CNN and the spatio-temporal context model for passenger detection and passenger tracking, respectively, and achieved a passenger counting accuracy of 93%. This APC counted 108 of 116 passengers in the bus transportation scene [31]. In [32], a two-class SSD and a Kalman filter are used for detecting passengers and tracking their movements. They conducted experiments using a 7-segment bus monitoring video where the segments have different characteristics, such as dark/strong outside light, crowded while getting on/off, including passengers carrying children, including passengers with babies and children. Their APC counted all 28 boarding passengers and 79 of 81 alighting passengers, resulting in an accuracy of 98%. Furthermore, they conducted some pedestrian statistical experiments in the laboratory, including single walking, two people walking together, 5 to 6 people walking, cross walking and squatting, etc. Their APC counted 64 of 68 "in" passengers and 54 of 61 "out" passengers for the experiments, resulting in an accuracy of 91%.
Sun et al. introduced a depth video stream generating the method from RGB-D videos obtained by a camera mounted on top of the door area of three different buses. They propose a boarding and alighting passenger counting method combining a two-step (generating and refining head proposal) head detection with a tracking algorithm for the generated depth video samples [33]. Different from [31], and [32], in [33] the APC was tested on a large data set with 2000 videos with four different sub-categories, which were defined based on the noise level (strong/mild sunlight) and crowdedness (crowded/uncrowded) of the scene. The performance for the different sub-categories ranged from 72.3% to 85.4% for boarding and from 91.3% to 93.7% for alighting passengers.
The three approaches [31]- [33] for counting boarding and alighting passengers from image sequences perform person detection and tracking with two different modules and subsequently combine both results for the counting. In our approach, we consider the architecturally more straightforward approach where a single neural network solves the detection, tracking, and counting problems simultaneously in an end-to-end learning fashion.
The development of information technology to collect data through cameras and sensors gives rise to significant privacy risks [34], [35]. Privacy issues become more and more critical by collecting and analyzing vast amounts of images and video sequences. Various privacy protection solutions for visual recognition were proposed, e.g., by ad hoc de-identifying face images [36], by different methods to hide distinguishing facial information [37], by regions of interest (ROI) based transform-domain or codestream-domain scrambling [38], by face morphing [39], by distortion-based visual privacy filters [40], by a degradation transform for the original video inputs [41], or by a video face anonymizer [42].
To assure a high level of passenger anonymity when counting people in sequences of color images, Skrabanek et al. used a camera set-up for capturing images from a top-down (orthogonal) view [20]. Low-resolution depth images obtained by RGB-D sensors are used for privacy-preserving human pose estimation in [43] and for head detection in the task of counting boarding and alighting passengers [33]. Top-view depth images from a video surveillance system are employed for detecting people [7], as well as people committing attacks and intrusions [8].

A. Contributions
Our work introduces an end-to-end algorithm for a real-time automated boarding and alighting passenger counting system called "Neural Automated Passenger Counter" (NAPC) based on an LSTM recurrent neural network. We present an LSTM network with a tailored cost function, which is trained on a large dataset of 3D LiDAR video recordings of individual door openings (hereafter called 'sequences', see Fig. 1) to automate passenger counting (Section III). It achieves high performance in a series of counting experiments (Section IV): On average, the algorithm obtains an exact count in 96 out of 100 sequences for both boarding and alighting passengers. The magnitude of the error made in the miscounted sequences is small. The algorithm fails to count only 1.46% of boarding and only 1.11% of alighting passengers (see Section IV). This performance is superior to the performance reported for three recently proposed methods for counting boarding and alighting passengers [31]- [33] (Section V).
Counting experiments were conducted on a large set of manually labeled videos recorded with 3D LiDAR cameras that are installed over the doorways of an eight-door German regional train with a top-down/high-angle perspective (Section II). At least three independent labeling assistants manually annotated the events (that is, boarding and alighting passengers) in each sequence using additional grayscale video recordings (320 × 240 pixels at 10 frames per second). If no consensus upon the correct annotation (that is, upon the total number of events, and upon their timestamps up to an intra-labeler standard variation of 2 seconds) was reached, the number of viewers was increased up to seven before an administrators' decision upon the correct annotation of that sequence was made. Even if their annotation was difficult for human viewers, many challenging sequences could be retained from rejection by this approach.
The NAPC system accepts sequences with hundreds of frames, whose lengths are variable and only determined by the duration of a realistic door opening phase and the recording frame rate. It reliably predicts passenger counts from lowresolution depth information with just 20 × 25 pixels (see Section IV), making our NAPC a privacy-aware passenger counting system. Learning is accomplished in an end-to-end manner. No background modeling, head detection, or trajectory tracking is required. Counting is realized through a single neural network architecture. A similar approach to our method is LRCN-RetailNet [30] for counting people in a retail store. Both approaches, NAPC and LRCN-RetailNet, consider the people counting task as a regression problem. While LRCN-RetailNet takes a fixed number (5, 9, or 12 frames) of the RGBP video sequences obtained by combining color information and extracted foreground (people) information as input, NAPC takes a variable length of video sequences recorded with 3D LiDAR cameras as input. Whereas LRCN-RetailNet predicts the occupancy at the store, our method predicts the number of boarding and alighting passengers during a door opening phase. To our best knowledge, our method is the first end-to-end recurrent neural learning algorithm for a boarding and alighting passenger counting system from 3D LiDAR video recordings.
A tailored cost function and data augmentation strategies (such as mirroring or backward-playing of video, see Section III) is used to maximize the information extracted from a given training set such that the number of required training videos can be minimized. Approx. 2,000 sequences are already sufficient to train an NAPC network from random initialization to high accuracy (see Section IV-G). In our setting, this amounts to six days of data collection which underpins the practical significance of our approach.

II. THE Berlin-APC DATASET
For the evaluation of our system, we employ a largescale dataset of APC-relevant image sequences (Berlin-APC Dataset, [44]). It consists of 12,956 sequences with a shape of t×20×25, where t denotes each sequence's variable number of frames. Note that only 3D LiDAR (but no RGB) information is captured, resulting in one channel per pixel. (The tradeoff between this approach and other sensor types was not subject of this research.) This mode of recording does not allow the identification of individual passengers (of Fig. 1) but preserves enough information to give an accurate algorithmic passenger count (see Section IV). The video sequences were recorded in 2017 by 3D LiDAR cameras mounted above the doors of a regional train under regular operation in the Berlin  metropolitan area. Every sequence is annotated by the number of boarding and alighting passengers (excluding children) as a label. The recordings were made at 40 frames per second but were later reduced to 10 frames per second. Each pixel takes floating-point values between 0 and 1, where 0 is closest to the sensor and 1 is 4 m away from the sensor. After reducing the framerate, the number of frames per sequence ranges from 56 to 3275 (avg.~190, see Table I). Each sequence shows one entire door opening phase in a top-down perspective, including the physical opening and closing of the pictured sliding door (see Fig. 1 for sample frames).
Each sequence was initially given to three human labelers who independently annotated the timestamp and direction (boarding or alighting) of every adult passenger they discerned in the sequence using a specialized labeling software. 1 An additional grayscale video recording with 320 × 240 pixels at 10 frames per second of every scene was made available to the labeling assistants to enhance clarity and comprehensibility. Rare events such as dense crowds or large objects could be annotated in a free-text field. After annotation, the labelers were asked to mark every sequence as decidable or undecidable. Sequences marked undecidable by at least two labelers (primarily due to sensor errors) were rejected.
It was then checked whether the three initial labelers agreed upon the total counts per category and direction. If there were no simple majority (1,270 out of 12,956 sequences, 9.80%), the sequences would be re-examined by up to four additional human annotators. If still no majority upon a correct label was reached (396 out of 12,956 sequences, 3.0%), an administrators' decision finally determined the label. The persequence totals of boarding and alighting adults were then stored. This iterative approach ensured that dense crowds and sequences with many passengers obtain consistent labels after careful manual inspection. Though comparatively rare, their sequences are a cornerstone for training a neural network and evaluating its predictive performance. This procedure yielded a dataset of 12,956 sequences with 26,243 boarding and 26,164 alighting events; only 382 sequences were unusable. The detailed statistics are summarized in Table I.

A. The Network Architecture
There are two main approaches in deep learning which are capable of modeling temporal dependencies. Autoregressive neural networks like WaveNet [45] condition their new prediction on previous ones. Plain recurrent neural networks (RNN) [46], gated recurrent units (GRU) [47] or LSTMs maintain past information in hidden states throughout the sequence. Plain RNNs can not maintain information throughout long sequences, and GRUs are less capable of solving counting problems compared to LSTMs [48]. Thus, LSTMs are chosen.
The input data is the above mentioned Berlin-APC Dataset (see Section II). Every frame is represented as a 500dimensional vector by concatenating pixel rows. The fully connected input layer reduces the input vector to a smaller representation. This reduced representation is then propagated through the LSTM layers. The output is combined with a second fully connected layer into the final two output classes, namely the counted boarding and alighting passengers.
We propose the network architecture shown in Fig. 2. Finally, the two major hyperparameters of the network's structure are the depth and the height of the LSTM core. Those were optimized using standard hyperparameter selection procedure (for details, see Section IV-F).

B. Data Augmentation
We apply the following data augmentation to each sequence independently at every epoch: The neural network decides whether a passenger is passing through the door, which is centered on the top of each frame (see Fig. 1). This property must not be changed. Thus, the sequence labels remain valid when performing a left-right mirroring of each frame, keeping the door position fixed at the top center of the view. (Mirroring all frames upside-down invalidates the sequence as the door would flip to the bottom, and thus change the region of interest every time.) Due to the low-resolution data, reversing the sequence does potentially distort the view on the objects but does not change the position of the door. When reversing the sequence, previously boarding passengers are now leaving the vehicle and vice versa, i.e. swapping the boarding and alighting labels yields a valid label for the reversed sequence. Both augmentations (left-right mirroring and reversing) are applied independently with a probability of 0.5.

C. Simple Loss
Each sequence is paired with only one label per class (the two accumulated counts of the boarding and alighting passengers). The network processes the whole sequence at once, and the prediction of the last frame is compared to the label. The same approach was used in [30] and also works for the NAPC. Employing more control over the network's loss without acquiring more information about the sequences themselves leads to event-precise predictions. Therefore, the error is calculated for each frame. This may improve the gradient flow through time.
Valid predictions are bounded by zero from below and the respective labels from above, as shown in Fig. 3. We refer to them as the lower and upper bounds.
The error is calculated as follows: Network predictions of boarding or alighting passengers are updated at every frame of the sequence. Let k be the index of a sequence X k ∈ R t k ×500 with t k frames, and let Y k ∈ N 2 be the corresponding labels, with the two output classes. Then, the upper bound U k ∈ N t k ×2 for that sequence is given by and the lower bound L k ∈ N t k ×2 is given by Intuitively speaking, we require the network to count at most as many events as the label prescribes; as well as to count at least zero events, except for the last frame. For the last frame we require the network to exactly predict the label of that sequence. Let Y k ∈ R t k ×2 denote the NAPC's prediction of that sequence, which is always greater or equal to zero due to the activation function (see Fig. 2). Then the error E k ∈ R t k ×2 is given by how much the bounds are violated, where the minimum and maximum operate elementwise.
Minimizing the simple loss results in networks which only predict zeros. This is due to two reasons: First being a significant fraction of sequences not containing any events in either class (see Table I). However, the second and more influencial one being the loss function in combination with the labels. To clarify the problem, assume for simplicity no boarding and alighting events for a sequence X k . The according labels are both zero. Placing it in Eq. 1 and Eq. 2 results in U kij and L kij being zero for every i ∈ {1, . . . , t k } and j ∈ {1, 2}. Substituting U k and L k in Eq. 3 with zero, the resulting error E k is calculated as: Which holds because of the final activation function of the network. Thus, every prediction which is not exactly zero produces an error. When this error is minimized due to a large amount of no events, the only leftover error is produced for the last frame for any label bigger than zero, which is negligible in sequences with up to several thousands of frames.

D. Refined Loss
To overcome the aforementioned problem, we concatenate sequences to create longer sequences with intermediate counting ground truth. Thus counts of the concatenated sequences are accumulated. The bounding boxes of the concatenated sequences (see Fig. 4) now stack on top of each other but moved along the time axis. The new loss function is defined as follows: Let k and l be the indices of the sequences X k ∈ R t k ×500 and X l ∈ R t l ×500 . The concatenation of these two sequences contains two successive door opening phases, and the upper and lower bounds U * , L * ∈ R (t k +t l )×2 of the concatenated sequences are given by the sum of the two individual bounds, i.e.
Given a neural prediction Y * ∈ R (t k +t l )×2 the loss function now reads as The number of concatenated sequences is a hyperparameter of the learning procedure. It was fixed to a length of five for all experiments.

E. Learning Procedure
We use the Adam optimizer [50] with a fixed learning rate of 0.001 when training an NAPC model. It runs for a fixed number of 5,000 epochs, where the prediction accuracy is determined after every 10th epoch on the validation set. The model with the highest validation accuracy is selected for testing (see Section IV and Fig. 5). Unless noted otherwise, all experiments use 5 LSTM layers with 50 LSTM cells each. The sequences are concatenated as described previously and are batched together to 32 concatenated sequences. Thus, every batch consists of 160 (= 5 × 32) randomly drawn sequences until the training data is exhausted and a new epoch begins.

IV. RESULTS FOR THE Berlin-APC DATASET
We next present a series of experiments that demonstrate the high passenger counting accuracy of NAPC on the previously introduced dataset. To that end, we first formulate the passenger counting task in terms of a classification and a regression problem, and compute the respective performance indices. We then explored how prediction performance depends on the hyperparameter selection and the required amount of available training samples. To increase the robustness of our experiments, the splitting of our dataset into a training, validation, and test subset is done along the lines of the original 2017 recording days. That is, if sequences are randomly sampled, always entire recording days are drawn rather than individual sequences (a similar approach is chosen in [33]). To that end, the sequences recorded at 6 out of 31 recording days (approx. 20%) were chosen at random (with random seed S) to build a test set, H S , of size H S ∈ N denoted by H S := X S,k ∈ R t k ×500 : k = 1, ..., H S , where t k denotes the number of frames of the k-th sequence. The remaining 25 recording days are split into a training, T S , and validation, V S , subset in a 3:1 ratio. To simplify the notation, we drop the random seed index S in the following. After a randomly initialized NAPC instance was trained using T and model selection was done using V, we computed for each sequence X k ∈ H the prediction tensor Y k ∈ R t k ×2 (k = 1, ..., H). The prediction of the last frame Y k [t k , :] ∈ R 2 is the final algorithmic count, where [·, ·] denotes standard array indexing. We denote by the final boarding and alighting count, respectively, where round denotes standard integer rounding. Intuitively speaking, Y k [:, 1] contains the prediction of the entire boarding timeseries (that is, the neural prediction to the given input sequence X k ), and Y k , respectively, as Y k is the count until the last frame.
The results presented were obtained by the following scheme: First, we used eight different random seeds S to sample independent random partitionings of the entire dataset into the training, validation, and holdout sets T , V, and H as described above to estimate the performance of our approach independent of the specific choice of the training and holdout sets. For each of these different partitionings, we then trained four randomly initialized NAPC networks each to test the robustness of the performance results against different influences of model initializations and against different data samples used for training and testing. That is, 32 models were trained in total.

A. Classification Performance
We first looked at the passenger counting problem in terms of a classification problem, that is, every sequence has a discrete class-label in N×N. We denote the boarding accuracy of an NAPC network by that is, the share of correctly classified sequences relative to the total amount of sequences in the test set H. The alighting accuracy ACC a is defined analogously. Our approach achieved a boarding accuracy of 0.9615 (average over all models, min: 0.9522, median: 0.9613, max: 0.9711), i.e., approx. 96% of all sequences in the test set were classified correctly. The accuracy only varied by 2 percentage points from training to training. That is, the accuracy was independent of the choice of the test set, and this held for both the boarding and the alighting direction, see Table II.

B. Regression Performance
The passenger counting problem can also be understood as regression task which allows for the quantification of counting errors. To that end, the mean absolute and mean absolute percentage errors in boarding direction (MAE b and MAPE b ) are given by The MAE and MAPE in the alighting direction (MAE a and MAPE a ) are defined similarly. Note that the MAPE formula excludes sequences with zero passengers to avoid division by zero. Since the models summarized here classify a sequence with zero boarding passengers correctly with a probability >99%, this exclusion does not strongly affect the MAPE's significance. In line with [33], we also report the mean absolute percentage error in the boarding direction relative to the average number of boarding passengers (including sequences with zero passengers), In the alighting direction, MAPE a is defined similarly with Y (a) = 2.02.
Our models achieved a boarding MAE b of 0.0595 (average over all models, min: 0.0337, median: 0.0535, max: 0.0944), see Table II, i.e. a typical trained NAPC network over-or undercounted approx. 6 boarding passengers per 100 door opening phases. The relative error indicated by the boarding MAPE b was on average 0.96% (min: 0.76%, median: 0.94%, max: 1.21%), i.e., a trained NAPC network over-or undercounted approx. 1% of boarding passengers in the median of all door opening phases. The relative error indicated by the boarding MAPE b was on average 2.76% (min: 1.80%, median: 2.78%, max: 3.91%), hinting at larger errors in sequences with more passengers. The results for the alighting direction were similar, see Table II.

C. Counting Metrics
We further propose to employ the following two metrics tailored to the passenger counting task. First, the global relative bias in boarding direction ∆ (global) b is defined by The global relative bias in alighting direction ∆ (global) a is defined analogously.
Second, we computed the 95% confidence interval (CI) of a t-test-induced equivalence test [51] with the alternative hypotheses H 0 : |µ| ≥ m, i.e., a systematic boarding count error H 1 : |µ| < m, i.e., no systematic boarding count error where m > 0 is the required equivalence margin, and µ is the (unknown theoretical) MAPE b if the test set contained an infinite number of sequences (i.e., H → ∞; analogously defined for MAPE a ). Then, if the confidence interval is fully contained in the interval [−m, m], the hypothesis H 1 holds with a probability of 95%.
In our experiments, we achieved a lower bound on the confidence interval CI (lower) for the boarding direction of -2.24% (average of over all models, min: -4.01%, median: -2.14%, max: -0.57%) and an upper bound CI (upper) of -0.68% (average of over all models, min: -1.57%, median: -0.65% ,   TABLE II  THE RANDOMLY INITIALIZED NAPC NETWORKS WERE TRAINED AND   EVALUATED ON EIGHT DIFFERENT PARTITIONINGS OF THE AVAILABLE   DATASET INTO A TRAINING, VALIDATION AND TEST SET. THE REPORTED   VALUES ARE THE TEST SET  max: 0.28%). Note that the zero (that is, a theoretically unbiased system) was not contained within the median confidence interval, but the equivalence test suggested an undercounting bias. The results for the alighting direction were similar. A typical fully trained NAPC network would not have passed the equivalence test for the boarding direction with an equivalence margin of m = 0.01 as suggested in [52] since the 95% CI is not fully contained in the interval [−m, m].

D. Challenging Sequences and Rare Events
Next, we posed the question of how well a trained NAPC network handles sequences that contain challenging events such as large objects, crowds, bicycles, or lingerers. To answer this question, we further examined a trained NAPC network from the previous section. First, we plotted its training and validation accuracy together with the average (taken over all 32 previously trained NAPC networks) of the accuracy on the training set and the accuracy on the validation set along with ±2 times their standard deviation in Fig. 5. We further marked the epoch with the highest attained validation accuracy after the training was stopped. Our results show that all training repetitions had a steep increase in validation accuracy up to approx. 500 epochs, and only minor improvements after approx. 4,000 epochs.
We further plotted the error distribution, that is, the distribution of the differences between the manual and the automated count, as histograms in Fig. 6. Over the entire test set, we observed that average differences are close to zero with a tendency to undercount. Notably, the fully trained NAPC instance never missed more than six passengers-even for challenging sequences. (Similar results were obtained for the alighting direction and when analyzing the other models summarized in Table II.) If a door opening phase featured a dense passenger sequence, a passenger jam, or more than 20 boarding passengers, the accuracy decreased to merely 55% due to undercounted (that is, overlooked) passengers. This is also observed in the confusion matrix (not reported here). The distribution for lingerers is right-tailed, which indicates that lingerers were often double-counted. Bicycles or other large objects had a slight adversarial effect on counting performance, decreasing the accuracy by approx. 7 percentage points. The thick lines visualize the validation (dark orange) and training (dark blue) accuracy of a randomly selected single NAPC instance. The vertical line indicates the epoch with the highest attained validation accuracy after the training was stopped; the corresponding model was selected for testing. Note that the training accuracy is determined using concatenations of five sequences (see Section III) and is, therefore, lower than the validation accuracy.

E. Resolution Tradeoff
To determine the tradeoff between the spatial resolution of the depth video and the performance of our counting system, we artificially decreased the resolution of the input data by downsampling (bi-linear spline interpolation with antialiasing) simulating different sensor resolutions (15 × 18, 10×12, 5×6, and 2×3 pixels). The data was then upsampled to the original resolution of 20 × 25 pixels (using the same method) in order to keep the network architecture constant over the entire experiment. For each of the four datasets, a randomly initialized NAPC instance was trained. This experiment was repeated five times with different random seeds S, i.e., a total of 20 additional NAPC instances were trained. The results are summarized in Table III. For convenience, we also state the relevant metrics for the full-resolution experiment (without downsampling) from Sections IV-A-IV-C as a reference.
Counting performance remains stable for lower resolutions down to a resolution of 5×6 pixels, for which the performance slightly decreased in all metrics accompanied by an increase of the standard deviation (i.e., an increased dependence of the performance on the chosen holdout set). Still, more than 95% of all sequences in the holdout set are counted correctly. Only when the resolution is reduced to 2 × 3 pixels, counting performance drops significantly, and correct counts are achieved for 45% of all sequences only. While we consider a resolution of 20 × 25 pixels to be low enough to prevent the identification of single individuals, even lower resolutions down to 10 x 12 pixels can be employed without sacrificing counting performance.

F. Hyperparameter Validation
Next, we validated the choice of the LSTM network's depth and height. To that end, we fixed a training, validation and test partitioning of the entire dataset and trained a total of five NAPC instances each for the hyperparameters depth ∈ {1, 2, 5, 10} and height ∈ {10, 20, 50, 100}, i.e. a total 80 NAPC instances. Thereby, we obtained for each tuple (depth, height) a total of ten values for the performance of a trained model in terms of its boarding and alighting accuracies. We jointly visualized the distribution of these ten accuracy values for each hyperparameter configuration in Fig. 7. The choice of 5 LSTM layers with 50 cells each reached the highest accuracy and exhibited the smallest variation (in terms of interquartile range). These findings are confirmed by the analogous plot of the hyperparameter validation in the global relative bias ∆ (global) (not reported here).

G. Minimal Required Train Set Size
Finally, we decreased the number of recording days provided to the training procedure while keeping all other parameters constant. More precisely, we drew a fixed validation and test set and varied the number of recording days used for training. Unused sequences were discarded. This procedure was validated with five different seeds S, that is, with five different training, validation and holdout sets. Similar to Subsection IV-F, we jointly visualized the boarding and alighting accuracy as a function of the train set size, see Fig. 8. Our results show that training can reach an accuracy above 90% in both directions with as few as approx. 1,000 sequences, but is not robust as it depends on the choice of the test set. Training the model with 2,000 sequences already had a high chance of reaching accuracy values of 90% and above, and only failed to do so once. When using around 5,000 sequences, none of the trained models had an accuracy below 90%. In the Berlin-APC dataset, this amounts to around two weeks of in-vehicle video material collection.

V. CONCLUSION
We introduced a real-time Neural Automated Passenger Counting system (NAPC), which is based on an end-to-end LSTM recurrent neural network. NAPC counts the number of boarding and alighting passengers from 3D LiDAR videos obtained by a top-down sensor during door opening phases.
A direct quantitative comparison with other counting algorithms is not possible because they were tested on datasets different from ours in the context of different real-world scenarios, distinguished, e.g., by location (urban/rural), type of vehicle (bus/train/tram), sensor (RGB/RGB-D/depth-only, with various resolutions), number of samples to test, or viewpoints (tow-down/oblique-view). However, it is possible to test whether our approach yields a performance comparable to the performance of other counting algorithms whereby each algorithm is tested in the scenario it was designed for. NAPC counted on average approx. 99% of the boarding and alighting passengers in our test data. This performance is superior to counting 108 of 116 passengers (i.e., approx. 93%, jointly for both directions) reported in the scenario of [31], and counting 107 from 109 passengers (i.e., approx. 98%, jointly for both directions) reported in the scenario of [32]. Note that the authors of [32] report an accuracy of 100% for the boarding direction; however, their test data consists of only 28 boarding passengers (ours: ranging 4555-5841 boarding passengers, i.e., a roughly 200 times bigger test set). Looking at the performance measure of [33], NAPC obtained on average an absolute relative error of approx. 3% for both boarding and alighting passenger counting, which is superior to all four sub-categories reported in the scenario of [33] (15% absolute relative error for boarding, and 6% for alighting passengers, respectively in the best-performing sub-categories). As we have pointed out, these results must be interpreted with caution as better performance indices could also be explained by other, scenario-dependent factors. However, it can be concluded that the use of deep learning on low-resolution depth-data leads to results competitive with other APC algorithms.
Besides looking at performance metrics, we stress the following advantage of NAPC compared to other deep learning and/or classical methods: Unlike three recently proposed boarding and alighting passenger counting methods [31]- [33], NAPC is based on an end-to-end architecture. No separate background modeling, head detection, or trajectory tracking is required. Our experiments were conducted on a largescale dataset of approx. 13,000 depth-only recordings with a resolution of 20 × 25 (or less) pixels. This demonstrates the possibilities to build a privacy-friendly APC system with competitive counting performance based on a deep-learning approach, avoiding unnecessary data collection in the public sphere.

SOURCE CODE / DATASET AVAILABILITY
The source code of the TensorFlow implementation of NAPC is available at https://github.com/nicojahn/ open-neural-apc.