Helmet Use Detection of Tracked Motorcycles Using CNN-Based Multi-Task Learning

Automated detection of motorcycle helmet use through video surveillance can facilitate efficient education and enforcement campaigns that increase road safety. However, existing detection approaches have a number of shortcomings, such as the inabilities to track individual motorcycles through multiple frames, or to distinguish drivers from passengers in helmet use. Furthermore, datasets used to develop approaches are limited in terms of traffic environments and traffic density variations. In this paper, we propose a CNN-based multi-task learning (MTL) method for identifying and tracking individual motorcycles, and register rider specific helmet use. We further release the HELMET dataset, which includes 91,000 annotated frames of 10,006 individual motorcycles from 12 observation sites in Myanmar. Along with the dataset, we introduce an evaluation metric for helmet use and rider detection accuracy, which can be used as a benchmark for evaluating future detection approaches. We show that the use of MTL for concurrent visual similarity learning and helmet use classification improves the efficiency of our approach compared to earlier studies, allowing a processing speed of more than 8 FPS on consumer hardware, and a weighted average F-measure of 67.3% for detecting the number of riders and helmet use of tracked motorcycles. Our work demonstrates the capability of deep learning as a highly accurate and resource efficient approach to collect critical road safety related data.


I. INTRODUCTION
Nowadays, drivers' adherence to traffic laws is mainly monitored and enforced by traffic police officers through direct observation. Yet implementations of road surveillance infrastructure are increasingly being used to automatically identify safety related behaviors through traffic video analysis. Approaches have been developed to register relatively simple variables, such as traffic flow and density [1], [2], speed [3]- [5], traffic light violations [6], or collisions [7]. More recently, computer vision has been used to register more complex road user behaviors, such as driver mobile phone use [8] and unauthorized use of car-pooling lanes [9]. Since for many developing countries the main form of motorized transport consists of motorcycles, the detection of motorcycle The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Tang . helmet use of riders through machine learning has also been explored [10], [11]. The availability of exact and concurrent data about motorcycle helmet use on the street is crucial to injury prevention, as it can be used for targeted enforcement and effective education campaigns.
The registration of helmet use through human observers naturally consists of four basic elements, that any automated detection method must also possess to produce comparably detailed helmet use estimates. (1) Detection: Initially, active motorcycles need to be detected. (2) Tracking: Individual motorcycles need to be tracked through the road environment, to ensure that each motorcycle is only registered once, regardless of how long it is observed. (3) Rider differentiation: For an accurate calculation of motorcycle helmet use and to produce position-specific helmet use data, rider numbers and positions (i.e. distinguishing the driver and passenger(s)) per motorcycle need to be registered. (4) Site-diversity: VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Helmet use numbers need to be accurately registered, independent of the road environment at an observation site. Hence, automated approaches need to show accuracy for more than one road environment. While these four basic elements of motorcycle helmet use observation come naturally to humans observers, existing automated detection approaches, either do not include all four elements or have low performance on some of them (see Section II).
In particular the lack of rider differentiation is a crucial element for the application of automated helmet use detection in the field. Researchers repeatedly find evidence of an influence of rider position and rider number on helmet use on individual motorcycles [12]- [15]. Hence, the differentiation of rider helmet use for drivers and passengers is a crucial metric, that should not be omitted in automated detection approaches. The lack of broad applicability and robustness prevents the substitution of human observers through automated approaches in helmet use observation. Hence, we present a deep learning based automatic detection approach that contains all four basic elements of human-observer helmet use registration, i.e. detection, tracking, rider differentiation, and site-diversity. The proposed work builds on and extends a previous approach for frame-based helmet use detection [10], which did not include tracking of motorcycles and in which the dataset was not made public. To encourage the development of diverse detection approaches, we make this dataset available with the publication of this article. In addition, we further propose a benchmark metric for the assessment of automated detection approaches.
In summary, our main contributions are twofold: • We propose a comprehensive CNN-based approach for helmet use detection of tracked motorcycles, containing all basic elements utilized by human observers. A multi-task learning (MTL) framework is developed for both visual similarity learning and patch-based helmet use classification, which increases computational efficiency as well as detection accuracy. The source code and pre-trained model are available in [16].
• We publish a diverse, large-scale, annotated dataset for motorcycle detection, called HELMET. It contains 10,006 annotated motorcycles in 910 video clips, recorded throughout the country of Myanmar, containing 12 observation sites across 7 cities. To the best of our knowledge, it is the largest and most diverse motorcycle helmet use detection dataset. Based on the dataset, we propose a metric to evaluate the performance of helmet use detection algorithms, which takes account of both spatial and temporal detection. The dataset, together with the source code for performance evaluation, are available in [17].

II. RELATED WORK
To date, a number of approaches for the automated detection of motorcycle helmet use in recorded video data have been proposed [10], [11], [18]- [24], details of which can be found in Table 1. For the initial step of active motorcycle detection, approaches can be broadly categorized into conventional methods [18]- [21] and deep-learning-based methods [10], [11], [22]- [24]. For the detection of active motorcycles, most conventional methods follow similar procedures. First, a background subtraction method is used to extract moving objects/vehicles from the video data. After this, a binary classifier (e.g. a support vector machine (SVM)) is used to detect motorcycles. In another step, the head region of the motorcyclists is localized, and an additional classifier is used to distinguish helmet use from non-helmet use. To improve the performance of the binary classifier, hand-crafted features are used, a common one is to extract a histogram of oriented gradients (HOG) [25] from the detected head regions of riders. Such methods, however, do not work well when there are many motorcycles and/or there is more than one rider on a motorcycle. Instead of designing hand-crafted features, deep learning based methods strive to automatically develop representations from raw image data that are most suitable for the helmet use detection task. In [24], helmet use is classified in the detected head regions of riders using a convolutional neural network (CNN). In [22] and [11], two independent CNNs are trained, one is used to distinguish motorcycles from other vehicles, the other to classify helmet and non-helmet in the head region of riders. Since it is time-consuming to detect motorcycles and helmet use through two separate CNNs, [10] and [23] use one single CNN to detect motorcycles and helmet use simultaneously.
The tracking of individual motorcycles through single frames of a recorded video is only included in half of existing approaches presented in Table 1. While video data recorded with traffic surveillance infrastructure is inherently framebased, helmet use data produced through automatic detection must be projected onto individual motorcycles, to allow a valid appraisal of helmet use. Hence, frame-based detection results for motorcycle and rider counts, as well as helmet use must be remapped to individual motorcycles which appear in multiple frames. This can only be achieved by approaches that link frame-based detection to cross-frame tracking. This tracking is missing in some approaches (e.g. [11]). To compensate for this lack of tracking, it is necessary to either use single frame detection at a fixed point/line in the frame to prevent the repeated detection of the same motorcycle) (e.g. [20]) or to collect helmet use data in every video frame without tracking, leading to the loss of information on the number of motorcycles registered at an observation site (e.g. [10]). Both of these shortcuts lead to a decrease in helmet use data quality and in addition prevent the use of multiple frames of an individual motorcycle for helmet use and rider detection.
For rider number and position detection, only one of the approaches listed in (Table 1) generates detailed information on this [10]. And while other approaches (e.g. [20]) use head counts on the motorcycle as a substitute for rider numbers, this information is not mapped on rider positions (i.e. driver vs. passenger). As the specific position and number of riders on a motorcycle directly relates to their helmet use [12]- [15], the lack of this critical information presents a clear barrier for the application of automated helmet use detection approaches in the field.
On the element of site-diversity, the existing datasets used to develop automated motorcycle helmet use detection approaches (Table 1) show a critical lack of diverse observation sites and a general lack of detailed information on road environments used. Five of the datasets [19]- [21], [23], [24] only contain data from one recording site, prohibiting robust evaluation of the developed solutions in diverse traffic environments. The two datasets which contain more recording sites do not distinguish helmet use between motorcycle drivers and passengers [11], [22]. This lack of data diversity and level of annotation detail in existing datasets hinders the development of widely applicable detection solutions.

III. THE PROPOSED APPROACH
Our proposed approach for helmet use detection of tracked motorcycles consists of the three steps, which are visualized in Fig. 1. In the first step of our approach, we use a fine-tuned pre-trained RetinaNet [26] for the detection of active motorcycles, i.e. motorcycles with at least one rider on them, on a single frame level. In the second step, each detected active motorcycle is tracked through adjacent frames, using both the motion state of the motorcycle as well as the visual similarity between detected active motorcycles. In the last step, when a track terminates, i.e. an individual motorcycle leaves the view of the video camera, the helmet use class of the tracked motorcycle is predicted, i.e. rider number, their position, and their helmet use are identified. All three steps are described in detail in the following sections.

A. MOTORCYCLE DETECTION
Detecting a motorcycle in a single frame is a classic object detection task. To this end we trained a state-of-the-art object detection algorithm to detect motorcycles in the dataset. Today's prevalent algorithms for object detection can be subdivided in two broad approaches: one-stage and twostage. While the two-stage algorithms have overall higher accuracies in object detection, they are comparably slower, as frames are processed twice, once for identifying potential object locations in a frame, and once more for detecting the actual objects. Single-stage methods combine the steps of localizing potential objects and object detection into a single processing stage, which results in a small decrease in accuracy, but a large decrease in the processing time. A relatively new single-stage method is RetinaNet [26], which uses a multi-scale feature pyramid combined with focal loss to successfully overcome detection accuracy limitations. RetinaNet achieves faster detection than two-stage methods, while having a higher detection accuracy than comparable single-stage methods such as YOLO [27]. We therefore applied a RetinaNet model for detecting motorcycles.
Since motorcycle detection is very similar to other object detection tasks, instead of training from scratch, we fine-tuned a RetinaNet model with pre-trained weights obtained by the COCO dataset [28].

B. MULTIPLE MOTORCYCLE TRACKING
To clarify the procedure of motorcycle tracking, let V = {v (1) , . . . , v (k) } be the set of existing tracks at time t. Using the notations in Table 2, a track is denoted by t is the predicted bounding box, and x (j) t is the cropped image patch from the predicted bounding box. Each image patch is re-scaled to 192 × 192. Furthermore, we normalize the bounding box by the frame width and height so that all the numbers fall between 0 and 1. Given a bounding box (l, u, w, h), its centroid z is computed as z = (l + w/2, u + h/2).
For an existing track v (i) , we first predict its new stateŝ Next, we compute the distance between all tracks and measurements, yielding a distance matrix D, where D ij denotes the distance between track v (i) and measurement m (j) t . With matrix D, the measurement-to-track association is solved by the Munkres assignment algorithm [30].
To measure the distance D ij , the conventional way is to compute the motion distance, namely, the squared Mahalanobis distance [31] between the predicted Kalman state of track v (i) and the centroid of the predicted bounding box z (j) t : While the motion distance is a suitable association metric when moving objects are sparse, the density of motorcycles in our dataset is very high, which results in a poor measurementto-track association when multiple motorcycles are very close to each other. To address this limitation, in addition to the motion distance, we compute visual dissimilarity: where x (n) denotes the N cropped image patches that are assigned to track v (i) , and φ(·; θ) corresponds to the feature vector learned by a InceptionV3 deep neural network model, to be defined later in Section III-D. Hence, we have a combined distance as the product of the motion distance and the visual dissimilarity: To sum up, Eq. (4) indicates any track and measurement are similar only if they have similar visual appearances and similar motions.
Applying the Munkres assignment algorithm to the distance matrix D, any new measurement can either be assigned to an existing track or initiate a new track. If a measurement m (j) t is assigned to an existing track v (i) , the track is updated as: where K is Kalman gain; otherwise it initiates a new track v (i+1) , with track information updated by: If no measurement is assigned to an existing track, the track is updated as: which allows temporary occlusion or missing detection.
We close an existing track if no new measurement is assigned to it for more than 8 consecutive frames. We only keep closed tracks with a duration greater than 5 frames and a proportion of visible frames in a track greater than 60%. While the lack of new information after 8 consecutive frames reliably closes tracks of motorcycles that drive out of the camera's view, it also helps to close tracks that were incorrectly started by a false positive detection. The rule for a 5 frame minimum to keep a track then reliably leads to the deletion of these false positive tracks.

C. HELMET USE CLASSIFICATION
For a closed track of sufficient length, its helmet use is estimated by pooling the helmet use prediction of cropped image patches within the track. More specifically, let (x (n) ) N n=1 be the cropped image patches that are assigned to a tracked motorcycle, then the track's helmet use class is estimated as:ŷ where g(·; W ) is a deep convolutional neural network (CNN), parameterized by W .

D. MTL FOR PATCH-BASED HELMET USE CLASSIFICATION AND VISUAL SIMILARITY LEARNING
In our approach, apart from fine-tuning RetinaNet for motorcycle detection, we need to train CNNs for two purposes. One is visual similarity learning. That is, we want the distance between two image patches to be small if they are in the same track, but large if they belong to different tracks. The other purpose is patch-based helmet use classification, i.e. we want to predict the helmet use class (rider number, position, and helmet use) given a cropped image patch. In our approach, instead of training two separate CNNs which is time-consuming, we apply multi-task learning (MTL) to learn both tasks simultaneously. The architecture of the proposed deep learning model is illustrated in Fig. 2. A given pair of image patches x (a) and x (b) with 192 × 192 resolution, is feed into a Siamese network [32] that uses an InceptionV3 [33] CNN body with shared weights θ. The network body is truncated, such that the global average pooling layer (GAP) and the final fully-connected (FC) layer are removed. Each image patch x is transformed into a 2048-dimensional feature vector φ(x; θ) after passing the output of InceptionV3 CNN body to a GAP layer.
With these two 2048-dimensional feature vectors, the model has three tasks to learn: 1) Given the feature vector φ(x (a) ; θ), predict the helmet use class p (a) = f φ(x (a) ; θ); w (a) , where f (·) is a softmax regression model, parameterized by weight w (a) . 2) Compute the Euclidean distance between the transformed feature vectors of image patch x (a) and image patch x (b) : 3) Given φ(x (b) ; θ), predict helmet use class ; θ); w (b) , with the softmax regression model f (·) parameterized by weight w (b) .
Using the MTL model, the helmet use classification model in Eq. (8) can be rewritten as g(·; W ) = f (φ(·; θ); w), where the visual similarity learning task and the helmet use classification task shares the weights θ in the training and predicting process, which not only significantly decreases the computational cost, but also improves generalization by using the domain information contained in the related tasks [34].
For the first and third tasks, we use the cross-entropy loss for optimization: where y is a one-hot vector that encodes the ground truth helmet use class, and K is the number of annotated helmet use classes. In the HELMET dataset, K = 36.
For the second task, we consider the contrastive loss [35]: where S is an index set consisting of image pairs that come from the same track, and D is an index set consisting of image pairs that come from different tracks. By minimizing the contrastive loss function, we expect the distance d(x (a) , x (b) ) of an image pair in the same track is less than a threshold τ 1 and that of an image pair in different tracks is larger than a threshold τ 2 . In this work, τ 1 and τ 2 are set as 1 and 5 empirically.
The loss function of the MTL framework is defined as: where γ k corresponds to the weight of task k. In our work, we assume each task has equal weight, namely γ k = 1/3.

IV. HELMET DATASET
The HELMET dataset is an extension of our previous work [10], [12]. Here we give a brief introduction how we create and annotate the dataset, as well as how to evaluate the performance of helmet use detection approaches based on the dataset. VOLUME 8, 2020

A. DATASET CREATION AND ANNOTATION
The source data for the HELMET dataset consists of 385 hours of traffic video, recorded in 2016 over a two month period in the country of Myanmar. Video data collection was planned and conducted in close consultation with the Myanmar Traffic Police Force, to ensure adherence to local laws and regulation. Using two video-cameras built from a Raspberry Pi 3 mini-computer and a Raspberry Pi camera module, 13 observation sites around seven cities in Myanmar were recorded at a rate of 10 frames per second, with a resolution of 1920 × 1080 pixels. The recorded data include diverse road environments, various traffic densities and different weather conditions. Before selecting video data for the HELMET dataset, the underlying source data was cleaned up, and video sections were removed when they contained motion blur due to cloudy weather or rain, which would have prohibited their detailed annotation. After this pre-processing, video clips of 242 hours length taken at 12 observation sites remained.
To most efficiently utilize the available annotation resources, it was decided to preferentially annotate video sections that contain a high number of motorcycles. After splitting up the video data from each observation site into 10 second video clips (100 frames per clip), the pre-trained YOLO9000 [27] object detection algorithm was applied to identify the number of motorcycles in each clip. Broadly maintaining the share of individual observation sites in the source data, 910 video clips with the highest number of motorcycles (identified through YOLO9000) were chosen for annotation. The resulting distribution of the 910 sampled video clips (91,000 frames) is presented in Table 3. Data in the HELMET dataset was annotated by drawing a rectangular bounding box around motorcycles, and adding information on the number of riders, their positions, and rider specific helmet use. The structure for rider position annotation is shown in Figure 3, distinguishing between the driver (D), multiple passengers (P1-P3), and a child passenger (P0) standing on the floorboard of the motorcycle in front of the driver. Bounding boxes of individual motorcycles are linked over subsequent frames in the annotation process, i.e. the identification of bounding boxes belonging to individual motorcycles is possible in the HELMET dataset, forming the basis of the developed tracking approach. This track of motorcycle bounding boxes belonging to a single motorcycle, imprinted with information on rider number, rider position, and helmet use, is defined as a continuous helmet use event (CHUE). The number of CHUEs is identical with the number of individual motorcycles observed ( Table 3).
All annotation was conducted using the program Beaver-Dam [36] and each annotation was verified by a second annotator. The 91,000 annotated frames from 12 observation sites form the HELMET dataset, which can be accessed by researchers free of charge with the publication of this article [17].

B. EVALUATION METRIC
To facilitate a consistent evaluation of the performance of approaches for helmet use detection of tracked motorcycles, we adapt a metric for continuous visual event recognition proposed in [37] for use with CHUE.
Let a CHUE be a tuple E = (L, T ), where L is helmet use class, and T is motorcycle track. The motorcycle track T is defined as where f i is the frame number, and b i = (l i , u i , w i , h i ) is the bounding box defined by location information within the frame: left (l i ), upper (u i ), width (w i ), and height (h i ), and N is the duration of the track.
A detected helmet use event E Detect is regarded as a correct detection w.r.t. a ground truth event E GT only if it satisfies the following conditions: • Given a bounding box pair from E Detect and E GT in an individual frame f i , a correct frame detection is defined as an intersection over union (IoU) of above 50% between E Detect and E GT inf i .
• Given a number of correct frame detections, a correct track detection is registered if the ratio of correct individual detections in E Detect in relation to the track duration N of E GT is above 50%.
• Given a predicted helmet use class L Detect , a correct detection is registered if L Detect is identical to the labeled class L GT . Following these criteria, we are able to measure the performance of an approach by the following metrics: precision, recall, and weighted aggregate F-measure.
• Precision is the ratio of the number of correct E Detect to the total number of E Detect (correct and incorrect) in the i-th class.
• Recall is the ratio of the number of correct E Detect to the number of correct E Detect combined with missed E GT in i-th class.
• For i-th class, the F-measure is the harmonic mean of precision and recall: • Since samples across all helmet use classes L GT are imbalanced, we use a weighted aggregate F-measure in the dataset, defined as: where C is the number of helmet use classes, and the weight on the i-th class w i is proportional to the number of samples in the i-th class; w i 's sum to one.

A. TRAINING SETUP
To evaluate our proposed method, the 910 annotated video clips were randomly divided into three non-overlapping subsets: a training set (70%), a validation set (10%), and a test set (20%) according to each individual site, as shown in Table 4. We used the training set to train our proposed method, and used the validation set to find the best generalizing model. Given the best generalizing model, we report the final model performance on the test set. For multiple motorcycle tracking, the parameters A, H , Q, and R of Kalman filter are predefined and given by: To train and evaluate the MTL model, we first generated all pairs of image patches in each video clip. Next we randomly sampled 2,000,000, 100,000, and 200,000 pairs from training, validation, and test sets respectively. In each subset, 50% image pairs come from the same tracks and 50% image pairs come from different tracks. In our work, the CNN body was initialized with the pre-trained weights on ImageNet [38] and all FC layers were initialized with random weights.
For both deep learning models, i.e. the motorcycle detection model and the MTL model, the Adam optimizer [39] was used with the default parameters β 1 = 0.9, β 2 = 0.999, and a custom learning rate α. In our experiments, we tried α = 10 −2 , 10 −3 , . . . , 10 −5 and chose the value that gives the best validation result. Considering the large size of data, we only trained for 10 epochs with a batch size of 2 for the motorcycle detection model and 128 for the MTL model. In the training process, we saved the best model that gave the minimum loss on the validation set and reported its final performance on the test set.

B. RESULTS AND ANALYSIS FOR MOTORCYCLE DETECTION
The first step in our approach is the detection of active motorcycles (Step 1 in Fig. 1). As described in section III-A, a fine-tuned RetinaNet was used for this task. 1 To evaluate the performance of our model, we use the average precision (AP) metric [40]. The AP on the test set is very high, achieving 95.3% for the detection of motorcycles. Fig. 4 visualizes the motorcycle detection results for four sampled frames. It can be observed that the motorcycle detection through RetinaNet is very close to the human annotation despite the occlusions occurring to some motorcycles and riders.

C. RESULTS AND ANALYSIS FOR MULTI-TASK LEARNING
As helmet use detection and visual similarity are two tasks in the MTL procedure, their results are presented separately.
For helmet use classification, as there are two such tasks in MTL, we selected the output with a lower loss on the validation set. Using this model resulted in an 80.6% accuracy for the detection of motorcycle helmet use classes on all 54,529 annotated bounding boxes in the test set. In other words, in 80.6% of all detected active motorcycles in the test set, our approach correctly classified the number of riders, their position, and their helmet use.
For visual similarity learning, two types of errors can occur, different motorcycles can be falsely classified as belonging to the same track, or the same motorcycle can be falsely classified as belonging to different tracks. To evaluate

D. ABLATION STUDY
To investigate the effectiveness of the key components of our approach, we conduct six ablation experiments, in which different components of our approach are removed or replaced to learn more about their contribution to detection accuracy and computational efficiency. As shown in Table 5, for the first two ablation experiments, we use frame-based motorcycle and helmet use detection from a single CNN (YOLOv2 or RetinaNet), as the basis for detection, adding motorcycle tracking through motion distance from our approach. The motorcycle class is determined by the majority of predicted helmet use classes of a track. For the additional four ablation experiments, we used the approach presented in this paper, but use different networks, i.e., YOLOv2 or RetinaNet, for motorcycle detection and  For computational speed, it can be observed that the first two approaches are achieving the highest number of processed frames per second (25.12 and 13.42 FPS), as they only use one network to simultaneously predict motorcycle and helmet use class. However, it can be observed that this high speed is achieved at the expense of detection accuracy, as the two approaches are prone to produce missing detections due to the imbalanced helmet use classes. This in turn decreases tracking performance. Comparing RetinaNet and YOLOv2, the advantage of the multi-scale feature pyramid with focal loss is apparent, as RetinaNet has a higher accuracy than YOLOv2, at the expense of processing speed. This difference is present for detecting motorcycle and helmet use simultaneously or separately. Finally, combining motion similarity with visual similarity during tracking (our hybrid tracking approach) improves the F-measure value with little additional computational cost.

E. RESULTS AND ANALYSIS OF HELMET USE DETECTION
Detailed results of our proposed approach for helmet use detection of tracked motorcycles in each individual class are presented in Table 6. We achieved a 67.3% weighted F-measure on the test set of the HELMET dataset. Our approach works well on common classes of up to two riders per motorcycle. Considering only these common classes, the weighted F-measure improves to 70.6%. Fig. 7 shows detection results on some sampled frames. For more detailed results, we have attached video samples of our approach as supplementary files.
In addition we present the location-wise performance in Fig. 8, which shows the weighted F-measure for each observation site in the HELMET dataset. Video clips from the test set for all sites can be found in the supplementary files of this article. It can be observed that the approach works well for most locations, with nine of the observation sites showing an F-measure of around 70% and above. However, comparatively low accuracy can be observed for Bago_urban  and NyaungU_urban, with F-measures slightly below 60%, and Yangon_II with an F-measure of only 39.1%. Looking at the video clips in the test dataset (see supplementary files), it becomes apparent, that the three sites with the lowest F-measures have properties that can be linked to the low accuracy of our approach in these environments. Bago_urban and Yangon_II contain a crossroad, i.e. motorcycles appear in the observation camera's view in unusual angles compared to the other sites. The observation site NyaungU_urban contains a large number of parked motorcycles, on which riders rest. Our approach detects and registers these motorcycles, leading to an inaccurate detection, since parked motorcycles were not annotated in the annotation process, as they are not actively used.

F. COMPUTATIONAL COST
Our approach was implemented using the Python Keras library with Tensorflow as a backend and ran on two NVIDIA Titan Xp GPUs. In our implementation, instead of keeping every cropped image patch in a track, we retain its visual feature and helmet use prediction output only, which reduces both computational space and time.
The overall processing speed of our method is 8.32 FPS. More specifically, the computational time for motorcycle detection is 0.059 seconds per frame; the computational time for visual feature extraction and patch-based helmet use classification is 0.058 seconds per frame; and the computational time for tracking is negligible, merely 0.003 seconds per frame.

VI. CONCLUSION
In this paper, we have proposed a deep learning based method to automatically perform three elements of human observer motorcycle helmet use registration, i.e. detection and tracking of active motorcycles, as well as identification of rider number per motorcycle, rider position, and rider specific helmet use. In addition, we have applied our approach to video data from diverse road environments, which included adverse factors such as occlusion, differences in camera angle, an imbalanced number of coded classes, as well as differing rider numbers per motorcycle and varying traffic densities. All of these elements make our approach more comprehensive than earlier approaches for the automated detection of motorcycle helmet use (see Table 1). Our results show a generally high accuracy of our approach. For the element of frame-based detection of motorcycles, we achieve an average precision of 95.3%. The visual similarity element of motorcycle tracking of our approach achieves 0.967 AUC, in this first application of CNN-based tracking of active motorcycles. For the element of detection of helmet use class, i.e. the registration of rider number, position, and rider specific helmet use, we achieve an accuracy of 80.6% on a frame based level. Especially the imbalanced number of classes in the HELMET dataset contribute to wrong classifications. For the comprehensive application of our approach, all its elements are combined, i.e. motorcycle detection, tracking, and helmet use class prediction are jointly applied. Our results show a weighted F-measure of 67.3% for the helmet use detection of tracked motorcycles, showing that our approach can be used to generate reliable motorcycle, rider number, and position specific helmet use estimates. The results of our ablation study show that our approach achieves a comparatively high accuracy against ablation experiments. While this high accuracy comes at the expense of computational efficiency, our approach can process more than 8 FPS on consumer hardware, which is close to real-time speed for 10 FPS video data. Overall, our work shows that all four basic elements of helmet use registration through human observers can be implemented in a CNN-based approach that is computationally efficient on consumer hardware. Furthermore, the inclusion of detailed rider differentiation is an enhancement of existing approaches. In addition to presenting our helmet use detection approach, we publish the HELMET dataset with this paper, which includes diverse traffic video data that can be used to train and evaluate similar approaches. Since existing datasets have a number of shortcomings and are not readily available to researchers, we hope that the publication of the HELMET dataset will advance the development and evaluation of detection approaches similar to the one in this paper.
There are some limitations to our work. The detection accuracy can be much compromised when dealing with uncommon traffic environments, or street scenes with parked motorcycles. Hence, the current approach is partly constrained in real-world applicability, as observation site specific elements could decrease detection accuracy. And while the HELMET dataset is a first step towards using more diverse datasets for the development of automated helmet detection approaches, further data needs to be collected to make approaches universally applicable.
For future research, we intend to enhance the HELMET dataset by incorporating scenes with more diverse traffic infrastructure, e.g. crossroads, to ensure more robust application of the approach. Also, more training data that contains parked motorcycles will be acquired and used for training, so that these objects will not be detected as false positive.