Neural-Network-Optimized Vehicle Classification Using Clustered Image and Fiber-Sensor Datasets

Internet of Things (IoT) becomes indispensable for transport and automotive industry to advance functions in on-road traffic monitoring. Indeed, smart management tools and machine learning concepts are inevitable in vehicle categorization systems. However, to date, existing systems for vehicle classification are exclusively based on singular technological platforms only. This not only limits their long-term use and future scaling, but also sets restrictions to obtain high classification accuracies with modern machine learning tools feed with diversified big volume data. In this work, we design a novel convolutional neural network (CNN) that substantially improves the on-road vehicle classification. In particular, we experimentally harness, to the best of our knowledge for the first time, two different datasets from separated technological platforms based on close-circuit television (CCTV) and fiber Bragg grating (FBG) sensors, respectively. The hybrid CNN classification system, with individual CCTV and FBG datasets, substantially improves detection levels, reaching in-class accuracy up to 90% - 97%. Moreover, this classification concept includes an intrinsic back-up verification with respect to each platform compensating the shortcomings of individual technologies. Our demonstration can make key advances towards near-unity accuracy in vehicle classifications for IoT systems, capitalizing on cost-effective and well-established platforms.


I. INTRODUCTION
The Internet of Things (IoT) is seen as universal solution to merge diverse technologies into a conceptual network. IoT allows different devices to be mutually sensed with respect to their surroundings and then communicate with an instant response [1], [2], [3]. The IoT is particularly appealing for data-intensive applications [4]. ''Big data'' applications, which range from military, medicine, and smart buildings to intelligent transport and automotive industry, are good examples. In recent years, the IoT becomes indispensable for transport and automotive industries to advance activities such The associate editor coordinating the review of this manuscript and approving it for publication was Muguang Wang . as vehicle classification, instant on-road control, or traffic management, among others [1], [2], [3], [4], [5], [6].
Fiber-optic technologies are proved to be compelling for traffic surveillance as they afford a virtually unlimited capacity compared to electrical wires, while fiber sensors constitute a close-to-ideal choice for optical sensing. Coexistence between high-speed network and optical sensing platform is highly desired to keep both cost and complexity at low levels [36]. Fiber Bragg gratings (FBGs) emerge as an attractive optical platform to form single-and multipoint sensor arrays [26], [27], [35], [36], [37], [38], [39]. In FBGs, the reflected light adds constructively in the backward direction and creates a distinct narrowband drop in the transmission, while the rest of the light passes through the grating. The external perturbations induce a spectral shift compared to the initial position and the change can be clearly determined. This way, the FBG platform unleashes promises for effective vehicle categorization, especially in dense traffic situations. However, FBG systems have a range of obstacles, which includes instrumentation, installation, and calibration issues, while require improved data collection, vehicle categorization, and overall detection accuracy -IoT system parts that are mostly driven by modern processing techniques.
In this work, we propose and demonstrate a novel CNN architecture for enhanced vehicle classification. Here, we harness a clustered dataset, originating from a hybrid technological platform. Hybrid system combines data inputs from CCTV system and FBG sensors, respectively, a vital coexistence has remained unexplored to date. Clustered CNNbased classification improves vehicle detection accuracy and provides an intrinsic back-up verification with respect to each separate technology.

II. TESTING PLATFORMS AND DATA COLLECTION
In this work, the technological base for vehicle classification comprises two experimental platforms: an optical sensor network and a visual-based CCTV system. Indeed, each of them can be operated separately, but this brings a range of limitations. Measurements were performed over three weeks. The total number of vehicles passing through the platform was up to 2000 per day.

A. FBG SENSOR NETWORK
The sensor network comprises a FBG array. This in-house testing platform, schematically shown in Fig. 1, is situated at the campus of the University of Zilina, Slovakia. The FBG sensor array is connected to two active interrogators. Interrogators collect the output data from the sensor array. The Sensing array includes active induction loops to help detection of passing vehicles. For this work, inductive loops are disconnected, and FBG sensor array is only used passively, i.e. without additional electricity, which reduces power consumption. From an installation point of view, FBG sensors were mounted invasively into the ground, i.e. directly into the existing road in the two-layer asphalt arrangement.
In the optical array, FBGs are positioned both vertically (parallel to the vehicle wheels) and horizontally (perpendicular to the vehicle wheels), which enable to setup different experimental topologies. For our investigation, we are primarily focused on the spectra response of the FBG sensor array, for now, without the possibility to operate platform dynamically to detect the vehicle speed or exploit weigh-in-motion functions. The horizontally oriented (HO) fiber sensors are used at the beginning only to verify platform operation and working conditions. Vertical FBG array is divided into two 1.75 m long parts, located on both sides of the road, respecting the road center. The sensor array on each side comprises 36 vertically-oriented (VO) FBGs. The reference spacing between the individual optical sensors is 100 mm. FBG sensors are used to measure the deformation caused by pressure, once the vehicle passes through the platform. Double-sided sensor topology helps to create a back-up solution for situations, when the optical fiber is damaged, or fiber connections are interrupted. Output data from the on-road sensor cluster are transmitted through single-mode optical fibers (SMF-28) to two fourchannel interrogators. The interrogators cover a spectral range from 1485 nm to 1610 nm. Interrogators are situated in the local data center, 700 m away from the experimental platform. On a side of interrogators, there are standard opto-electrical input/output (I/O) interfaces. This includes: 4 optical ports, two USB ports, 1 Ethernet and HDMI port and a serial interface RS485. Interrogators use a 12 V DC industrial adapter as input power. An RJ-45 Ethernet port allows connection to a wired 10/100/1000 Mbps local area network. Optical fiber ports are used to connect the interrogator to the sensor array, in particular, each I/O fiber port connects 19 FBGs.  The output spectrum from the sensing array at one of interrogators channel is shown in Fig. 2. Interrogators process the data offline from the sensor network offline, with a recording frequency 500 samples per second. Examples of spectral outputs from FBG sensors for vertically positioned configurations are shown in Fig. 3. Here, we can observe a detail of vehicle passage with an optimal trajectory over a sensor array, having an axle wheelbase of 2570 mm. Measured values correspond to the wavelength shift of the reflected optical signal. This way, we create a twodimensional pressure map of the vehicle as a function of the time and FBG sensor position. Sensor data were collected from 2000 tests using 36 vertical FBG sensors. The dataset comprises 500 samples per second × 4 s time slot per single vehicle.
The time period is sufficiently long to cover the majority of vehicles that passed in and out of the platform, with a restricted area speed of 30 km/h. Moreover, correct detection for front and back axles is always obtained at different positions. In turn, this helps to reduce the FBG dataset down to 600 correct measurements × 36 vertical FBGs per one passing vehicle. Furthermore, from the dataset analysis, we found out that a maximum of 5 sensors from the whole array is primarily responsible for the spectral outputs, which lowers the final dataset to 600 tests × 5 vertical sensors. The spectral shifts initiated by the remaining sensors (compared to the 5 dominant) are close to zero, and thus their contribution can be neglected. Last, but not least, measured data are normalized to toned image values (shaded gray colors), having a range from 0 and 1. Output data are stored in Tagged Image Files Format (TIFF). Output TIFF data, shown in Fig. 4, are then used to train and test CNNs and to categorize particular vehicles.  The system for car identification encompasses the body type of the vehicle, the number of axles, wheelbase as well as the number of vehicles driving in and out of the platform. Table 1 sums up car categories and corresponding datasets. In total, there are 5 specific car categories (see Tab. 1). Corresponding parameters (specific car characteristics) are listed in Tab. 2. Specific characteristics for 5 vehicle categories are also depicted in Fig. 5 and explained below [37], [38]: Special characteristics for given vehicle class [37], [38]. • A: The tire width is marked by the first three numbers (from the side to side) in millimeters.
• B: The next two values determine the aspect ratio (the tire height as a percentage of the width). In our case, the aspect ratio is 55. This means that the sidewall of the tire is 55 percent higher than the tire width.
• C: The construction type is represented by a single letter which describes the type of the internal construction of the tire (R is for radial tires and D is for tires built with diagonal plies).
• D: The last number describes the diameter of the wheel in inches.

B. CCTV SYSTEM
CCTV system represents a conventional image-based recognition technique. Here, visual recognition was carried out with an industrial camera situated in a close proximity to the experimental FBG platform (of about 100 m). Data were retrieved according to the local GDPR rules. CCTV system operates with digitalized data from a recorded video stream. This video stream is then transformed into static images. Once the vehicles pass through the platform, image characteristics are changed. The camera has a resolution of 1920 × 1080 pixels, with the speed of 30 frames per second. Camera was supplied by Power over Ethernet (PoE) interface and the video stream was recorded on a computer using Open Broadcaster Software Studio. For vehicle classification, only images of passing cars were selected from the entire video stream, and then only points of interest were captured from those images, i.e. that is the passing vehicle itself. For this, images are shrunk down to a square resolution with 800 × 800 pixels and 1:1 aspect ratio. Images with shadows, blurs or images affected by weather conditions were excluded from further processing. By repeating this process, we retrieved final images of vehicles that are assigned to the specific car categories. Different vehicles from the CCTV system are shown in Fig. 6. In turn, this creates a final image dataset for neural network training and testing.  CNN's are conventionally leveraged to recognize twodimensional (2-D) image patterns from the pixels of the input data with minimal pre-processing. CNN classifies an input image into the following four categories: supine, left side, prone, and right side.
Main operations performed by CNN rely on: (i) convolutional layer to extract various features from the input image. For a reliable process, these layers (and thus operations they perform) are repeated several times; (ii) the nonlinearity (ReLU) layer serves as an activation function for data processing; (iii) MaxPooling or sub-sampling layer that searches for the largest element in the feature map and forms a bridge (inter-layer) between initial convolutional layer and ending fully connected (FC) layer; and (iv) fully connected layer (classification). This way, CNN can learn more and more complex symptoms. In the first layers, the network encodes low-level image features (such as edge detectors and simple color transitions). In the subsequent layers, features for shapes (such as semicircle or multicolor gradients) are described. The last CNN layers comprise features responsible for individual image objects or complex image shapes.

III. RESULTS AND DISCUSSION
For given datasets, training and testing is realized with re-sized image inputs to match the size of open-access CNNs (AlexNet, GoogleNet, ResNet-50, and ResNet-101), as defined in Tab. 3. Those images pass through the CNN convolutional layers, with filters that extract local image features.
The CNN activation function determines values of the outputs for individual neurons based on their internal potential. The internal potential is a by-product between weights and the input. Outputs from individual windows are combined to create a new down-scaled feature map. Once this process applies to all new maps, an additional set of feature maps are created, which then form inputs to the next convolutional layer, where the whole process is repeated. The output of the convolutional layer is flattened via fully connected layer. Finally, the Softmax output layer for image classification is used. This way, we determine and link specific input image with an appropriate vehicle class. The confusion matrix, schematically shown in Fig. 8a, summarizes the result of the classifier [39], [40]. Rows of the confusion matrix are indexed by output variable classes (corresponding to the reality), while columns are sorted according to classes predicted by the model (estimation/prediction). Thus, each column within the confusion matrix represents predicted classes. The individual rows in the confusion matrix represent the current (correctly assigned) classes. Success parameters, specifically P, R, and F1-score, were calculated from the obtained values. The parameter of precision (P) is the ratio between true positive (TP) and the sum of positive data (true positive (TP) and false positive (FP)). On the other hand, the parameter of recall (R) is the ratio between true positive (TP) and the sum of data from the actual class (true positive (TP) and false negative (FN)). The F1-score denotes a combination between precision and recall, i.e. weighted average value of precision and recall. The resulting accuracy is obtained as the ratio between the sum of correctly predicted samples and the sums of all samples. In particular,  To monitor the generalization performance and select the optimal study model, the dataset was divided into a training VOLUME 11, 2023  set with a data amount of 70% and a validation set, having a 30% of data amount. The performance of the proposed CNN with a training set was continually improved, while the CNN performance with a validation set reached a saturation point, after which the network starts to overfit training data, and then the network learning algorithm was terminated. The implementation of the CNN models relies on the TensorFlow framework. Moreover, from the results summarized in Tab. 4, we can observe that GoogleNet and ResNet-50 performed better than AlexNet. The obtained precision levels are 34% -62% for AlexNet, 51% -77% for GoogleNet, 59% -78% for ResNet-50, 61% -86% for ResNet-101. From a precision point of view (classification results), by comparing the same vehicle class for different types of standard CNNs, we can clearly see this trend. These results arise from the fact that the GoogleNet network uses combinations of inception modules, and each of them encompasses Max-Pooling, convolutions at different scales, and concatenation operations.
In addition, ResNet-50 and ResNet-101 are a type of CNN with a depth of 50 layers and 101 layers, respectively. The ResNet-50 and ResNet-101 models replace each two-layer residual block with a three layer bottleneck block. This then uses 1 × 1 convolutions to reduce and subsequently restore the channel depth, allowing for a reduced computational load when calculating the 3 × 3 convolution. Those specific features of both CNNs are particularly advantageous, facilitating more accurate categorization. On the other hand, the AlexNet is formed by 5 convolutional layers, followed by 3 fully connected layers, and finally added a 1000-way Softmax. This then corresponds to the probability of 1000 categories (in our case we used five vehicle classes as the output). Typically, this network includes repetition of few convolutional layers and each one is followed by max MaxPoolings and few dense layers. However, there was no standard for the filter sizes to be used, which is a significant shortcoming of this CNN.

B. HYBRID IMAGE-AND SENSOR-BASED NEURAL NETWORK
The proposed CNN, schematically shown in Fig. 9 and described in Tab. 6, comprises two parts: sensor-and imagebased branches, respectively. This is a hybrid CNN structure with two independent inputs (sensor-and image-based data). The first network (sensor-based) is fed with 600-d inputs, while the second one (image-based) accepts the 32-d inputs. Both networks operate separately with respect to each other and then they are merged into the single CNN that performs the final classification.
The individual layers of the hybrid CNN are described below.

1) SENSOR DATA BRANCH
• The sensor data was used as input. These input data were resized.
• The second to fourth blocks of the sensor part were 2D CNN layers, which have 32 feature maps with 3 × 3 kernel dimension. The Rectifier linear unit (ReLU) was used as an activation function.
• In the next block, the MaxPooling layer with kernel dimension 2 × 2 was used. The Dropout layer with a probability set of 0.25 was used.
• In the sixth block, the 2D CNN was used with the same parameters as in the second to fourth blocks. However, the number of feature maps was doubled to 64.
• MaxPooling and Dropout layer have the same parameters as the fifth block.
2) IMAGE DATA BRANCH • The image data was used as input. These input data were resized.
• The second block of the image part was the 2D CNN layer, which has 16 feature maps with 3 × 3 kernel dimension. The Rectifier linear unit (ReLU) was used as an activation function.
• In the next block, the MaxPooling layer was used (kernel dimension 2 × 2. The Dropout layer with a probability set of 0.25 was used.
• The fourth block, the 2D CNN was used with the same parameters as in the second block. However, the number of feature maps was doubled to 32.
• The MaxPooling layer and Dropout have the same parameters as the third block.

3) HYBRID SENSOR AND IMAGE DATASETS
• In our case, we tack on a fully connected layer with fifth neurons. Our final model using the inputs of both branches (sensor and image data) was defined.
• The connection of the final layers in the hybrid network is based on the output of both the sensor and image branches. The final output of the hybrid network was classified into 5 different classes of cars. Features are extracted from individual sources of information by building appropriate network models, preferably models that are most suitable for given data types. Feature extraction from one source is independent with respect to the other one. Once all the features essential for prediction are extracted from both datasets, they are combined into the single-shared representation. In the next step, information is merged from two modalities to perform a final prediction. The information coming from different modalities is characterized by varying predictive power and noise topology. In our case, we take a weighted combination of the sub-networks so that each input modality can have a learned contribution towards the final output, i.e. towards a resulting prediction. The hybrid CNN model comprises sensor and image parts. They are described below.
• Number of epochs: This can be set as an integer value ranging from one to infinity. To increase the number of epochs is advantageous, if the dataset contains a big data count.
• Early stopping: The training process stops when the validation performance deteriorates for 10 consecutive epochs. This helps to avoid poor performance of the neural network with non-training data, while learning well learning on training data.
• Training function: This is the overarching algorithm used to train the neural network and to associate certain inputs with specific outputs.
• Minimum gradient: This refers to the minimum magnitude of the gradient descent required for the training of the neural network to terminate. VOLUME 11, 2023 • Activation function to introduce non-linearity into the models was used. In our case, the Rectified Linear Unit (ReLU) was employed as an activation function.
• The pooling layer moves the filter through the output feature map of the previous convolutional layer. The filter size was 2 × 2 and max-pooling approach was used.
• The number of epochs is the number of times for the training data that is displayed to the neural network.
• The batch size is the number of samples submitted to the neural network.
• The learning rate describes the step-size for a neural network model to achieve a function with minimum loss.
• Dropout size describes a technique that prevents the neural network over-fitting.
• Kernel size parameter describes the size of the filter, determining the size of the 2D convolution window.     Table 8 and Figure 10 show the results of the individual CNNs, separately for the sensor branch (sensor data only) and separately for the image branch (image data only), while Tab. 9 and Fig. 11 summarize the overall performance of the hybrid CNN. The proposed two-input-model of CNN shows a significant improvement in the performance (precision accuracy) compared to conventional CNNs (AlexNet, GoogleNet, ResNet-50, and ResNet-101 -see results in Tab. 4 and our single-input CNN based on sensor data). First, from the retrieved results, it becomes apparent that the hybrid CNN of combined datasets yields much better results than the CNN with sensor or image data only.
As a result, precision for the vehicle classification reached extraordinary high values, ranging between 90% and 97%, with respect to the particular vehicle class. Moreover, comparing precision results obtained by this novel hybrid (combined image and sensor) CNNs with those previously obtained via conventional CNNs (GoogleNet, ResNet-50, ResNet-101, and AlexNet -shown in Tab. 4) and sensor data only, we can observe a considerable enhancement in the final classification results. Nominally, these improvements are between 35% -58% with respect to AlexNet, 20% -39% with respect to GoogleNet, and 19% -33% with respect to ResNet-50. Obtained results (confusion matrix and evaluation criteria) are graphically represented in Fig. 11. Numbers in each cell of the graphical representation of the confusion matrix refer to the percentage value of correctly and incorrectly assigned objects. To show the percentage values of individual objects of the actual class, we used color for each cell. Here, the dark-blue color indicates 0% incorrectly assigned vehicles, and then the color continually changes to the light-yellow, indicating 100% correctly assigned vehicles from all tested data.
The main advantage of the proposed hybrid CNN over conventional approaches is the combination of image and sensoric datasets. This solution has potential to afford comprehensive representation of the vehicle under test, with an in-built back-up verification. As a result, predicted accuracy is higher, as we demonstrated in this work. By prioritizing the individual branches, we can manage classification accuracy and improve the robustness of the system. The overall classification can be increased as both data sources are originally independent, and through the hybrid solution, they can trade-off mutual shortcomings of individual techniques. On the other hand, there are few challenges to be addressed with such hybrid CNN solution. The effective integration from both sources can be problematic in some situations. This is due to the fact that, in general case, the image and sensoric datasets can have contradictory characteristics, scales, and requirements for complex processing in one CNN system. Moreover, the categorization results can be affected by image quality and by the number of images used in each class and each branch. By comparing datasets examples from Fig. 4 (sensoric dataset) and 6 (image dataset), respectively, the image quality is different, which may impact the overall results as well. Nevertheless, improved datasets for both branches can facilitate further enhancement in the final classification product.
From an application point of view, the proposed solution for vehicle categorization can be potentially employed for road transport monitoring within Smart cities, performing an automatic counting or separation. This also includes dynamic management to optimize different vehicle classes on the road or road resources. Moreover, another potential application field can be found in vehicle speed monitoring or weight-inmotion function, where heavy vehicles can be diverted from the main road towards a destined parking lots or to evaluate the vehicle overload in-or on-side.

IV. CONCLUSION
In summary, we demonstrated a novel neural network architecture to improve vehicle classification. More specifically, we used clustered datasets from a hybrid technological platform based on a conventional CCTV system and FBG sensor array. A novel clustered CNN-based classification system (with both sensor and image-based datasets) improved vehicle detection accuracy, obtaining precision levels between 90% -97%, and including a back-up verification with respect to each technology we used. Collecting data from two technologically independent platforms can also trade-off the intrinsic shortcoming of both conventionally decoupled solutions. The hybrid CNN concept substantially enhances the detection accuracy for correct vehicle classification and opens up a way for effective vehicle classification by leveraging available and inexpensive technologies.

DATA AVAILABILITY
The data that support the findings of this work are available from the authors upon reasonable request. Data contact person: Dr. Patrik Kamencay (patrik.kamencay@uniza.sk).