Intelligent Traffic-Monitoring System Based on YOLO and Convolutional Fuzzy Neural Networks

With the rapid pace of urbanization, the number of vehicles traveling between cities has increased significantly. Consequently, many traffic-related problems have emerged, such as traffic jams and excessive numbers and types of vehicles. To solve traffic problems, road data collection is important. Therefore, in this paper, we develop an intelligent traffic-monitoring system based on you only look once (YOLO) and a convolutional fuzzy neural network (CFNN), which record traffic volume, and vehicle type information from the road. In this system, YOLO is first used to detect vehicles and is combined with a vehicle-counting method to calculate traffic flow. Then, two effective models (CFNN and Vector-CFNN) and a network mapping fusion method are proposed for vehicle classification. In our experiments, the proposed method achieved an accuracy of 90.45% on the Beijing Institute of Technology public dataset. On the GRAM-RTM data set, the mean average precision and F-measure (F1) of the proposed YOLO-CFNN and YOLO-VCFNN vehicle classification methods are 99%, superior to those of other methods. On actual roads in Taiwan, the proposed YOLO-CFNN and YOLO-VCFNN methods not only have a high F1 score for vehicle classification but also have outstanding accuracy in vehicle counting. In addition, the proposed system can maintain a detection speed of more than 30 frames per second in the AGX embedded platform. Therefore, the proposed intelligent traffic monitoring system is suitable for real-time vehicle classification and counting in the actual environment.


I. INTRODUCTION
Road traffic monitoring is an important research topic. By analyzing the types of vehicles and traffic flow on the road, current traffic conditions can be understood, and actionable information can be provided to traffic management agencies. This information can help these agencies to make decisions that improve people's quality of life. For example, on holidays, information regarding the road traffic volume can be used to suggest alternate routes to drivers to divert traffic from congested areas. In addition, if large trucks often use a certain road, roadside warnings can be installed to alert drivers and reduce traffic accidents. Moreover, the type and color of a specific vehicle can be used to identify and track the vehicles of criminals. The abovementioned applications The associate editor coordinating the review of this manuscript and approving it for publication was Frederico Guimarães . all rely on information collected by a road monitoring system for analysis. Therefore, to obtain information on passing vehicles, many researchers have used different methods to achieve vehicle detection and classification.
Traditional vehicle-detection methods are mainly divided into two types: (1) Static-based methods [1]- [7] that use sliding windows or shape feature comparison methods to generate vehicle prediction frames and verify them based on the information in the prediction frames and (2) methods that use the dynamic features of a moving object [8]- [12] to separate it from the image to obtain the contour of the object. Regarding static-based methods, Mohamed et al. [1] proposed a vehicle-detection system that uses Haar-like features to extract vehicle shape features and inputs the extracted features into an artificial neural network to realize vehicle classification. Wen et al. [2] also used Haar-like features to extract the edge and structural features of vehicles and input them into AdaBoost to filter important features. Then, the filtered features were input into a support vector machine (SVM) for classification to improve its recognition accuracy. Sun et al. [3] and David and Athira [4] used Garbor filters to obtain vehicle characteristics and then input them into an SVM to determine whether a vehicle is present in an image. Wei et al. [5] designed a two-step vehicle-detection method. First, they used Haar-like features and AdaBoost to obtain the region of interest with vehicles and subsequently used the histogram of oriented gradients (HOG) [6] and an SVM to reverify the region. According to their experimental results, their method exhibited improved vehicle-detection capability. Yan et al. [7] designed a vehicle-detection system that used vehicle shadows to select the boundaries of vehicles and the HOG to extract features. These features were then input into an AdaBoost classifier and SVM classifier for verification. In this method, when vehicles block each other, they are regarded as one vehicle because the shadows are connected to each other, which weakens the detection effect.
In terms of dynamics, Seenouvong et al. [8] proposed a vehicle-detection and counting system based on dynamic features. Background subtraction was used to obtain a difference map from a given current image to achieve segmentation of the corresponding foreground image. In addition, various morphological operations were used to obtain the outline and bounding box of a moving object, detect moving vehicles, and count the vehicles passing through a designated area. A few researchers have used Gaussian mixture models (GMMs) [9], [10] to model the background or adaptive background [11]- [13] with the aim of solving the problem of background subtraction due to background images. Poor foreground segmentation is caused by gradual changes in brightness. The aforementioned static and dynamic methods have many limitations in overcoming this problem. For example, traditional feature extraction methods must be manually designed by experts on the basis of their experience, meaning that the process is complicated. Moreover, the extracted features are mostly pieces of shallow vertical and horizontal information, which cannot effectively describe the changes in vehicle features and cannot be widely used. The dynamic feature method increases the complexity of subsequent image processing operations in cases of extensive background changes in addition to yielding poor detection results. With recent advancements in deep learning, these conventional methods have gradually been replaced by deep learning techniques.

II. LITERATURE REVIEW
In recent years, deep learning has been widely used in many fields, and good prediction results have been obtained with this method. Compared with traditional methods that require artificial feature determination, the convolutional neural network (CNN) method greatly improves the accuracy of image recognition. Initially, Lecun et al. [14] proposed the LeNet model to solve the problem of recognizing handwritten digits in the banking industry. Krizhevsky et al. [15] proposed AlexNet to improve the traditional CNN by deepening the model architecture and using the ReLU excitation function and the dropout layer to increase the effectiveness of the network during learning and prevent overfitting. Szegedy et al. [16] proposed GoogLeNet, which uses multiple filters of different sizes to extract features that enrich feature information. Simonyan and Zisserman [17] proposed two models, namely VGG-16 and VGG-19. They replaced the large convolution kernel by successively using multiple small convolution kernels to perform operations and proved that increasing the depth of a model can improve its accuracy. He et al. [18] proposed the ResNet model. They used residual blocks to solve the problem of gradient disappearance and convergence inability due to excessive network depth. Howard et al. [19] proposed MoblieNet, which uses deep separation convolution to extract fewer and more useful features and reduces the number of redundant parameters in a CNN model.
The aforementioned studies have focused on improving the feature description capabilities of a CNN to extend the application of CNNs to more complex problems, such as object detection. Several researchers [20]- [24] have used region-based CNN (R-CNN) series models to solve the vehicle-detection problem. R-CNN uses the region proposal network (RPN) [25] to extract the position of an object and then classifies it by using a traditional CNN. RetinaNet [26] is the latest network architecture of R-CNN models. The R-CNN framework comprises a two-stage mechanism and uses a multilayer neural network for classification [27], [28]. This architecture substantially increases the number of parameters used and decreases the execution speed; thus, it is unsuitable for real-time detection. To solve this problem, onestage mechanism methods have been proposed for vehicle detection, such as the you-only-look-once (YOLO) framework model [29]- [31] and the single-shot multibox detector (SSD) [32] framework model. One-stage methods are fast and can detect objects in real time, but their classification accuracy is lower than that of R-CNN methods [33], [34].
The aforementioned object-detection methods have the following problems: 1) Two-stage object-detection methods have high classification accuracy, but the large of network parameters decrease the detection speed. 2) One-stage objectdetection methods have a high real-time detection speed but lower accuracy than two-stage object-detection methods.
3) To increase the number of object categories, the entire network must be retrained, which is time-consuming and reduces the scalability of the method.
Recently, fuzzy neural networks (FNNs) [35]- [39] that have a human-like fuzzy inference mechanism and the powerful learning functions of neural networks have been widely used in various fields, such as classification, control, and forecasting. Asim et al. [35] applied an adaptive network-based fuzzy inference system to classification problems. Compared with traditional neural networks, this method yielded higher classification accuracy. Lin et al. [36] used an interval type-2 FNN and tool chips to predict flank wear, and their method VOLUME 10, 2022 yielded superior prediction results. A few researchers have used a locally recurrent functional link fuzzy neural network [37] and Takagi-Sugeno-Kang-type FNNs [38], [39] to solve system identification and prediction problems, and both methods have yielded good results. In this study, an FNN was embedded into a deep learning network to reduce the number of parameters used in the network and obtain superior classification results. Conventional CNNs use pooling, global pooling [40], and channel pooling [41] methods for feature fusion. Global pooling methods sum the spatial information and perform operations on each feature map to achieve feature fusion and can be divided into global average pooling (GAP) [42] and global max pooling (GMP) [43]. Thus, global pooling methods are more robust to spatial translations of the input and prevent overfitting. Channel pooling methods include channel average pooling (CAP) [44] and channel max pooling (CMP) [45], which perform feature fusion by computing average or maximum pixel values, respectively, at the same positions in each channel of feature maps. Furthermore, these methods only compress features and do not contain learnable weights, leading to poor classification results. In this study, a new feature fusion method named network mapping was proposed to enhance the utility of feature fusion and explore the effectiveness of different feature fusion methods.
To design an intelligent traffic-monitoring system with fast execution speed, high classification accuracy, and high category extensibility, a two-stage object-detection method was adopted in this study. The proposed intelligent trafficmonitoring system based on YOLO and a convolutional FNN (CFNN) collects real-time information on traffic volume and vehicle type on the road. In this system, a novel modified YOLOv4-tiny (mYOLOv4-tiny) is first used to detect vehicles and is then combined with a vehicle counting method to calculate the traffic flow. Furthermore, two effective models (CFNN and Vector-CFNN) and a network mapping fusion method that improve the computational efficiency, classification accuracy, and category extensibility were proposed for vehicle classification. The proposed model architecture has fewer network parameters compared to other models; therefore, the system can achieve real-time, high-accuracy vehicle classification with limited hardware resources and flexible extensibility for different categories.
The contributions of this study can be summarized as follows: • An intelligent traffic-monitoring system was developed to record real-time information about traffic volume, and vehicle types.
• An mYOLOv4-tiny model was proposed to achieve real-time object detection and improve detection efficiency.
• Two effective models (CFNN and Vector-CFNN) that adopt a new network mapping fusion method were implemented to increase the classification accuracy and greatly reduce the number of model parameters.
• Category extensions (e.g., vehicle type) only require training of the classification model (CFNN) without retraining of the object detection model (YOLO). This not only saves substantial training time but also improves the flexibility of category extension.
• The proposed intelligent traffic monitoring system was implemented on the NVIDIA AGX Xavier embedded platform and applied to provincial highway 1 (T362) in Kaohsiung, Taiwan for real-time vehicle tracking, counting, and classification. The remainder of this paper is organized as follows: Section 3 introduces the proposed YOLO-CFNN method for intelligent traffic monitoring. The experimental results of the proposed method are described in Section 4. Section 5 presents our conclusions and an outline of future work.

III. PROPOSED YOLO-CFNN FOR INTELLIGENT TRAFFIC MONITORING
In this section, an intelligent traffic-monitoring system is introduced. The proposed system has three functions, namely (1) vehicle detection, (2) vehicle counting, and (3) vehicle classification. The system architecture is illustrated in Fig. 1. A flowchart of the proposed intelligent traffic-monitoring system is presented in Fig. 2. First, real-time road images are obtained from traffic cameras. Then, the proposed mYOLOv4-tiny model is used to detect the position of a vehicle. To solve the problem of the repeated recording of the same car as different vehicles in different frames,  a counting algorithm is introduced to track the vehicle. In other words, a vehicle is assigned the same identity (ID) across different frames. Before execution of the counting algorithm, the virtual detection area (as shown in Fig. 3) of the target vehicle is screened to reduce the computational burden. Finally, the vehicles passing through the virtual detection area are counted and classified, and the resulting information is collected and stored for subsequent analysis.

A. VEHICLE DETECTION USING MODIFIED YOLOv4-tiny
The conventional YOLOv4-tiny is a lightweight network simplified using YOLO. It uses convolutional layers and max pooling layers to extract object features. In addition, YOLOv4-tiny uses UpSampling and Concat layers to merge features and expand feature information to further improve detection. Compared with other YOLO and SSD methods, YOLOv4-tiny has a faster detection speed. However, the detection accuracy of YOLOV4-tiny is worse than that of YOLO and SSD methods due to its greatly simplified network architecture. Conventional YOLOv4-tiny uses two outputs for object detection. To improve the detection accuracy, an mYOLOv4-tiny that has three outputs was designed for vehicle detection. The network architecture of mYOLOv4tiny is depicted in Fig. 4. In total, 24 convolutional layers and three max pooling layers were used. Finally, three scales-25 × 15, 50 × 30, and 100 × 60-were used for prediction. In this system, the mYOLOv4-tiny model is only used to detect vehicle objects.

B. VEHICLE COUNTING METHOD
The above-described YOLO object detection method can be used to identify a vehicle and its location information from a single picture. However, in actual traffic applications, a continuous image frame is provided as the input. The vehicles detected in different image frames are independent of each other. Therefore, the same vehicle is counted multiple times, and the collected vehicle information would be wrong. To solve this problem, the ID of the detected vehicle must be configured to prevent double counting. In the proposed system, an object counting method is added to correlate and match the vehicles detected in different image frames and to determine whether a detected vehicle is newly added. In this study, the multiobject counting method [46] is adopted, which uses vehicle position information from the previous frame obtained by the detection method to predict the position of a vehicle in the current frame by applying a Kalman filter. Then, the actual vehicle position detected in the current frame and the current vehicle position estimated using the Kalman filter are used to calculate the intersection over union (IoU) as the distance cost. Finally, the Hungarian algorithm is applied to match vehicles to achieve vehicle tracking.
C. VEHICLE CLASSIFICATION USING CFNN By using the methods described in Subsections 2.1 and 2.2, the vehicle position can be determined from the complete image and the vehicle can be segmented. Next, to collect more detailed information, such as vehicle type, the segmented vehicle image is analyzed and the results are obtained after classification. If information items must be added, the YOLO model need not be retrained; thus, it has superior extensibility and reduced training time after category expansion.
To identify relevant vehicle information, two CFNNs, called CFNN and Vector-CFNN, are proposed, as illustrated FIGURE 4. Network architecture of mYOLOv4-tiny. VOLUME 10, 2022 in Fig. 5. In the CFNN model in Fig. 5(a), at the outset, the convolutional layer is used to extract features from the image, and the maximum pooling layer is then used to compress these features to reduce the amount of calculation. The interactive stacking method is used to increase the model depth to complete various shape feature combinations, and a feature fusion layer is added to reduce the dimensionality of the feature size and integrate information. Finally, the fused feature information is sent to the FNN for classification to obtain the classification result of vehicle type. To solve the problem of multiple redundant parameters in the traditional CNN model, this study proposes a Vector-CFNN model (i.e., Fig. 5(b)). The architecture of this model is similar to that of CFNN, and the traditional convolutional layer is replaced with a two-layer vector kernel convolutional layer [47] to further reduce the number of parameters and computational complexity of the model. Next, the feature fusion layer and FNN classifier are explained in the proposed models.

1) FEATURE FUSION LAYER
In the feature fusion layer, different fusion methods can be used to integrate different types of feature information to obtain more useful features. Given a large number of input features, a suitable fusion method is selected to compress the features and reduce the dimensionality of the information between them. For method selection, the features are fused using either pooling operations or network mapping. Based on the different operation rules between features, different fusion results can be obtained, as summarized in Table 1.
In this study, a network mapping fusion method is proposed. This method assigns a weight to the information of each extracted feature and then integrates these weights to obtain new features. The calculation method is shown in  Fig. 6, and the calculation formula is as follows: where f z is the output of the zth fusion, n is the total number of input features, x i is the ith input feature element, and w zi is the ith input weight used in the zth fusion result.

2) FNNS
FNNs mimic human logical thinking and learning abilities. In terms of network design, an FNN can be divided into the input layer, fuzzification layer, rule layer, and defuzzification layer. The fuzzy set is contained in the fuzzy layer, and its members can have different degrees of membership on the interval is [0, 1]; this is known as a membership function. The fuzzy membership function converts input data to a value in [0, 1] based on the degree of membership of a specified set, providing a measure of the degree of similarity of an element of a fuzzy set. Common fuzzy membership functions include triangular, trapezoidal, bell-shaped, and Gaussian; among these, the Gaussian membership function has the highest accuracy [48]. Therefore, the Gaussian function is adopted as the membership function in the proposed CFNN. The feature vectors extracted by convolution operation are classified by a FNN. The If-then can be used to represent the fuzzy rules to make fuzzy inferences (Fig. 7).

IV. EXPERIMENTAL RESULTS
To verify the effectiveness of the proposed intelligent trafficmonitoring system, three experiments were performed in this study. Section 4A describes the evaluation indicators and the parameter settings of the CFNN model. Section 4B compares the classification efficiency of various CNN models and feature fusion methods by using the Beijing Institute of Technology Vehicle data set (BIT-Vehicle Dataset). Section 4C applies the public GRAM road-traffic monitoring (GRAM-RTM) data set to compare the evaluation indicators between the proposed YOLO-CFNN and the state-of-the-art object detection methods. Section 4D presents the proposed intelligent traffic monitoring system that was implemented in the AGX Xavier embedded platform and applied to provincial highway 1 (T362) in Kaohsiung, Taiwan for vehicle classification, tracking, and counting.

A. EXPERIMENTAL DESIGN
To evaluate the output results of the model, this study used the category with the highest model output value (top-1) as the classification result and accuracy as the evaluation indicator. The calculation formula is as follows: where TP, FP, TN, and FN denote true positive, false positive, true negative, and false negative, respectively. The mean average precision (mAP), precision, recall, F-measure (F1), and detection speed (FPS) were also adopted to verify the effectiveness of various object detection models. The evaluation indicators can be calculated as follows: Here, n indicates the number of classes, and AP k denotes the (AP) of class k. In the experimental environment, TensorFlow   Tables 2 and 3, respectively. In the CFNN model, the input image size is set to 224 × 224 × 3, and four sets of convolutional layers and pooling layers are used to achieve feature extraction. In each convolutional layer uses a 3 × 3 (see Table 1), 3 × 1, or 1 × 3 (see Table 2) convolution kernel to extract features. Each feature is compressed through the largest pooling layer of size 2 × 2 to reduce the computational load. In the convolutional layer, 32, 64, and 128 are used as the number of convolution kernels in the first three layers to extract various shape feature combinations. Then, the number of convolution kernels in the last layer is set to 64, and the feature fusion layer VOLUME 10, 2022  is added to reduce the dimensionality of the features. Here, by using the proposed network mapping method, a total of 128 features are fused and input into the FNN for classification. The output size represents the number of categories of different vehicles.

B. CLASSIFICATION RESULTS OF BIT-VEHICLE DATASET
The BIT-Vehicle Dataset [49] is a public dataset for vehicle classification collated by Beijing Institute of Technology. The dataset contains a total of 9,850 vehicle images, all of which were captured using 2 cameras at different times and locations on highways. These images differ in brightness, proportions, and surface color. The dataset includes six vehicle types: buses, minibuses, minivans, sedans, sports utility vehicles (SUVs), and trucks. Each image contains 1 or 2 vehicles. The vehicle positions marked in advance in the dataset are segmented (Fig. 8), and the vehicle types and numbers after segmentation are listed in Table 4.
In the training and testing of the model, according to the processing method described in [49], 200 vehicles were randomly selected from each category to be the training and testing data. In total, 2400 images each were used as the training and test datasets for the experiment. Ten experiments were performed using these data sets and the average of the values obtained in these experiments was used for evaluation. This study used different fusion methods to evaluate the performance of the proposed CFNN and Vector-CFNN models. The experimental results are listed in Table 5. The accuracies of the CFNN and Vector-CFNN models reached 90.20% and 90.45%, respectively, with the network mapping fusion method. Compared with the global pooling and channel pooling methods, the proposed network mapping fusion method has higher accuracy.
Moreover, the two proposed models were compared with other common models, namely AlexNet, GoogLeNet, VGG-16, VGG-19, ResNet50, Sparse Laplacian CNN [49], and PCN-Net [50]. The experimental comparison results are summarized in Table 6. According to the table, the accuracy of the   two CFNN models is higher than that of other deep learning classification methods. The accuracy of the two CFNN and Vector-CFNN models is 0.89% and 1.93% higher, respectively, than that of PCN-Net, and 51.7% and 57.1% fewer parameters are used in the CFNN and Vector-CFNN models, respectively, than in PCN-Net.

C. VEHICLE CLASSIFICATION RESULTS ON THE GRAM-RTM DATA SET
The GRAM-RTM (M-30) data set [51] was used to compare the performance of the proposed YOLO-CFNN and state-ofthe-art object detection methods, including RetinaNet, SSD, YOLOv4, and YOLOv4 tiny. The M30 contains 7520 frames with a resolution of 800 × 480 at 30 fps recorded using a  Nikon Coolpix L20 camera. Vehicle types include large truck, truck, car, van, and motorcycle. The ratio of training data to test data was 8:2. That is, 80% of the data (6016 frames) were used for training and 20% of the data (1504 frames) were used for testing. The number of vehicles in the training and testing phases are presented in Table 7. First, the vehicle object detection model (mYOLOv4-tiny) was trained, followed by the classification models (CFNN and Vector-CFNN). The vehicle classification results are listed in Table 8, which reveals that the proposed YOLO-CFNN, YOLO-VCFNN, and YOLOv4 yield a mAP as high as 99%. The proposed YOLO-CFNN and YOLO-VCFNN methods had a higher F1 score than other models.
The traditional YOLOv4-tiny model had a higher detection speed (300 FPS) but lower F1 and mAP than other models. However, the proposed two-stage vehicle classification models (YOLO-CFNN and YOLO-VCFNN) achieved a high score for the evaluation indicators and a detection speed of over 70 FPS. Thus, the proposed YOLO-CFNN and YOLO-VCFNN models can be employed for real-time vehicle classification applications.

D. APPLICATION TO PROVINCIAL HIGHWAY 1 (T362) IN KAOHSIUNG
To verify the effectiveness of the system in an actual environment, the proposed intelligent traffic-monitoring system was applied to roads in Taiwan, and its architecture is depicted in Fig. 9. In this architecture, each monitored road section houses a camera and an AGX Xavier embedded computing platform. The specifications of the AGX Xavier platform are listed in Table 9. Real-time images are processed using the AGX Xavier platform to obtain detailed traffic data, and the data are sent to the road monitoring center for analysis through the wireless network. Therefore, three functions are achieved on the road monitored for testing, namely vehicle classification, and traffic flow calculation. Moreover, a T362 vehicle data set, which contained vehicle type, is established for Kaohsiung, Taiwan to verify the performance of the proposed model.
The number of vehicle types image data points in the T362 vehicle dataset was 1,815. The images were captured from different lanes and at different times. Therefore, the captured images were illuminated by different light sources, and some vehicles were blocked, as illustrated in Fig. 10. The T362 vehicle dataset has the following six vehicle types: buses, trucks, cars, motorcycles, and trailers. The numbers of each vehicle type are listed in Table 10. In terms of model training and testing, this study used 80% of the collected vehicle type dataset as the training data and 20% as the testing data. The input image size of the classification model was uniformly adjusted to 224 × 224 × 3, and 10 experimental runs were performed to ensure the stability of the experiment.

1) CLASSIFICATION RESULTS FOR VEHICLE TYPE
In the evaluation conducted using the vehicle type dataset, the experimental results obtained using CNN and Vector-CNN with different fusion methods are summarized in Table 11. The CFNN and Vector-CFNN models with the proposed network mapping fusion method exhibited the best accuracy values of 94.68% and 95.28%, respectively. The proposed   network mapping method is superior to the other fusion methods in vehicle classification.
The CFNN and Vector-CFNN models proposed in this study were compared with a few common deep learning methods, and the experimental results are listed in Table 12. The accuracy of the proposed models with the network   mapping fusion method was superior to that of the other classification methods. The accuracy of the proposed models was 1.83%, 3.59%, 8.6%, and 11.29% higher than the accuracy of AlexNet, VGG16, LeNet, and MoblieNet, respectively. In terms of the number of parameters, the proposed CFNN and Vector-CFNN models had the lowest parameter value of approximately 0.5 M. In addition, compared with the lightweight MobileNet, LeNet, and AlexNet, the two proposed CFNN models reduced the number of parameters by up to 86.8%, 89.4%, and 98.8%, respectively. Thus, the proposed models achieve favorable classification and offer a competitive advantage when few parameters are used.

2) COUNTING RESULTS OF ACTUAL ROAD TRAFFIC FLOW
Finally, the vehicle counting method used in this study was evaluated and verified. In this experiment, the same  traffic scene was used for verification. Three actual road traffic videos were used to evaluate the proposed vehicle counting method. Each video was 5 min long, and the two selected videos were recorded at 07:00 and 17:00. The remaining videos were taken in rainy conditions. Still images from the three videos are displayed in Fig. 11.
In the evaluation, the proposed vehicle flow counting result was divided by the manual counting result to determine the accuracy of vehicle counting. In addition, different occlusion VOLUME 10, 2022  conditions were included in the real road scene as presented in Fig. 12. As shown in Fig. 12, a larger bus blocks a car, resulting in a missed count. The visual vehicle detection and counting are shown in Fig. 13. The text in the first half of the green label in Fig. 13 represents the type of vehicle and the text in the second half represents the number of counts. When a vehicle enters the virtual detection zone, the proposed intelligent traffic-monitoring system immediately performs vehicle classification and counting.
The traffic flow counting results of each video are summarized in Tables 13-15. The precision versus recall curves of the proposed YOLO-CFNN and YOLO-VCFNN models are shown in Figs. 14-16. As shown in Table 13, the mAP of Reti-naNet and SSD was 94%, but the F1 scores were only 76.06% and 86.28%, respectively. The mAP and F1 score of YOLOv4 were 88.82% and 85.55%, respectively. However, the mAP for trailers was only 64.44%. Although YOLOv4-tiny has a detection speed of 145 FPS, the motorcycle detection performance was poor (65.49%). The proposed YOLO-CFNN and YOLO-VCFNN are superior to other methods in terms of F1 score (99%). After introducing the counting method into CFNN and VCFNN, FPS can be maintained above 30 to achieve real-time detection. The two proposed methods also had an accuracy of 97.05% in traffic flow vehicle counting.
For the afternoon road traffic video (Table 14), the mAP and F1 of YOLO-CFNN and YOLO-VCFNN were higher than those of other methods. The accuracy of flow counting was 98.5%. For the rain video (Table 15), except for the SSD method, the mAP of the motorcycle detection was lower because images captured in rainy conditions are blurry,  affecting the judgment results. However, the mAP and F1 of the two proposed methods were higher than 90%, and the counting accuracy was 100%. These scenarios reveal that the proposed intelligent traffic-monitoring system is suitable for real-time vehicle counting in actual environments and has a high counting accuracy.

V. CONCLUSION
In this study, an intelligent traffic-monitoring system was proposed to calculate traffic flows and classify vehicle types. The major contributions of this study are as follows: • A novel intelligent traffic-monitoring system combining a YOLOv4-tiny model and counting method was proposed for traffic volume statistics and vehicle type classification.
• The proposed CFNN and Vector-CFNN were designed by introducing the fusion method and FNN, which can not only effectively reduce the number of network parameters, but also enhance the classification accuracy.
• The proposed network mapping fusion method was superior to the commonly used pooling method, and it could effectively integrate image features and improve the classification accuracy.
• Compared with the current state-of-the-art object detection methods (Retinanet, SSD, YOLOv4, and YOLOv4 tiny), the proposed YOLO-CFNN and YOLO-VCFNN have a high mAP rate, accurate counting accuracy, and real-time vehicle counting and classification ability (over 30FPS). The experimental results indicated that the performance of the proposed CFNN and Vector-CFNN models was superior to that of common deep learning models. On the BIT dataset, compared with the pooling method, the proposed network mapping fusion method improved the recognition accuracy by 3.59%-5.92%. In addition, compared with the PCN-Net model, the proposed CFNN and Vector-CFNN models improved the accuracy by 1.93% and reduced the number of parameters by 57.1%. On the GRAM-RTM data set, the mAP and F1 of the two proposed vehicle classification methods were 99%, higher than those of other methods. In addition, among the FPS indicators, the proposed method was 1.65 times faster than the traditional YOLOv4. On the T362 vehicle type dataset, compared with the general pooling methods, the accuracy of the proposed network mapping fusion method was 2.3%-5.36% higher. In addition, compared with the AlexNet model, the accuracy of the proposed CFNN and Vector-CFNN models was 1.19% and 1.83% higher, respectively, and the number of parameters decreased by 98.8%. In three actual road traffic scenarios, the proposed YOLO-CFNN and YOLO-VCFNN methods yielded a high F1 score for vehicle classification and high accuracy for vehicle counting. In summary, the CFNN and Vector-CFNN models proposed in this study not only have favorable vehicle classification effects but also have fewer parameters relative to other models. Therefore, the proposed models are suitable for information analysis in environments with limited hardware performance.
In terms of the extensibility of the proposed models, many factors that affect the machining accuracy of machine tools in intelligent manufacturing have been identified, such as temperature and tool wear. Therefore, developing an accurate model of the effects of these factors is crucial. In future studies, the proposed CFNN and Vector-CFNN models and the network mapping fusion method will be applied for modeling in intelligent manufacturing.