Designing Electronic Traffic Information Acquisition System Using Deep Learning and Internet of Things

In implementing the Intelligent Traffic Monitoring System (ITMS), timely and effective access to road traffic information is an essential link. It requires an effective traffic Information Acquisition System (IAS) to collect real-time data and transmit the collected information to the background for processing. Therefore, this paper studies on-road vehicle information recognition based on Deep Learning (DL). Firstly, a framework of traffic IAS is proposed. Then, an improved MT-GooGleNet model based on Convolutional Neural Network (CNN) is proposed to locate and recognize vehicles in traffic images. Finally, the performance of the model is analyzed by simulation. The experimental results of vehicle position recognition show that the classification accuracy of Multi-Task (MT)-GooGleNet after fine-tuning is 99.5%. Compared with other models, the MT-GooGleNet model proposed is the best in vehicle position recognition, and its positioning accuracy is very high. The results of vehicle identification show that after data enhancement and pre-training, the testing set accuracy of the MT-GooGleNet model is 79.96%. The results show that the model’s accuracy has been dramatically improved after processing. The research provides a reference for establishing IAS in the future.


I. INTRODUCTION
The last several decades have seen rapid socio-economic development and a much-improved quality of life in China, with which household car ownership and industrial vehicle volume have gained a substantial boost. Yet, such civil convenience has come at a cost, such as the stubborn traffic congestion and up-rising accidents. Therefore, there is an imminent demand for traffic congestion alleviation and transportation resource distribution using modern technological means. Under such circumstances, the ubiquitous Internet of Things (IoT) and maturing Artificial Intelligence (AI) technologies might help devise an intelligent urban traffic monitoring and information management system for the well-being of residents [1,2].
Arguably, for a robust and viable Intelligent Traffic Monitoring System (ITMS), timely and data acquisition efficiency should be the utmost concern. Overall transformational data can cover vehicle speed, traffic flow, road occupancy rate, and vehicle type. Every link matters for timely and effective data collection, transmission, processing, and forecast [3][4][5]. The first link will be traffic data collection, against which multiple models and systems are available. In particular, real-time traffic information is a prominent and comprehensive traffic situation indicator, which is relatively easier to collect and against which many ready-made video detection tools have been invented. Compared with other traditional detection methods, the video detection approach features intuitive traffic scenes and an excellent corner-cover detection range. Thus, vehicle-oriented Image Recognition (IR) and Video Analysis (VA) technology can effectively realize real-time vehicle tracking and management [6]. On the other hand, the Convolution Neural Network (CNN) is a Deep Learning (DL)-based common image processing feedforward Neural Network (NN) structure with an outstanding local connection and weight sharing mechanism [7,8]. In a CNN, neurons are organized in some way to respond to overlapping vision domains. With the introduction of more hidden layers, the depth of CNN increases and can be further applied to more advanced scenarios, such as Machine Translation (MT), Speech Recognition (SR), Image Segmentation (IS), and Natural Language Processing (NLP). Its unique operation in the convolution pool has high efficiency in image processing. It has unparalleled advantages in 2D graphics tasks such as IR and positioning [9]. The Residual Neural Network (ResNet) is an innovative form of CNN to make the convolutional layer learn a residual rather than directly learn the target [10]. For example, Zhao et al. (2018) [11] analyzed the characteristics of CNN and ResNet to build a ResNet model. They collected image data using the Unmanned Aerial Vehicle (UAV) remote sensing system, based on which the IR performance of the proposed ResNet model was tested. Chen et al. (2018) [12] introduced a deep network architecture --the prototype parts network. The network parsed the image by finding the prototype parts and classified the image with the evidence of the prototype. The network only used image-level labels for training without any annotations of the image part. Thus, CNN is often seen in IR and vehicle location recognition.
Background complexity and target diversity increase difficulty in IR, and factors like illumination or shooting angle will result in deviations in the positioning and identifying vehicles from images. Therefore, based on DL, this paper proposes an improved Multi-Task (MT) learning GooGleNet model based on CNN to locate and recognize vehicles in images and studies the model's performance. The main innovation is that CNN in DL is applied to IR, which has contributed to improving the IR accuracy and is the main difference and novelty between the present work and previous research methods. At the same time, the research has a certain reference value for the digitization and intellectualization of urban transportation infrastructure.

II. RECENT RELATED WORK
A. Current status of research on the IoT-based ITMS Compared with the rapid urbanization, especially the surging car ownership, the development of traffic management has been relatively backward, such as insufficient road planning and construction. These are the leading causes of urban traffic congestion, a modern urban disease demanding an immediate solution. In particular, ITMS is gradually becoming the research direction for solving traffic congestion problems globally.   [13] established the vehicle motion model using OPNET Modeler. They simulated the multi-hop scenario of the vehicle Self-Organizing Network (SON) in the ITMS based on the IoT, finding that the vehicle SON presented a stable overall performance under a low speed. Brincat et al. (2020) [14] identified several IoT-related cases in ITMSs, such as connected and autonomous vehicles, cooperative traffic networks, and smart roads, to improve data dissemination create heterogeneous connections in highcapacity environments and low-latency applications.   [15] integrated Data Mining (DM) technology and distributed parallel Hadoop technology to realize the ITMS architecture, analyzed the running state, and applied the IoT to obtain data from the perception layer. The simulation results showed that the system had good prediction performance. LV et al. (2020) [16] used the DL method to study, simulate, improve the safety of ITMS and statistically analyzed the data transmission performance, accuracy prediction performance, and path change strategy of the system. The simulation data transmission performance showed a 100% success rate. Kaffash et al. (2021) [17] applied big data algorithms to further improve the intelligence of applications in the transportation field by reviewing the literature on big data algorithms and their applications in ITMS. The research outcome determined the research gap and direction in the future. To sum up, scholars have conducted extensive research and used different methods to study IoT-based ITSM, as encapsulated in Table 1.  [18] proposed an automatic data acquisition and annotation strategy based on target detection to obtain various fine-grained models and quickly label vehicle images. They proposed a dense attention network because the traditional DL network has a serious feature information transmission attenuation between each layer, lacks mutual dependence between features, and ignores important regions and features. Kwan et al. (2020) [19] constructed a DL tool --You Only Look Once (YOLO)--for target detection and classification based on real-time DL, pixel-coded exposure measurement, and vehicle detection and classification. They conducted many experiments and proved the feasibility of the real-time framework, which could recognize the objects and their positions in the image and speed up the acquisition of good detection and classification performance.
Considering the time dependence of sensor data, Li et al. (2020) [20] established a superimposed Long Short-Term Memory (LSTM) model to detect vehicle maneuvers. The system's performance was compared with previous studies and various Machine Learning (ML) methods, finding that the system accurately detected different vehicle maneuvers. Yu et al. (2020) [21] established a traffic safety solution based on DL for automatic and manual hybrid vehicles in the 5th Generation Mobile Communication Technology (5G) ITSM. In this scheme, the driving trajectory data set and the natural driving data set were used as the network input of the 5G-enabled LSTM ITSM. The softmax function calculated the probability matrix of each intention, which effectively improved the accuracy and real-time performance of intention recognition. It helped to solve the lane change problem in the mixed traffic environment.
Through the above analysis, the application of DL in intelligent traffic is expanding. Vehicle-oriented IR+VA technology can effectively realize real-time vehicle tracking management. However, few studies are on vehicle positioning and type recognition using DL. Therefore, CNN based on DL is of great value for vehicle identification and vehicle type and location.

B. Functional modules of the traffic IAS
The primary module in the traffic IAS is the video acquisition and transmission module. It collects video images from the traffic scene through the acquisition device and transmits them to the video processor for processing. Acquisition range and image quality are two main concerns in traffic video acquisition modules. Obviously, a small detection range, blurred or occluded video images, and images with poor contrast will increase image processing difficulty. Thus, there is a need to adjust system parameters during acquisition, such as adjusting the camera's aperture and shutter.
The video processing module is at the core of the entire Traffic IAS. It directly or indirectly collects traffic information through vehicle detection and tracking, so the accuracy of vehicle positioning and recognition will have a direct impact on the acquisition of traffic information signals. Therefore, the present work focuses on vehicle positioning and recognition.

A. Convolutional Neuron Networks
CNN is a feedforward Deep Neural Network (DNN). Its essence is the combination of classifier and features extractor. After multi feature learning from the input image, a set of feature vectors will be extracted. Feature vectors are close representations of the original image. Finally, feature vectors are input into the classifier to classify and recognize the data [22,23]. As in Fig. 3, CNN is generally used for IR and comprises lower sampling, input, convolution, output, and fully connected layers. It is mainly used for IR.
(2) demonstrates the feature calculation accorded with the convolution layer.
In Eq. The underlying sampling layer contains the same number of features as the previous convolution layer. The underlying sampling layer divides the input into many blocks, and the values of each block are obtained by determining the sampling method for each pixel and added with bias. Finally, functions are output by stimulating [24,25]. The bottom sampling can make the features more robust against deformation. If the local features do not change after deformation, the uniformly sampled features are consistent. Meanwhile, the sampling size of the feature map is reduced, thus greatly reducing the amount of data required for subsequent processing and improving the training efficiency [26]. Eq. The pooling layer comes after the convolution layer, which is also unique to CNN. The difference between the pooling and convolution layers is that the former does not activate the function [27]. The convolution and pooling layers can be repeatedly combined within the hidden layers as per actual application needs [28]. The fully connected layer is connected after the convolution layer and pooling layer. In general, the fully connected layer is composed of a sigmoid function or Radial Basis Function (RBF) at the last base of the network. Eq. (4) calculates the output sigmoid function. Down-sampling layers greatly reduce the training parameter in the CNN. Thus, CNN is trained by the backpropagation algorithm. Generally, feature extraction consumes substantial time and calculation. Down-sampling can reduce computational complexity, so a CNN is constructed by using the supervised learning algorithm [29][30][31]. Then, Gradient Descent Method (GDM) is used to correct the error; each step of the gradient needs to move forward and backward to complete so that it takes a lot of time to train. Fig.4 Fig.4 Training process of CNN CNN is a good DL method in image processing, which helps to extract the features of the image well. Its core method is to use convolution to verify the number of images and then filter the image to extract from each face and direction of the image to ensure that many features can be obtained [32,33]. Many features can be quickly extracted from the image only by training fewer convolution parameters. Remarkably, weight sharing can optimize CNN by iteratively training the parameters in the convolution kernel with the back-propagation method.

B. Vehicle's location and bi-classification research based on Multi-Task (MT)-CNN
When the scene is complex, the vehicle is not easily identifiable from an image. To this end, it is necessary to combine the image classification with the vehicle location task to recognize the vehicle. This section uses a multi-task description language model for IR and classification. MT-CNN refers to obtaining higher learning motivation through related tasks. This method can distinguish the differences between different tasks and share relevant features. Fig. 5A refers to a single task DL diagram, and Fig. 5b  In the process of vehicle positioning and feature extraction, GooGleNet is adopted. The reason is that GooGleNet uses many 1*1 convolution kernels to enhance the nonlinear fitting and reduce dimension. Generally, GooGleNet has three outputs, but the last layer's output is generally optimal. Therefore, the output of the last layer is used from 22 layers of the network structure, added with the position output. Fig.6 demonstrates the MT-GooGleNet learning algorithm model. The second is to fine-tune MT-GooGleNet, which specifically uses its initial parameters to initialize the convolution of all the fully connected layers and the last two layers of MT-GooGleNet before the binary classification training. The binary classification experiment is carried out after these two methods. Then, the comparative analysis is conducted.

C. Identification of vehicle's type based on MT-CNN
Then, to collect comprehensive information in the driving process, the present work uses the MT-GooGleNet model to locate the vehicle in the image and applies the vehicle area in the image after shearing as the input of the final classification model. Its essence is to use the MT-GooGleNet to locate the vehicles first and then use the positioning results as the input of the second CNN to classify the vehicle types. Then, two improvements are made on GooGleNet. The first is to enhance the data of GooGleNet, that is, to mirror the training set and then use GooGleNet to classify the original image. The second is to pre-train GooGleNet, that is, to classify GooGleNet on the original graph. The initial parameters are multi-task DL model parameters and then compare and analyze the models. The vehicle type recognition data are chosen from Stanford University public database and Google Road recognition open dataset and are divided into 196 categories and 16,185 images in total. After data enhancement and model pre-training, the vehicle appearance signal is recognized and processed. The vehicle image is collected by road monitoring equipment, and the vehicle type and position coordinates are returned. The classified vehicle images are input into the model for unified processing. Here, "vehicle identification" refers to the identification and processing of vehicle appearance and data signals, while "vehicle position identification" refers to the road signal identification of vehicles. The two constitute the data acquisition and processing part of the proposed model. CNN is usually integrated with LSTM to maintain a certain time dependence and correlation. Further, to use the CNN network alone for image feature extraction, it is necessary to fine-tune the application and output layer of the CNN model and add the method in the Timedistributed layer of the LSTM model to the CNN model to improve the accuracy of model recognition. In this case, the model is applied to the data of multiple input time steps many times, and a series of image interpretation or image features are provided to the CNN model in turn.

V. RESULTS AND DISCUSSION
A. Analysis of the accuracy of vehicles' positioning Apparently, the accuracies of these three models decline as the overlapping rate increases, while the MT-GooGleNet has the highest positive sample accuracy in these three models. As in Fig.7B, when the overlapping rate is 0.7, the negative sample error rates of MT-GooGleNet, MT-VGGNet, and MT-CaffeNet are 0.03%, 0.04%, and 0.04%; when the overlapping rate is 0.8, the negative sample error rates of MT-GooGleNet, MT-VGGNet, and MT-CaffeNet are 0.06%, 0.12%, and 0.13%; when the overlapping rate is 0.9, the negative sample error rates of MT-GooGleNet, MT-VGGNet, and MT-CaffeNet are 0.1%, 0.38%, and 0.4%. Thus, as the overlapping rate increases, the error rates of these three models are surging, while the MT-GooGleNet has the lowest negative sample error rate. To sum up, MT-GooGleNet performs best because the 1*1 filter can enhance the model nonlinearity fitting, thus promoting the model prediction accuracy.

C. Recognition Results of vehicles' type
The fine-tuned GooGleNet, CaffeNet, and VGGNet are directly used for classification on the original image. Fig.9 and  Fig.9 reveals that the total number of iterations is 200,000, and the initial learning rate is 0.005. Obviously, the network converges at the 10,000 th iteration, and the data enhancement and pre-training can accelerate the network convergence. Meanwhile, vehicle identification has the highest accuracy of 87.2% after data enhancement and pre-training. More data help to train the network with higher accuracy in vehicle identification.  The MT-GooGleNet model has a testing set accuracy of 70.14% with only pre-training; the MT-GooGleNet model has a testing set accuracy of 79.96% with both data enhancement and pre-training. Therefore, the accuracy of the MT-GooGleNet model has been greatly improved after processing. Fig.11B suggests that under the same data set, there is a difference in CPU and GPU's test time with different network structures. Thanks to GPU, the test speed of the model is significantly improved.

VI. CONCLUSION
Vehicle information acquisition from complex scenes is extremely difficult. Accordingly, the present work aims to collect comprehensive vehicle on-road driving information. Consequently, an MT-GooGleNet model is proposed to locate and recognize vehicles in images. Firstly, the experimental results of vehicle location and recognition show that the MT-GooGleNet model proposed has the best effect and high positioning accuracy in vehicle location and recognition compared with other literature methods. Secondly, the experimental results of the fine-tuned MT GooGleNet model are significantly better than the fully trained model, indicating that the initialization of parameters has a significant impact on the training results. Thirdly, the results of vehicle identification show that after data enhancement and pre0training, the testing set accuracy of the MT-GooGleNet model is 79.96%. Therefore, the accuracy of the MT-GooGleMet model has been greatly improved after processing. The research findings provide a reference for establishing IAS in the future. The guiding principles of the research are comprehensive data collection and accurate model construction. However, the research has some deficiencies, mainly because the experimental data set used has a single source. The dataset does not contain enough image types, which is also the main limitation affected by the experimental environment. Therefore, the future research work will further optimize the system structure, optimizes the data processing process, and improves the accuracy of model recognition and the signal acquisition efficiency of the proposed traffic IAS by selecting larger datasets and deeper CNN network