PG-YOLO: A Novel Lightweight Object Detection Method for Edge Devices in Industrial Internet of Things

With the rapid development of Industrial Internet of Things (IIoT) technology, video surveillance devices and video data in IIoT environments are massively growing and increasingly important. Deploying rapidly evolving deep learning-based object detection algorithms in IIoT can improve the efficiency of video data utilization and increase the automation and intelligence of the IIoT. Facing the transmission latency problem of massive video data, the algorithms are better deployed in edge devices. However, due to the large size and high computing power requirements of existing object detection algorithms, it is difficult to deploy them on edge devices with limited performance. Therefore, to solve this problem, we propose a smaller and less computationally intensive object detection algorithm, PG-YOLO. We replaced all the convolutional modules of YOLOV5 with low computational ghost modules and streamlined the backbone network to improve the network performance. Then, we propose a better pruning algorithm to compress the network, and finally improve the accuracy by distillation to obtain PG-YOLO. PG-YOLO is smaller and faster and can be deployed in edge devices. In addition, we conducted experimental validation on the SHWD dataset. The experimental results show that PG-YOLO can compress the volume of the model by 9 times compared with YOLOv5s, and the compressed detection accuracy reaches 0.934, with only 0.1% loss of accuracy. Also, the time required for inference is reduced by 32.7%, and the inference speed is improved by 10 FPS. Compared with other object detection algorithms, PG-YOLO also has advantages.


I. INTRODUCTION
As a typical application of the Internet of Things(IoT), Industrial Internet of Things(IIoT) is the implementation of IoT in various manufacturing systems [1] and the introduction of a large number of IoT devices and computing nodes in production lines and manufacturing processes to enable the monitoring of industrial systems [2]. In the process, the massive and continuous growth of surveillance devices and the high-quality video data generated in IIoT environments has occurred. And computer vision and deep The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Tang . learning-based algorithms have shown great potential for video and image processing. In particular, deep learning-based object detection algorithms have shown significant progress in the accuracy and speed of video data processing [3]. Also, deep learning based object detection algorithms can be well suited for many industrial applications such as quality monitoring, classification counting, and security management. Therefore, the integration of deep learningbased object detection into IIoT systems can enhance the automation and intelligence of IIoT systems greatly [4].
Since the deep learning-based object detection algorithm requires high computing power to analyze the data, it is usually deployed in a cloud server with high computing power.
After the video data generated by surveillance devices is transferred to the cloud server, the object detection algorithm then processes it. With the massive increase in the volume of high-quality surveillance video data brought about by the explosive growth of surveillance devices in IIoT environments, the traditional cloud computing model is facing increasing challenges. On the one hand, there is an increasing gap between the data processing capacity of cloud computing and the speed of video data generation. On the other hand, the transfer of large volumes of video data to servers can lead to network congestion and affect IIoT systems that are usually very latency sensitive [5]. While deploying object detection algorithms to the edge is a natural solution, it introduces new challenges due to the limited computational power of edge nodes [6].
In order to obtain higher accuracy, the general trend in object recognition is to build deeper and more complex networks. In particular, the self-attention target detection algorithm containing Transformer structure, these algorithms has further improved the accuracy of recognition [7], [8], [9], [10], [11], [12]. However, the larger networks resulting from increased accuracy do not necessarily improve recognition speed. Due to the limited computational power and storage space of edge nodes, deploying object detection algorithms to the edge requires the algorithm itself to be sufficiently lightweight. Existing deep learning-based object detection algorithms have difficulty meeting this, with most networks being oversized in terms of the number of parameters, computation, and model size. Especially in terms of volume, even some lightweight algorithms are still too large in terms of the number of parameters and the size of the weight files. And some too small models sacrifice accuracy, and the lower accuracy is difficult to meet the demand of practical deployment. To solve this problem, there is an urgent need to lightweight the object detection models based on deep learning. Therefore, we propose PG-YOLO that compresses substantially in size with little loss of accuracy and can meet the practical requirements for deployment at the edge side.
There are three approaches to lightening the neural network currently used for object detection. (1) Choosing an object detection algorithm framework that is advantageous in terms of volume, such as the one-stage algorithm. (2) Using a more compact model (3) Compression of the neural network model. Therefore, this paper combines the above three methods and proposes a more lightweight object detection network model. We have optimized and compressed the network based on the excellent object detection algorithm YOLOv5s. First, we replaced all the convolutional modules in the original network with Ghost module, a much smaller computational and volume module, and optimized the redundant parts of the network structure. Then we obtained the extremely light model PG-YOLO by sparse training and pruning. Finally, we used knowledge distillation to improve the final accuracy of PG-YOLO, so that PG-YOLO combines both lightness and network accuracy. Due to the great success of deep learning-based image classification, object detection techniques using deep learning have also seen some research progress in recent years, and have been further applied in different fields [13], [14]. The current major deep learning-based object detection frameworks can be divided into two categories: two-stage detectors, such as region-based CNN (R-CNN) [15] and its variants [9], [16]. The other one is one-stage detectors, such as YOLO(You Only Look Once) [7] and its variants [8], [10], [17]. The two-stage detector first uses a region candidate network (RPN) to extract candidate object information, and then the detection network completes the prediction and detection of the location and class of the candidate object, which usually achieves higher localization and object detection accuracy [3].
The one-stage detector does not need to use RPN but generates the object location and category information directly through the network, which is an end-to-end object detection algorithm. Therefore, one-stage detection algorithms are usually faster [7], [18]. Two-stage detection algorithms are more accurate but require a slower detection speed. In contrast single-stage detection, although slightly less accurate, has advantages in terms of speed and model size that meet the needs of this paper. Therefore, the one-stage detector algorithm will be used as the basis for enhancement in this paper. Among the one-stage detectors, the YOLO series algorithm is representative [3]. The YOLO series algorithms have reached the current mainstream level in terms of speed, quantity and accuracy, with YOLOv4 and YOLOv5 as the most widely used. YOLOv4 has made a lot of improvements based on the previous single-stage algorithm, absorbing many excellent ideas, such as data augmentation, CSP structure, etc. The related algorithms also have a large number of applications in the industry [19], [20]. And then YOLOv4 introduced YOLOv4-tiny optimized for edge-side devices has greatly improved the real-time performance of the model, but there is still room for optimization. YOLOv5 was released later, and it is very similar to YOLOv4 in architecture. While YOLOv4 outperforms YOLOv5 in terms of performance, it is less flexible and faster than YOLOv5. YOLOv5 has been optimized for accuracy since then. This makes YOLOv5s version, the smallest of the four versions of YOLOv5, have a good balance of speed and accuracy. Therefore, this paper will be based on YOLOv5s for improvement.
In addition, the recently proposed Detection Transformer (DETR) [21] is the first fully end-to-end target detector. By utilizing Transformers [22], originally developed for language tasks, the final set of predictions can be output directly without further post-processing. Although DETR performs relatively well in terms of accuracy, it has disadvantages that cannot be ignored. On the one hand, its structure gives it relatively low performance in detecting small objects. On the other hand, and its main drawback, DETR requires a long training time to converge. Therefore a large amount of research work has focused on how to alleviate this problem [23], [24]. There are also some studies that attempt to introduce Transformer structures and ideas in the original target detection framework [25]. Although these works have made some progress, Transformer-based object detection algorithms still have some distance to go in practical applications.
Due to the need to deploy neural networks on embedded devices, a series of compact models have been proposed in recent years. MobileNet [26] is a cell phone neural network model proposed by Google in 2017. It mainly uses deep separable convolution to reduce the computation. MobileNetV3 [27] is the third version of the MobileNet series, in addition to absorbing the bottleneck structure with reverse residuals in MobileNetV2 [28]. Further optimizations were made to (1) search the entire network architecture using NAS, (2) search the appropriate network width using NetAdapt, (3) introduce an attention mechanism in the SE module, (4) introduce the h-swish activation function, and ShuffleNet introduces a channel shuffle operation to improve the information flow exchange between channel groups. ShuffleNetV2 [29] further considers the actual speed of the object hardware for compact design. In addition, to reduce the computational consumption of neural networks, the [30] proposes the Ghost module to build an efficient convolutional module. This module divides the original convolutional layer into two parts, first using fewer convolutional kernels to generate a small number of internal feature maps, and then a simple linear change operation to further generate ghost feature maps efficiently. From the experimental point of view, GhostNet has the best detection accuracy and compression compared with other models, even surpassing mobilenetV3 [30].
The main model compression methods currently used include knowledge distillation, low-rank decomposition, parameter quantization, and parameter pruning. Knowledge distillation [31] is more complex to train and is used for image classification tasks with most knowledge distillation methods and is not applicable for object detection tasks. The end result of quantization, low-rank decomposition, and pruning are all from pre-trained large models to lightweight small models, reducing the network parameters and computational effort. Among them, pruning makes important filtering for different granularity of units in the network and prunes the unimportant units to maintain the network performance with more flexibility [32].

II. RELATED THEORY
Here we briefly introduce the theory related to this paper.

A. YOLOv5 NETWORK
YOLOv5 is an object detection network proposed by the ultralytics company. It takes both detection efficiency and accuracy into account. It is a real-time and efficient onestage object detection algorithm. Based on the original YOLO framework, the algorithm optimizes the backbone of feature extraction, the neck network for feature fusion, and the output of prediction head for classification and regression, and integrates the excellent algorithms and models of deep convolution neural networks in recent years. YOLOv5 is available in 4 sizes with different volumes and YOLOv5s have the smallest network volume.
YOLOv5s network mainly uses the Bottleneck CSP structure. The Bottleneck CSP structure is divided into two parts. The first part performs a classical residual structure called the Bottleneck operation, which undergoes The first part performs the Bottleneck operation, which is a classical residual structure that undergoes a 1 × 1 and 3 × 3 convolution operation to add the convolution result to the input. The other part is dimensioned down by 1 × 1 convolution, reducing the number of channels by half. In the YOLOv5 series, the network structure size is controlled by two parameters: depth and width. So there are 4 models of YOLOv5, the difference is the width and depth of the network Fig. 1 shows the network structure of YOLOv5 network.

B. GHOST MODULE
The [30] proposes that well-trained networks usually have rich or even redundant feature map information to ensure the understanding of the input. And for these redundant feature map information can be obtained from other feature maps using an inexpensive linear transformation, which can reduce the overall computational consumption. The specific Ghost module structure is shown in Fig. 2.
In ordinary convolution, the transformation that completes the input X = h · w · c to the output h · w · n. The number of FLOPs needed in this convolution process is n · h · w · c · k · k, which is usually large due to the number of convolution kernels c and the number of channels n, which are usually very large (e.g., 256 or 512).
The output feature mappings of a convolutional layer usually contain many redundancies, some of which may be similar to each other. Therefore, a cheap conversion can be made  The process of pruning. Sparse training adds a scaling factor to each channel, which measures the importance of the channel. Channels with small scaling factor values (green) will be pruned and then compact models will be obtained.
from the original feature maps to these similar redundancies. The cheap conversion is expressed here as . It has been experimentally verified in the [30] that a convolution operation with a convolution kernel size of 3 · 3 is the most suitable for linear transformation. Therefore, the theoretical speedup ratio for upgrading the normal convolution using the Ghost module is where d · d has a similar magnitude as that of k · k. We let the proportional relationship between the original feature map and the total feature map s=n/m, then s c, r s the equation can be approximated as The final compression ratio is S and s=n/m which means the smaller the value of m, the more computationally compressed it is. However, excessive compression does not guarantee that too many ghost feature maps can contain enough rich features. Therefore, [30] set S in the range of {2, 3, 4, 5} and conducted a comparison experiment. When S increases, the FLOP of the Ghost method decreases significantly and the accuracy also decreases gradually. However, when S = 2, the Ghost method performs optimally and even slightly better than the original model. Based on this result, this paper also sets S to 2.

C. MODEL COMPRESSION
At present, the main methods of model compression include knowledge distillation, low-rank decomposition, parameter quantization, parameter pruning [33], channel pruning [34], and so on. Different methods are applied to different situations [35], [36]. In this paper, channel pruning and knowledge distillation are selected to compress the model.
Channel pruning [37], [38] is based on sparse training of BN layers in the network with a trained model; Then, the network is sorted according to the weights added by sparse training; Finally, the appropriate threshold or pruning rate is set, and if the weight is less than the set range, the channel will be pruned.
Regarding the BN layer i.e., batch normalization has been used as a standard method by most convolutional neural networks to achieve fast convergence and better generalization performance. Also, the way BN normalized activation makes possible a simple and efficient way to count the scaling factors of the channels. After thinning training, the γ parameter in the BN layer is used as the scaling factor needed to thin the network, and all the convolutional layers are trained sparsely. With sparse training, we can filter out the channels with low-performance contributions and remove them. Although there is a loss of precision in removing these channels, the loss can be recovered by fine-tuning and will greatly compress the model.
Knowledge distillation is another common approach in model compression and was first applied to classification tasks [31]. Unlike pruning and quantization in model compression, knowledge distillation is done by constructing a small lightweight model and using supervised information from a larger model with better performance to train this small model in order to achieve better performance and accuracy. This large model we call the teacher model, and the small model we call the student model. The teacher model outputs supervised information, and the student model learns to migrate supervised information from the teacher model, a process known as knowledge distillation. The training accuracy of the teacher model is higher than that of the student model, and the larger the difference between the two, the more obvious the distillation effect is. By distillation, we can get a model with a small volume and high accuracy.

III. NETWORK OPTIMIZATION AND COMPRESSION
In order to solve the problem of insufficient computing power to deploy large networks in mobile and embedded devices, in this paper, we propose a PG-YOLO network by combining relevant theories and considering the weight size, computation, detection speed, and accuracy. Firstly, we use ghost bottleneck to replace all the convolutional modules in YOLOv5s to obtain the initial lightweight network model. We then modified the backbone network by selectively removing inefficient structures from the backbone, and then we use the R-pruning method to further compress the network. We further improve the accuracy of the network by using distillation after completing the compressed network, so that the accuracy of the network remains high after several times of compression.

A. NETWORK OPTIMIZATION
Due to the excellent performance of YOLOv5s, it is chosen as the basis of algorithm improvement to ensure that the algorithm can eventually obtain good accuracy and speed. ghost module can complete the work of information extraction with less computational power cost and storage space cost through cheap operation and identity. Therefore, replacing the normal convolution module in YOLOv5s with a better performance Ghost convolution module can significantly reduce the size and computation of the network and complete the lightweighting of the network.
The Ghost bottleneck is a stack of ghost modules, the specific structure of which is shown in Fig. 4. On top of the ghost module, the reimaging bottleneck consists mainly of two stacked reimaging modules. The first layer is used as an extension layer to increase the number of channels. The second Ghost module reduces the number of channels to match the shortcut path.
Then, the inputs and outputs of these two Ghost modules are connected using shortcuts and taking the advantage of MobileNetV2, no ReLU is used after the second Ghost module and the other layers apply batch normalization (BN) and ReLU nonlinear activation after each layer. For the case of stride = 2, through the shortcut path by the depthwise convolution between the lower sampled layer and stride = 2, as shown in Fig. 4.
The specific structure of the C3 module in the original YOLOv5 is three normal convolutions and a CSP bottleneck block. We replace the CSP block with a Ghost bottleneck, and the replaced C3 is denoted as C3Ghost. C3Ghost can also perform the function of the C3 module for learning residual features, but it is faster and smaller. In addition, the C3 module in YOLOv5 is an optimization of CSPNet, which differentiates the input layer and convolves only a portion of the features before Concat. The second and third C3 modules of the backbone contain nine bottlenecks to enhance the feature extraction capability. However, we found that the number of these bottlenecks was too large and not cost-effective for model accuracy improvement. Reducing the number of bottlenecks did not result in any loss of accuracy and reduced the number of parameters and the size of the network to some extent. So we reduced the number of Ghost bottlenecks in the second and third C3Ghost of the backbone, and get the LightC3.
YOLO-Ghost is obtained by replacing the convolution module and C3 module of the original YOLOv5 with the Ghost module and LightC3 module. the YOLO-Ghost network can fully combine the accuracy advantage of YOLOv5 and the lightweight advantage of the Ghost module, and achieves the initial compression compared with the original YOLOv5s. Finally, the complete network structure is shown in Fig. 5.

B. NETWORK COMPRESSION
The pruning algorithm is one of the model compression methods, and model compression can reduce the size of the network significantly. Pruning starts with sparse training [34], sparse training is performed by training the BN layer with L1 regularization so that the weights of the BN layer [37] converge to 0 as much as possible and the sparse weights are redistributed to the other effective layers of the network. The pruning algorithm then removes unnecessary channels based on the weights and never achieves model compression. The pruning algorithm can effectively compress the size of the network while maintaining accuracy as much as possible. Therefore, based on YOLO-Ghost, we use pruning for further network compression. We use YOLO-Ghost as a model compression baseline for sparse training, pruning, and fine-tuning.
Sparse training will directly affect whether the pruned model meets the required performance. The loss function formula of sparse training is as follows: In (3) the first sum represents the original loss function, the second sum represents the new sparse training loss, and g(γ ) is the loss function of the scaling factor to evaluate whether the scaling is reasonable. The λ represents the sparse coefficient, which can adjust the sparsity training strength.
By sparse regularization training, we make most of the γ factors of the BN layers in the model (both original and newly added) close to zero, which indicates that the contribution of the channels corresponding to these γ factors is reduced. We then use a smaller scaling factor to prune the network by removing the low-contributing channel incoming and outgoing and the corresponding weight parameters. The scaling factor here is a global threshold across all layers, which is defined as the percentage of bits occupied among all γ factors values. By doing this, we can obtain a narrower network and fewer computational operations.
In general, the number of channels in each layer of the network is designed to be 2^N to satisfy the computational characteristics of the GPU in order to enable the network to efficiently utilize the computational power of the GPU during the inference calculation. The previous pruning operation prunes the channels at each layer according to the gamma factor of each channel from small to large, and the channels with γ factor less than the threshold are pruned. However, this operation destroys the designed number of channels of the network, which leads to the inefficient use of GPU computing power, so sometimes the small size of the network is slower than the large size of the network in inference. Therefore we propose R-pruning (Recovery Pruning). After ordinary pruning, the number of channels in each layer that does not meet the number of 2^N is supplemented by restoring the channel with the largest γ factor from the deleted channels until the number of channels in that layer meets 2^N. The optimized pruning process is shown in Fig. 6.
We use the set C to denote the retained channels after pruning at a certain layer, c i denotes the ith channel, and γ i denotes the γ coefficient of the ith channel. Channels larger FIGURE 6. The process of R-pruning. We recover the channels(orange) with larger scaling factors after pruning so that the number of channels retained after pruning is equal to 2^N. than the threshold γ TH will be retained. In the optimal pruning algorithm, we recover several channels based on the ordering of γ , with the aim that the number of retained channels The optimized pruning algorithm has a significant speed advantage in actual deployment compared to the model size after previous pruning, although it will be on the larger side.
A large compression of the model inevitably results in a loss of accuracy, and although fine-tuning will restore some of the performance, there is still a loss of accuracy. Therefore, we use knowledge distillation to improve the accuracy of the model.
We train a teacher model on top of the dataset in advance, and then use the acquired teacher model for supervised training to achieve the distillation purpose when training the student model. Specifically, when the student model is trained, the distillation loss function calculates the difference between the previous output predictions of the two models, and the loss of the student model is added together as the whole training loss to perform a gradient update. Finally, higher VOLUME 10, 2022 performance and accuracy of the student model are obtained. L = αL distillation +βL student (6) After sparse training, pruning, and knowledge distillation, we obtained PG-YOLO. The accuracy of PG-YOLO obtained after model compression has exceeded the baseline YOLO-Ghost before compression. Compared with YOLOv5s, PG-YOLO compresses the volume twice with almost no performance loss, which meets the requirement of lightweight.

C. EXPERIMENTS AND DISCUSSION
The test environment of this experiment is Ubuntu 16.04 operating system, and Pytorch deep learning framework implements the compression algorithm of the PG-YOLO object detection model. The hardware configuration of the server is NVIDIA RTX 3080 GPU * 1 and Intel Core i7 CPU; We used the NVIDIA Jetson TX2 as the development board in order to realize on edge device detection during the testing part of the experiment. The NVIDIA Jetson TX2 is an embedded vision computer system with the 256-core NVIDIA Maxwell GPU model, dual-core Denver2 CPU model, and an 8 G memory size. At the same time, TX2 has the characteristics of low power consumption and high performance, making it very suitable as a device for the deployment of algorithms at the edge.
We choose to validate the algorithm with a helmet of targets that may need to be detected in a practical application scenario of edge devices. Therefore, the experiments use the helmet-related dataset safety-helmet-wearing(SHWD). Pascal VOC is used to label and train the data. According to the actual scene, 7571 pictures are detected, including 5450 training sets, 1515 test sets, and 606 verification sets. Fig. 7 shows the images in the dataset and the corresponding labels.
First of all, we use YOLOv5s to start training on this data set and choose not applicable pre-training weight, the epoch is 100. The results of the training were compared. Then we use the YOLO-Ghost after replacing the Convolution Module for training, and the epoch and other settings are consistent with those of YOLOv5s. Ghost convolution effectively reduces the computational effort of convolution by an inexpensive convolution operation.  Ablation experiments Table 1 show that applying the Ghost convolution module to YOLOv5s can significantly compress the size of the network and reduce the computational expenditure of the network.
The C3 module is the main component of the backbone network in Yolov5, which can accomplish the work of feature extraction with less computation. However, in the original network structure, the second and third C3 modules of backbone have a superposition of 3 Ghost Bottlenecks, and we believe that these Bottlenecks are somewhat redundant. Therefore, we set the number N of these Bottleneck to N={1,2,3,4}, respectively, and then conducted a comparison experiment. The experimental results are presented in Table 2. The experiments show that 3 Bottlenecks in YOLOv5s do have redundancy, and the method performs best when the number of Bottlenecks is 2, even slightly better than the performance of 4 Bottlenecks. We named the bestperforming C3 module with N=2 as LightC3.
Sparse training is an important step in the process of pruning. Selecting the appropriate sparse training parameters will affect the performance of the model after pruning. The purpose of choosing the appropriate sparse parameters is to select most of the channels which have a low contribution to the network performance and cut them out in the pruning operation. In this way, a large degree of pruning can be completed without affecting the performance as much as possible. If the sparsity factor is too large or too small, it can not effectively filter channels. As shown in Fig. 8, when the sparsity factor = 0.002, the sparsity strength is small and cannot compress the channel weights to 0. Therefore, when pruning, many channels with non-zero weights are pruned, which affects the performance more obviously. And when the sparsity factor = 0.006, most of the channel weights are compressed to 0, so when pruning, even though a small pruning threshold is chosen, most of the channels are still pruned. When the sparsity factor = 0.004, the sparsity is more appropriate, and most of the channel weights are close to 0. With a suitable pruning threshold, the channels with channel weights of 0 can be selectively pruned. Fig. 8 shows the distribution of weights after training with different sparse coefficients. The orange color in Fig. 8 shows the weights that are retained and the blue color indicates the  weights that will be removed because they are close to 0. Through sparse training, most of the channel weights in the BN layer are close to 0.
Next, we begin to prune the sparse training model. The optimized pruning algorithm ensures that the number of channels in each layer of the network meets 2^N when pruning, which allows the pruned network to make the most efficient use of the GPU's computational power and provides a significant advantage in inference speed. The experimental results are shown in Table 3. The network after the pruning algorithm has a small speed improvement. And the network after the R-pruning algorithm has a huge speedup even with a large amount of computation (GFLOPS) because the optimized pruning algorithm makes the number of channels per layer satisfy 2^N.
After sparse training, we prune the network by different percentages. The network with different pruning thresholds has different model sizes. Generally, the larger the size of the network, the higher the accuracy. But when the volume is greatly reduced, the precision loss can be ignored. At 20% pruning, the accuracy of the model is close to the original yolov5s and the size is reduced to 4MB, corresponding to a decrease in the number of parameters and computational effort. We want to get the model with both volume and accuracy. While the maximum degree of 60% pruning compresses the volume to 1.1 MB, the accuracy loss is large. Therefore, we choose 50% pruning and further improve the accuracy by exponential distillation on this basis.   Finally, we choose YOLOv5L as the teacher network after training on this dataset. The completed training YOLOv5L has a volume of 89 MB and 46.5M parameters, while the mAP reaches 0.94. The pruned network is used as the student network. After distillation, we can see that the accuracy of the network has improved. Detailed ablation experiments are presented in Table 5.
We were able to see that PG-YOLO has an amazingly small size while ensuring accuracy. Likewise, we tested the performance of other prevailing object detection algorithms and prevailing lightweight algorithms. The results are presented in Table 6. The results show that PG-YOLO offers a good balance between accuracy and lightness. Compared to YOLOv4, our method has a very close accuracy and huge volume advantage, while compared to other lightweight algorithms, our method is ahead in both accuracy and volume.
Then we tested it on edge-side devices. We selected some algorithms to test and compare with PG-YOLO on Nvidia tx2. The results are shown in Table 7. PG-YOLO has the fastest speed, with an actual inference speed of 30.3 FPS, which can meet the needs of real operation.
Compared to YOLOv5s, PG-YOLO has almost the same accuracy, while the volume is compressed by 8.75 times, the time required for inference is reduced by 32.7%, and the inference speed is improved by 10 FPS. This makes PG-YOLO more suitable for deployment in some edge-side devices with limited computing power and high real-time requirements. Therefore, it can be well applied to the edge devices that generate video data in an IIoT environment. PG-YOLO processes video data directly at the edge devices of IIoT with high accuracy and high real-time to improve the automation and intelligence of the IIoT system and solve the transmission latency problem of massive video data in cloud computing.

IV. CONCLUSION
This paper investigates the problem of real-time processing of large amounts of video data in IIoT environments. video data in IIoT can be greatly automated and intelligent when processed with the help of deep learning-based object detection algorithms. Object detection algorithms need to be deployed on edge devices because the cloud approach cannot meet the real-time performance required by IIoT. However, most of the algorithms face challenges when deployed in edge devices with limited performance. Therefore, this paper designs and validates PG-YOLO for edge devices in IIoT by replacing and optimizing the backbone network and combining it with an improved model compression method. Experiments show that the algorithm PG-YOLO in this paper (1) reduces the model size and floating point operations (FLOPs), (2) reduces the time for inference, and (3) sacrifices almost no accuracy. Therefore,fihe use of PG-YOLO can meet the deployment in IIoT, i.e., PG-YOLO can be deployed on edge devices with limited performance in IIoT. PG-YOLO solves the practical problems faced by object detection algorithms in edge device applications and achieves high accuracy and real-time performance even on platforms with limited performance and storage space. After that, one of our research directions is to focus on video data processing problems in IIoT problems, such as faster transmission or more efficient compression. On the other hand, the inference speed of the algorithm does not improve exponentially compared to the large compression of the volume, so how to further improve the inference speed is also our research direction. His research interests include deep learning, object detection, and model compression.
CHENGXIN PANG (Member, IEEE) received the Ph.D. degree from the University of Technology of Troyes, Troyes, France, in 2008.
His research topics have been first magnetooptics isolator in IEF at Paris-Sud University, Orsay, France, and hybrid silicon photonics at Orange Laborateries, France, and in LCF at IOGS, Palaiseau, France. His research interests include networking systems of artificial intelligence (AI) and artificial intelligence and the Internet of Things (AIoT). XING HU received the Ph.D. degree from Shanghai Jiaotong University, in 2016. He is an Associate Professor with the University of Shanghai for Science and Technology. His research interests include computer vision, signal processing, and intelligent driving. VOLUME 10, 2022