An Industrial Meter Detection Method Based on Lightweight YOLOX-CAlite

With the application of inspection robots, the demand for automatic detection and identification of pointers in pumping stations, substations, and laboratories has increased. This paper proposes the YOLOX-CAlite meter detection algorithm to address the lack of performance in detecting targets involved in the specific meter detection process. The core of the YOLOX-CAlite is improving the drawbacks of the original algorithm with its large backbone network, a large number of parameters, and large calculation volume. It mainly consists of using data augmentation on the input side. We removed the Focus structure from the original YOLOX and replace it with a convolutional pooling form. The backbone was replaced with Shufflenet v2 and the Ghost module was introduced into the neck to improve real-time performance. Meanwhile, the neck was changed to BiFPN structure to enhance feature fusion. coordinate attention was introduced in front of the detection head to enhance feature extraction. Finally, the binary cross-entropy loss in YOLOX was replaced by focal loss, and the multi-task joint loss of the algorithm was optimized using the complete-IoU loss function. The experimental results on our dataset showed the improved YOLOX-CAlite achieved an AP of 90.4%. Compared to the YOLOX-s and YOLOv5, YOLOX-CAlite improved the AP by 1.4% and 2.2%. And the gflops have been reduced massively, from 26.6 gflops to 8.8 gflops, a reduction of about 68%. The detection speed increased by 25% while the number of parameter sizes is reduced. The model size has been reduced from 17.2 M in YOLOX-s to 4.89 M, a reduction of about 71%. The inspection robots can identify the meter better in the following works.


I. INTRODUCTION
At present, digital instruments are widely used, but they are prone to failure in environments with electromagnetic interference. Pointer meters are highly immune to interference and are low cost and easy to maintain, and are used on a large scale in a variety of industrial scenarios and laboratory environments [1].
Traditional meters are very precise and need to be checked for accuracy on time. The previous inspection methods were solved manually, and manual collection no longer meets the demand for real-time data. Besides, there can be subjective errors resulting in economic losses due to which inspection robots [2] have been invested to improve efficiency. However, The associate editor coordinating the review of this manuscript and approving it for publication was Turgay Celik . inspection robots also have problems such as slow detection speed, low detection accuracy, leakage, misdetection, and the current robots are not yet well adapted to practical requirements [3].
The ability of the vision system to accurately identify meter readings is one of the criteria for the stable operation of intelligent inspection robots [4]. When the inspection robots carry out the work, the routes are pre-planned. The inspection robot takes images of the meters and identifies the meter information at the front end. In the vision system section, the instrument detection recognition consists of two parts: positioning and reading. The accuracy of the meter positioning has a large impact on the overall detection and identification task. Positioning is part of target detection in computer vision. However, the specific nature of the environment in which the inspection robot works makes this VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ problem different from the normal target detection problem.
The inspection robot will first follow the inspection route to the designated image acquisition area. The camera on the robot then takes a picture of the meter to identify its position; finally, it is focused on identification. Ideally, the position of the meter at the same collection point is fixed for multiple inspection tasks. However, due to factors such as positioning errors in the inspection robot and mechanical errors in the camera, there are unpredictable errors in the actual images acquired. The errors are mainly small positional deviations. If a larger number of actual meter images can be acquired and the characteristics of various types of meter images with different degrees of offset can be sufficiently learned, the positioning accuracy can be improved. Most traditional target detection methods locate instruments by the grey-scale characteristics of the image. This results in poor robustness, adapting only to certain specific conditions. These methods are difficult to convert from one type of meter to another and the algorithm has a large number of parameters.
In the last few years, the main instrument localization algorithms proposed by relevant researchers include the traditional detection algorithms based on feature detection and matching as well as deep learning-based detection algorithms. Hua et al. [5] used Scale Invariant Feature Transformation (SIFT) algorithm to achieve instrument dial area feature detection and matching. Gao et al. [6] used the accelerated robust feature (SURF) method to match with template images. With the rapid development of deep learning, the use of deep learning for instrument positioning is increasing. Zhou et al. [7] used a fully convolutional neural network to semantically segment instrument image data to accurately segment the position of instrument dials. Wang [8] used YOLOv2 to localize the instrument, but the detection of small targets needs to be improved. Xu et al. [9] used a modified YOLO9000 network to quickly detect the position of the enclosing frame of the instrument dial and remove redundant background information interference. Liu used the Two-Stage target detection algorithm Faster R-CNN to detect the location of the target instrument. Focusing on model lightweight technology, Zhang et al. [11] has proposed a lightweight convolutional neural network ShuffleNet in a convolutional approach. At present, there are more and more algorithms for pointer meter detection using deep learning, but the real-time nature and accuracy of these methods still need to be improved, and it is difficult to complete real-time and efficient detection of meters on computers with limited computing power. So it is difficult to meet the needs of works in the industry [12].
In order to improve the accuracy of the network for meter detection, we propose an improved lightweight meter detection algorithm to adapt to local specific pointer meters detection. Improvements are made to the network structure and loss function of YOLOX [13]. Firstly industrial instrumentation data was collected and fabricated. The overall network structure was then based on YOLOX, the deep convolutional neural network of the One Stage target detection algorithm. The Focus module of the original YOLOX detection structure was removed. Replacement of the structure of backbone and neck for lightweight, enabling the efficient fusion of target features. Coordinate attention [14] is inserted behind the neck structure. Finally, focal loss [15] and CIoU [16] are added for loss optimization. The experimental results show that the YOLOX-CAlite algorithm has high accuracy and speed for industrial half-disk and full-disk pointer-type instrument detection. It solves the problems of large model and slow speed in industrial pointer-type instrument detection algorithms.

II. YOLOX ALGORITHM THEORY
The main improvement of YOLOX compared with other YOLO series is that it builds an anchor-free end-to-end target detection framework based on YOLOv3-DarkNet53 [17] by adding the Decoupled Head, Data Aug, Anchor Free, and SimOTA sample matching methods. The network model is divided into four parts: input side, backbone, neck, and prediction. The algorithmic framework is shown in Fig. 1.
Input: First resize the input sample image to 640 × 640 for different sizes, then use mosaic, and mixup image enhancement. Mosaic splices four images into a new image by random scaling, random flipping, and colour gamut transformations, which are input into the neural network to add context to the detected objects. MixUp is an additional enhancement strategy for Mosaic that blends two images in proportion to a certain RGB value, and the new image requires the model to predict all of the original targets. Both methods are used to improve the robustness and detection of small targets.
Backbone: The Focus network structure was first used. The original 640 × 640 × 3 image was taken to a value every other pixel to obtain four separate feature layers, and then the four separate feature layers were stacked to quadruple the input channel to 320 × 320 × 32. DarkNet53 was then used as the feature extraction network to obtain three effective feature layers. Secondly, SPP (spatial pyramid pooling) is used in the backbone to improve the perceptual field of the network through maximum pooling of different pooling kernel sizes for the receptive field. The Focus and SPP [18] network structure is shown in Fig. 1.
Neck: Three effective feature layers obtained by the FPN [19](feature pyramid networks)+PAN [20](Path Aggregation Network) enhanced extraction network in Backbone. The feature layers are fused in the FPN+PAN from bottom to top. The purpose of feature fusion is to combine feature information from different scales. The FPN+PAN network structure is shown in Fig. 2.
Prediction: The main components are decoupled head, anchor-free, SimOTA (Simplified Optimal Transport Assignment), and loss calculation.
Decoupled Head: There are three decoupled head branches in prediction, which improve accuracy and convergence speed. In the first decoupled head, there are reg_output, cls_output, and obj_output. cls_output: category prediction score for the target box. Judged by N binary categories, after Sigmoid activation function processing, become 20 × 20 × 80 size. obj_output: determine whether the target box is foreground or background, after sigmoid processing, it becomes 20×20× 1 in size.
reg_output: predict the coordinate information (x, y, w, h) of the target frame, size 20×20×4. The output of the last three branches is fused by Concat to obtain the feature information of 20 × 20 × 85.
The second decoupled head outputs the feature information and concatenates it to obtain 40 × 40 × 85 features. The third decoupled head outputs the feature information and concatenates it to get 80 × 80 × 85 feature information. Reshape the 3 information again, perform the overall concat, and get the 8400 × 85 prediction information. Then by Transpose, it becomes two-dimensional vector information of size 85 × 8400.
Anchor-free: It reduces a featured image from predicting 3 groups of different size anchors to predicting only one group. It predicts four values directly, reducing the number of parameters and gflops and speeding up.
SimOTA: In order to find a global high confidence assignment for all gt (ground-truth) objects in the image, firstly determine the positive sample candidate region, calculating the sum of the top ten iou in the interval, marking it as a. Then taking the top an of cost as a positive sample and the rest as negative samples, using the positive and negative samples to calculate the loss.
Loss calculation: The loss calculation of YoloX consists of three parts, Reg, Obj, and Cls. The Reg section is the regression parameter judgment of the feature point, the Obj section is the judgment of whether the feature point contains an object, and the Cls section is the kind of object contained in the feature point. In both the Obj part and the Cls part, the BCEloss (Binary Cross-Entropy Loss Function) approach is adopted. In the case of binary classification, there are only two situations that the model needs to predict in the final. It can be expressed as follows: where y i is the label of sample i, positive class is one, negative class is zero. p i represents the probability that sample i is predicted to be in a positive class. Reg: The feature points corresponding to each real box are obtained from the prediction, and after obtaining the feature points corresponding to each box, the prediction box of the feature points is taken out. Afterward, the IOU loss is calculated using the real boxes and the prediction boxes. It is composed as the composition of the loss of the Reg part.
Obj: The feature points corresponding to each real box are obtained from the prediction. All feature points corresponding to real boxes are positive samples and all remaining feature points are negative samples. The cross-entropy loss is calculated based on the prediction results of positive and negative samples and whether the feature points contain objects or not. Finally it is composed as the loss of the Obj part.
Cls: The feature points corresponding to each real box are obtained from the prediction, and after obtaining the feature points corresponding to each box, the kind prediction result of the feature points is taken out.
The cross-entropy loss is calculated based on the predicted results which include the types of real boxes and the types of feature points. It is used as the loss component of the Cls part. The obj part and the Cls part use BCE_loss. It can be expressed as follows: where loss Reg is the bounding box regression loss function. loss Obj is the confidence loss function, and loss Cls is the classification loss function.

III. IMPROVED ALGORITHM A. YOLOX-CALITE ALGORITHM
Considering the original YOLOX algorithm with more gflops and larger weights, we generate an improved model YOLOX-CAlite: the idea is to reduce the number of parameters, i.e., to prune and optimize the accuracy of the whole algorithm structure. In this model, the original Focus structure would be removed and replaced with a convolutional pooling form. Meanwhile, we introduced Ghostnet [21] to reduce the overall network parameters. The backbone part was replaced with Shufflenet v2 [22]. Since we detect that there are not so many classes, so we choose to remove the Shufflenet V2 backbone of 1024 conv and 5 × 5 pooling. The neck part was replaced with BiFPN [23] to enhance feature fusion, and the coordinate attention was inserted in front of the detection head. The original Loss is replaced by Focal Loss. And we use CIoU for IoU calculations. The YOLOX-CAlite network structure is shown in Fig. 3.

B. REMOVE FOCUS
The Focus structure concentrates the spatial information into the channel information, after cat the input channel becomes four times, and then the new feature map is obtained by conv, so that the information of the downsampling is not lost and the image information is preserved. However, this increases the amount of computation and the number of parameters. Moreover, the focus information is stored in a shallow layer, which is not very meaningful. Removing the Focus layer can reduce the cache footprint and ease the computational processing burden, which is conducive to chip-side deployment. The ShuffleNet V2 network model is divided into two types. The first is shown in Fig. 5 in part a. Channel split is performed after the input feature map. The feature map with input channel number n is divided into n-n and n by channel segmentation. The left branch has no operations. The right branch has 3 convolution operations, and the two 1 × 1 convolutions have been replaced by normal convolutions from the grouped convolutions in ShuffleNet v1. Then, the data in these two branch channels are merged with the Concat+Channel Shuffle operation, which not only makes the number of input and output channels of this base module the same, but also avoids the Add operation. In b, the number of output channels is twice as many as the number of input channels, and the left and right-branching processes are basically the same as in a.
The backbone of the network structure of YOLOX-lite is mainly a replacement of the backbone's DarkNet53 with ShuffleNet V2. To avoid the multiple uses of C3 layer and the high channel C3 layer, the shufflenetv2 backbone's 1024 conv and 5 × 5 pooling are removed. The Shuffle block structure is shown in Fig. 4.

D. GHOST MODULE
Because there are redundancies in the YOLOX network, the feature maps that affect accuracy are obtained from convolution operations and input to the next convolution layer for operation. This process contains a large quantity of network parameters. At the same time, these network parameters consume a lot of computational resources. So the ghost module is added to YOLOX, replacing the neck and head parts with ghostconv, using a lower-cost computational effort to obtain these redundancies.
We assume that the input feature map size is h 1 · w 1 · n, the output feature map size is h 2 · w 2 · m. The size of the convolution kernel is d · d. The convolution computation C 1 can be expressed as follows: The ghost module assumes that the number of transformations of the feature map is a and that each linear operation kernel is d 1 · d 2 . The convolution computation C 2 can be expressed as follows: where the magnitude of d · d is similar to d 1 · d 2 , so the compression ratio r of the parameter is approximately equal to the acceleration ratio a. The GhostBottleneck section has two structures: when Stride = 1and no downsampling is performed, two ghost convolution operations are performed directly; when Stride = 2 and downsampling is performed, a depth convolution operation with a step size of 2 is added. In this model, all GhostBottleneck is applied with Stride = 1.

E. NECK BiFPN
The feature maps of the deeper layers of the network have the characteristics of large receptive fields, strong semantic information, and high robustness. However, the feature maps have low resolution and lose more detailed information. In contrast, shallow networks have a small field of perception, higher resolution, and more detailed information. Shallow networks are good for target localization, but have the disadvantage of missing semantic information. The target detection task requires not only to target localization, but also classification of the detected targets, which requires the input feature maps at the final detection layer to be rich in both semantic and detail information.
The FPN+PAN structure used in the neck of the original YOLOX framework serves for multiscale connection and weighted feature fusion. As the FPN+PAN is cascaded by transforming the feature maps to the same size, it is unable to make full use of the features between different scales. This eventually results in limited detection accuracy of the detection network. The neck in YOLOX was altered to a more effective BiFPN(bi-directional feature pyramid network )feature fusion structure to improve the detection model's efficiency. BiFPN is a weighted bi-directional feature pyramid network proposed in EfficientDet. It mainly performs simple and fast multi-scale feature fusion and enriches feature semantic information. The integration steps are as follows: Because we have chosen a lightweight model, the number of channels is small and does not add more computation to slow down the speed. First, we obtained 3 effective feature VOLUME 11, 2023 layers in the backbone, C1, C2, and C3. P2 is obtained by up-sampling C3 and convolving it with C2. F2 is obtained by down-sampling the features from F3 and stitching C2 and P2 together in the channel dimension. F3 is obtained from F2 by max-pooling and then convolving with C3. F1 is obtained from P2 by upsampling and then convolving with C1. The BiFPN structure is shown in Fig. 5.

F. LOSS FUNCTION
Focal loss is primarily designed to address the problem that one-stage target detection contains many a priori boxes, resulting in an unbalanced problem between positive and negative samples. The Focal loss formula is shown as follows: where L means the value of the Focal Loss function for each pixel point on the image. p t ∈ [0, 1] is the output value of the positive class. (1−p t ) γ is the modulating factor. The modulation factor can reduce the weight of samples that can be easy to classify, thus allowing the model to focus more on hardto-classify samples during training.γ > 0 is the focusing parameter, which makes the model focus more on difficult and misclassified samples. The larger the γ , the better the balance. But too large will affect the overall accuracy. We replace IoU with complete-IoU Loss (CIoU Loss) in the IoU calculation. The CIoU loss function takes into account the distance, overlap and scale between the target and the anchor, making the target box regression more stable and reducing problems such as divergence during the training process. At the same time, the CIoU loss function is sensitive to scale transformation and converges faster. The CIoU formula is as follows: where l is the distance between the center point of the prediction box and the real box; Where δ is the weighting parameter. Where r is a parameter measuring the consistency of the aspect ratio; Where c is the diagonal distance between the predicted and true boxes in the smallest matrix region. The δ and v are shown in the equations as follows: where w is the width and h is the height. The L CI oU is shown in the equation as follows: In this model, in order to maximize the amplification of useful information and mitigate useless information, this paper fuses multi-scale features and then effectively weights them by Coordinate attention, allowing the model to autonomously learn the proportion of features needed for multi-scale feature fusion, thereby maximizing the weight of target information and thereby suppressing background information. The accuracy is improved while ensuring the number of parameters. The operation is divided into two steps: Coordinate message insertion and Coordinate attention generation. The structure is shown in Fig. 6. Coordinate information embedding: The input X is firstly encoded for each channel first using pooling kernels of dimensions (H, 1) and (1, W) along with the horizontal and vertical coordinate directions, and these two transformations perform feature aggregation along with both spatial directions. Thus, the cth channel of height h is shown as: Likewise, the cth channel with width w is shown as: where Z c is the output associated with channel c. The above two transform aggregate features along with the X and Y spatial directions, respectively. This gives the feature map which has direction-aware. Coordinate attention generation: after passing the transformations in the information embedding, this part performs a concatenate operation on the above transformations. The transform is then transformed using the convolutional transform function.
where F 1 is 1 × 1 convolution. f is the intermediate feature map obtained by downsampling operation δ. f h and f w are the two tensors after the cut. g h and g w are obtained using the 1 × 1 convolution F h and F w and the σ transformation. Expanded and added to the input as attention weights to get the final output y c .

IV. EXPERIMENTAL ENVIRONMENT AND SETTINGS
To ensure the training efficiency of the YOLOX-CAlite model, the experimental environment is configured as follows: CPU is Inter(R) Xeon(R) Gold 6130, main frequency is 2.10GHz, GPU is NVIDIA GeForce GTX2080ti discrete graphics card, Cuda version is 11.0, memory is 16GB, Python version is 3.7, based on the Pytorch 1.7 framework, and the operating system is Windows 10.

V. DATASET CONSTRUCTION AND IMAGE PRE-PROCESSING
For the scope of this paper, the research is mainly industrial and laboratory complex scenarios etc., while the pointer meters will generally be at poles or pipes and there is a wide variety of meter sizes. The blurring of photographs caused by the environment or the angle of the shot can have an impact on the data set. The experimental samples were mainly 2500 samples of substation meters downloaded from the web and 1000 samples of images taken in the local laboratory. The 1000 images were expanded to 1500 using data enhancement and divided into a training set and a test set in a ratio of 8:2. Finally, the input would be fed into the neural network for training.

A. MODEL TRAINING
Before starting the training, the sample images were first uniformly resized to 640 × 640, and the training method used VOLUME 11, 2023 was asynchronous random gradient descent with a learning rate of 0.9, the initial rate of learning was 5e-4, with 8 sample images for each training batch. In the last 15 training rounds, mosaic and mixup were switched off. The loss function was stabilized when the training reached 300 rounds to obtain the model training weights.

B. EVALUATION INDICATOR
In model evaluation, the combination of precision and recall of a model needs to be considered. This is why this experiment uses AP (Average Precision) to analyse the performance of the target detection algorithm to avoid false detections. The AP is a combination of different precision and recall points, and the area under the curve is the value of the AP. In this experiment, the AP@0.5:0.95 and AP@0.5 are selected as evaluation indicators. AP@0.5:0.95 is the AP when the IoU is taken as 0.5 to 0.95. The AP formula is as follows:

C. RESULTS AND DISCUSSION
We use the YOLOX-s, YOLOv5 [24], [25], YOLOX-lite and the modified YOLOX-CAlite network model for comparison experiments. Validation of meter recognition detection on a test set. The YOLOX-lite introduces backbone and neck lightweight. The YOLOX-CAlite is an addition to the YOLOlite with the addition of coordinate attention. Fig. 7 shows the losses of the three networks with three curves. In addition to the loss and AP curves, the four networks were evaluated quantitatively using gflops, parameter size, weight size, average testing time, and AP. The AP@0.5 remains largely unchanged at around 0.98. The YOLOX algorithm converged after 300 iterations, with an AP@0.5:0.95 of 0.897. The AP@0.5:0.95 of YOLOv5 algorithm reached 88.2. The YOLOX-lite algorithm also increased to an AP@0.5:0.95 of 0.89 after 300 iterations, while the YOLOX-CAlite algorithm increased to an AP@0.5:0.95 of 0.904 after 500 iterations. By introducing Coordinate Attention after backbone and neck lightweight, the Parameter Size increases slightly. But the  AP@0.5:0.95 improved by 0.013. The AP curves for the four models are shown in Fig. 8.
From Table 1 it can be concluded that comparing YOLOX, YOLOv5 and the two improvements on the test set with a resolution of 640 × 640. The YOLOX-CAlite algorithm was able to provide higher accuracy with the best results while reducing gflops and parameter size. This shows that introducing coordinate attention into the network allows the network to learn the target features while reducing the loss of location information before the feature map is input to the detection head. This enhances the deep learning of features and takes full advantage of feature learning, thus improving the learning efficiency of the network. In addition, the performance improvement does not increase the number of parameters to a larger number and complicates the network. The YOLOX-CAlite detects a single image frame in 40.8ms. Detection speed has not changed much compared to the YOLOX and YOLOv5.

VII. APPLICATION EVALUATION
In order to prove the detection effectiveness of the YOLOX-CAlite algorithm, the detected images is set to 640 pixels × 640 pixels in this experiment. Fig. 9 shows the region that the algorithm focuses on. It can be seen that the area of interest of the algorithm is mainly in the instrument surface information. Fig. 10 shows the detection results of YOLOX and YOLOX-CAlite for substation instrumentation images. Each image has a different quality. By comparing the detection results for the same images, we find that in most cases the accuracy is the same, but in some images YOLOX is not as accurate as YOLOX-CAlite. Fig. 10 (a) shows the detection results for YOLOX and (b) shows the detection results for our modified YOLOX-CAlite. It can be seen that the YOLOX-CAlite algorithm in Fig. 10 (b) is better than YOLOX in terms of accuracy for some of the detections. Fig. 11 shows the detection results of YOLOX, YOLOXlite, YOLOv5 and YOLOX-CAlite for a specific pressure gauge in the laboratory. By comparing the detection results for the same images, we find that after YOLOX-lightening, individual cases of incorrect detection occur in (b) and (d). False detection appears below and to the upper left of the detection target. Whereas with the addition of Coordinate Attention, the incorrect detection is eliminated in (c). The coordinate attention we add to the network can participate in larger regions while avoiding a large computational overhead, and filter the information that is more critical to the current task goal from the many information, i.e., filter noise, and alleviate the loss of location information caused by 2D global pooling. It shows that our improved YOLOX-CAlite has reduced the error phenomenon and is more robust.
Problems such as complex backgrounds, lens distortion, and image skewing in inspection instrument images can interfere with detection results. Despite these issues, our improved YOLOX-CAlite is lighter while maintaining accuracy compared to YOLOX and YOLOv5.

VIII. CONCLUSION
In our work, to solve the problems of a large number of target detection model parameters, and high resource consumption, this paper proposes a detection algorithm applicable to adapt to local specific pointer meters detection. We proposed the YOLOX-CAlite algorithm significantly reduces the computation, parameter size, and the size of the model while ensuring detection accuracy. The first part removes the Focus structure and replaced it with a convolutional pooling form, using ShuffleNet V2 structures and BiFPN structures to enhance feature fusion. Coordinate attention is introduced after the neck to better preserve the location information of features. The binary cross-entropy loss is replaced with focal loss in the loss calculation part, and CIoU is used instead to optimize the multi-task joint loss of the algorithm. The experiment results show that the gflops of our algorithm are reduced by 68% compared to the original YOLOX. The model size is reduced by 71% while maintaining high accuracy. The algorithm ensures that the inspection robot can autonomously perform the task of identifying meters in more complex industrial or laboratory environments. In future research work, we will further explore lightweight improvement methods for target detection algorithms. We will further improve the detection accuracy and detection speed and reduce false detections while ensuring that the amount of computation, the number of parameters, and model size are reduced as much as possible.