Anchor-Free Weapon Detection for X-Ray Baggage Security Images

Considering the real-time and high-precision requirements of image processing in X-ray baggage security screening; and problems such as the inflexibility and complex computation of anchor-based object detection, this paper introduces an anchor-free mode convolutional neural network object detection method for detecting weapons (knives and handguns) in X-ray baggage security images. The advantage of the anchor-free method over the anchor-based method is that the size of the anchor box does not have to be set, and the generalization ability is strong; the absence of the anchor box reduces the number of computations, and solves the problem of unbalanced positive and negative samples in the anchor-based method. To fully evaluate the effectiveness of the anchor-free method for X-ray baggage screening image detection, a large number of images containing knives and handguns were collected and annotated in the early stages of this work to produce a dataset that could be used for training. Six mainstream anchor-free methods (CornerNet, CenterNet, CornerNet-Lite, ExtremeNet, Objects as Points and You Only Look Once(YOLOx)) are introduced. For experimental integrity, this paper adds an anchor-based comparison experiment, using Faster-RCNN, YOLOv3 and YOLOv5 to perform the same work. The experimental results show that the YOLOx, Objects as Points and ExtremeNet anchor-free methods used in this paper have excellent performance in weapon detection in X-ray baggage security images. Among them, the mean average precision (mAP) of YOLOx combined with the CSPDarknet53 network reached 0.905, and the mAP of ExtremeNet combined with the Hourglass-104 network reached 0.900; the performance of the Objects as Points method was also good. All these methods performed better than the anchor-based methods compared in this paper. Therefore, we believe that the anchor-free method has a practical effect in weapon detection for X-ray luggage images.


I. INTRODUCTION
22 X-ray inspection equipment, as a widely used means of 23 detecting security risks, has been installed increasingly often 24 in key locations in crowded areas such as train stations and 25 airports, as an important protective barrier against terrorist 26 attacks. At present, the detection of dangerous goods still 27 relies on the human eye to identify pictures, which not only 28 consumes time and manpower, but also makes it easy to 29 The associate editor coordinating the review of this manuscript and approving it for publication was Jiju Poovvancheri . misidentify and miss detection when the operation task is 30 difficult. Therefore, automatic detection in X-ray images is 31 a topic that is challenging and worthy of research.

32
Deep learning-based image object detection techniques 33 have shown very competitive performance in recent years, 34 and after convolutional neural networks achieved great suc-35 cess in classification tasks with ImageNet [1] in 2012,  shick et al. [2] were the first to propose a framework for object 37 detection in region-based convolutional networks. Since then, 38 a new phase of object detection has begun. Akcay et al. 39 [25], for example, considered the use of convolutional neural 40 pseudo colour image [11]; the equivalent of an atomic number 97 less than 10 is organic and will be coloured orange, the equiv-98 alent of an atomic number greater than 18 is inorganic and 99 will be coloured blue, and material with an atomic number 100 between these two values or that is a mixture of the two types 101 will be coloured green. 102 All images used in this experiment were provided by a 103 model of dual-energy X-ray detector, manufactured by UNI-104 COMP, which provides two energy images simultaneously. 105 It means that two sets of data can be obtained during a radio-106 graphy to generate two images corresponding to high-energy 107 and low-energy rays respectively. The dual-energy detector 108 has two scintillators, gadolinium sulfide (GOS) (153mg/cm2) 109 at low energy and cesium iodide CsI (TI) at high energy. The 110 measured object is moved by the conveyor belt at a speed 111 of 22cm/s. The maximum width of the scanned object is 112 650 mm, and the height is 500 mm. We collected a large num-113 ber of pistol and knife models, mixed with ordinary objects 114 and other interference objects into the suitcase. After output 115 the raw image by X-ray scanning equipment, the image was 116 coloured according to the atomic number, and the image was 117 compressed to 960 × 640 resolution, 24 bits depth, and no 118 other post-processing was done. 119 Unlike the anchor-based method, the anchor-free method is 120 based on finding the key object points to determine the object 121 location, and the key point generation strategy has a direct 122 impact on the accuracy and speed of detection. This exper-123 iment introduces six anchor-free methods, namely,  [44], Objects as Points [43] and YOLOx [45], all of which 126 have different combinations of methods for selecting key 127 points and can have different detection results. In this paper, 128 key points are classified into three types, corner points, centre 129 points, and extreme points, and the locations of these key 130 points are based on the mapping from the backbone network 131 output of the feature heatmap to the location of the object. 132 In addition to the YOLOx method, which uses the CSPDark-133 net53 network structure (a fusion of CSPNet and Darknet53), 134 there are several other anchor-free methods that adopt the 135 Hourglass network as the backbone network. Hourglass is 136 a network model similar to encoding and decoding. It can 137 capture local and global information, which is helpful for 138 key point prediction. To compare anchor-based methods, this 139 paper also performs the same experiments on several classic 140 anchor-based methods, such as Faster-RCNN, YOLOv3 and 141 YOLOx and compares the experimental results with those of 142 the anchor-free methods.

143
The main contributions of this paper are as follows. (1) 144 This paper analyses the hashrate deficiency of the traditional 145 anchor-based object detection algorithm, and introduces the 146 latest anchor-free object detection algorithm for the task of 147 detecting X-ray baggage security knife and handgun images 148 to address the abovementioned problems. (2) In this paper, 149 several recent anchor-free object detection algorithms are 150 investigated, the advantages and disadvantages of the respec-151 tive methods are analysed, and comparative experiments are 152 97844 VOLUME 10, 2022 conducted. (3) Given the paucity of knife and handgun detection data in X-ray luggage images, this paper collects and 154 labels a large number of X-ray luggage images contain-155 ing these two items to construct a new X-ray image-based 156 detection dataset. Based on this dataset, a comprehensive 157 evaluation of each of the above algorithms is carried out.

158
Experimentally, we conclude that anchor-free methods have 159 better practicability than the anchor-based methods intro-160 duced for the task of weapon detection in X-ray baggage 161 security images.

163
Research on X-ray baggage security imagery has been 164 continuously updated with the development of computer 165 vision, and previous research has undergone several phases: 166 image enhancement [12], [13], [14], [15], [16]  (FAST) performed more competitively, it was also concluded 208 that the SIFT descriptors performed best, but not as well as on 209 conventional images, and that the main problem was the lack 210 of texture information in X-ray images. Franzel T et al. [19] 211 performed object detection of X-ray luggage images from 212 multiple viewpoints, where a combination of a histogram 213 of oriented gradient (HOG) features and an SVM classifier 214 was used for supervised learning to construct a classification 215 model, and the experimental results showed that the average 216 of the single-view detection accuracy (AP) increased from 217 49.7% to 64.5%, with multiple views able to detect approx-218 imately 80% of handguns. Schmidt-Hackenberg et al. [25] studied the application of deep neural net-250 works for classification and object detection in X-ray baggage 251 security imagery and achieved an accuracy of 0.994 for the 252 classification task by combining AlexNet network structures 253 [1] and SVM classifiers. In addition, they used SW-CNNs, 254 F-RCNNs [1], and YOLOv2 [6] for object detection and 255 achieved a mean average precision (mAP) of 0.885 for 256 six-class object detection and 0.974 for two-class object 257 detection; the detection efficiency reached 100 ms per sheet, 258 which shows that the deep convolutional neural network 259 has very good performance in the X-ray baggage security 260 imagery detection task. Galvez    duced and applied to weapon detection in X-ray baggage 317 security images, and the applicability of these anchor-free 318 methods in this scenario is evaluated.

320
To find the object region, the anchor base extracts the bound-321 ing box for the region in which the object is located via 322 the region proposal network (RPN), while the anchor-free 323 method achieves the same end by generating a keypoint 324 for the object region. The generation of key points should 325 be based on the heatmaps generated by the image atten-326 tion mechanism, similar to the way humans observe images, 327 where the global image is quickly scanned to obtain the object 328 area that needs to be focused on, and then more attention 329 resources are devoted to this area to obtain more details 330 about the object while suppressing useless information; 331 this is the difference between anchor-free and anchor-based 332 mechanisms.

333
A. BACKBONE NETWORK

334
The deep feature images extracted by convolutional neural 335 networks have an attentional effect upon activation, which 336 responds to regions of interest but easily loses deep fea-337 tures. To capture information from multiscale feature maps, 338 Newell et al. [46] proposed the Hourglass network structure, 339 motivated by the need to capture information at each scale, 340 The network structure is hourglass-shaped, using a residual 341 module as the basic network unit, with repeated top-down 342 and bottom-up structures to infer the locations of key points 343 of the object. Hourglass network used by anchor-free in this 344 paper has made some modifications on this basis. Before 345 entering Hourglass module, image through a 7 × 7 convo-346 lution module with stride 2 and 128 channels reducing the 347 resolution by 4 times. After the hourglass module is modified, 348 the max pooling downsampling method is removed and the 349 downsampling method with step 2 is used instead. The feature 350 resolutions are reduced 5 times, and the channel is increased 351 to (256, 384, 384, 384, 512). This Hourglass module is named 352 Hourglass-52, show in Figure1, and a stack of two modules 353 is called Hourglass-104.      denotes the radius of the circle. The y cij value decreases 416 more slowly as the negative sample moves away from the 417 positive sample. To maintain the consistency of the penalty 418 and the increase/decrease in distance, the penalty factor is set 419 to (1 − y cij ), so the loss function for key point detection is: 420 (2), as shown at the bottom of the next page.

421
N the number of objects in the image, α and β are hyper-422 parameters that control the contribution to the loss, P cij is the 423 predicted value on the prediction heatmap, and the predicted 424 location (i, j) is the probability of the corner being classified 425 as c. 426 The method of this paper extracts the key point map (shown 427 in Figure 5) during the detection process and predicts the 428 corner, centre, and extreme points of the object. With the 429 prediction of the key points, the anchor points generated at 430 the anchor base are eliminated, and the object is ensured to 431 have response in the feature map.

433
To gain an overall understanding of these anchor-free meth-434 ods, this subsection summarizes the overall flow of the meth-435 ods to better understand their processes for handling data and 436 the differences between them.

523
The experiments used the six anchor-free object detection 524 algorithms described above, the CornerNet method and the 525 CornerNet-Lite method based on corner point detection, the 526 CenterNet method based on a combination of corner and 527 centre points, the Objects as Points method based on the 528 centre point, and the ExtremeNet method using extreme 529 point detection. The anchor-based methods-Faster-RCNN, 530 YOLOv3 and YOLOv5-were also compared for experimen-531 tal completeness.

532
Dataset: Since X-ray baggage security images are uncon-533 ventional images with few sources of data acquisition and 534 even fewer datasets for object detection of knives and guns, 535 the data for this experiment were obtained from a X-ray 536 machine manufacturer, and several different types of knives, 537 handguns, and other items were combined for X-ray scan-538 ning. From the tens of thousands of pictures, 10,233 X-ray 539 pictures of knives and pistols were selected as the main mate-540 rial for the experiment. To obtain a more complete dataset, 541 we carried out extensive image annotation work using an 542 annotation tool to create the labels required for the experiment 543 from the positions of the knives and handguns in the image. 544 On average, each image contains two to three labels.

545
Training Details: We trained the methods using the 546 PyTorch framework with an image input size of 511 × 547 511 and an output size of 128 × 128. To reduce overfit-548 ting, standard data augmentation was used, including ran-549 dom horizontal flipping, random scaling, random cropping, 550 and random colour dithering, which included adjusting the 551 brightness, saturation, and contrast of the image; the training 552 loss was optimized using Adam. The number of training 553 iterations was 100,000, the learning rate was 2.5 × 10 −4 , 554 and the batch size varied depending on the network size and 555 number of stacks; the more parameters there were, the smaller 556 the batch size. Training was performed on a single Nvidia 557 GeForce Titan 1080 GPU, and each network training took 558 approximately two days to complete.

559
Evaluation: We evaluated the performance of the 560 anchor-free methods in the X-ray baggage security image 561 object detection task using the mAP and average recall (AR), 562 which were averaged across multiple IoUs using 3 IoU 563 thresholds IoU ∈ [0.5:0.75:0.95], which could enable a better 564 location and position of the object detector. To test the 565 performance of these models, we divided part of the dataset 566 into images and labels for testing, from which we selected 567 1,000 for the validation set and 1,000 for the test set.
(2) VOLUME 10, 2022     Therefore, these three anchor-free methods have advantages 593 over the anchor-based methods used in this paper for X-ray 594 baggage screening images, as shown in Figure 7.

596
The experiment introduces six anchor-free methods for the 597 detection of knives and handguns in X-ray baggage security 598 images. There is some research continuity between these 599 methods. The CornerNet method locates an object through 600 corner points. Due to the absence of anchor restrictions, 601 combining the corner points into an accurate bounding box 602 requires a very high-level corner point combination algorithm 603 because the assistance of global information is not available 604 in determining whether two corner points belong to the same 605 object; therefore, it is easy to combine two corner points of 606 different objects into a bounding box. Therefore, in deter-607 mining whether the top-left corner and bottom-right corner 608 belong to the same object, CenterNet considers adding centre 609 point information to further determine whether the centre 610 of the box consisting of these two points contains a cen-611 tre point with a high response value. Likewise, ExtremeNet 612 predicts four extreme points and predicts a central point to 613 increase the confidence level of the extreme point combina-614 tion. From the results, CenterNet is more accurate than Cor-615 nerNet, and the ExtremeNet method has the highest accuracy 616 of all methods, verifying that the centre point is indeed effec-617 tive in improving detection accuracy. The YOLOx method 618 assigns a 3 × 3 area in the centre location of each object as 619 a positive sample, which means that YOLOx also adopts the 620 anchor-free strategy of the centre point but expands this point 621 to a certain range, which further verifies the importance of the 622 centre point strategy for anchor-free methods. the CSPDarknet53 network. ExtremeNet achieved a detection 679 accuracy of 0.900 on the Hourglass-104 skeleton network, 680 and Objects as Points achieved an accuracy of 0.881 on 681 the DLA-34 skeleton network. Additionally, given the real-682 time nature of the detection tasl, Objects as Points worked 683 well with a lighter-weight network structure. Overall, the 684 anchor-free approach is simpler and more flexible and can 685 be improved and developed further.

686
In the future, more classes of datasets can be constructed to 687 further enrich the object detection dataset of X-ray baggage 688 security images; in addition, with the emergence of bet-689 ter skeleton network structures, the anchor-free method can 690 achieve improved detection accuracy and speed accordingly.