Learning-based Image Synthesis for Hazardous Object Detection in X-Ray Security Applications

X-ray baggage inspection has been widely used for maintaining airport and transportation security. Towards automated inspection, recent deep learning-based methods have attempted to detect hazardous objects directly from X-ray images. Since it is challenging to collect a large number of training images from real-world environments, most previous learning-based methods rely on image synthesis for training data generation. However, these methods randomly combine foreground and background images, restricting the effectiveness of synthetic images for object detection. To solve this problem, in this paper, we propose a learning-based X-ray image synthesis method for object detection. Specifically, for each foreground object to be synthesized, we first estimate positions difficult to detect by the object detector. These positions and their corresponding confidence values are then used to construct a difficulty map, which is used for sampling the target foreground position for image synthesis. The performance analysis using various state-of-the-art object detectors shows that the proposed synthesis method can produce more useful training data compared with the conventional random synthesis method.


I. INTRODUCTION
B AGGAGE inspection based on X-ray screening is an essential task for reducing the risk of crime and terrorist attacks and preventing the propagation of pests and diseases [1]. In general, the X-ray images are visually inspected by trained human inspectors to detect dangerous objects. Although it may take less than a second to investigate each piece of baggage, each inspector has to check a large amount of baggage over a long time. The possibility of human error is thus non-negligible, even with specialized training. Therefore, an automated X-ray baggage inspection system based on computer vision techniques, such as feature-based detection methods [2]- [5], is needed to detect hazardous objects robustly.
Recently, motivated by the remarkable success of convolutional neural networks (CNNs) in solving computer vision problems, learning-based automated X-ray inspection methods have been proposed [6]- [9]. To ensure performance in such learning-based approaches, a large dataset of X-ray images and their corresponding annotations is essentially required. Several publicly available datasets can be used for object detection in X-ray images. Mery et al. [3] presented the GDX-ray database that contains five object categories: castings, welds, baggage, natural objects, and settings. Miao et al. [10] introduced a larger size dataset called SIXray, which contains diverse types of hazardous objects in baggage with cluttered background items. However, the number of positive samples, i.e., images with hazardous objects, is much less than the number of negative samples (12,277 versus 1,050,302 samples). Although this class imbalance may reflect the real-world application environments, it also makes network training difficult.
To overcome such lack of training images, many recent methods paid attention to learning from synthetic data [8], [9], [11]- [13]. Considering that X-ray imaging can be modeled using the absorption law that characterizes the inten-sity distribution of X-rays through matter [14], Mery and Katsaggelos [15] introduced a solid mathematical model for the synthesis of threat objects to the background baggage. Following this method, Jain et al. [11] synthesized X-ray images during training for data augmentation and demonstrated the effectiveness of the synthetic data using several standard object detection models, such as YOLOv2 [16] and Faster R-CNN [17]. Yang et al. [12] further introduced a generative adversarial network (GAN)-based approach for generating realistic hazardous objects. Zhu et al. [13] applied a similar data augmentation framework to Jain's method [11] and demonstrated that the accuracy of the SSD model [18] can be increased by 5.6% in terms of the mean average precision (mAP) when the model is trained using the augmented dataset. Saavedra et al. [9] combined PGGAN [19] and the X-ray image synthesis technique [15].
The X-ray and natural images show a clear difference when multiple objects are overlapped with each other. As shown in Figs. 1(a) and (b), because X-ray is penetrable, both front and rear objects are visible in X-ray images [10]. However, Figs. 1(c) and (d) show that the occluded regions of the rear objects are generally not visible in natural images. Due to this difference, the X-ray images have an advantage in that the target object may be synthesized at any desired position, not at a limited or random location. However, the existing Xray image synthesis methods that overlay foreground objects at arbitrary locations regardless of the background content cannot fully take advantage of synthesized images for object detection. To this end, in this paper, we propose a novel learning-based X-ray image synthesis method.
In our proposed method, an object detection network is first trained using the X-ray images synthesized with hazardous objects at random positions. The hazardous objects are then synthesized at hard-to-detect locations estimated by the object detector during the learning process. By this simple but effective way, we can generate hard samples that can contribute to further boost the object detection performance. The experimental results obtained by various detection networks demonstrate the superiority of the proposed synthesis method.
The rest of this paper is organized as follows. The related works are reviewed in Section II. The proposed method is detailed in Section III. Experimental results are provided in Section IV. Finally, our conclusion is given in Section V.
In summary, this paper presents two major contributions. (i) We propose a difficulty map that represents the locations at which the detector is difficult to find the objects without any additional network. (ii) Using the difficulty map, we introduce a data synthesis technique that produces hard-todetect samples to train the detector effectively.

II. RELATED WORK
Before the explanation of the proposed method, in this section, we briefly review its related techniques including CNNbased object detection, X-Ray computer vision algorithms, X-Ray image synthesis, and data augmentation.

A. CNN-BASED OBJECT DETECTION
CNN-based methods have been very successful in the recognition and localization of objects. According to the design principle, these methods can be classified into two-stage methods [17], [20] and single-stage methods [18], [21], [22].
The two-stage methods first identify candidate bounding boxes using a deep network and then refine the candidates using another sub-network. To this end, Ren et al. [17] introduced the region proposal network (RPN), which performs efficiently by sharing full-image convolutional features with a subsequent detection network. Lin et al. [20] proposed a feature pyramid network (FPN) which combines lowresolution and high-resolution features via top-down paths and lateral connections. This feature pyramid contains rich semantics from all levels and can be built from a singlescale input image, thereby exhibiting effectiveness in terms of representational power, speed, and memory.
The single-stage methods detect objects via a single network inference. As a pioneering work, YOLO [21] used a unified detection network that predicts bounding boxes and classifies objects at the same time from an entire image. The computational efficiency and robustness of YOLO and its advanced versions [16], [23] have been demonstrated thoroughly. Liu et al. [18] designed a reduced VGG network architecture that extracts features from multi-layers, enabling the network to handle objects with various scales effectively. Lin et al. [22] adopted ResNet as a basic feature extractor and used a focal loss to address the class imbalance problem caused by the biased foreground-background ratio.

B. X-RAY COMPUTER VISION ALGORITHMS
In the area of baggage inspection, some computer vision algorithms based on a single view of a single energy have been reported. Riffo and Mery [2] proposed automated detection algorithm based on visual codebooks. Mery et al. [3] used adaptive sparse representations [24] to detect objects, with less constrained conditions including some contrast variability, pose, intra-class variability, size of the image and focal distance. On the other hand, in the analysis of single dual-energy images, Baştan et al. [4] presented a bag of visual words (BoVW) model with several hand-crafted feature representations. Additionally, there are some methods based on a single energy multi-view, using active vision [25], [26]. Support vector machine (SVM) classifiers and visual dictionaries are proposed in dual-energy multi-views Xray [5], [27].
Recently, several methods based on deep convolutional neural networks have been proposed. Akçay et al. [28] suggested CNN-based object classification method using transfer learning in order to overcome the limited amount of training data, and provided performance comparison among CNN-based object detection algorithms for X-ray baggage security imagery [6]. Gu et al. [8] proposed automatic X-ray object detection using feature enhancement module. Saavedra et al. [9] introduced GAN strategy in data augmentation for the threat object detection.

C. X-RAY IMAGE SYNTHESIS
Many studies assume that X-ray image formation obeys the Beer-Lambert law. Based on this assumption, at image location (x, y), the pixel intensity of the X-ray image I(x, y) is defined as where I 0 is the beam intensity, z represents the depth coordinate, and µ is the effective attenuation coefficient of the objects in the scene [29]. Based on this image formation model, Rogers et al. [30] introduced a data synthesis technique, called TIP, which generates synthesized threat images that have no significant differences compared with real threat images. More specifi-cally, they synthesize images by multiplying the foreground mask F (x, y) and background mask B(x, y) as follows: where µ F and µ B represent the effective attenuation coefficients of the foreground and background masks, respectively. It is worth noting that when N foreground masks are overlapped in the image, F (x, y) in (2) can be replaced with Fig. 2.

D. DATA AUGMENTATION
To increase generalization performance and attenuate overfitting problem simultaneously, functional solutions such as dropout regularization [31], batch normalization [32], and transfer learning [33] have been developed. In contrast to such techniques, data augmentation approaches focus on training datasets, which is the root cause of the overfitting problem.
In general, the data augmentation is conducted by simple transformations such as horizontal flipping, color space augmentations, and random cropping [34]. Moreno-Barea et al. [35] proposed noise injection as an additional data augmentation, demonstrating that adding noise to images for nine datasets in UCI repository could help CNN learn more robust features. Kang et al. [36] devised PatchShuffle Regularization (PSR), which is a kernel filter that randomly swaps pixel values in n × n sliding windows. Experiments on different filter sizes and probabilities of shuffling the pixels at each step, the authors demonstrated the effectiveness of PSR by achieving a 5.66% error rate on CIFAR-10 compared with an error rate of 6.33%. Inspired by the dropout regularization mechanism, Zhong et al. [37] developed a random erasing method that performs dropout in the input data space rather than in the feature space to prevent overfitting problems effectively.
As described above, although the data augmentation can be applied to images in the input space, it can also be applied to feature space. Konno and Iwazume manipulated the modularity of neural networks after training, improving the performance on CIFAR-100 from 66% to 73% accuracy. Xie et al. [38] presented DisturbLabel (DL), which is an adversarial training technique that randomly replaces labels at each iteration. On the MNIST dataset with LeNet CNN architecture, DL produced 0.32% error rate compared with a baseline error rate of 0.39%.
The first GAN architecture proposed by Ian Goodfellow [31] is a framework for generative modeling through adversarial training. Such a network architecture can be applied to data augmentation tasks by generating new training data that results in better-performing classification models. Researches to apply GAN to data augmentation and report the VOLUME 4, 2016  resulting classification performance have been conducted in the field of biomedical image analysis [39]. Frid-Adar et al. [40] tested the effectiveness of generating liver lesion medical images using DCGAN. On top of classical augmentations to attain 78.6% sensitivity and 88.4% specificity, the authors employed additional DCGAN-generated samples, finally achieving the performance of 85.7% sensitivity and 92.4% specificity. In the literature of X-ray security inspection, Yang et al. [12] proposed a GAN-based data augmentation method to generate the images of prohibited items, and Zhu et al. [13] improved SAGAN [41] to generate the realistic prohibited item images. Fig. 3 illustrates the proposed X-ray image synthesis framework. In this section, we first define a difficulty map and describe how the difficulty map is used to sample target foreground positions for the generation of hard training samples.

A. DIFFICULTY MAP EXTRACTION
Regardless of the difference between single-stage and twostage approaches, most deep learning-based object detection networks produce locations of objects and their corresponding confidence values [18], [20], [22]. Therefore, if we feed the background image to an object detector, we can obtain foreground positions that can confuse the detector when evaluated after the image synthesis. We thus attempt to use this degree of confusion as valuable information for determining the target position of foreground objects.
We first feed the background image to the object detector and obtain the box predictions with confidence estimates. Note that the detection network outputs the position of the box predictions, whether there is a target object in the image or not. Let (p k x , p k y ) and c k denote the center position and its corresponding confidence value of the k-th box prediction, respectively. Given the randomly scaled foreground object we want to synthesize, we collect the boxes with 50% or higher intersection over union (IoU) among multiple box candidates. We use these remained boxes, referred to as foreground-shaped predictions (FSPs), and their confidence estimates to define our difficulty map.
Let D denote the difficulty map, which is defined as follows: where x and y represent the pixel coordinates and σ k is the standard deviation of the Gaussian function. To avoid having difficulty values very close to zero, we set σ k as the center distance between the k-th FSP and its closest FSP. In this manner, difficulty values slowly decay between distant FSPs, which is advantageous for our probabilistic sampling of foreground positions. As illustrated in Fig. 4, the difference between the object detection process and the difficulty map extraction process is that object detection uses box predictions from higher scores and non-maximum suppression to sort out final results, whereas difficulty map  extraction uses all box predictions from FSPs without nonmaximum suppression. However, the difficulty map can be obtained using the detector network the same as that used in the detection process. Therefore, there are no additional network and loss functions to extract the difficulty map. Fig. 5 shows the difficulty maps for four foreground objects, shown in Fig. 6, which are the results of our case study for hazardous object detection. Note that Fig. 5 is a difficulty map extracted using SSD as an example, and the difficulty map differs if the detection network changes. In the first and second rows of Figs. 5(b), (c), and (e), it can be seen that different angles of objects with the same size produce the changes on the difficulty map. When the objects have both the same size and aspect ratio with different angles, considerably similar difficulty maps were generated, as shown in the first and second rows of Fig. 5(d). Moreover, the first and last rows in Fig. 5 represent the changes of difficulty map according to the size of the object. From the result that values of difficulty maps for larger objects have relatively uniform distributions, it is confirmed that the larger the foreground object is, the easier it is to be detected by machine learning algorithms, likewise human. Fig. 5(e) shows the difficulty map of the razor blade, which is largely influenced by the background due to its smaller and thinner characteristics than other objects. Using these difficulty maps, the proposed Xray image synthesis is performed.

B. IMAGE SYNTHESIS USING THE DIFFICULTY MAP
After the normalization of the difficulty map D to have the sum to be one, we obtain D, which can be treated as a probability map. We then sample the target foreground position using D 1 . Fig. 7 shows the process of sampling the target foreground position. Note that the previous X-ray synthesis [9], [11]- [13] also performs probabilistic sampling but using the 2D uniform distribution. On the contrary, we sample the positions where the object detector may get confused. In other words, our method can generate hard training samples that can boost the performance of the object detector. The proposed X-ray image synthesis is performed during training of the object detector as online data augmentation. Therefore, the same background image can be used multiple times for the same object with different scales as well as the other objects.
Given the foreground image F and the background image B, X-ray image synthesis can be performed as [8], [9], [42] according to the Beer-Lambert's law [29]. Specifically, the synthesized X-ray image I is obtained as follows: where F r denotes the randomly-scaled version of F , and Ω is a set of pixels in the F r . I, F r , and B have normalized values in the range [0, 1]. Because the location and class of objects are known in the image synthesis process, the ground truth can also be obtained to train the detector. The synthesized image and ground truth pairs are used to learn the detection model which extracts the difficulty maps.

IV. EXPERIMENTAL RESULTS
In this section, we present the superiority of the proposed Xray image synthesis method by applying it to various object detection networks including SSD [18], RefineDet [43], PF-PNet [44], and RFBNet [45], and comparing it with existing random synthesis methods.

A. DATASET DESCRIPTION AND EXPERIMENTAL SETUP
Our experiments have been conducted using the GDX-ray database [46]. The database contains not only 200 test images for X-ray threat detection, but also 48 background images, along with 576, 144, 200, and 100 foreground images of a knife, shuriken, gun, and razor, respectively, that are suitable for X-ray image synthesis. Using these images, we generated training data by synthesizing X-ray baggage images using the existing random position synthesis method [9] and the proposed difficulty map-based method, respectively. The total number of each training data is 30k, which is equivalent to the number of training iterations. For generating training data using the conventional synthesis method, we followed the authors' procedure using the source code provided [9]. All detection networks employed in our experiments were trained for 24k iterations with a learning rate of 1e-4, followed by 6k iterations with a learning rate of 1e-5. We used the Adam optimizer [47], and the batch size was set to 8. Our whole training process was conducted using a single NVIDIA TITAN X GPU.

B. PERFORMANCE EVALUATION
We evaluated the object detection performance using the average precision (AP) and mean AP (mAP). Table 1 shows a performance comparison of the synthesis methods on 200 real-world test images of the GDX-ray database. It can be seen that the proposed method improved the mAP scores by 3.2%, 5.0%, 3.0%, and 5.2% for SSD, RefineDet, PFPNet, and RFBNet, respectively.
To demonstrate the effectiveness of the proposed method more clearly, the performance comparison needs to be per-    Fig. 8. The experimental results on this synthesized dataset are shown in Table 2. Note that the overall performance decreased due to the difficulty of object detection in cluttered scenes. However, the proposed method enabled more solid and consistent performance improvements for all tested object detection networks. Fig. 9 shows several object detection results obtained using the conventional and proposed synthesis methods on the realworld test images. As shown in Figs. 9(a) and (d), the conventional method failed in detecting small occluded objects. On the contrary, such objects can be correctly detected by applying our synthesis method. Furthermore, the proposed method reduced false alarms as shown in Figs. 9(b) and (c).
The results of each method on the synthesized test images are illustrated in Fig. 10. Figs. 10(a), (b), and (d) show that the conventional method failed in detecting occluded objects, which can be correctly detected by applying our proposed method. Moreover, although the conventional method caused false alarms by a more complicated test set, the proposed method provided accurate detection results.

V. CONCLUSION
A novel learning-based image synthesis method was proposed to train object detection networks for X-ray security applications. The proposed method extracts the difficulty map, which is used for sampling the target foreground position for image synthesis during the training process. By synthesizing foreground objects at hard-to-detect locations, more challenging training samples can be generated, yielding improved object detection performance. The experimental results show that the proposed method improves the perfor-mance of various object detection networks compared to the previous standard of random image synthesis.