Computer Vision on X-Ray Data in Industrial Production and Security Applications: A Comprehensive Survey

X-ray imaging technology has been used for decades in clinical tasks to reveal the internal condition of different organs, and in recent years, it has become more common in other areas such as industry, security, and geography. The recent development of computer vision and machine learning techniques has also made it easier to automatically process X-ray images and several machine learning-based object (anomaly) detection, classification, and segmentation methods have been recently employed in X-ray image analysis. Due to the high potential of deep learning in related image processing applications, it has been used in most of the studies. This survey reviews the recent research on using computer vision and machine learning for X-ray analysis in industrial production and security applications and covers the applications, techniques, evaluation metrics, datasets, and performance comparison of those techniques on publicly available datasets. We also highlight some drawbacks in the published research and give recommendations for future research in computer vision-based X-ray analysis.


I. INTRODUCTION
The need of having a non-destructive procedure for examining the interior of objects to assess their structural patterns or constituent contents has resulted in many applications of X-ray technology in different fields.While the medical field was one of the first to use the technology for assessing the inner parts of the body [1], the use of X-ray technology is expanding considerably for industrial and security purposes [2], [3].Factories can now assess whether there are anomalies or defects inside a product without destroying it [4], and border patrol officers at security gates can check for forbidden objects inside baggages without opening them [5].
Considering the need for fast production lines in competitive industries and issues related to human-based image assessment, such as subjectivity and tiredness, the necessity of automatic and reliable image processing methods is obvious.Efficient automatic techniques for processing X-ray data are needed also at security gates for baggage checking due to the increasing number of travelers.Recent advances in computer vision (CV), machine learning (ML), and deep learning (DL) have the potential to provide efficient and reliable solutions to X-ray-based automatic and real-time object, anomaly, or defect detection and recognition.
As the need for automatic X-ray applications is growing in industrial production and security fields, several works have proposed different CV-based techniques for processing the data.To get an overall understanding of the current state of the research, an extensive review of the published articles is needed.There are also previously published related surveys: A book by Mery [?] discusses widely computer vision algorithms for industrially relevant applications of X-ray testing, but does not cover the recent deep learning based-advances.A survey by Hou et al. [15] focuses on computer-aided weld defect detection from radiography images.Surveys by Mery et al. [16] and Akcay and Breckon [17] focus on computer vision for security applications.A summary of the available surveys on this topic is presented in Table 1, where 's indicate the covered topics.We note that the previously published surveys in these topics cover only a specific application and/or just a limited type of CV methods.Furthermore, despite their obvious importance, the previously published surveys do not comprehensively discuss evaluation setups and metrics for different CV tasks in X-ray analysis.ionizing beams are generated by X-ray tubes and penetrated through the scanned object to be detected by the detectors on the other side of the object (Fig. 2).Depending on the mass density of the exposed object, the X-ray signal can be attenuated, which leads to a lower intensity on the detector.In other words, the measured intensity is inversely proportional to the material density.Therefore, X-ray imaging can be used to carry out non-destructive assessments, when mass density is a parameter of interest.
X-ray imaging techniques can be categorized based on the number of energy levels and the view they use.In terms of energy level, X-ray imaging techniques can be divided into mono-or multi-energy levels (irrespective of the view or the number of X-ray beams).In mono-energy X-ray imaging technology, only one energy level is used for the radiated beams [18], which provides gray-scale images (see Fig. 3-a according to the object mass density.This is a suitable Xray imaging technology when dealing with mono-material objects (e.g., in additive manufacturing).Dual-and multienergy X-ray imaging technologies use several energy levels to provide several X-ray images leading to a better understanding of the objects' density and effective atomic number [19].By using of look-up table [20], the measured values can be transferred to a pseudo-colored image of the object (see Fig. 3-b), where various colors are assigned to different types of material.Thus, when different types of material are assessed (e.g., in baggage security checks), these Xray imaging technologies can provide more information and make it easier to analyze the inner structure of the objects.
In terms of the view, X-ray imaging techniques can be divided into 2D, multi-view, and 3D imaging categories.In 2D imaging, the X-ray beams are radiated by the X-ray tube to the object from only one direction producing 2D images.In multi-view imaging, the objects of interest are exposed to the X-ray beams from various angles [21] (see Fig. 4a), providing more information on the object facilitating the analyses.In 3D view imaging, the output is in a 3D form and it can be provided in different ways.One approach, also known as tomography, is to capture 2D X-ray images of different layers of the object and then stack them on top of each other (see Fig. 4-b) to provide a 3D volume of it [6], [14], [22].Another approach for 3D imaging is to combine multi-view imaging with image processing techniques to transfer the 2D images from different angles to a 3D volume [23].
In the end, it should be mentioned that visual and X-ray imaging share some characteristics and challenges, such as noise, occlusion, and perspective.Therefore, many image processing methods developed for color images can be used also with X-ray images [24].

B. Computer Vision Tasks
In this section, we briefly describe the main CV tasks relevant to different X-ray applications.We start by defining the general terms: computer vision, machine learning, and deep learning.In many cases, the terms are interchangeable, but they still cover a different subset of methods.
• Computer vision (CV) is a sub-field of Artificial Intelligence (AI) that focuses on processing images and videos captured by a variety of sensors (e.g., visual cameras, Xray imaging sensors, depth sensors).Many CV methods are learning-based and thus also ML methods, but there exist also non-learning-based CV algorithms.• Machine learning (ML) is a sub-field of AI, where methods learn to perform a task without being explicitly programmed to do so.To be able to learn, most ML models need to be trained by using input-target output pairs.ML methods can be applied to many CV tasks involving images/videos captured by a variety of sensors, but also to tasks involving various other data types, which are out of the scope of this paper.• Deep learning (DL) is a sub-field of ML.DL methods involve multiple layers of data transformations usually taking the form of neural layers to progressively extract higher level and more complex patterns from data.
Classification is a CV task aiming at assigning a data sample (e.g., an image or a video) into one class included in a set of predefined classes.The classes can represent properties, such as intact/damaged, or types of depicted objects.In single-label classification (commonly simply referred to as classification), the classes are mutually exclusive and only one label is assigned to each sample, while in multi-label  [25], b) Object detection in X-ray baggage security inspection [25], c) Semantic segmentation of an X-ray welding image [25], and d) Instance segmentation of glass fibers in industrial computed tomography [26].
classification samples can be assigned a varying number of labels.Based on the number of classes, (single-label) classification tasks can be categorized into binary and multiclass types.Binary classification refers to tasks with only two classes, while multi-class classification refers to tasks with more than two classes.Fig. 5-a shows an example of a binary classification task on casting defect detection, where the image on the left is classified as non-defected and the image on the right as defected.
Object detection [27] aims at both localizing and identifying each object in an image/video.Bounding boxes are commonly used to represent where each object is in the image, while the identification can be seen as classification of the image patch depicting the object.Object detection allows counting different objects or following specific objects in videos.Fig. 5-b illustrates the output of object detection applied on an X-ray baggage security image.It can be seen that the trained ML model detected three types of objects of interest and put bounding boxes around them.
In semantic segmentation [28], the goal is to assign each pixel in an image to a class.Compared to classification and object detection tasks that provide the overall class or bounding boxes around the detected objects, semantic segmentation delivers an exact outline of the objects/content from different classes.Fig. 5-c presents an example of a semantically segmented X-ray welding image.In the segmented image, the exact pixel-wise locations of the defects are marked.Instance segmentation [29] is a slightly different task, where the goal is to segment the image according to the different instances of the same class, e.g., to count the instances.In [26], instance segmentation was applied for segmenting glass fibers in industrial computed tomography as illustrated in Fig. 5-d.
While most CV-based X-ray image analysis methods proposed in the literature can be categorized directly as methods targeting classification, object detection, or segmentation tasks, some works set their final objective beyond these tasks.For example, in [30], object detection is used to identify incorrect assembly, missing assembly, or transposition of internal components of a product and, in [31], defect segmentation is followed by estimation of different defect characteristics.In [32], features extracted from X-ray images were used to predict geometrical parameters of welding as a regression task.

III. APPLICATION AREAS OF X-RAY TECHNOLOGY
In this section, a short overview of different industrial production and security applications that use X-ray images along with computer vision techniques is provided.The main covered research topics are additive manufacturing, casting, welding, security, electronic industry, and material sciences.The related methods are categorized based on their underlying computer vision objective, i.e., classification, detection, or segmentation.

A. Additive Manufacturing
Additive manufacturing (AM) technology, which is also known as 3D printing [33], is broadly used in diverse industrial applications with high material and geometric complexities, such as car manufacturing.AM technology can use several techniques including directed energy deposition [31], powder bed fusion [34], binder jetting [31], and additive friction stir deposition [31], to built final or near-net-shape (i.e., initial roughly shaped) parts in a layer-by-layer manner directly from digital files.However, structural defects, such as pores, internal micro-cracks, air bubbles, surface pits, surface scratches, and porosity arrays, are inevitable in current AM processes [31].Printing errors, cyberattacks, residual stress, powder materials, chamber environment, as well as printing parameters, namely heat source power, scan speed, hatch space, and layer thickness, are considered to be the possible reasons behind the mentioned defects [33].
X-ray computed tomography (XCT), as a non-destructive evaluation process, is widely used in AM processes to examine the internal and surface structure of produced parts to detect different defects [14].Most of the CV-based techniques used for AM inspection aim at segmenting XCT data in 2D or 3D [6], [33], [34].In [6], defect segmentation of cylindrical AM specimens belonging CoCr AM XCT dataset (see Section VI-A) was carried out using a local thresholding method on 2D slides.3D defect segmentation of the same data based on 3D fully-convolutional network was carried out in [33].In [34], a 2D segmentation network was employed for automatic porosity segmentation of metallic AM specimens.The XCT data was processed as a stack of 2D images to provide porosity labels for the specimens, and different segmentation methodologies were evaluated.
In [31], an application going beyond the basic CV tasks was proposed as an inspection pipeline and applied for defect characteristics and pore evolution analysis in a binder jetting copper AM system.

B. Casting
Casting is a manufacturing process finding applications in complex industries, such as aerospace [2], [35] and automobile [4], [36] industries, and with materials, such as aluminum [37], [38] and titanium [2] alloys.Due to the limitations of the manufacturing techniques [37], castings can host several defects, such as holes and flaws, gas cavities, shrinks, slags, cracks, high-and low-inclusions, wrinkles, casting fins, shrinkage-holes, and incomplete fusion [2], [35], [36], which can lead to catastrophic failures of critical mechanical components [4], [37].Therefore, it is crucial to implement a non-destructive testing system to detect internal and surface defects early in the manufacturing process to reduce the risks and save time and costs [36], [38].
To this end, X-ray imaging is becoming a useful technology to visualize the internal structure of castings and, combined with CV methods, it allows for automatically assessing the products and detecting anomalies [39].In casting assessment, most of the studies consider the problem as a binary task (e.g., binary classification or segmentation) to differentiate between defective and non-defective castings.There are also some cases where more than one type of defects define a multi-class detection or segmentation problem [2].
Several works that frame the problem as a binary classification task [37], [38], [40]- [42], where the goal is to classify X-ray images into defective and non-defective classes, have been conducted in the automotive industry.In aerospace industry applications, [35] evaluated and compared several traditional classification methods on casting defect classification of image patches in supporting plates in aeromotors.
In aerospace industry applications, most of the related research focuses on detection of casting defects in aeroengines.In [2], defect candidate search and classification steps were applied to detect aerospace titanium casting defects.For detecting core failures in die casting, an unsupervised inspection framework was designed and introduced in [43].A DL-based detection method that tries to boost the detection performance at both data augmentation and algorithm levels was used in [39].Casting defects were localized using several convolutional neural network (CNN) architectures that were trained on a relatively small dataset in [44].As an alternative method, a classifier was trained on image series and a sliding window-based approach was applied for localization.To increase the safety in the construction of road-worthy metallic components, several casting defect detection CV methods were assessed and compared in [36].The defect detection approach proposed in [4] trains the network simultaneously for detection and instance segmentation on casting X-ray images.It was experimentally shown that simultaneous training for detection and instance segmentation led to a higher detection accuracy than training to detect alone.
Semantic segmentation methods for casting inspection were assessed in [45], [46].Authors in [45] used only realistically simulated X-ray data to train a network to perform semantic segmentation on cast aluminum parts.Large defect scale variation, small inter-class differences, and annotation uncertainty issues were tackled in [46] for defect semantic segmentation.

C. Welding
Welding, as a manufacturing process that joins materials by causing coalescence and melting two workpieces, plays a critical role in a variety of production applications, such as aircraft, shipbuilding, and automobile production [47].However, the instability of welding process parameters, such as welding current, voltage, speed, and nozzle height, as well as the structural component deformation might cause defects in the joints [47], [48].These defects can reduce the quality of the product by affecting strength, stiffness, safety, and durability and cause catastrophic damages [49], [50].Due to the different welding energy sources, environments, chemical and physical processes, and joining materials, welding is considered a complex and not fully understood process [51].Therefore, weld quality evaluation done by experts carries limitations, such as subjectivity, misinterpretation, and time consumption [52], [53].To overcome the limitations and to achieve real-time geometrical parameter monitoring, automatic non-destructive techniques to assess welding quality are needed [32].The combination of X-ray imaging technology and CV methods is a commonly chosen approach in this regard.
To classify different welding defects, a CNN-based classification method was used in [54], [55].In [55], imbalanced class distribution was addressed by using resampling methods to create a balanced dataset.In [32], an approach including feed-forward neural network (FNN) and Support Vector Machine (SVM) models was introduced for a laser welding process monitoring and defect recognition.In [56], binary classifiers with low data requirement for generic automated surface inspection was presented.Defective weldings were classified in [51] employing a complex binary classifier consisting of an artificial neural network (NN) and a fuzzy logic system.A set of geometric features, such as shape measures (compactness, elongation, symmetry, etc.), was defined in [57] to characterize defects in X-ray data and then these features were used as inputs to a multi-class classifier that divides the problem into one versus one binary problems.
Studies in [48], [49], [58], [59] applied detection CV tools on welding X-ray data.The class distribution of welding Xray image sets was balanced in [49] using two data augmentation approaches, and then the balanced data and a feature extraction-based transfer learning method were used to train two deep models.The models were then combined to perform defect detection via dividing the image into sub-images and separately classifying each sub-image.The method in [58] uses a hybrid automatic detection scheme including a location extractor of weld region and a detector based on a binary classifier.The detector uses sliding window and the trained binary classifier to detect the defective parts.A DLbased model was employed in [48] to automatically identify multiple welding defects and extract their location without any pre-processing.
In [47], high-precision automatic weld defect segmentation for small defects was achieved by employing a deep neural network and data augmentation.In [60], an automatic welded joints' segmentation technique is introduced which can localize weld beads, segment discontinuities (as potential defects), and finally, extract the features to classify the discontinuities.In [52], minimum intra-class and maximum inter-class variances were used to localize defects after applying a noise reduction method on X-ray data.Then, shape features were extracted and used to classify the defects.The potential weld joint defects were segmented using a background subtraction algorithm in [53], and then defects' features including average gray-scale difference to the background, gray-scale standard deviation, and the defect area were extracted and used by a classifier to differentiate real defects from all potential ones.In [59], defects were segmented by classifying each pixel using extracted feature vectors.

D. Security
One of the most common areas where X-ray imaging is used is baggage inspection at, for example, security gates on railway stations, subway stations, and airports, for detecting prohibited items and threats [61].However, the inspection and threat detection in this context are usually performed by humans [61], [62].Fatiguing work schedules, complexity in catching contraband items, inexperienced operators, and squeezed and overlapped items can be named as limitations of the human detection operation [3], [5], [63].On the other hand, especially during busy hours, quick evaluation and detection are urgent to prevent any delay in the passengers' transportation schedules [64], [65].All these together emphasize the necessity of reliable and time-efficient methods to do the detection automatically.Several datasets are available for developing CV-based approaches as discussed in Section VI.Some sample images from one of the datasets, OPIXray [9], are shown in Fig. 6.
To tackle the overlapping issue in X-ray images of tightly packed luggages, a multi-label classification network was used in [66] to recognize prohibited items such as guns, FIGURE 6.Samples of X-ray security images, available in OPIXray data [9].knives, scissors, etc.A multistage analyzer and classifier system was proposed in [67] to automatically perform threat recognition in different security monitoring environments to identify a wide range of firearm threats.To manage intraclass variability in contrast, pose, image size, and focal distance, a new representation approach was introduced in [68] to recognize objects.In [69], a method was adopted for binary classification of firearms versus other objects in baggage security X-ray images.A deep NN was employed in [70] as a security image classifier with the ability to overcome the data scarcity problem to classify among gun, knife, and electrical device classes.Ten diverse CV-based strategies were investigated in [24] for object recognition in X-ray security imaging.In [71], an anomaly detection method was proposed that categorizes anomalies in appearance and semantic anomalies.Unusual shape, texture, and density were considered as appearance anomalies and unfamiliar objects as semantic anomalies.A multi-scale CNN architecture was used in [72] to discriminate materials into classes, such as metal or organic substances, using dual-energy X-ray scanner images.
Most studies in security applications aim at detection of threats.Some works focus on binary detection tasks to discriminate a specific class of objects, e.g., firearms/firearm components [73]- [75].The works in [5], [63], [74]- [77] aim at recognizing multiple different threat categories, such as knives and guns, whereas the work in [78] aims at categorizing several object types, such as laptops and mobile phones, either as benign or anomalous.An approach for extracting multi-level information and handling nonrigid deformations was employed in [79].The problem of overlapping objects in X-ray security data was tackled in [80].A patch-wise image classification method based on sparse representation of direction features was introduced in [81] and used for threatening object detection.The direction features were extracted to build a foreground dictionary used for assessing test data to detect the foreground.Another simple foregroundbackground segmentation technique based on color thresholds was applied in [64] as a preprocessing step for object detection in the X-ray images.To increase the trust in automatic detectors on baggage security imagery applications, a human-in-the-loop detection framework was presented in [82].The framework gives a score to each prohibited item proposal and, based on the score, the baggage is assigned to safe, suspicious, or dangerous classes.In case of classification as suspicious, a human makes the final decision on the baggage.
Using synthetic data is a way to tackle the lack of data in security image detection [61], [62], [83].In [61], data for CNN training process was obtained using a method that generated X-ray security images with multiple prohibited items.The study in [62] used a data augmentation method that first generates several RGB images of prohibited items.Subsequently, the images are transformed into X-ray format and combined to different backgrounds.Synthetically composed X-ray images of transformed threats and backgrounds were used in [83] to overcome the high data requirement challenge in CNN training.To assess the approach, a CNNbased object recognition method was trained with both real and synthetic data and the results were compared, showing promising results for the combination of real and synthetic data.
Some studies on automatic X-ray security data analysis applied image segmentation.The work in [84] presented an automatic segmentation method for security screening that first enhances images to improve performance and then applies color-based pixel segmentation to distinguish diverse materials (organic, inorganic, mixed, and opaque objects) from the background.A method using a DL model as a robust feature extractor and an adversarial auto-encoder to classify images into organic and inorganic classes considering the overlap among the materials was proposed in [3].To find the most suitable object level and sub-component level anomaly detection strategy, several segmentation methodologies were assessed in [65].Their performances were analyzed by applying them to an extensive dataset focusing on electronic items.

E. Material Sciences
Following the recent developments in CV capabilities in combination with diverse X-ray imaging technologies, a new growing research topic focuses on the analysis of various materials using CV methods on X-ray images.It should be mentioned that the term material science here refers to the studies that focus on the properties of materials, not the production process of different materials.These methods can be employed to achieve realistic textile composite finite element models [85], fiber extraction models [26], links between microstructures and physical properties [86], and characterization and mappings of materials [87].
Atomic resolution images of materials can be obtained by illuminating particles with random orientation with an Xray free-electron laser beam and collecting of the scattering patterns.In [88], neural networks were used for binary classi-fication of such diffraction patterns of non-crystalline objects into single hit or non-single hit classes.
For detection of small-sized and dense void and inclusion defects in spacecraft composite structures, transfer learning and domain adaptation were used in [89].In [90], detection of internal defects of aluminum conductor composite core was performed as a patch-wise classification task.
Several works apply 2D or 3D segmentation techniques on XCT images.A 2D segmentation method was employed in [91] on lab-based micro computed tomography (µ-CT) images of carbon fiber reinforced polymers to tackle the challenges caused by noise, low contrast between fiber and polymer, and unclear fiber gradients.A 3D instance segmentation method was developed in [26] for XCT scans of short glass fiber reinforced polymers.The model has an additional output for embedding learning, which allows a clustering algorithm to distinguish among various fiber instances.Different 2D and 3D semantic segmentation techniques were applied on XCT image data to study microstructures of materials in [86].The ground truth information of 3D X-ray diffraction measurements was used to develop a grain-wise segmentation model for Al-Cu specimens with additional post-processing to enhance visible grain boundaries and reduce over-segmentation.A µ-CT image processing method to build digital material twins was presented in [85], where a deep learning model was applied on 2D glass and 3D carbon reinforcements' images to efficiently segment them based on extracted multi-scale features using data-driven convolutional filters.The scanned µ-CT images and images produced by computer-generated virtual reinforcements models were used to train the model.

F. Electronic Industry
The electronic industry and more specifically semiconductor manufacturing has been under rapid development in the last few decades [92], which increases the necessity of developing fast and accurate methods for defect detection [92], unwanted particles deposition [93], volumetric inspection [94], etc.Therefore, the X-ray imaging technologies along with the CV techniques have drawn researchers' attention also in this field.
In order to characterize, measure, and optimize the design and production of buried interconnects in advanced integrated circuit packages, XCT imaging was used in [23] to avoid cross-section of the chips, and then several DL-based 3D object detection and segmentation methods were used to identify the components and perform 3D metrology.In [93], a CNN model was trained on a set of energy-dispersive Xray and scanning electron microscopy images to classify the chemical composition of particle defects on semiconductor wafers to decrease analysis time and error caused by human unpredictability.The internal wire bonding of chips is a process that can easily face interference and produce defects in the semiconductor enterprise capsulation step.Therefore, two algorithms were used in [92] to distinguish defective chips based on the standard template and similarity calculations among the neighbor chips.Also, as ML techniques need a lot of labeled data, a data synthesis procedure was employed in [94], where synthetic XCT images were produced during the miniature fabrication of thin silicon wafer layers with known orientation, position, and geometry features.These known data characteristics were used as annotations and used to train an automatic ML-based feature extraction model.

G. Others
Besides the fields mentioned above, there are other industrial production fields that employ different CV techniques on X-ray images.A brief overview is provided here.
In [95], a two-stage method was employed for X-ray cargo image inspection to solve the empty container verification problem.First, a rule-based algorithm was adopted to discover the location of containers' positions in the images and, afterward, a DL method was used to identify the empty containers.Solder balls' head-in-pillow defects were inspected using an ML-based methodology in [96].These defects affect the solder balls' conductivity and consequently lead to intermittent failures.In another study [97] focusing on solder joints, solder voids and head-in-pillow defects were recognized.
In order to increase the safety in aircraft flying, a DL method was proposed in [98] for X-ray image-based nondestructive examination of aeronautics engines with multiple defect inspection paradigms.Another application of using X-ray images and CV techniques is assembly inspection of internal components [30].In this study, to ensure that all components of a complex product are assembled accurately, a multi-view X-ray imaging technique was used to obtain projection information on each internal component.Then, a deep CNN model was used to classify the internal components and provide their coordinates to compare and match the locations and consequently recognize transposition or dislocation faults.To detect and reject defective products in a mineral wool production line, a binary classifier was developed in [99].The goal of this work was to achieve fast classification for a real-time application that can outperform a thresholding-based method on the production line.To this end, the authors performed structured parameter pruning on the adopted deep learning model.X-ray-based adaptive defect detection in milled aluminum ingots surfaces was used in [100].Automatic segmentation of multi-class progressive matrix damage of aerospacegrade advanced composite laminate images obtained by nondestructive on-site mechanical tests coupled with synchrotron radiation computed tomography was considered in [101].In order to assess and classify tablets' internal defects, an Xray-based method was used in [102] to explore the impacts of a filler composition, roller compaction force, and magnesium stearate on tablets quality.Also, it was shown that the use of X-ray images with quantitative CV analysis can generate deeper mechanical knowledge of the compaction phenomenon in tableting.
A two-step CV methodology was proposed in [103] to detect voids and segment concrete samples.Another study on concrete fractures [104] noted that due to the low number of pixels for each fracture in X-ray images, high-frequency noise, and weak contrast over fractures, the performance of conventional segmentation methods is limited in extracting the continuous fractures, which leads to an overestimation of fractures aperture and thickness values.Therefore, an encoder-decoder network was adopted with a CNN to achieve rapid and precise detection of barely seen micro-fractures.The organic microcapsules in cement were classified in [105] into five categories, namely microcapsules, ruptured microcapsules, pores, adhesive objects, and others, using a CVbased classifier.

IV. COMPUTER VISION TECHNIQUES
In this section, the computer vision and machine learning methods used in the previous X-ray-related studies are summarized.While advances in deep learning during the last decade have made deep learning techniques the default solutions for many machine learning tasks, these techniques require large amounts of training data, which is not always available, and therefore, traditional techniques are still commonly used in many tasks.A major difference between traditional and deep learning methods is that deep learning methods can typically operate directly on high-dimensional raw data, such as X-ray images, while traditional techniques generally use as their inputs lower-dimensional features extracted from the raw data.Therefore, feature extraction techniques are important for traditional techniques, but not for deep learning-based methods.On the other hand, as deep learning models require large training datasets, different approaches that allow training models with less data have become important.We divide our description into two main categories, traditional and deep learning methods, and further into relevant subcategories.

A. Traditional Methods
Numerous CV techniques have been implemented on a variety of image analysis tasks and applications.In this section, we provide a comprehensive overview of traditional (non-deep learning) CV methods that have been applied on different X-ray image analysis tasks.
1) Feature Extraction Techniques: Traditional ML methods typically cannot directly use the original highdimensional raw data as their inputs, thus feature extraction methods are needed to transform the data to lowerdimensional features conserving relevant information for the analysis task at hand.
Scale Invariant Feature Transform (SIFT) [106] and Speeded Up Robust Features (SURF) [107] are used to extract features of local image patches and have been extensively used especially in object detection applications.In [108] and [69], SIFT and/or SURF were used to create Bag of Words (BoW) [109] representations of bag inspection images to classify them with SVMs.Well-known feature extraction methods Gabor features [110], Histogram of Oriented Gradients (HOG) [111], and Local Binary Pattern (LBP) [112] were used in [35] to extract features for casting defect detection with several traditional classifiers.LBP features were observed to give better results than Gabor or HOG features.Similarly, Gabor, LBP, HOG, SIFT, SURF, and other features including features extracted from pretrained deep CNNs were used in [41] for classifying small image patches (defect vs. no defect) of casting images and the best performance was obtained with LBP features.
Many works applied also subspace learning methods, most commonly principal component analysis (PCA), for feature extraction.In [67], a supervised multi-label dimensionality reduction method, Multi-Output Proximity Embedding (MOPE) [113], was used in feature extraction for threat classification from security images, where Multi-Output Proximity Embedding (MOPE) is the embedding engine.
Adaptive Sparse Representation (XASR+) was introduced in [68] for object recognition in security screening.It is a learning-based representation, where several patches of each object in the training set are used to learn a representative sparse dictionary for the class.In the test phase, the unseen samples are classified using these dictionaries and sparse representation-based classification (SRC), which is introduced in Section IV-A2.The approach proposed in [68] led to promising results compared to other traditional feature extraction techniques.
It is also possible to use feature extraction as a preprocessing step before deep learning methods.In [100], Difference of Gaussians (DoG) and Mask Gradient Response-based Threshold Segmentation (MGRTS) were used for Region of Interest (RoI) extraction in aluminum ingot images for surface defect detection.The found RoIs were subsequently classified using a CNN.
2) Traditional Classifiers: FNNs, artificial neural networks (ANNs), fully-connected neural networks (FCNNs) and multi-layer perceptrons (MLPs) typically refer to the same approach: a set of neurons arranged in layers and having a connection (weight) between every pair of neurons in subsequent layers.However, it should be noted that also deep learning methods are ANNs and many of them are also FNNs.The layers of traditional MLPs are commonly called fully-connected layers and used also in deep learning.The traditional MLPs are multi-input and -output functions, but due to computational limitations, the input dimension cannot be very high if the network has only fully-connected layers.
During training, the network learns the connection weights so that the error between the predicted and ground-truth outputs is minimized.The error is quantified using a loss function and training happens via back-propagation of the loss so that the loss reduces.
Support Vector Machine (SVM) [114] is a classifier that has been originally developed for binary classification and aims at finding a hyperplane that separates the features of samples in different classes by maximizing the margin between the classes.Non-linear decision boundaries can be obtained using SVM together with the kernel trick and, e.g., by using radial basis function (RBF) or polynomial kernel.SVMs can be also used for multiclass classification by formulating the problem as multiple one vs.one or one vs.all tasks.
In X-ray image processing, SVMs have been used in [41], [53], [57], [59], [67], [69], [84], [96], [108] as the classifier.In [41], both linear and RBF SVMs were used for classifying patches of cast images (defect vs. no defect).In [96], they were used to complement a CNN in solder ball defect inspection (defect vs. no defect).Multiple works for welding detect segmentation [53], [57], [59] used first some simple methods (e.g., thresholding) for segmenting the defect candidates and then a separate SVM to classify the candidates.Binary classification with different kernels was applied in [53], [59], whereas [57] applied multiclass SVMs using both one vs.one or one vs.all approaches.In [84], three SVM models with different kernels (linear, RBF, polynomial) were used to segment X-ray baggage security images into organic and inorganic material.In [69], [108], SVMs were used for binary classification of security images (gun or no).MOPE-SVM was used as the classifier in the threat classification systems in [67].
Random forest (RF) applies bootstrap aggregation to build multiple classification trees and then classifies the objects based on the majority vote of the trees.RF was compared against other traditional classifiers in [84] for an X-ray baggage security segmentation, but it showed inferior performance.In [115], RF was compared against k-nearest neighbors classifier in an imbalanced mineral phase segmentation task and the algorithms achieved similar performance.
K-nearest neighbors (k-NN) is a simple classification technique that assigns labels for test samples by calculating the distances of the samples with all the training samples, finding k nearest samples and then selecting the most frequent label or the average of the labels in case of classification or regression, respectively.k-NN can become noticeably slow when the data size grows.It was applied in the same studies as RF [84], [115] with similar results.
Naive Bayes (NB) is a simple Bayes theorem-based probabilistic classifier with (naive) independence assumption among the features.It was the winner among the traditional classifiers compared for X-ray baggage security segmentation in [84].
Logistic regression is used to evaluate class probabilities in binary classification tasks.It aims at finding optimal parameters values to fit a logistic function to model a binary target variable.The parameters are usually estimated using maximum likelihood estimation over cross-entropy loss.Logistic regression was applied in [103] to differentiate the features of aggregate and mortar pixels in concrete phase segmentation.
Sparse representation-based classification (SRC) [116] computes a sparse representation for all training samples and decides the class of an unseen test sample by evaluating how well the sample can be constructed from the sparse representations of different classes.The assumption is that the sparse representations capture the central features of the images belonging to a certain class, and a better reconstruction means that the central features of a test sample follow the class characteristics.In [68], SRC was used together with the XASR+ representations introduced in Section IV-A1 for security screening.
3) Clustering Techniques: Clustering refers to a process of grouping the input samples so that similar items are assigned to the same cluster and dissimilar ones into different clusters.Segmentation can be seen as a clustering task, where regions corresponding to different objects should be assigned into different clusters and evaluated via clustering metrics as described in Section V-B3.Clustering techniques also find use in X-ray image segmentation.
K-means clustering algorithm initially assigns all items randomly in one of K clusters.Then the algorithm proceeds iteratively by computing the centroid of each cluster, reassigning the items to the cluster of the closest centroid, and repeating these steps until the algorithm converges.K-means was assessed in [115] for 3D mineral phase segmentation based on voxels' gray-scale values.Fuzzy C-means clustering (FCM) is a fuzzy version on K-means.While K-means assigns each item into one cluster, in FCM the items can belong to multiple clusters in a fuzzy manner.FCM was used for 3D mineral phase segmentation based in [115].
Simple linear iterative clustering (SLIC) [117] is another variant of K-means, where the distance measure combines feature similarity and spatial distance of the pixels.Also, the number of distance evaluations is limited to an area proportional to the superpixel size.SLIC was applied for subcomponent level segmentation in anomaly detection within X-ray security imagery in [65].

B. Deep Learning Methods
Deep learning has dominated many CV tasks by adopting deeper and more complicated neural architectures that make the networks capable of modeling more complex patterns and relations.X-ray image processing is not an exception and researchers have exploited the benefits of DL models, too.In this section, we present a comprehensive overview of DL methods applied on X-ray image analysis.We review the adopted deep architectures categorized according to the considered CV tasks, i.e., classification, detection, and segmentation, and we also cover different loss functions, data augmentation strategies, and other approaches for improving the performance of deep learning models.
A type of deep learning model that has been commonly used for image analysis is convolutional neural networks (CNNs) [118] leading to high performances [49], [96].CNN architectures consist of several layers with different properties.The most common types of layers are: convolutional layer, pooling layer, and fully-connected layer.Convolutional layers learn to extract useful features from the input images and each layer transforms the input data into a more abstract representation.Pooling layers are used to compress the feature maps and fully-connected layers make the final prediction based on the extracted features.The last layer provides the network output and the output format depends on the CV task at hand.
1) Deep Classification and Backbone Architectures: In CNN architectures for classification, the output format usually is a one-hot encoded vector [118], which has one element for each class and the value of the elements is a form of predicted probability of the input image to belong to the corresponding class.The image is assigned to the class with the highest probability.This type of layer is typically used together with categorical cross entropy loss function (see Section IV-B4) and softmax activation function defined as where y[i] and ŷ[i] denote the i th element in the output vector before and after applying the softmax function, respectively, and C is the number of output classes.Softmax activation confines the output element values between zero and one and makes the sum of the output elements equal to one.The exponential function highlights the probability of the most probable class making the predictions clearer.
While the CNN architectures presented below were originally proposed for classification tasks, they can be used as backbone networks in other problems, such as object detection [36], or within Generative Adversarial Networks (GANs) used for data augmentation [44] simply by removing the output layer designed for classification and adding other types of layers.
Studies applying simple CNN architectures for X-ray image classification include [55] using a simple CNN model for weld defect classification, [105] using another simple CNN model for binary classification of automotive components into defective and non-defective class, and [2] using Evenly Distributed CNN (ED-CNN) illustrated structure in Fig. 7-a for casting defects classification.
A simple and well-known classification backbone architecture is VGG [119] which contains stacked convolutional and max pooling layers.VGG-16 architecture shown in Fig. 7-b was used in [97] for solder joint classification and in [40] along with spatial attention and bilinear pooling for casting defects classification.VGG-16 was used as a feature map extractor (backbone network) in an anomaly detector network for casting defect localization in [44] and in an object detection network for X-ray baggage security assessment in [64].In [54], a different VGG variant was used for weld defect classification.Different simple CNN models including VGG-19, VGG-F, and VGG-2048 were compared in [41] for casting defects classification from X-ray data.
One of the most common and well-known classification architectures is Residual neural network (ResNet).ResNet is an extension on CNN models that was proposed to prevent the problems caused by very deep networks, in particular the vanishing gradient problem.To optimize and overcome the network degradation problem, ResNets include residual blocks that have skip connections over some layers (Fig. 7-c, left).They also apply batch normalization after each convolution layer.These factors make it easier to pass information through the networks, which allowed to have a larger number of layers and a smaller error rate on both train and test sets than the earlier CNN models.Due to its benefits, different variants of ResNet with various depths, such as ResNet-18 [85], ResNet-34 [95], ResNet-50 [66], and ResNet-101 [44], have been used in X-ray data assessment studies, e.g., in casting defect recognition [44], [46] and detection of internal defects in the Aluminum Conductor Composite Core (ACCC) [90].
In order to improve the multi-scale performance of ResNet, a newer model known as Res2Net was introduced in [120].In Res2Net, the residual blocks are replaced by hierarchical residual-like connections within one single residual block (Fig. 7-c, right).Res2Net was used as the generator in a GAN in [61] for data augmentation (See IV-B5).A modification of ResNet-50, DetNet, was proposed in [121] as a backbone network optimized for object detection alleviating the loss of location information in feature maps caused by down-sampling operations.DetNet-59 architecture shown in Fig. 7-d was adopted in [36] as a backbone for casting defect detection.
While convolutions focus on local relations in the data, augmenting convolutions with different approaches to capture long-range dependencies have been proposed.Selfattention [122] is an attention mechanism that can relate different positions of the data in order to compute a feature representation.A self-attention guided CNN was used in [38] to detect small casting defects.Its overall structure of the employed model is shown in Fig. 7-e.
Considering the special characteristics of the weld defects, the usual pooling strategies have poor dynamic adaptability.Therefore, an improved pooling strategy was proposed in [123].In the proposed approach, different pooling method were used depending on whether the pooling domain is outside the defected area or on the defect's edge.
Autoencoders are a special type of neural networks that can be used for unsupervised feature extraction.They are composed of an encoder that turns input images into feature representations and a decoder that tries to reconstruct the input from the feature representation as the network's output.While the task would be trivial if the feature representation had the same dimensionality as the input (and output), the feature representation in autoencoders usually has a much lower dimensionality, which forces the network to learn representations that contain the most useful information for   the reconstruction process.As the training requires only the images with no need for class labels, training can be performed in a fully unsupervised manner.The trained encoder can be then used as a feature extractor for other tasks, such as classification.This approach can be useful when there is a large unlabeled data set available, but only a limited number of labeled training samples.For instance, a non-convolutional sparse auto-encoder (SAE) shown in Fig. 7-f was utilized in [58] as an intrinsic feature extractor for welding defect detection.In [43], an unsupervised inspection system was built on top of a convolutional autoencoder (CAE) (Fig. 7-g) to inspect casting X-ray images with no labeling.In [124], a CAE was trained using abundant normal images of manufacturing production lines.The encoder was then combined with fully-connected layers for classification that were trained using a lower number of labeled samples of both normal and defective engines.
2) Deep Object Detection Architectures: Object detection networks aim at finding the locations of objects in addition to recognizing them.The number of objects can significantly vary in different images, which means that the output layers for object detection architectures cannot use a fixed fullyconnected structure as is commonly done in classification.
The key design question in object detection architectures is how to locate the possible objects for deeper analysis.One approach would be to predefine all possible bounding box locations and sizes and exhaustively analyze whether they contain objects of interest.However, this approach would have an enormous number of bounding boxes to analyze and would be computationally too expensive.Therefore, most object detection architectures propose approaches for finding only the most promising subset of all the possible bounding boxes for further analysis.
Region-based CNN (R-CNN) [125] is one of the architectures commonly used for object detection.For a given image, R-CNN applies a selective search mechanism to extract approximately 2,000 RoIs.Afterward, each RoI is introduced to a CNN to obtain the output features, and then a collection of SVM classifiers is used to recognize the type of object in the RoI (if there is any).Fast R-CNN [126] improves the efficiency of R-CNN by not introducing all RoIs to the CNN, but introduces the input image once, and the features for the RoIs are then extracted from the overall feature map.An upgraded version of Fast R-CNN, called Faster R-CNN (Fig. 8-a) uses a separate network to predict RoIs instead of using the slow selective search.Faster R-CNN was applied in [98] and [64], [78], [83], [127] to detect defects in tires and prohibited items in baggage, respectively.In [82], an additional branch called Part-based Detection Network (PDN) was added to Faster R-CNN (Fig. 8-b) to improve detection of occluded items in threat detection on X-ray security images.
You Only Look Once (YOLO) [128] is another commonlyused object detection architecture.The main difference between YOLO and the region-based approaches is that YOLO uses a single CNN to predict both bounding boxes and class probabilities.Therefore, it can be trained in an end-toend manner and it is much faster than the region-based approaches.There are multiple versions of YOLO architectures and many of them have been employed also in X-ray image processing.YOLOv2 [129] improves the original YOLO in multiple ways, such as adding batch normalization, removing fully-connected layers, and using anchor boxes.As a result, YOLOv2 manages to improve YOLO's recall and localization, while maintaining its classification accuracy.YOLOv2 was used in [88] with the architecture shown in Fig. 8-c for detection and classification of diffraction patterns in singleparticle imaging.YOLOv3 [130] improves the accuracy of earlier versions by adding objectness scores to bounding box prediction, adding connections to the backbone network lay-ers, and making predictions at three separate levels of granularity to improve performance on smaller objects.YOLOv3 was employed in [88] as a diffraction pattern detector on X-ray images by the illustrated structure in Fig. 8-d and in [76], [131] to detect dangerous objects in baggage security application.It also was used in [132] to detect defects in casting products.To achieve better detection speed and accuracy, YOLOv4 is introduced in [133] with improvements in network structure, training method, loss function, and data enhancement in comparison to YOLOv3.YOLOv4 was used in [134] as a detector in an X-ray security inspection task.The fifth version of YOLO, known as YOLOv5 was used in [135] to detect casting defects.
Single Shot Multibox Detector (SSD) [136] is a one-stage object detection network that eliminates the proposal generation phase by discretizing the bounding box prediction space into a set of default boxes and then calculating scores presenting the existence of each object class in each box and finally makes adjustments on the boxes to improve the scores.During inference, the predictions obtained from multiple feature maps with different resolutions are combined together to capture various object sizes.This method is combined with VGG-16 and ResNet-101 in [4] as an object detector for casting assessment.
Another common one-stage object detection architecture, RetinaNet [137], uses focal loss and feature pyramid network (Fig. 8-e).It achieves good performance with dense and small-scale objects, as the focal loss better addresses the problems caused by a major class imbalance between background and foreground classes.In the topic of X-ray image analysis, RetinaNet is used in [48], [73], [78] to detect defects in welding, anomalies in cluttered security imagery, and firearms in baggage security imagery, respectively.
In [5], a Cascaded Structure Tensor (CST) framework for detection and classification of heavily occluded baggage items from X-ray scans was proposed.The framework uses non-convolutional CST approach for object proposal extraction and a CNN only for subsequent object recognition.
Path Aggregation Network (PANet) [138] can be used for both detection and segmentation tasks.It improves regionbased networks by including bottom-up path augmentation to cut down the information path among lower layers and topmost features, adaptive feature pooling to connect feature grids at all levels of features, and fully-connected fusion to enhance mask prediction.It was combined with Resnet-50 in [36] (shown in Fig. 8-f) to detect defects in casting products.
Feature Pyramid Networks (FPNs) [139] were introduced for detecting objects at different scales.They use the inherent pyramidal and multi-scale hierarchy of deep CNNs to build feature pyramids with marginal extra cost.A FPN was used for detection of automobile casting aluminum parts in [39] with the shown structure in Fig. 8-g.In [66], it was combined with a ResNet-50 and used for prohibited item detection in X-ray scanning images.
3) Deep Image Segmentation Architectures: In image segmentation, the network output is an image with the same dimensions as the input image, and the task is to predict the class of each pixel.Segmentation architectures typically have a structure similar to autoencoders, where an encoder learns to extract a descriptive lower-dimensional representation of the input and a decoder up-samples the feature map size back to the input size and produces the class predictions.The networks are often fully-convolutional networks, i.e., they do not contain any fully-connected layers.[140] (Fig. 9-a) is one of the widely used models for image segmentation.It has an encoder-decoder structure with additional connections between them.The method has been used on both 2D and 3D (an extension upon the standard U-Net) X-ray data in [33], [86], [91].In [33], defects in 3D AM X-ray images were segmented using a modified U-Net.U-Net was used to segment continuous carbon fiber reinforcements composites in [91].2D and 3D U-Net were applied in [86] to predict grain boundaries in Al-Cu alloy materials.U-net structure with additional skip connections was used in [47] for segmenting the locations of welding defects.A U-Net-like structure with residual connections was used in [34] for porosity segmentation in XCT scans of additively manufactured metal specimens.Mask R-CNN [141] is an extension of Faster R-CNN (see Section IV-B2) with an additional output for predicting segmentation masks for each RoI.As this approach segments each instance of an object class independently, Mask R-CNN is an architecture for instance segmentation.In [78], Mask R-CNN (Fig. 9-b) was used to segment anomalies in cluttered security imagery.Casting defects and firearms in baggage were segmented using Mask R-CNN in [4], [73].

U-Net
CH-Net [3] is a semantic segmentation model based on adversarial autoencoders (AAEs) [142] (see Fig. 9-c).It was proposed as a fast and memory-efficient method for baggage security image segmentation in [3].DeepLabv3+, an encoder-decoder network with atrous separable convolutions in the encoder was proposed in [143].DeepLabv3+ was applied with ResNet18 as a backbone for semantic segmentation of µ-CT images for creating digital material twins of fibrous reinforcements in [85].
4) Loss Functions: As quantifiers of the difference between a model's predicted and expected outcomes, loss functions are essential in training deep neural networks.During training, the models are guided towards minimizing this difference and, if the loss function does not represent well the problem at hand, the results will be suboptimal.
In regression tasks, where the goal is to learn to predict specific values as the network's output, mean squared error (MSE) is a common loss function.MSE loss is defined as where y s is a vector containing the target values for all the network's N outputs for a specific sample s, ŷs is the predicted output vector, and y s [i] is the i th element of the target output.The final loss is the average loss over the training samples.While MSE can be also used in classification tasks, cross entropy loss, also called log loss, is preferred in classification tasks in general, and also most of the works on X-ray image classification use cross entropy loss.There are two commonly used versions of the cross entropy loss.Binary cross entropy loss is used in binary classification tasks as well as in multi-class multi-label classification tasks, where each sample may belong to multiple classes: Categorical cross entropy loss is used in multi-class singlelabel classification tasks: Compared to MSE, cross entropy losses penalize output values that lead to wrong classification more.Similar to MSE loss, the binary cross entropy loss gives an equal weight for all the output elements, whereas the categorical cross entropy loss focuses on positive samples, i.e., learning which samples should be classified to a specific class instead of trying to learn which samples should not be classified to the class [118].
In object detection tasks, the models need to predict the bounding box locations along with the corresponding class.Thus, they usually minimize both a regression loss (MSE) to learn the bounding box locations and a classification loss (cross entropy) to learn the classes.As semantic segmentation can be seen as pixel-wise classification, pixel-wise cross entropy loss is commonly used also in segmentation tasks.Below we briefly introduce some less common loss function choices that have been considered in X-ray image analysis tasks.
Triplet loss [144] is a loss function that tries to reduce the distance between data in the same class and increase the distance between data belonging to different classes.To compare performance, the triplet loss with cosine similarity is used as the loss function in [96] for soldering defect inspection, leading to a higher accuracy compared to using the cross-entropy loss function.
Mutual-channel loss function was introduced in [145] and it consists of a discriminant component and a diversity component.This results in a set of feature channels each of which reflects different locally discriminative regions for a particular class.This loss function was used for casting defect detection in [37] to focus on different discriminative regions without part annotations or bounding boxes of the defects.
Focal loss was designed to tackle multi-class object detection scenarios with a high imbalance between foreground and background classes [146].It gives a higher weight to hard misclassified examples.It was used in [100] for an imbalanced dataset of milled aluminum ingot defects.
Although focal loss makes one-stage detectors focus more on hard samples for improving performance, the availability of a fair amount of hard outlier samples can cause a reduction in accuracy [134].Gradient Harmonization Mechanism (GHM) loss was introduced in [147] to tackle this problem and it was used in [134] on a YOLOv4 model in security threat detection application.
5) Data Augmentation: Data augmentation refers to the process used to increase the number of data by creating slightly modified versions of the available real data or creating synthetic data.It can help to reduce the overfitting problem.Common augmentation tricks, such as random rotation, cropping, or flipping, are often used for training deep learning models.Some more specific augmentation techniques that were used in X-ray related studies are presented here.
A common way to tackle the problem of lacking a high number of images from real-world environments is learning based image synthesis.However, these methods usually combine background and foreground images randomly which limits the performance of the generated data.In Xray security applications, a learning-based image synthesis method was proposed in [148].In this method, a detector is first trained to estimate difficult positions for each foreground object detection.Then, a so-called difficulty map is created and the objects are synthesized at hard-to-detect locations using the difficulty map.
Attention-Guided Data Augmentation (AGDA), proposed in [37], creates new training samples from existing ones by suppressing the most discriminative parts found using attention maps.The method was applied for casting defect detection in [37].
Generative Adversarial Networks (GANs) [149] are a class of DL models that learn to generate new data samples with the same statistics as those in the training set.GANs consist of two competing networks: a generator that generates fake images and a discriminator that tries to distinguish between real and fake images as shown in Fig. 10.To be able to fool the discriminator, the generator needs to learn to create realistic images.GANs were used in [150] to generate simulated data of defective aluminum casting and improve the balance of the dataset.A GAN-based method was used in [61] to synthesize X-ray security images.Several modifications of the original GANs have been also used for data augmentation in X-ray applications.The authors of [49] noted that generating data using prior human knowledge is not applicable for some specific types of welding defects (e.g., burn through and crack of weld) due to their complexity.Therefore, they used Wasserstein Generative Adversarial Networks (WGANs) [151] for the task.In WGANs, the discriminator gives a fakeness score for the generated samples instead of just classifying them into real and fake classes.Self-Attention Generative Adversarial Network (SAGAN) was used in [62] that first generates several images of prohibited items.Subsequently, the images are transformed to Xray format using a cycle GAN and combined to different backgrounds.Also, Deep Convolutional GAN (DCGAN) and Spatial-and-Channel Attention Block and X-ray Wasserstein GAN Gradient Penalty (SCAB-XWGAN-GP) were used in [152], [153] for data augmentation in X-ray security application.

V. EVALUATION SETUP AND METRICS
In order to evaluate the performance of new methods and to have a valid comparison among different algorithms, it is important to know the different evaluation protocols that have been used in previous studies.Commonly, CV and ML techniques are evaluated by dividing the datasets into nonoverlapping training and test sets in order to reliably estimate the performance on unseen data.If the model to be trained has some hyperparameters which need to be determined by the user, they are typically set using a third separate part of the dataset called validation set.When no predefined validation set is given by the experimental protocol defined by the dataset, two approaches can be used.In the first one, the training set is divided into two non-overlapping sets, one used for training and the second used for validation.The second approach divides the training set into k non-overlapping subsets (sometimes called folds) and performs training and validation k times.Each time, data in a different fold is used for validation, while the data in the remaining k − 1 folds are used for training the models obtained by using different hyperparameter values.The average performance on all folds is calculated and the best hyperparameter values are those used in the model leading to the highest average validation performance.Then, the final model can be trained on the full training set using the best hyperparameter values.This procedure is commonly known as k-fold cross-validation.Especially for smaller datasets, an approach similar to kfold cross-validation approach may be used for the testing as well to get a more reliable performance estimate.In this case, the fold set aside should not be used for adjusting the model's hyperparameters, but the goal is to evaluate the method on a wider variety of test samples not seen during the training process.It can be also necessary to repeat the overall experiment multiple times and use average values, if variations in the methods' performance are expected.
In addition to the training setup, an important aspect of the evaluation is the selection of the evaluation metrics.Accepted and standard evaluation metrics must be employed to ensure fair comparisons.For meaningful evaluations, it is also important to understand which evaluation metrics are suitable for the task at hand.Below we introduce the most commonly used evaluation metrics for classification, detection, segmentation, and speed comparisons.

A. Classification Metrics
Many binary classification metrics rely on the counts of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) that are defined as shown in the confusion matrix in Fig. 11.In multi-class classification, similar confusion matrices can be computed either considering all classes or independently for each class and generalized evaluation metrics can be defined using these class-specific numbers.
1) Accuracy: Accuracy is maybe the most commonly used classification metric and it is defined simply as the ratio of correct classifications to the total number of samples.In binary classification, this can be defined as follows: While accuracy is an intuitive measure, it is not a good evaluation metric when the class distribution is not balanced.For example, if there are dangerous items in every 1000 th bag, a classification method can achieve a very high accuracy of 0.999 by simply classifying everything as safe.However, in such applications missing a dangerous item is obviously much more critical than incorrectly labeling a safe bag as dangerous.The latter situation might only lead to an additional manual inspection while the former may cause significant danger.Therefore, other evaluation metrics should be used in such imbalanced cases.Also, several works on X-ray image classification have complemented accuracy with other classification metrics.The combination of accuracy, recall, precision, and F1 score (with possibly other metrics) has been used, e.g., in X-ray-based casting defect classification [37], [38], welding defect recognition [58], and concrete phase segmentation [103].
2) Recall: Recall or sensitivity or true positive rate (TPR) is a commonly used metric for binary classification to quantify the number of correctly identified positive samples out of all positive samples in the data.It is defined as follows: Recall can be a good evaluation metric, when it is important to correctly recognize samples of a particular class, while it is less critical to incorrectly label objects from the other class, as in the example of finding bags containing dangerous items.However, a perfect recall can be always obtained simply by labeling all the samples as positive.Therefore, recall alone is not a sufficient evaluation metric.
3) Precision: Precision is another commonly used metric for binary classification that quantifies the ratio of correctly predicted positive predictions to the total number of positive predictions.It is defined as: A good precision can be generally achieved by labeling only very few and certain cases as positive and, therefore, it is not a good measure by itself either.However, recall and precision complement each other and other metrics combining the two have been suggested.4) F1 Score: F1 score is a widely used metric for binary classification that takes both recall and precision into account and can be seen as their harmonic average.F1 is defined as: F1 score is typically considered a good single measure for binary classification and more suitable for unbalanced data than accuracy.However, it is not as intuitive to understand what a certain score means in practice.5) Specificity: Specificity or true negative rate (TNR) is another metric for binary classification that can be used to complement recall/sensitivity.While recall focuses on the positive items, specificity focuses on the negative items.If the class assignment to positive and negative classes is reversed, specificity becomes equal to recall before the reversion.Perfect specificity can be obtained by classifying all the samples as negative.It is defined as Specificity was used to evaluate binary pixel-wise welding defect classification in [47].6) Average Performance: Average performance is another performance score that considers both precision and recall defined in [24].This score is averaging the performance of the method over all available classes and it is defined as: The metric was used in [24] for evaluating dangerous item classification.7) MCC: Mathews Correlation Coefficient (MCC) can also represent an overall classification performance as a single value and it is considered to be a reliable metric also when the class distribution is very imbalanced.This metric is defined as follows: MCC was used to measure welding defect classification performance in [51].8) ROC: Receiver Operator Characteristics (ROC) curve and Precision-Recall curve can be used to evaluate binary classification methods when the result depends on a threshold.As explained above, perfect recall/sensitivity can be obtained by classifying all the samples as positive, while classifying all the samples as negative leads to perfect specificity or precision.When the threshold is varied so that the number of positive assignments grows from zero, a good classifier assigns the true positive items as positive before falsely assigning any negative item as positive.A method's ability to do so can be evaluated using ROC curves that plot true positive rates vs. false positive rates (=1-true negative rate) during the process as shown in Fig. 12-a or precision-recall curves that plot precision vs. recall pairs (Fig. 12-b) as the name suggests.In general, ROC curves are more suitable for balanced class distributions, while precision-recall curves are recommended for imbalanced cases [154].ROC curves were used to evaluate firearm recognition in [69], while both ROC and Precision-Recall curves were used in [5] to compare algorithms for identifying normal and suspicious items.9) AUC: Area Under Curve (AUC) allows representing ROC (or Precision-Recall) curves numerically by computing the relative area below the curve.A higher area corresponds to better performance.AUC score for ROC curves was used in [5] for the detection of different suspicious items in a security application.
10) mAP -classification: Mean Average Precision (mAP) can be used as a classification metric for methods that rank the test items based on their estimated probability to contain certain objects.Each class-specific ranked list is used to compute precision and recall values for each rank.Here, recall is defined as the ratio of positive examples ranked above a given rank and precision is the ratio of all samples above that rank that are from the positive class [155].This creates a precision-recall curve.Average Precision (AP) for the class is defined as the average of precision values at eleven equally spaced recall levels [0,0.1,...,1].Finally, mAP is the average over classes.As described in Section VI-C, mAP was adopted with SIXray dataset [8] for the imagelevel classification task.In [108], AP is used as an evaluation metric in a binary handgun recognition task.

B. Segmentation Metrics
Segmentation can be seen as pixel-wise classification and, therefore, classification metrics can be used also as segmentation metrics.This has been common also for X-ray segmentation tasks.For example, sensitivity, specificity, accuracy, Precision-Recall curves, and AUC on Precision-Recall curves were used to evaluate segmentation performance in welding defect localization in [47].In [34], Precision, Recall, and MCC were used to evaluate AM porosity segmentation performance.In [115], TPR, false positive rate (FPR), and AUC for ROC were used to evaluate the performance in an imbalanced mineral phase segmentation task.Accuracy, Recall, Precision, F1 Score, ROC curve, and AUC on ROC were used to evaluate concrete segmentation in [103].
1) Dice Coefficient: Dice Coefficient can be used to evaluate the ground-truth segmentation mask A with the predicted segmentation mask B as where ∩ denotes the intersection (common pixels) of the two masks and |A| denotes the number of pixels in A. In classification terms, the union corresponds to TP, whereas A = T P + F N and B = T P + F P .Thus, dice coefficient is equivalent to F1 score.Dice coefficient was used to evaluate welding defect segmentation algorithms in [47] and fiber segmentation in [156] and [157].
2) Intersection over Union (IoU) -segmentation: IoU can be used to evaluate segmentation methods by comparing the ground truth segmentation mask A with the predicted mask B as Comparing the above equation with Dice Coefficient (F1 score) shows that they are similar.In fact, Therefore, it is not meaningful to use both IoU and Dice Coefficient at the same time.
IoU was used in [158] to evaluate additive manufacturing defect segmentation.In [33], the same task was performed as 3D segmentation, and also here, IoU was used as the performance metric.In this case, A and B consist of voxels.This metric was used in [159] with the name Jaccard Index to evaluate the segmentation performance in welding defects inspection.
3) Clustering Metrics -segmentation: Segmentation can be also seen as a clustering task, where regions corresponding to different objects should be assigned to different clusters.Therefore, different clustering metrics can be used for evaluating the performance of segmentation methods.In [84], the Davies-Bouldin index [160], Calinski-Harabasz index [161], Dunn index [162] and Hartigan index [163] were used to compare X-Ray baggage image segmentation methods.In [26], Adjusted Rand Index (ARI) was used to evaluate a 3D fiber instance segmentation method by considering voxels as items to be clustered into uniform instances.

C. Detection Metrics
Object detection methods typically provide as output the bounding boxes for the detected objects, their predicted classes, and the corresponding confidence values for the predicted classes.Each image can contain multiple objects and each object must be first located and then recognized.Therefore, comparing object detection algorithms is more complicated than comparing classification methods.A commonly adopted approach is to report mAP at selected IoU thresholds as described below.
1) Intersection over Union (IoU) -detection: IoU defined in Section V-B2 is broadly used also for the evaluation of object detection methods, but not as an independent evaluation metric.A threshold in the IoU between a detected bounding box and a ground truth bounding box is used to decide whether the detected bounding box is considered to match the ground-truth bounding box.These matches between detected and ground-truth bounding boxes are then used in mAP computation.
2) Mean Average Precision (mAP) -detection: In object detection, mAP is computed from class-specific precisionrecall curves as in classification (see Section V-A10), but the difference is in computing the ranked lists used for computing the precision and recall values.In detection, an IoU threshold is first selected (a value of 0.5 is commonly used).Then, the detected bounding boxes for a class are first ranked based on their confidence values.They are then assigned as true positives or false positives by comparing them with the ground-truth bounding boxes for the class and using the selected IoU threshold.If there are several detected bounding boxes corresponding to a single ground-truth bounding box, the detected bounding box with the highest IoU is considered as true positive, all the others as false positives.The ranked list with the corresponding true/false positive assignments can be used to compute the class-specific AP scores and, finally, the mAP score as described in Section V-A10.
3) Soft-IoU: In defect detection, the defects such as gas cavities in casting parts cannot be considered independent objects similar to animals or humans, but it can be equally correct to annotate cavities close to each other with a single bounding box or several separate bounding boxes.While IoU criterion is used to match a single output bounding box with a single ground-truth bounding box at a time, Soft-IoU algorithm proposed in [36] can match multiple bounding boxes with a single bounding box or vice versa.mAP can be computed using Soft-IoU when it is more suitable than IoU.
4) Object Localization Accuracy: Object localization accuracy is used for evaluating object localization heatmaps using ground-truth bounding boxes.If the pixel of a maximum response is inside one of the bounding boxes for the specific class, the detection is considered true positive.Otherwise, it is considered false positive.Finally, the object localization accuracy is computed as T P T P +F P .This metric is used in a class-specific object localization task for SIXray dataset [8] described in Section VI-C

D. Speed Metrics
There are several time-based metrics to assess the speed of the models, such as Frames Per Second (FPS) that was used in [38] to evaluate the speed of the model.The other common speed metrics are Training time [5], [33], [63] and Evaluation time [4], [5], [44], [63], which can be calculated based on CPU! (CPU!) and GPU! (GPU!)-based executions.

VI. DATASETS
Datasets have a central role in the development of CV methodologies.They are necessary not only to adjust or train the models, but the availability of a public dataset also makes it possible to have a fair comparison among the performance of different CV methodologies.As seen in our review of CV studies on industrial or security-related X-ray images, most of the studies in the field used lab and industrial environment datasets that are not publicly available.In fact, in our opinion, this is one of the main reasons that research on this topic is not advancing at the same pace as other topics in CV where a variety of publicly available datasets exists widely used.However, there are a few publicly available datasets that can be used to develop new methodologies and compare their performance with previously proposed approaches.The main characteristics of these datasets are summarized in Table 3 and more details for each dataset are provided in the following.

A. CoCr AM XCT
The CoCr AM XCT dataset is introduced in [6] and it is available on [22].The dataset consists of 4,350 images of five cylindrical additive manufacturing specimens.In each specimen, a different minor variation along with geometric magnifications is applied leading to small variations in voxel sizes.A 4-times optical magnification is used on all the specimens.Furthermore, a different exposure time is adopted for each sample.
Sample images from CoCr AM XCT dataset are shown in Fig. 13, including images from all five specimens.It should be noted that this dataset was not provided for CV purposes and there is no ground-truth labeling.In [6], the Bernsen local thresholding method [164] was used for defect segmentation and this segmentation was later used as the ground-truth segmentation mask in [33] for evaluating the performance a fully-convolutional 3D segmentation network.However, as convolutional neural networks have the potential to outperform simple thresholding-based methods such as the one used for the ground-truth generation, it cannot be guaranteed that this evaluation protocol leads to a fair comparison between more advanced methods.

B. GDXray+
The GDXray+ dataset [7] provides a collection of more than 21,100 X-ray images to develop, test, and evaluate CV and image analysis methods.The dataset is named GDXray according to the name of the Machine Intelligence Group performing the data collection (GRIMA X-ray database).The data can be used freely only for research and education purposes.
GDXray+ includes five groups of images (casting, welding, security, nature, and setting).Three of these five groups can be considered relevant for our review and they are briefly described in the following: • The GDXray Casting dataset contains 2,727 X-ray images mainly from automotive parts, including aluminum wheels and knuckles, many of which contain casting defects.The casting defects in each image are labeled with tight-fitting bounding boxes.The size of the images in the dataset ranges from 256 × 256 pixels to 768 × 572 pixels.This group of X-ray images is arranged in 67 series.The description and applications of each series are available in [7].Fig. 14 shows a random collection of the images from the GDXray Casting dataset.While the dataset does not define a default evaluation metric, mAP at IoU 0.5 has been used in [4], [44].A random split into a 80% training set and a 20% test set was proposed and made publicly available in [44].
A new dataset obtained from the GDXray Casting data by cropping 32 × 32 pixels patches is introduced in [41].This dataset consists of 47,520 X-ray casting images along with their labels.• The GDXray Welding dataset includes 88 images of metal pipes welding with porosity defects and it contains pixel-wise ground truth segmentation information for some of the images.This group of X-ray images is arranged in 3 series.The description and applications of each series are available in [7].Two samples from this dataset along with their ground-truth segmentation are shown in Fig. 15.• The GDXray Baggage dataset contains 8,150 X-ray baggage scans containing both occluded and nonoccluded items with marked ground truths for handguns, razor blades, shurikens, and knives.This group of Xray images is arranged in 77 series.The description and applications of each series are available in [7].Some image samples from this dataset are shown in Fig. 16.

C. SIXray
SIXray is a pseudo-color X-ray security inspection dataset introduced in [8].It includes over a million X-ray images that were collected at several subway stations using color-Xray scanners that assign various colors to different materials.The data is categorized into six common threat groups (gun, wrench, knife, scissors, pliers, and hammer).Some samples from this dataset are shown in Fig. 17.
To study the impact of class imbalance, three different subsets known as SIXray10, SIXray100, and SIXray1000 are defined as follows: SIXray10 has all 8,929 scans with suspicious items and ten times scans without suspicious items; SIXray100 contains all scans including suspicious items and 100 times non-suspicious scans; SIXray1000 has only 1000 images with suspicious items and all images without suspicious items.Each subset is randomly divided into a training set containing 80% of the images and a test set containing the remaining 20% of the images.Image-level annotations provided by human security inspectors are available for the FIGURE 14.Samples of X-ray images from casting specimens, available in GDXray data [7].FIGURE 15.Samples of X-ray images from welding specimens and their segmentation, available in GDXray data [7].whole dataset, while bounding box annotations of prohibited items are available only for the test datasets.
The original dataset paper provides results for two different tasks: image-level classification and object localization.Both tasks are evaluated separately for each class.For image-FIGURE 17.Samples of X-ray security images, available in SIXray data [8].
level classification, the methods should rank the test images based on their probability to contain a specific object and the results are evaluated using mAP (see Section V-A10) similar to Pascal VOC classification challenge [155].For object localization, the evaluated methods produce heatmaps for each class separately and the performance is evaluated using object localization accuracy (see Section V-C4 as in [165]).

D. OPIXray
OPIXray [9] is a pseudo-color X-ray security dataset of occluded prohibited items.The backgrounds of all samples are scanned by the security inspection machine and the prohibited items are synthesized into these backgrounds using professional software.The dataset consists of 8,885 X-ray images categorized based on 5 prohibited items from five categories: Straight Knife, Folding Knife, Utility Knife, Multitool Knife, and Scissors.All prohibited items are annotated manually with a bounding box by a professional inspector from an international airport.Each image contains at least one prohibited item, while some have more, and images are stored in JPG format with a resolution of 1225 × 954.Some samples from this dataset are shown in Fig. 6.
The dataset is partitioned into a 80% training set and a 20% test set.Furthermore, the test set is divided into three subsets (OL1-3) containing prohibited items with different occlusion levels.The task to be performed is object detection and the evaluation metric used in [9] is mAP with 0.5 IoU threshold.

E. PIDray
PIDray [10] is a large X-ray dataset including 47,677 real security images, each of which contains at least one prohibited item.Some prohibited items have been deliberately hidden.The images are collected from different scenarios including airports, railway stations, and subway stations using three different security inspection machines from different manufacturers resulting in a variety of sizes, colors, and resolutions.The prohibited item categories are knife, gun, scissors, lighter, sprayer, baton, wrench, pliers, hammer, handcuffs, power bank, and bullet.Some samples of this dataset are shown in Fig. 18.
The dataset is divided into a 60% train set and a 40% test set.In addition, the data is split into three groups, namely easy, hard, and hidden.The images are annotated with both bounding boxes and segmentation masks.Therefore, this dataset can be used for classification, object detection, and instance segmentation.The performance metrics used in [10] are Average Precision and Average Recall averaged over 12 classes and 10 IoU thresholds between 0.5 and 0.95.

F. HiXray
The High-quality X-ray (HiXray) security inspection image dataset was introduced in [11].It is the largest highquality dataset for prohibited item detection, and it contains 45,364 pseudo-color X-ray images with 102,925 common prohibited items which are categorized into 8 classes, namely portable charger 1 (lithium-ion prismatic cell), portable charger 2 (lithium-ion cylindrical cell), water, laptop, mobile phone, tablet, cosmetic, and nonmetallic lighter.The images are collected from a real-world airport security inspection and bounding box annotations are provided manually by professional security inspectors.The images are in JPG format with an average resolution of 1200 × 900, and on average, each image has 2.27 prohibited items.A sample of each class in HiXray dataset is shown in Fig. 19.
The dataset is divided into training and test subsets with a 4:1 ratio.The dataset is proposed for detection tasks and can be used for more specific tasks such as small object or occluded object detection.The evaluation metric used in [11] is mAP with 0.5 IoU threshold.

G. CLCXray
The Cutters and Liquid Containers X-ray Dataset (CLCXray) [12] focuses particularly on the overlap problem in security images.While OPIXray dataset also focuses on the overlap problem, CLCXray has more overlap between objects and similar backgrounds, as well as overlap between multiple objects.In addition, the images in OPIXray are synthetic, while CLCXray contains real images.Overall, compared to other security datasets, CLCXray has the most labeled images, the most labeled threat objects, the most threat categories, and more accurate bounding box annotations.
There are 9,565 pseudo-color X-ray security images that consist of 4,543 images collected from real subway scenes and 5,022 simulated images from manually designed baggages.The images are labeled by professionals in 12 categories including 5 classes of cutters (blade, knife, dagger, scissors, Swiss army knife), and 7 classes of liquid containers (cans, carton drinks, plastic bottle, glass bottle, vacuum cup, tin, spray cans).In total, there are more than 20,000 potentially dangerous items in the dataset resulting in an average of more than two items per image.The images have resolutions between 373 × 200 and 732 × 1280 pixels.A sample of each category in CLCXray dataset is shown in Fig. 20.
The dataset is divided using an 8:1:1 ratio into training, validation, and testing sets.Annotations are provided in COCO format.For evaluation, CLCXray adopts the COCO evaluation metrics [166]: mAP is the mean average precision computed across 10 IoU levels of 0.5:0.05:0.95,mAP 50 is computed at a single IoU of 0.5.mAP 75 is computed at a single IoU of 0.75, mAP s in the mAP for small objects (area < 322), mAP m is the mAP for medium objects (322 < area < 962), and mAP l is the mAP for large objects (962 < area).

VII. PERFORMANCE COMPARISON
Due to the lack of available public datasets for industrial X-ray image processing, most studies have used private data, which makes it difficult to compare and verify the performance of different approaches for different tasks.Furthermore, even studies using public datasets use different evaluation protocols and metrics.Several works have not reproduced the results of some of the methods used in the comparisons but simply transferred them from prior work.Thus, direct comparison with the exact same experimental protocol is not possible.In some cases, this has even led to directly comparing results for different tasks (e.g., classification and detection).
In this section, we have collected results reported in previous works using public datasets.These results are divided based on the datasets and presented in Tables 4 to 12. Due to the above-mentioned problems, we report the results collected from different studies in separate blocks of rows with different background colors.Furthermore, we indicate where the comparative results have been collected from different studies, and in our discussion, we point out some of the clearest problems we observed.

A. CoCr AM XCT
As can be seen in Table 4, only one paper employed this dataset [33].The paper assessed three variants of 3D U-Net on the dataset.The first one is a 3D U-Net with convolutional, batch normalization (BN), and rectified linear unit (ReLU) layers.The second one uses convolutional, group normalization (GN), and ReLU layers, and the last one is a residual symmetric 3D-Net.mIoU and training time on GPU were chosen as the accuracy and time metrics and the results show that the residual symmetric 3D U-Net achieved the best accuracy, but it was slower to train.As the training time is not connected to the real-time operation, the computational time can be neglected, and the higher accuracy would be the selection index among these three methods.However, as mentioned in VI-A, the ground-truth annotations were generated using a simple thresholding approach and it is unclear whether this leads to a meaningful comparison of the more advanced methods.

B. GDXray Casting
As can be seen in Table 5, two papers employed this dataset [4], [44].In the first one [44], the performances of

C. GDXray Welding
Results for the GDXray Welding dataset are shown in Table 6.An improved U-net was compared in [47] with U-Net and eGAN, assessing by five accuracy indices (recall/sensitivity, specificity, accuracy, AUC, and dice), and on four out of the five indices, it achieved a better performance.Two DL networks with two and three hidden layers were proposed in [58].The three-hidden layer network achieved a better performance evaluated by Precision, Recall, and F1.As the original dataset contains only 88 large welding images, both of the above studies applied cropping to create a larger dataset.However, different cropping makes the results incomparable.Furthermore, in [4] transfer learning from casting defect detection to welding defect detection was evaluated using the GDXray Welding dataset with again a different image cropping approach.In this work, a mAP mask of 0.85 was achieved.
The GDXray Welding data was used as a part of the experiments also in [48], where a Retina-based network was used for welding defect detection.However, here the dataset used in the experiments contained as also the GDXray Casting data as well as privately collected welding X-ray images, and thus the results are even less comparable to other studies.

D. GDXray Security
As shown in Table 7, all available studies on the GDXray Security dataset applied different experiment protocols, such as different subsets of data, different splitting between train and test sets, and different classes included.Therefore, it is not possible to reliably compare the performance of the suggested methods between studies.
The authors of the GDXray dataset, applied implicit shape model (ISM), adapted implicit shape model (AISM), ISM! (ISM!), AISM! (AISM!), SURF, and SIFT-based non-deep  learning methods for threat detection in [167] and AISM led to the highest performance.To evaluate their performance, they computed ROC curves on three different IoU levels.Then they reported AUC, true positive rate (recall) at the false positive rate of 0.05, R 0.05 as well as true and false positive rates at the best operation point.In Table 7, we report AUC and the best operation point results as R and S p (1-false positive rate), but it should be noted that in the later works also these metrics were computed in a different manner.
In [168], Faster R-CNN, YOLOv2, and Tiny YOLO were used for object detection, but the performances were evaluated on image level using classification metrics apparently without considering the bounding box overlap in any way.Faster R-CNN achieved the best performance in terms of accuracy, F1 score, recall, and precision.The works in [5], [63] compared the performance of their proposed methods with the methods presented in [167], [168].However, considering different experiment protocols, these comparisons are not reliable as noted also by the authors themselves.
Although the mAP metric was used to assess detection performance in [82], [148], [169], it is not possible to reliably compare the performance due to the variations in the experimental protocols.In [82], adding PDN branch to Faster R-CNN led to improved performance.In [148], a learningbased image synthesis method was proposed to generate more training data.This method was evaluated with four different detection architectures and compared against random training data generation proposed in [171].PFPNet was the best-performing architecture, while the proposed image synthesis approach consistently led to better results.In [169], Unlike the other studies on the GDXray Security dataset, classification was considered as the CV task in [24].The performance of ten CV methods was assessed and GoogleNet achieved the best test accuracy.

E. SIXray
As explained in VI-C, the original dataset paper [8] defines three different subsets of data (SIXray10, SIXray100, and SIXray1000) as well as two tasks: image-level classification evaluated by mAP and object localization evaluated by localization accuracy.The original dataset paper also provides baseline results for three different network architectures with/without class-balanced hierarchical refinement (CHR) on these tasks on each subset as reported in Table 8.
In [5], [63], the performance is directly compared against results copied from [8], but due to a different experimental protocol, this is questionable.In [5], [63], the classification is carried out on object proposal level, not on image-level as in [8].Furthermore, while not clearly described, it appears that only the detected object proposals are considered, i.e., completely undetected objects will not harm the classification performance.The computation of localization accuracy is not described.
In [80], the focus is on foreground-background separation.The proposed method FBS is compared against four detection methods, namely, YOLOv4, De-Occlusion Attention Module (DOAM), CHR, and RGBS on SIXray10 data using mAP (detection) at 0.5 IoU as the evaluation metric.
The remaining studies [66], [134], [170] on SIXray dataset did not use the subsets defined in [8], but instead, they picked only the images containing prohibitive items (8929 images).Image-level multi-label classification was tackled in [66], whereas [134], [170] focused on detection.In [170], the results are directly compared with image-level classification results picked from [8] despite the different subset of the dataset used and the different CV task evaluated.Furthermore, different splitting into training and test sets makes the results from the detection papers [134], [170] incomparable.Furthermore, the papers do not report the IoU threshold used for mAP evaluation, which makes also future comparisons with the reported results unreliable.

F. PIDray
At the time of writing this paper, the main paper that introduced PIDray dataset [10] was the only one reporting results on this dataset.Two tasks, i.e., detection and segmentation were considered and several methods were evaluated for both tasks using AP obtained by averaging over multiple IoU levels and all categories.We report these results in Table 9.
The methods evaluated in [10] included the proposed Selective Dense attention Network (SDANet), which is an architecture based on Cascade Mask-RCNN [172] that has a ResNet-101 network as its backbone.It can be seen that the SDANet achieved the best performance compared to others with overall Detection AP and Segmentation AP of 61.6 and 49.8, respectively.

G. OPIXray
The original OPIXray dataset paper [9] provides several results for object detection task evaluated using mAP at 0.5 IoU threshold focusing on the performance of the proposed De-occlusion Attention Module (DOAM).In Table 10, we report the results for three architectures (SSD, YOLOv3, and fully-convolution one-stage object detector (FCOS) [173]) with and without DOAM.FCOS+DOAM obtained the best performance.
Compared to other datasets, there appears to be more consistency in the way the OPIXray dataset has been used in the evaluations, and therefore, comparison of results across papers is possible.In [80], YOLOv4 was evaluated by itself and with different additional modules, namely DOAM, CHR, RGBS, and the proposed FBS.The proposed FBS approach achieved the best performance in terms of mAP (shown in Table 10), η, and Recall, while DOAM achieved the best performance in terms of precision and F1 metrics.In [11], the authors of the OPIXray dataset paper proposed Lateral Inhibition Module (LIM) and provided some additional results that are reliably comparable with the original OPIXray results.The best performance was achieved by YOLOv5 combined with LIM.Adaptive Training Sample Selection (ATSS) model [174] was evaluated in [12] by itself, with DOAM, and with two proposed Label-aware Mechanisms.Lable-aware classification (LAcls) achieved the best performance.In [170], Cascade R-CNN + Information-exchange Enhanced Feature Pyramid Network (IEFPN) led to the best accuracy in comparison to other implemented methods.The dataset has been also used for image-level multi-label classification in [66].The performance of the proposed SXM-Net was compared with ResNet-50, ResNet-50 + FPN, and CHR.As the results show, the suggested method achieved the best performance.

H. HiXray
At the time of preparing this article, the HIXray dataset paper [11] was the only one providing results on the dataset.Three object detection architectures (SSD, FCOS, YOLOv5) were evaluated as such, with DOAM and with the proposed LIM.The performance was evaluated using mAP at 0.5 IoU.As shown in Table 11, the combination of YOLOv5 and LIM led to the highest performance.

I. CLCXray
At the time of writing this article, only the CLCXray dataset paper [12] provides results on the CLCXray dataset.Several approaches for object detection are evaluated using the COCO evaluation metrics as described in Section VI-G.We report results for the best-performing approaches along with some results for well-known detection architectures in Table 12.Based on the main metric mAP the proposed method using LAcls reached to the best performance.

VIII. CONCLUSION
As a non-destructive technology, X-ray imaging is finding use in different industrial and security applications to assess the inner structure or contents by measuring mass distributions (absorption rate).Automatic assessment of Xray images, in terms of detection, classification, and segmentation, can be achieved by applying CV-based methods.In this paper, a review of CV studies on X-ray data applications in industrial production and security areas was presented.
While a large number of recent studies have focused on this topic and many advances have been made as evident from our review, we observed a lot of room for improvement for the field as a whole in experimental evaluation.To advance as a field, the proposed approaches and obtained results need to be comparable across studies.At the moment, a large part of the studies use only private data.Furthermore, even the studies using public datasets use varying experimental setups that often make the comparisons incomparable.
We recommend to all the works comparing their results with prior works to first carefully check the computer vision task and note that the same evaluation metrics can be used in classification, detection, and segmentation, while the results naturally are not comparable.Furthermore, we recommend carefully following the experimental protocols including data subsets, splitting into training and testing sets, and evaluation metrics and note that any variations make the results incomparable.It should be also noted that common metrics, such as mean average precision have multiple implementations, and therefore, every work should carefully report also the details of their selected metric.
New larger datasets for different applications are still needed.For newly published datasets, it is naturally important to document all the details of the experimental protocol.To avoid variations in later works using the same dataset, we also recommend publishing the implementations for running evaluations on the dataset.

FIGURE 3 .
FIGURE 3. Different X-ray imaging techniques based on the number of beam sources: a) mono-energy, b) multi-energy.

FIGURE 4 .
FIGURE 4. Different X-ray imaging techniques based on the view: a) multi-view, b) 3D view.

FIGURE 5 .
FIGURE 5. Computer Vision task examples, a) Binary classification for casting defect detection [25], b) Object detection in X-ray baggage security inspection [25], c) Semantic segmentation of an X-ray welding image [25], and d) Instance segmentation of glass fibers in industrial computed tomography [26].

FIGURE 13 .
FIGURE 13.Samples of X-ray images from cylindrical AM specimens, available in CoCr AM XCT dataset [6].

TABLE 2 .
List of important abbreviations used throughout the paper in alphabetical order.

TABLE 3 .
X-ray datasets in industrial and security applications [11]les of X-ray security images, available in HiXray data[11].

TABLE 4 .
Performance comparison on CoCr AM XCT dataset mIoU : mean Intersection over Union

TABLE 5 .
Performance comparison on GDXray Casting dataset mAP bbox : mean Average Precision at 0.5 IoU of for bounding box predictions mAP mask : mean Average Precision at 0.5 IoU of for mask predictions

TABLE 6 .
Performance comparison on GDXray Welding dataset η: Accuracy, AU C: Area Under Curve

TABLE 7 .
Performance comparison on GDXray Security Dataset mAP : mean Average Precision, Sp: Specificity, η: Accuracy, AU C: Area Under Curve, R: Recall, P : Precision

TABLE 8 .
Performance comparison on SIXray dataset mAP cla : mean Average Precision (classification), mAP det : mean Average Precision (detection)

TABLE 9 .
Performance comparison on PIDray dataset AP : Average Precision

TABLE 10 .
Performance comparison on OPIXray dataset mAP det : mean Average Precision (detection) mAP cla : mean Average Precision (classification)

TABLE 11 .
Performance comparison on HiXray dataset mAP : mean Average Precision