Three-Dimensional Residual Neural Architecture Search for Ultrasonic Defect Detection

This study presents a deep-learning (DL) methodology using 3-D convolutional neural networks (CNNs) to detect defects in carbon fiber-reinforced polymer (CFRP) composites through volumetric ultrasonic testing (UT) data. Acquiring large amounts of ultrasonic training data experimentally is expensive and time-consuming. To address this issue, a synthetic data generation method was extended to incorporate volumetric data. By preserving the complete volumetric data, complex preprocessing is reduced, and the model can utilize spatial and temporal information that is lost during imaging. This enables the model to utilize important features that might be overlooked otherwise. The performance of three architectures was compared. The first architecture is prevalent in the literature for the classification of volumetric datasets. The second demonstrated a hand-designed approach to architecture design, with modifications to the first architecture to address the challenges of this specific task. A key modification was the use of cuboidal kernels to account for the large aspect ratios seen in ultrasonic data. The third architecture was discovered through neural architecture search (NAS) from a modified 3-D residual neural network (ResNet) search space. In addition, domain-specific augmentation methods were incorporated during training, resulting in significant improvements in model performance, with a mean accuracy improvement of 22.4% on the discovered architecture. The discovered architecture demonstrated the best performance with a mean accuracy increase of 7.9% over the second-best model. It was able to consistently detect all defects while maintaining a model size smaller than most 2-D ResNets. Each model had an inference time of less than 0.5 s, making them efficient for the interpretation of large amounts of data.


I. INTRODUCTION
C OMPOSITES are versatile materials that are widely used in many industries due to their superior mechanical properties such as corrosion resistance, specific strength, and specific stiffness.Carbon fiber-reinforced polymer (CFRP) is a widely used composite in the aerospace industry making Shaun McKnight, Christopher MacKinnon, S. Gareth Pierce, Ehsan Mohseni, Vedran Tunukovic, Charles N. MacLeod, and Randika K. W. Vithanage are with the Sensor Enabled Automation, Robotics, and Control Hub (SEARCH), Center for Ultrasonic Engineering (CUE), Electronic and Electrical Engineering Department, University of Strathclyde, G1 1XQ Glasgow, U.K. (e-mail: shaun.mcknight.@.strath.ac.uk).
Tom O'Hare is with Spirit AeroSystems, BT3 9DZ Belfast, U.K. Digital Object Identifier 10.1109/TUFFC.2024.3353408 up over 50 wt% for the two most recent long-range aircraft, Airbus A350 and Boeing 787, and up to 70-80 wt% for private jets and helicopters [1].CFRP is fabricated by layering multiple carbon ply sheets and curing them with a thermoset polymer.
Nondestructive evaluation (NDE) refers to a suite of techniques employed to inspect components without caus-

Highlights
• Neural architecture search (NAS) is leveraged to discover an effective architecture for detection in ultrasonic volumetric data.
• The discovered architecture was able to achieve 100% accuracy on the experimental test data, a 7.8% improvement over the next best-performing architecture.
• This work demonstrates a deep-learning defect detection technique for volumetric ultrasonic data.It provides a solution to the challenge of automated ultrasonic inspection.
ing any damage.Radiography, thermography, electromagnetic methods, and ultrasound are among the most widely used NDE techniques.These methods allow inspection of components with varying levels of complexity and size.The choice of NDE technique depends on the nature of the component and the defects to be detected.The application of these NDE techniques has significantly improved the reliability and safety of various structures and components across numerous industries.Ultrasonic testing (UT) is a versatile technique that can be used to inspect components made of various materials, including composites, and is based on the transmission, propagation, and reception of ultrasonic waves.UT has become widely adopted and standardized for volumetric inspection in the aerospace industry due to its relatively easy and hazard-free implementation compared to radiography and its ability to detect a wide range of volumetric defects [3], [7], [10], [11].In UT, sound waves are excited on the surface of a component, and the reflected/scattered wave from internal scatterers can provide valuable information about the volumetric discontinuities of the component.Currently, phased arrays are the preferred technology for generating the initial sound wave owing to their operational flexibility.Phased arrays employ independently controllable UT elements that enable more complex electronic scanning and imaging, such as beam steering, dynamic depth focusing, and variable subapertures [9].By controlling each individual element (or subaperture of elements) of a linear phased array, depth-wise sectional images (B-scans) can be created in a single scan [Figs. 1 and 2(b)].When combined with mechanized scanning perpendicular to the length of a linear phased array, complete 3-D volumetric scan data of components can be generated by stacking multiple individual B-scans together at known positions [Fig.2(b)].This technique is highly valuable for assessing the structural integrity of large and complex components and has significant implications for the reliability and safety of aerospace structures.
UT data are commonly visualized as images, either by selecting a B-scan directly or as an amplitude or time of flight C-scan, where either the maximum response amplitude or the time index of the maximum response amplitude within the volume is imaged to produce a top-down section view across the sample [13].
The integration of robotics into NDE has revolutionized large-scale inspection processes by enabling efficient automated inspection of large components [14].While robotic scanning offers greater flexibility and significantly reduces scan time compared to manual scanning, the interpretation of results remains a tedious and time-consuming task in  the industry.To interpret results in line with existing standards, there is a requirement for highly trained and qualified operators [10], [15], [16], [17], [18], [19].Despite the significant improvements brought about by robotic NDE, the need for expert human interpretation of results persists.This highlights the need for further research and development of automated data interpretation techniques that can supplement or even replace human interpretation, to improve the efficiency and reliability of NDE in various industries.By reducing the dependence on human interpretation, automation can potentially enhance the consistency, repeatability, and traceability of the NDE processes while reducing inspection time and costs.
The interpretation of UT scan results by human operators presents two significant drawbacks, namely, poor time efficiency and the risk of human error [17].Low levels of automation for data interpretation are feasible for mass-produced parts with precisely known geometries, but this approach typically relies on hard-coded features such as predefined time gating, filtering, and amplitude thresholding, which may not be adequate for complex tasks such as changes in manufacturing conditions, variations in geometry, or defect characterization [19].There is a clear necessity for an automated approach to interpret UT data, which could leverage deep learning (DL).This approach should seamlessly integrate with robotic inspection systems, to substantially enhance the quality and efficiency of large component inspection.Using DL would lead to shorter signal interpretation time and faster UT automation uptake in aerospace and other industries, where DL has been identified as a key requirement for transitioning from low to high levels of industrial automation [19].
Despite the potential benefits of applying DL techniques to ultrasonic signal analysis for composite components, its uptake has been limited [19].Shortage of training data is one of the main challenges that hinders research developments in this area.This shortage, combined with industrial concerns about the interpretability and compliance with standards of DL models, has presented challenges for the effective use of DL techniques.As a result, the adoption of DL in UT signal analysis for composite components has been slow, despite its promising potential to enhance the accuracy and efficiency of defect detection and characterization.
Synthetic datasets are widely used in machine learning (ML) to augment small training datasets [20] and they have been successfully implemented for UT of composites with encouraging results for 2-D classification of C-scan images [21].Part of this work builds upon the work in the previous article to extend one of the synthetic data generation methods to make it applicable for full 3-D volumes.The synthetic datasets are based on simulations from semi-analytical physics-based software that has been shown to produce experimentally accurate defect responses [22], [23].This software offers a less computationally expensive alternative to finite element analysis (FEA), allowing for the simulation of composite responses based on bulk material properties [24].
When ML is used to interpret UT NDE data in the literature, it is typically applied to interpret A-scan time traces or 2-D images constructed from A-scans [25], [26], [27], [28], [29], [30], [31].Compared to B-scans, A-scans lack all spatial information and nowadays, and they are rarely used alone to characterize defects by human operators since the introduction of phased arrays.C-scans preserve detailed spatial information, and however, constructing the 2-D image from the volumetric data necessitates the compression of temporal information.While C-scans excel in capturing intricate spatial details, their need for temporal compression results in minimal representation of through-depth features.Compression of A-scans to C-scans often removes useful features such as the backwall response, which can be important when detecting defects with a low reflective index such as porosity [32].Furthermore, to produce C-scan images, appropriate gating must be applied to remove the front wall surface response.This can be challenging when trying to detect near-surface defects.In the aerospace industry, operators typically start with a C-scan to gain a complete picture of defect responses and then move to analysis of B-scans for further information about the nature of the responses [33], [34].While current ML approaches in literature make use of data in formats that are easily interpreted by humans (images or time traces), ML algorithms are not limited to image-level analysis and have proved very capable at interpreting 3-D volumetric data [35], [36].By implementing algorithms capable of volumetric interpretation, we retain all spatial and depth information, this gives the algorithms more relevant features to learn from and removes the need for image preprocessing and gating.
Convolutional neural networks (CNNs) have been used effectively for decades in a wide variety of image and volumetric analysis tasks with models such as residual neural network (ResNet) typically having tens of millions of parameters [37] and are still widely used as backbones or standalone architectures [38].However, these networks are typically applied to data of similar dimensions or data that have been scaled to give even dimensionality of each axis.UT data have extreme aspect ratios due to the difference in requirements of sample rate in the spatial and time dimensions.Compressing the data in the time dimension to match the spatial dimension, normally dictated by the subaperture pitch and the scan acquisition rate, would result in a substantial loss of depth information.Alternatively, the spatial dimensions could be upscaled to match the number of samples in the time dimension, but this is highly inefficient, creating data instances that would require large amounts of memory and would make training intractable.Therefore, retaining the original dimensionality and aspect ratio of the UT data is highly preferable.Using CNNs to interpret images with high dimensionality is not new and the use of rectangular kernels instead of square kernels in CNNs has given positive results for classification of speech signals, which have high aspect ratios when represented as spectrogram images [39].This article makes use of a similar approach for volumetric data.
Network architecture design is a key component of effectively leveraging ML techniques.Traditionally, network design heuristics and "rules of thumb" would be used, in tandem with domain expert knowledge to construct a specific architecture.Automatic architecture design or neural architecture search (NAS) is a development on this approach where a practitioner can leverage compute to aid the process of architecture selection.This process, which can be considered a subset of hyperparameter optimization, generally involves an iterative process of selecting, training, and evaluating architectures.In its simplest form, a "random search" involves repeating the above process until some threshold or limit in terms of performance or computation budget is reached.More complex approaches to NAS often focus on efficient model evaluations, making use of proxy evaluation methods [40], [41] or efficient sampling algorithms [42], [43] attempting to make the largest improvement with each evaluation.
This article presents a comparative analysis of the performance achieved from three separate architectures for defect detection in volumetric ultrasonic data.The first, VoxNet [44], is prevalent in the literature for volumetric classification problems, the second architecture presents modifications to VoxNet for this task using a traditional network design approach, and finally a discovered architecture from NAS.
VoxNet is a 3-D CNN initially proposed for classification of light detection and ranging (LiDAR), red green blue depth (RGBD), and computer aided design (CAD) data.It has since been used as a backbone for different volumetric classification tasks [45].In addition, notable contributions of this study to knowledge in the field encompass the introduction of domainspecific augmentations, which exert a substantial impact on classification performance.Furthermore, synthetic data generation techniques are leveraged from prior 2-D work to generate 3-D UT datasets from semi-analytical simulations, effectively addressing one of the prominent challenges encountered in the application of DL for NDE: the scarcity of effective training data.
This work presents a novel DL architecture designed to process volumetric UT data.In contrast to prior methods relying on time-series data or 2-D image-based approaches, which diminish spatial or temporal features while often requiring additional processing.The main contributions of this work are as follows.
1) Interpretation of volumetric UT data, instead of images or time signals.This reduces preprocessing requirements and allows the model to learn from greater features.2) Introduction of two domain-specific methods for data augmentation, helping with the domain transfer from synthetic to experimental data.3) Discovery of a novel 3-D CNN architecture through NAS.

A. Pipeline Overview
In this work, the automated data interpretation is simplified by inspecting the complete volumetric data and eliminating image processing steps such as gating to remove front and back wall responses while preserving all spatial and temporal information.While the models are trained on synthetic data, they are tested using experimentally collected UT data from samples with manufactured defects that aim to mimic delamination.Manufactured defects are commonly used in the literature to act as test cases and qualify NDE techniques and operators where naturally occurring defects are not always available [6], [27], [28].An overview of the simulation and DL pipeline is presented in Fig. 3. Fig. 3 also shows how NAS can fit into this process.

A. Experimental Data Collection
Experimental ultrasonic data were acquired from CFRP samples, both with and without artificially introduced defects, to serve as test data.To imitate delamination defects, which are one of the most common defects in composites [31] and a significant life-limiting failure mode [46], flat-bottom holes were drilled from the backside of the samples.Prior to introducing defects, clean scans of each sample were taken to form a defect-free test set.The use of the same CFRP base sample ensured that the trained models learned defect-specific responses rather than the underlying properties of different composite samples.The ultrasonic data were acquired at room temperature using a robotically deployed unfocused linear phased array.The array used was an Olympus Inspection Solutions  Overview of the experimental scan setup of KUKA KR90, force-torque sensor, and ultrasonic roller probe used for data acquisition.
RollerFORM-5L64 [47], which had a central frequency of 5 MHz and was made up of 64 elements with a pitch of 0.8 mm and an elevation of 6.4 mm.The elements were driven at 100 V with a receiver gain of 22.5 dB.The sample rate was 100 MHz.The pulse repetition frequency was set to collect a B-scan every 0.8 mm with a scan speed of 10 mm/s, which was controlled using a fully automated robotic system built around a KUKA KR 90 R3100 extra HA industrial robot (Fig. 5) [48].Robotic scanning enabled the concatenation of encoded B-scans to form volumetric datasets.To ensure a steady coupling of the roller probe to the surface of the component and consistent transfer of acoustic wave energy into the sample at different scanning positions, force-torque compensation was used to control the distance from the samples surface with feedback from the force axis perpendicular to the sample.This was accomplished with integration of a Schunk GmbH & Company FTN-GAMMA-IP65 SI-130-10 force-torque sensor, mounted between the robot's flange and the roller probe.This ensured a constant scanning force of 70 N, which maintained consistent tire compression throughout the scan.Water was used as an acoustic couplant in the scanning process.This data acquisition setup is widely used in industry and has been employed for data collection on large composite aerospace components [49].

B. CIVA Simulations
Due to the lack of available experimental training data, a simulated dataset was constructed for training.This was done using CIVA, a semi-analytical physics-based commercial NDE simulation software [50].CIVA has the ability to accurately model wave propagation and its interaction with defects, which has been experimentally validated for UT [22], [23].In addition, the software is computationally efficient when compared to other alternatives such as FEA.The full control of the simulated domain enabled the modeling of similar defects and material properties to the experimental domain.However, the use of semi-analytical software instead of FEA had limitations in that the software was unable to model responses from ply interactions and lacked noise seen in experimental data.As a result, differences existed between the simulations and measured experimental responses, leading to the use of the synthetic data generation steps discussed in Section II-D to reduce the differences between simulated and experimental domains.For further information on the difference between the simulated and experimental domains and the need for accurate synthetic data, please refer to the previous work on this topic [21].
To set up the simulation, the individual layers of composite were constructed and used to generate equivalent homogeneous material properties of the experimental CFRP samples.A single ply layer was constructed and alternated repeatedly with eight layers at orientations of 0 • , 45 • , −45 • , and 90 • to match the experimental sample as closely as possible.The resulting multilayer structure was homogenized to be consistent with a homogeneous medium having mechanical properties equivalent to those of the multi-ply composite, with the fiber density set to 50% to best match the experimental sample's density of 1440 kg/m 3 .
To simulate the waveform, a sinusoidal wave of 5 MHz was employed, accompanied by a Hanning filter that provided a bandwidth of 66% at 12 dBs.
For running multiple, sequential simulations, a parametric study was set up, using the composite bulk properties previously calculated and varying the diameter and depth of defects.Flat-bottom hole defects were simulated with diameters from 3.0 to 15.0 mm, increasing every 0.5 mm, with varying depths from 1.5 to 7.0 mm from the surface, in increments of 1.5 mm.A defect-free simulation was also run to provide the basis for defect-free synthetic data.Both the front and back wall surface reflections were included in the model.The full simulations took less than 15 h on a desktop computer with a 24-core 3.79-GHz CPU and 128 GB of memory.

C. Signal Processing and Dataset Generation
The resolution of the UT data in the array dimensions was constrained by the element pitch, and the scan width was restricted by the number of elements in the array.This limited the inspection data to 64 voxels in the array dimension.To match this, 64 B-scans were selected in the scan dimension to create cuboidal datasets [Fig.2(b)].The distance between the elements was 0.8 mm, and the robotic scanning speed was regulated with the pulse repetition frequency to ensure a B-scan offset of 0.8 mm.This enabled the generation of volumes with square voxels in the spatial domains, along both the probe and scan directions.By utilizing this approach, the study was able to achieve a standardized volumetric resolution that was consistent throughout the dataset.
The data collected in both experimental and simulated domains was in the form of radio frequency A-scans, also known as amplitude scans.In order to obtain 3-D volumetric datasets from these sources, a number of signal processing steps were performed.Initially, the A-scans were centered at zero amplitude and enveloped by taking the absolute of the Hilbert transform, as shown in Fig. 6(a).The Hilbert transform was used to obtain the analytical signal, which is useful for calculating the instantaneous response of a time series.This approach is a standard signal processing technique used when generating C-scan images from time-series ultrasonic data.Subsequently, each volumetric dataset was normalized between 0 and 1 by dividing by its maximum peak amplitude.
Once the data were normalized, the offset in the time domain was compensated for by aligning the peak front wall response to the origin.This made sure that features were correctly aligned in the time domain and helped to account for any variability in the acoustic path length between individual transducers and the surface of the sample.Fig. 6(b) shows how the time shifting was done for an individual A-scan with the Hilbert transform applied.Fig. 7 shows the effect of this on the complete ultrasonic volume.

D. Synthetic Data Generation Method
Our previous studies have shown that semi-analytical simulated data alone are not representative enough of the experimental domain [21].Therefore, there is a need for methods of translating the simulated domain closer to the experimental domain.Fully statistical methods of generating noise are advantageous as they can be resampled continuously to keep generating unique noise profiles that are in line with experimental data.In this article, we extend previous work in generating 2-D synthetic images and propose a new approach for adding noise to complete volumetric UT data.
The previous study concluded [21] that A-scan level noise was the best fully generative statistical method for adding noise.In addition, all the other approaches, except for the simulated A-scan noise, introduced noise at an image level, which is intractable for volumetric data.To adapt the methodology described in the original article for the analysis of full volumetric data, unique noise profiles for each A-scan were generated and subsequently summed with the simulated responses past the front wall.To temporally align the responses, the time shift of the front wall response was performed when generating the noise distributions from experimental data and when combining the simulated responses with the generated noise profiles [Fig.6(b)].
Fig. 8 shows an example of the addition of noise on simulated data at an A-scan level and Fig. 9 demonstrates

TABLE I SUMMARY OF THE DATASETS PRODUCED
this for a complete ultrasonic volume.The statistical noise distributions of the A-scans were calculated from a separate hold-out sample with the same layup and thickness as the test samples.For further details on building up the noise profiles, please refer to the previous work [21].
A summary of the datasets generated from the experimental and synthetic data is given in Table I.

E. Augmentation Methods for Synthetic Training Data
The generalizability of ML models is a critical aspect of their performance.One approach to improve generalizability is to augment the training data.Augmenting the training data makes the task more challenging by adding noise at the training stage, reducing the likelihood of overfitting, and often improving performance in the target domain.This is particularly important when the target (experimental) domain is different from the training (synthetic) domain.As demonstrated in Fig. 9(a), there is little variation between simulated A-scans.However, this is not the case for experimentally acquired data.Received amplitudes are affected by the sensitivity of individual elements, variations of couplant on the surface of the sample, and roughness of surface finishes (particularly for manufactured defects).The anisotropy of CFRP can also result in variations in attenuation, which impacts the effect received amplitudes.Surface roughness and local changes in fiber density due to the material's inherent anisotropy produce small changes in time of flight.Traditional augmentation methods, such as those used for images (e.g., crop, mix-up, and flipping), do not model these variations well and can produce unrealistic examples.
Therefore, in this study, we introduce two types of augmentation that were generated online for each minibatch during training.These augmentations aim to mimic the interelement response variability observed within the UT probes used for data collection.
The first type of augmentation is related to the magnitude of response measured by the UT elements, which varies due to many factors not included in the simulation, such as manufacturing tolerances of the sample and the UT array probe, wear and tear of the probe and electrical wires/connections, or interlayer multiple scattering of the sound waves.To mimic these while preserving the correct normalization, each A-scan was scaled by a constant past the front wall.The scale factor was sampled from a uniform distribution to give a scale factor between 80% and 120%.An example of this is given in Fig. 10(a).
The second type of augmentation mimics any changes in ultrasonic travel time seen by different elements.This can be caused by a variety of factors, such as variations in component sound speed due to the anisotropic nature of composites, departure from central frequency for certain elements, and so on.To simulate this, 1-D interpolation was used to randomly stretch or compress the signal in the time domain.The dilation amount was randomly sampled from a uniform distribution for each A-scan up to ±15 samples.An example of this is given in Fig. 10(b).
By introducing these augmentation methods, we aim to improve the generalizability of the models to the experimental domain.The online nature of these augmentations means that they can be easily incorporated into the training process without the need for additional data collection or preprocessing steps.To ensure a consistent length of data in the time domain, each A-scan was padded with zeroes to a length of 1024 samples during training.

III. NETWORK ARCHITECTURES
In this article, we investigated the performance of three different 3-D CNN architectures for binary classification of 3-D defect and defect-free UT data with extreme aspect ratios.
The first 3-D CNN, VoxNet, was designed for similar volumetric classification tasks and acts as a baseline architecture.For low aspect ratios, CNNs (such as VoxNet) typically make use of square or cuboidal kernels, which are appropriate for their equal (or near equal) aspect ratios.The use of CNNs on data with more extreme aspect ratios is less common and is particularly extreme for UT data between the time and the spatial domains, with an aspect ratio of 16.
To overcome this challenge, a task-specific architecture was handcrafted by adapting VoxNet in a manner that follows the traditional approach to architecture design.This custom network is specifically designed to tackle the extreme aspect ratio problem and enhance the overall classification performance.
As an alternative to traditional architecture design, NAS was employed to develop a third architecture for comparison.For each model, the Adam optimizer [51] was used with a constant learning rate of 0.001, β 1 of 0.9, and β 2 of 0.999.A batch size of 8 was utilized in the training process.The chosen loss function for this model was binary cross entropy, with a sigmoid activation function applied to the final layer to facilitate classification.
Due to the small amounts of experimental test data, there was a likelihood of noisy results during both training and testing phases.To mitigate this, each model was trained ten times with varying random initializations, and their individual results were averaged across the performance metrics.This gives a better representation of the model's performance by averaging out any noisy results due to the small datasets.
During the training phase, a fixed validation set comprising 30% of the total test data was randomly selected from each class of experimental data.This set was used to monitor the model's performance and minimize the risk of overfitting.The models were trained with a patience of ten epochs, where the training process monitored binary cross-entropy loss on the validation data, for improvement.If there was no enhancement for a consecutive period of ten epochs, the training process was halted.The model parameters with the lowest validation loss were used to evaluate the classification performance on the test set.This approach ensured that the final model's defect detection performance was evaluated using the parameters that had the best ability to generalize to the target domain, as opposed to the model that had overfitted to the synthetic domain.

A. Evaluation Metrics
To quantitatively assess the binary classification performance of each network, average mean accuracy, precision, recall, and F1 scores were calculated according to the following equations: Accuracy = (TP + TN)/(TP + TN + FP + FN) (1) where TP is the true positive, TN is the true negative, FP is the false positive, and FN is the false negative, with positives being the presence of a defect.Each result was individually averaged using a simple mean across the ten training cycles.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. VoxNet: Baseline Architecture
Introduced by Maturana and Scherer [44], VoxNet is a 3-D CNN designed to tackle classification problems of 3-D data that can be represented as voxels to form an occupancy grid.Originally tested on LiDAR, RGBD, and CAD data, it has since been used as the backbone for methods tested on ModelNet40 [45].
While the data from UT for this task differs from the datasets previously employed with VoxNet, the process of converting data into a voxel-based format within the VoxNet pipeline is well aligned to the 3-D representation of UT data.As a result, VoxNet was employed to establish baseline model performance metrics for this task.
VoxNet is constructed using two 3-D convolutional layers with cuboidal kernels, followed by a pooling layer and two fully connected layers (Fig. 11).For further details on the model, please refer to the original article.VoxNets total number of parameters is 235 M.

C. Hand Designed Architecture
The second architecture, referred to as CustomNet, demonstrates a conventional approach to architectural design.In this context, adaptations to VoxNet have been implemented to contemporize and enhance its performance, specifically for the given task.
Our UT dataset stands out to previous VoxNet datasets due to its higher dimensionality coupled with notable differences in spatial and temporal dimensions.To effectively handle these unique attributes, adjustments were made to the models' architecture.Specifically, the number of convolutional layers was increased to enhance the extraction of meaningful features from the complex dataset.In addition, cuboidal kernels with nonuniform dimensions were employed in the initial four blocks (refer to constant blocks in Fig. 12) of the model.This approach aimed to address the uneven dimensionality inherent in the data, ultimately equalizing the dimensions and contributing to a more robust feature representation throughout the network (Fig. 12).After this, a feature block with cube kernels of equal dimensionality could be used (refer to feature block in Fig. 12).In the process of updating VoxNet, we incorporated convolutional layers for pooling instead of the previously employed max-pooling layers.In addition, rectified linear unit (ReLU) was substituted with LeakyReLU, and batch normalization was introduced.To mitigate overfitting, dropout and global average pooling were employed to reduce the number of features for classification, avoiding the use of large fully connected layers.These modifications are geared toward improving the model's performance by incorporating contemporary practices that have shown substantial performance benefits, as highlighted in previous studies [52].
The final architecture is given by the diagram in Fig. 12.The total parameter size of the network was estimated to be 1.28 M parameters.
D. NAS Discovered: 3-D ResNet-Based NAS 1) Neural Architecture Search: The final architecture was developed through NAS of a modified ResNet search space to account for 3-D convolutions and operations.One of the challenges in applying NAS to a new domain task is the design of the search space.For this task, a new search-space framework that utilizes a novel search space based on a ResNet-like structure is introduced.A fixed stem was used to downsample the data by a factor of 4 in the spatial dimensions and a factor of 8 in the time dimension while aiming to retain information by increasing the channels to 64.A further downsample block with average pooling followed by two to four residual blocks was all searched individually.An overview of the structure can be seen in Fig. 13.The residual blocks and bottleneck features of the ResNet architecture are retained while searching operations for each edge within the residual block.This provided a large diversity of architectures, which is key to attaining good performance in a novel application while also ensuring that many networks conformed to successful design principles.Each residual block contained two fixed point-wise convolutions used to downsample and upsample the number of channels.Fig. 14 shows an example of a residual block denoting the searched and fixed operations.
These blocks were then stacked in groups, with the resolution downsampled between groups.Equation ( 5) gives the  probability of a new group being created for each residual block; otherwise, they were added to the current group.This makes groups unlikely to be extremely long or short The primitive operations of a search space are the list of operations that are assigned to the edges of a network architecture.The implemented approach incorporated a standard set of operations commonly found in the NAS literature.These operations comprised of convolutions, pooling, and skip connections, which are widely recognized and utilized within the field.These operations were all 3-D due to the dimensionality of the data.In contrast to standard practice, which makes use of separable convolutions, the approach presented in this study deployed both depth-wise and pointwise convolutions as the fundamental convolutions within the search space.This significantly reduced the number of parameters in each operation of the architecture, greatly reducing the computational cost.Specifically, the depth-wise convolutions were applied with equidimensional cube kernels, of size 3, 5, or 7, coupled with dilation values that ranged from 1 to 4. Skip connections, point-wise convolutions, as well as average and max pooling operations were also searched for.For the pooling operations, equidimensional cube kernels of size 3, 5, or 7 with a dilation value of one were employed.The search encompassed the exploration of Gaussian error linear unit (GELU) activation function and batch normalization, as well as the absence of activation and normalization operations.This allowed for architectures with fewer activation and normalization functions, which has been shown to be beneficial [52].The searched downsample operation had a fixed kernel size with two in the spatial dimensions a four in the temporal dimension with a dilation of one.Throughout the relevant operations, a stride of one was employed.A simple random search was applied to this search space for 80 iterations.Each model was evaluated using the validation dataset, with the lowest loss on validation across the training taken as the evaluation metric.For each searched architecture, a model was retrained with new initializations three times and the mean evaluation metrics were used when selecting the discovered architecture, and this ensured a more accurate estimate of model performance.Cross validation was unable to be used as the combination of NAS and domain transfer would have resulted in data leak between the NAS stage and the final model test evaluation stage.Fig. 15 provides an overview of the NAS process and demonstrates how separation of the validation and test set were maintained in context of the complete model pipeline, given in Fig. 3.The final discovered architecture had 1.03 M parameters and is given in Fig. 16, with the details of the residual blocks given in Fig. 17.

IV. RESULTS
Table II provides average confusion matrixes for the test results of the VoxNet, CustomNet, and NAS discovered models.Table III presents a summary of each method's performance, displaying the mean and standard deviation across various performance metrics.The architecture discovered by NAS consistently produced ideal results when trained using data augmentation, with a mean classification accuracy of 1.00 and a standard deviation of 0.00 across the ten separate training iterations, and this demonstrated high confidence in the model's conclusions and robust design for the target domain.
Table IV demonstrates the impact of data augmentation on the best-performing NAS model.Discarding data augmentation completely during training had a significant impact on the classification performance with a 22.4% drop in mean accuracy, along with a standard deviation increase of 17.8%, which demonstrated a significant reduction in statistical confidence.While the addition of amplitude scaling augmentation improved the mean accuracy, it was only by 3%.This demonstrates the importance of using both augmentation methods in parallel for increased generalizability to the experimental domain.
Table V provides a summary of model sizes and inference times for a single batch of test data.The NAS discovered architecture has a total size of 16.7% and 5.2% smaller than

V. DISCUSSION
VoxNet demonstrated it was able to learn features from synthetic data and performed reasonably well on experimental data, with a mean F1 score of 0.825.However, its significant standard deviation in accuracies between training instances demonstrates that the architecture was not well optimized for the problem.The CustomNet improved on the accuracy of VoxNet substantially by 14.8% while also reducing the standard deviation of results by 8.9%, which indicated an increase in consistent generalizability to the experimental domain.This illustrates the benefits of tailoring architectural modifications to address the needs of specialized tasks.The experimental results demonstrated that the architecture discovered from NAS greatly outperformed the other two in terms of classification accuracy.While all the models used in this article are not large and are considerably smaller than typical sizes for 2-D ResNet's and other CNNs [37], the NAS model was able to achieve the highest performance with a significantly lower model size, at only 5.2% the memory requirement of VoxNet.The black box nature of DL makes it difficult to specify which features lead to this improvement in performance.This complexity is in fact a large motivator for NAS as the design space is too large for a human to efficiently find an optimal network architecture.The authors believe that the addition of skip connections and the ability to vary operations at different depths added by the NAS has a significant positive impact on performance.This demonstrated the importance of utilizing NAS to optimize CNNs.
Due to the large fully connected layers, VoxNet results in a far greater number of total parameters than the other two networks.This results in a model, which occupies far more memory.While CustomNet and the NAS discovered architecture have a comparable number of parameters the discovered network is far smaller.This is a result of many of its operations being far more efficient, such as the separation of point and depth-wise convolutions.While the discovered architecture is the smallest, its inference time takes the longest due to the greater architectural complexity of the model and its operations.Having said, all models have acceptable inference time and can process eight samples in under half a second.However, CustomNet is notably 12 times faster at inference than the second fastest network, VoxNet, which could be an advantage in some industrial settings.
When trained without data augmentation the NAS model performed significantly worse.Furthermore, the performance was only slightly improved by adding amplitude scaling augmentation alone.For best performance, both augmentation methods were needed in combination.This indicates that despite accurate synthetic data generation, data augmentation still has a significant role in producing generalizable models for the experimental domain.
While ideal classification was achieved consistently for the discovered architecture when trained with data augmentation, this was tested on detection of manufactured defects only.Specifically, back-drilled holes are perpendicular to the propagating sound wave and act as ideal reflectors.This makes them comparably easier to detect than other defects.While samples with naturally occurring defects are challenging to get access to, in future work, the authors aim to expand the simulation scope and test the models on naturally occurring defects, which will likely prove more challenging to detect.For more challenging detection and characterization tasks, a more sophisticated search optimization algorithm could be employed to discover architectures more efficiently.
The achieved classification results suggest that the synthetic data generation process is a viable approach for producing fully synthetic 3-D UT volumetric datasets that closely map to the experimental domain and enable the development of effective classifiers.However, due to the substantial improvement in classification performance achieved through the implementation of data augmentation methods, it is important to acknowledge that disparities between the synthetic and experimental domains persist.This observation underscores the necessity for augmentation techniques to further enhance the generalizability of the model.Nonetheless, it is worth noting that the data augmentation methods employed in this study proved to be highly effective in aiding not only generalizability but also in facilitating the transfer of knowledge across domains.
The key benefits for analyzing the complete 3-D volumetric data instead of processed images were the ability to learn from greater features, the reduction in preprocessing requirements, and the potential reduction in inference time by analyzing the complete volume all at once.The impact on inference time is challenging to quantify; however, if comparing the compute required to process 64 B-scan images (the equivalent spatial scan data), without parallelization for equivalent 2-D classifiers, there is the potential for up to 64 times saving in inference time for the same scan area.Despite these advantages, there are still potential benefits to analyzing UT data as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
images.One of these is the many opportunities for detection of a single defect in multiple B-scans.It is likely that defects will span multiple B-scan images, and as such, by analyzing each B-scan, there are multiple chances to detect an individual defect.This means that an individual defect can still be detected even if individual defective images are incorrectly classified.However, the opportunities for characterization and localization of defects are far greater when retaining the volumetric spatial information and this work opens future prospects for 3-D classification and segmentation, which would be much more challenging if using C-scans or B-scans alone.
The research outcomes demonstrated the considerable potential of employing 3-D-CNNs in conjunction with well-designed data augmentation techniques and optimized architecture search spaces to address challenging 3-D classification tasks characterized by extreme aspect ratios, as observed in the context of UT.Insufficient utilization of data augmentation severely hampered the model's ability to generalize to experimental datasets, leading to suboptimal classification performance.Likewise, choosing an unsuitable model architecture could result in the failure to capture crucial features necessary for accurate classification.Consequently, it is imperative to thoroughly consider both aspects during the design of a classification model for 3-D UT data to ensure the optimal performance.

VI. CONCLUSION
DL has demonstrated prior success in ultrasonic NDE when applied to either time-series or image data.However, analyzing only time-series or image data can result in a significant loss of information in either the temporal or spatial domains.This article proposes the use of 3-D CNNs to classify complete volumetric ultrasound data without compression, retaining all spatial and temporal information.This approach not only reduced the need for accurate gating when constructing C-scan images but also decreased the amount of signal processing required.To train the models, synthetic data were generated from semi-analytical simulations, while experimentally collected ultrasonic responses from manufactured defects were used for testing.Two forms of data augmentation were implemented based on physical variations seen in experimental ultrasonic responses to improve the model's classification performance in the experimental domain.Furthermore, the performance of three different architectures, one existing in the literature, one hand-designed based on current practices, and one designed by NAS from a ResNet space modified for 3-D, was compared.
The first architecture, VoxNet, performed reasonably well on experimental data, achieving a mean F1 score of 0.825.However, its notable standard deviation in accuracies during training suggests suboptimal architecture optimization for this task.CustomNet's greatly improved on VoxNet with an accuracy increase of 14.8% while reducing the standard deviations in accuracy by 8.9%, hence demonstrating an architecture better optimized for this task.
The third architecture, designed by NAS, when trained with data augmentation, gave the best results, providing 100% classification accuracy.The impact of online domain-specific augmentation was notable, leading to a 22.4% decrease in mean accuracy for the NAS model when augmentation was omitted.
Overall, this work demonstrated that it is possible to train successful DL models to classify full volumetric ultrasonic data for NDE.The issue of a lack of data in most NDE situations was addressed by successfully implementing synthetic data generation in 3-D.The work highlighted the importance of appropriate architecture selection and effective data augmentation when translating between synthetic and experimental domains, with both factors essential in achieving high classification accuracy.
The focus of this work was on the use of volumetric datasets, and while 100% classification accuracy was achieved through effective NAS, the authors recognize that back-drilled holes are generally simple defects to detect by human operators.Future work aims to increase the complexity of the task by detecting a wider range of more challenging defects and expanding the simulation scope to better cover naturally occurring defects, where performance can be measured against human operators in a more realistic industrial scenario.Further work also aims to extend the problem to defect classification and sizing.

Manuscript received 19
December 2023; accepted 9 January 2024.Date of publication 12 January 2024; date of current version 27 February 2024.This work was supported by the Spirit AeroSystems/Royal Academy of Engineering Research Chair for In-Process Non-Destructive Testing of Composites under Grant RCSRF 1920/10/32.(Corresponding author: Shaun McKnight.)

Fig. 1 .
Fig. 1.Demonstration of how individual probe elements can make up a linear phased array that can produce B-scan and C-scan images.

Fig. 2 .
Fig. 2. (a) Representation of how A-scans are stacked to form B-scans. (b) How B-scans are stacked to create a full UT volume.

Fig. 3 .
Fig. 3.Overview of the pipeline for automated volumetric UT classification.
Composite samples measuring 254.0 × 254.0 × 8.6 mm (W × D × H ) were provided by Spirit AeroSystems and were manufactured to the BAPS 260 specification with woven fabric and Cycom 890 resin using a vacuum-assisted resin transfer molding process.Of the three samples, two contained defects.The first sample contained 15 flat-bottom holes measuring 3.0, 6.0, and 9.0 mm in diameter, with each defect drilled to depths of 1.5, 3.0, 4.5, 6.0, and 7.5 mm from the front surface.The different defect sizes were spaced 30 mm apart with different depth defects spaced 35 mm apart.The second sample contained 25 flat-bottom holes, drilled to the same depths as the first sample but with additional defect diameters of 4.0 and 7.0 mm, as shown in Fig. 4. All defects were manufactured to tolerances in depth of ±0.3 mm and in diameter of ±0.2 mm.

Fig. 5 .
Fig. 5.Overview of the experimental scan setup of KUKA KR90, force-torque sensor, and ultrasonic roller probe used for data acquisition.

Fig. 6 .
Fig. 6.(a) Example of relative amplitude response from simulations, normalized signal, and Hilbert transform, applied to the original signal.(b) Demonstration of how individual A-scans are time-shifted to the front wall response.

Fig. 7 .
Fig. 7. (a) Volumetric data with Hilbert transform applied only.(b) Volumetric data with time shifting to the central response of the front wall peak.Both figures have been thresholded to remove the lowest 10% of amplitudes to aid in visual clarity.

Fig. 8 .
Fig. 8. (a) Frame of 64 simulated A-scans for a simulated defect response.(b) Corresponding A-scans with synthetically added noise for the same defect response.

Fig. 9 .
Fig. 9. (a) Complete ultrasonic volume of simulated A-scans for a defect response.(b) Corresponding synthetically noised volume for the same defective response.Both figures have been thresholded to remove the lowest 10% of amplitudes to aid in visual clarity.

Fig. 10 .
Fig. 10.(a) Example of how scaling augmentation is done on an individual A-scan.(b) Example of how dilation augmentation and padding is completed for an individual A-scan.

Fig. 11 .
Fig. 11.VoxNet architecture.Here, Conv (f, d, s) indicates the number of filters f, filter size d, and stride s, of the convolutional layer.

Fig. 15 .
Fig. 15.Overview of the process for NAS implementation.

TABLE II AVERAGE
CONFUSION MATRICES FOR VoxNET, CustomNET, AND THE NAS DISCOVERED ARCHITECTURE

TABLE III COMPARISON
OF CLASSIFICATION RESULTS ACROSS THE DIFFERENT ARCHITECTURES.THE MEANS AND STANDARD DEVIATIONS AREPRESENTED AS MEAN ± STD

TABLE IV COMPARISON
OF THE EFFECTS OF DATA AUGMENTATION ON THE NAS DISCOVERED ARCHITECTURE.THE MEANS AND STANDARD DEVIATIONS ARE PRESENTED AS MEAN ± STD TABLE V COMPARISON OF MODEL SIZES AND INFERENCE TIME FOR EACH ARCHITECTURE CustomNet and VoxNet, respectively, while CustomNet was 12 times faster at inference than the next closest, VoxNet.