Explainable Deep Learning System for Advanced Silicon and Silicon Carbide Electrical Wafer Defect Map Assessment

The recent increasing demand of Silicon-on-Chip devices has triggered a significant impact on the industrial processes of leading semiconductor companies. The semiconductor industry is redesigning internal technology processes trying to optimize costs and production yield. To achieve this target a key role is played by the intelligent early wafer defects identification task. The Electrical Wafer Sorting (EWS) stage allows an efficient wafer defects analysis by processing the visual map associated to the wafer. The goal of this contribution is to provide an effective solution to perform automatic evaluation of the EWS defect maps. The proposed solution leverages recent approaches of deep learning both supervised and unsupervised to perform a robust EWS defect patterns classification in different device technologies including Silicon and Silicon Carbide. This method embeds an end-to-end pipeline for supervised EWS defect patterns classification including a hierarchical unsupervised system to assess novel defects in the production line. The implemented “Unsupervised Learning Block” embeds ad-hoc designed Dimensionality Reduction combined with Clustering and a Metrics-driven Classification Sub-Systems. The proposed “Supervised Learning Block” includes a Convolutional Neural Network trained to perform a supervised classification of the Wafer Defect Maps (WDMs). The proposed system has been evaluated on several datasets, showing effective performance in the classification of the defect patterns (average accuracy about 97%).

it with the production yield [6]. 66 One of the most used approaches to characterize produc-67 tion defects in semiconductor wafers is based on the visual 68 FIGURE 2. Front-end pipeline description for binarized WDM generation.
analysis of the defect maps at Electrical Wafer Sorting (EWS) 69 stage in which a series of electrical conformance-tests will be 70 performed (short-circuit tests, leakage, parasitic capacitance, 71 and so on) [7]. 72 Specifically, the EWS binarized Wafer Defects Maps 73 (WDMs) are considered as excellent tool for identifying pre-74 dictive markers of production yield or issues in the upstream 75 manufacturing lines. 76 More in detail, the binarized WDMs are obtained at the 77 end of the Front-End manufacturing process (Fig. 2) where 78 designed devices are emebedded in disc-shaped wafers and 79 tested by a probing machine. The probing machine verifies 80 the device functionality through electrical tests, assigning a 81 test-outcome color to each device and by distinguish them 82 in fully, partially or not working devices thereby creating a 83 defect map. The binarization of WDMs consists in assigning 84 the white color (value ''1'') to partially or not working devices 85 while the black color (value ''0'') is assigned to full working 86 devices and background. 87 The pipeline herein proposed is based on the wafer defects 88 analysis at the end of the Front-End manufacturing, i.e., when 89 the binarized WDM has been generated. Therefore, from a 90 careful monitoring of the so generated WDM, semiconductor 91 manufacturers will be able to build correlation models with 92 the issues upstream the production lines or to predict the 93 impact on the production yield of a specific defect pattern, 94 defining properly policies of recovery. 95 The main contribution of this work is the development of 96 a deep pipeline for a robust and intelligent classification of 97 defect patterns both in Silicon (Si) technology and in the 98 production of Silicon Carbide (SiC) devices. In subsequent 99 development (currently being designed) we will deal with the 100 correlation between the classified wafer defect patterns and 101 the issues upstream the production process and therefore with 102 the related yield. 103 This work is arranged into three main sections: related 104 works where several approaches to assess defect pattern 105 recognition problem are briefly described, materials and 106 methods where the proposed approach is discussed from 107 mathematical and computational perspectives, experiments 108 and results section in which the performance and benchmark 109 comparisons of the designed approach will be outlined. The 110 final section will also include a description of the deliv-111 ered tool named STAI-EWS. This tool embeds the pipeline 112 described in this contribution and it is currently in use in 113 Silicon and Silicon Carbide technology production lines. An interesting approach has been presented in [16] by 164 proposing an Ensemble Convolutional Neural Network based In [17] researchers proposed a Gaussian Mixture of Varia-171 tional Autoencoder (GMVAE) where extracted visual fea-172 tures from the source WDM and by means of an ad-hoc 173 Dirichlet process they were able to provide a robust WDMs 174 clustering. This approach has been benchmarked against tra-175 ditional Bayesian non-parametric models using the adjusted 176 rand index (ARI) and adjusted mutual information (AMI) 177 as measure of similarity between clusters, obtaining 0.76 as 178 highest values in both ARI and AMI. 179 In [18] authors proposed a pre-processing statistical tech-180 nique on a custom dataset containing 6 wafer lots, consisting 181 in a binarization of wafer maps, filling the inner testing 182 wafer points on the wafer using the around median value and 183 reducing the noise using a median filter. At the end of the pre-184 processing stage, variational autoencoders are used as feature 185 extractors to decompose high-dimensional wafer maps to a 186 low-dimensional latent representation. Finally, a traditional 187 K-means or hieararchical clustering were involved and simi-188 larity evaluated by Silhouette Score. Unfortunately, the men-189 tioned authors provided only the 2D-latent plot representation 190 of their method without any performance metric. The authors of [20] proposed a combination of three tech-202 niques based on distributed K-Means++ for clustering as 203 well as a statistical mining patterns by FPGrowth [21] and 204 finally a deep classifier based on a 5-layers CNN backbone 205 for making a robust defect maps classification of a custom 206 input wafer dataset. The method seems very promising as they 207 collected 95.00% in F1-score.

208
The authors of [22] proposed a Stacked Convolutional 209 Sparse Denoising Auto-Encoder (SCSDAE) in which the 210 designed convolutional layers were used to extract wafer 211 visual features. The so collected features will be processed by 212 the auto-encoder part of the architecture in order to retrieve an 213 internal unsupervised latent representation of those features 214 suitable to perform a robust features-related defects cluster-215 ing. The method showed 95.13% of accuracy using a 5-fold 216 cross validation.

217
A promising approach has been showed in [23] in which 218 the authors proposed an approach based on dimensionality 219 reduction of the input defect maps distribution followed by an 220 autoencoder based processing. Specifically, the input defect 221 maps were fed to Principal Component Analysis (PCA) that 222 extracts features. The so collected features will be processed 223 The second part of the proposed pipeline is composed 279 by the ''Supervised Learning Block'' structured with ad-hoc 280 designed Convolutional Neural Network trained to perform a 281 supervised classification of the resized input WDMs.

282
As introduced, in the common semiconductor production 283 lines there is a concrete need to correctly identify and classify 284 defect patterns as predictive markers of manufacturing issues 285 and production yield. Furthermore, it becomes necessary to 286 characterize novel and unknown defect patterns related to a 287 new issues in the upstream production process which needs 288 to be properly investigated.

289
The classical manufacturing issues which produces wafer 290 defects mainly concern to failures, impurities or degrada-291 tion of the production lines [27], [28]. For the work herein 292 described, it is worth mentioning the case of Silicon Carbide 293 (SiC). The SiC-based manufacturing pipelines show defect 294 patterns which are usually significantly different with respect 295 to the silicon-based ones (a more detailed description about 296 datasets can be found in IV-A1 and IV-B1). For this reason, 297 an advanced ''unsupervised'' pipeline suitable to identify new 298 defect patterns is investigated. In this way, the herein pro-299 posed pipeline will be able to catch new issues in the upstream 300 production lines, through a hybrid approach (unsupervised / 301 supervised) that will be able to correctly characterize defect 302 patterns.

303
Each of the designed parts of the proposed full pipeline will 304 be described in the following sub-sections.

306
The designed Unsupervised Learning Block is composed by 307 four sub-systems: Resize and Filter, Dimensionality Reduc-308 tion, Hierarchical Clustering and Metrics-driven Classifica-309 tion. Each of the mentioned sub-systems will be described in 310 detail.

312
The input of this sub-system is the high-resolution bina-313 rized WDMs (usually at classical wafer dimension, i.e., 314 20, 000 × 20, 000 spatial resolution) resized (using bicubic 315 algorithm [29]) to ad-hoc reduced spatial dimension by the 316 resize block. From our internal investigation, an optimal 317 resolution for the herein analyzed application is 61 × 61. 318 However, the spatial resolution resizing does not have any 319 significant impact on the overall performance of the proposed 320 unsupervised pipeline to the extent that the defect patterns 321 information are preserved. Due to the adopted Resize Block, 322 each pixel of the processed WDM image no longer represents 323 a single die (device) but may represent a set of dies (devices) 324 according to the adopted photo-lithography process [30].

325
Before applying the dimensionality reduction and hierar-326 chical clustering techniques as reported in Fig. 3, a Filter 327 Block is preliminary applied to the resized input WDM. This 328 filter discards defect maps whose patterns show a low-impact 329 in the upstream production issues. Specifically, a wafer map 330 showing few defective dies (i.e., the so called ''Spot Wafer 331 Map'' as in Fig. 4a or a defect map with no defective dies 332 (i.e., the so called ''Empty Wafer Map'' as in Fig. 4b) will 333 be discarded as they do not produce any significant impact 334 in the production lines but only computational cost of the 335 pipeline. In order to characterize the defect maps as ''Spot'' 336 or ''Empty'' ad-hoc thresholds have been defined.

337
As reported in Fig. 3 the introduced Resize and Fil-338 ter sub-system will enable dimensionality reduction and  the so reshaped vector) will represent a specific dimension in 365 the related high-dimensional space as total of 3, 721 dimen-366 sions. These samples are known as data-points. This dimen-367 sional reduction approach is a key-process of the WDMs 368 unsupervised clustering. For this reason, the unsupervised 369 sub-system was designed to process batch of WDMs, specif-370 ically, the whole set of wafers produced at each production 371 cycle (from our tests 350 wafer maps based on Silicon Car-372 bide technology were processed -on average -per week 373 cycle).

374
The UMAP algorithm is based on the following main parts: 375 High-Dimensionality-to-Graph Block and Graph projection 376 Block. The target of this block is to build a weighted graph associated 379 to the input set of WDMs (high-dimensional space). Let 380 introduce such mathematical assumptions needed to reduce 381 the dimensionality of the high-dimensional space associated 382 to wafer maps. Specifically, the authors have assumed that 383 data-points are uniformly distributed over the input high-384 dimensional space. Considering that this assumption is not 385 always satisfied in a real application, we have applied a Rie-386 mann's metric (G r ) that allows to consider input data-points 387 as uniformly distributed in the input space, thus making 388 in the input high-dimensional space. This circle radius can are able to connect more or less data-points to the graph. 432 An instance of so generated 2D simple graph is reported in 433 Fig. 6.

434
After that, we normalize the measure of distance between 435 the edges in the graph (i.e., the weights) by associating a fuzzy 436 topology representation of the graph in which distance values 437 may change between zero and one [31].

438
The use of UMAP allows to obtain considerable 439 advantages (compared to classical dimensionality reduction 440 techniques such as Principal Component Analysis (PCA), 441 Singular Value Decomposition (SVD), t-Distributed Stochas-442 tic Neighbor Embedding (t-SNE)) as it allows to preserve the 443 global and local features of the high-dimensional input space 444 into the projected low-dimensional space by optimizing the 445 degree of dimensionality for feature representation. The second step of the UMAP algorithm is the input 448 high-dimensional graph projection into low-dimensional 449 ones. Basically, with this step the authors want to build a 450 new low-dimensional weighted graph by optimizing a cross-451 entropy-based function that embeds the weights associated 452 to the edges of both graphs (the input high-dimensional and 453 the projected ones to be defined by the optimization process). 454 In Eq. 1 the adopted cross-entropy (µ, υ, A) function is 455 reported: where A is a reference set, i.e., the set of the input 459 high-dimensional wafer defect data-points, µ and υ are the 460 related weights defined in α → [0, 1] due to mentioned fuzzy 461 representation. In the so created low-dimensional space (due 462 to the previously optimization) a connected graph is associ-463 ated. The output of UMAP processing is then a set of features 464 Silicon Carbide WDMs is reported in Fig. 7.

469
The so obtained low-dimensional data-points will be fed as while the data-points outside the ''r-neighborhood'' are 496 defined as ''noise''. After that, we leveraged the following 497 definitions related to core-object:

498
Definition D 1 : Two core-objects are considered r-reachable 499 if data-points in the related core objects are nested all 500 together; 501 Definition D 2 : N core-objects are density-connected if they 502 are directly or transitively r-reachable; 503 Definition D 3 : A cluster (C) can be defined with respect to 504 its radius (r) and smoothing factor (m p ), as non-empty subset 505 of density-connected core-objects; 506 We can also define other properties related to distance 507 between core-objects: 508 Definition D 4 : The core distance d core of a core-object a p 509 (with reference to its radius r and smoothing factor m p ) is the 510 distance between a p to its nearest neighbor in m p ; 511 Definition D 5 : A core-object is considered r-core-object 512 if the correlated radius r is greater than or equal to the core 513 distance of a p .

514
After the core objects definition, HDBSCAN provides 515 an internal graph reconstruction starting from input low-516 dimensional data-points and core-objects previously defined. 517 This graph is usually named as Mutual Reachability Graph 518 and it is defined as: 519 Definition D 6 : Mutual Reachability Graph is a weighted 520 graph with the data-points configured as graph-vertices while 521 for each edge (data-points connection) ad-hoc weights are 522 defined as measure of the mutual reachability distance of 523 related data-points.

524
Definition D 7 : Mutual reachability distance d mr is defined 525 as the maximum distance between core distance a p , core 526 distance a q and the distance between the two core-objects a p 527 and a q . In Eq. 2 the mathematical representation of the d mr . 528 At this point, HDBSCAN provides a mutual reachability 530 graph by connecting core-objects and by weighting the con-531 nection through the mutual reachability distance d mr .

532
Through ad-hoc thresholding applied to the overlapping 533 edges of the mutual reachability graph, the mutual reach-534 ability graph connection scheme can be re-configured by 535 optimizing the number connection-complexity. To do that, 536 HDBSCAN embeds the usage of Minimum Spanning Tree 537 (MST) approach [35], [38]. MST re-configures and reduces 538 in complexity the input densely connected graph by a classi-539 cal graph-theory approach which provides a new graph with 540 a minimal set of edges that connects all the components. 2 541 An instance of MST optimized graph associated to an input 542 Silicon Carbide WDMs is reported in Fig. 8.

543
The target of the unsupervised pipeline which embeds 544 UMAP and HDBSCAN is to provide a final hierarchical 545 structure which highlights the key group of clusters associ-546 ated to the input set of similar wafer defect patterns. Based on 547 the performed analysis, we have obtained a non-hierarchical 548 MST optimized and densely connected weighted graph. 549  Therefore it is necessary to construct from this graph a hier- clusters (C i ) are split or merged according to the density value 572 λ i , where eligible clusters are the one that will survive at the 573 λ density changes.

574
The Eq. 3 reports the mathematical integral equation 575 related to the ''excess of mass''. In Fig. 10 we reported an 576 instance of excess of mass approach on the probability density 577 function of clusters.
Through the approach described by Eq. 3 and in Fig. 10 the 580 authors were able to retrieve the optimized number of clus-581 ters C i through density λ i . For instance, the two meaningful 582 clusters C 1 and C 2 are merged at the corresponding minimum 583 density level λ min related to C 1 and C 2 and create cluster C 3 , 584 then merged cluster C 3 will be merged with another cluster 585 C 4 according to the minimum density level λ i+1 , and so on. 586 Finally, the output of HDBSCAN sub-system is a set of 587 core-objects representing the final set of optimized clus-588 ters. These defined clusters will be re-mapped back to the 589 UMAP block in order to associate them to source data-points. 590 In Fig.11 an instance of the mentioned UMAP re-mapping is 591 reported.

592
In Fig. 11 clusters related to input defect maps (in Silicon 593 Carbide technology) with significant similar features have 594 been highlighted and grouped by color. 595 At the end, the set of optimized clusters will be processed 596 by the following Metrics-driven Classification Sub-System. 597

598
In details, the target of this sub-system is to assess the 599 matching between the identified defect map clusters (from 600 UMAP and HDBSCAN) with the well-classified defects 601 classes stored in the database available in the pipeline.

602
To do that, we have integrated the K-Means approach [40] 603 with the target to retrieve only one centroid for each cluster 604 (basically K=1). K-means is also applied to the well classified 605 VOLUME 10, 2022  The following Eq. 4 showed the applied metric comparison: where WDM new is the computed K-means cluster centroid 613 related to the WDM clusters while WDM DB is the same 614 related to the well-classified WDMs stored in the Database.

615
The cosine similarity score ranges from −1 to 1, as −1 rep-616 resents high dissimilarity of the centroids while, conversely, 617 1 represents high similarity of the input data. The designed Supervised Learning Block takes as input 636 the defect patterns stored in the internal database possibly 637 updated by the Unsupervised Learning Block.

638
To perform the mentioned supervised classification, ad-639 hoc designed Deep Convolutional Neural Network has been 640 implemented. It is composed by 5 convolutional layers having 641 a kernel size 3 × 3, padding and striding set to 1. For each 642 convolutional layer a ReLU activation function followed by 643 a Batch Normalization are applied. The number of kernels is 644 doubled at each layer, starting from 64 till to 512. Starting 645 from the second layer a Max-pooling of size 2 × 2 and 646 striding set to 2 is applied. The so designed Convolutional 647 Neural Network backbone is described in details in Table 1. 648 Specifically, we have designed two type of deep convolu-649 tional network backbones (differentiating the input layer and 650 the final layers that embed the fully connected) in order to 651 validate the best of these in performance and to facilitate the 652 benchmark comparison phase. More details about the two 653 implemented backbones are now given.

654
The Big CNN. This first backbone embeds an input layers 655 at 224 × 224 × 3 as data resolution/channels while shows 656 a final stack of two fully connected layers which embeds 657 100, 352 and 1, 024 neurons respectively.

658
The Small CNN. This second backbone embeds a 659 single-channel input layer at 64 × 64 and a final set of two 660 fully connected layers is composed by 8, 192 and 1, 024 neu-661 rons respectively. As introduced, the need to have two deep 662 architectures is mainly for performance validation as well as 663 in reference to a more robust benchmarking of the proposed 664 solution as some scientific literature solutions with which our 665 method has been compared have inputs of 224 × 224 × 3 or 666 single-channel. In Table 1, the details of the implemented 667 deep backbones.

668
As reported in Table 1, the final number of well defined 669 WDM classes has been defined to 45. Although this number 670 can vary significantly according to the new classes that may 671 emerge from the unsupervised clustering block. More details 672 about this defect map classes are reported in the next sections. 673 The experimental results we have collected were related 674 to this setup although similar considerations can be extended 675 to any number of defect map classes. However, an attempt 676 is made to minimize the number of defect pattern classes 677 in order to efficiently characterize production. Furthermore, 678 as new defect classes are identified, they are analyzed and 679 resolved in the upstream production line, thus contribut-680 ing to the maintenance of a minimum number of defect 681 classes.

683
This section reports experimental results for Unsupervised 684 Learning and Supervised Learning approach and some details 685 about the STAI-EWS application we have developed to per-686 form the test through an user-friendly tool.   Fig. 14b reported such defective dies at the center with 718 straight vertical lines and spots of good dies; Fig. 14d reported 719 an amplified version of ring-like pattern; Fig. 14f reported an 720 amplified and inner half-moon-like pattern on the left side of 721 the wafer; Fig. 14g reported an amplified version of ring-like 722 pattern with good dies arranged vertically at the center of the 723 wafer similar to SiC_6 in Fig. 13a; Fig. 14v reported a wafer 724 full of good dies with a horizontal centered line and spots 725 arranged like a checkerboard of good dies; Fig. 14w reported 726 a right half side of the wafer with defective dies and a straight 727 horizontal line of good dies.

728
From this internal dataset, a subset of 225 unlabelled mixed 729 WDMs is randomly chosen and arranged in a 3D Surface 730 Plot as reported in Fig. 15. This plot allows to spot predomi-731 nant patterns by visual inspection, due to the binarization of 732 WDMs where good dies have value ''0'' and defective dies 733 have value ''1''. By stacking up binarized WDMs along pixel 734 coordinates (x and y axes) the sum of ''1'' values (z axis) 735 will represent the spatial distribution of defective dies on the 736 wafer. A high value of z at a specific coordinate point will 737 represent a predominant defective pattern. 738 This dataset has been used to evaluate the unsupervised 739 block of the proposed pipeline. The defect maps embedded 740 in this dataset have been previously analyzed by engineers of 741 STMicroelectronics, in this way, we were able to better and 742 more accurately evaluate the outcomes of the unsupervised 743 analysis.  -''Distance metric'' is the metric used to compute dis-759 tances in high dimensional space.
where n is the number of dimensions.   environment. According to [41] Manhattan distance should 782 be used in high-dimensional space scenario as it is more 783 robust to outliers but at the same time it is affected by the 784 curse of dimensionality drawbacks. While, Euclidean dis-785 tance affects embeddings generation by squaring the distance 786 of far-way data-points x i and y i . In practical application, the 787 authors opted for a metric rather than another due to the 788 statistical distribution of the data.

789
The combination of the indicated parameters in Tables 2 790 and 3 enabled the evaluation of 30, 780 models for clus-791 ters generation starting from input WDMs. For this rea-792 son, we have used ad-hoc performance indexes to evalu-793 ate the performance of each model. Specifically, we have 794 adopted the Silhouette Score, Calinski-Harabasz Index and 795 Davies-Bouldin Index defined as follow. where:

801
-ā is the average distance between samples in the same 802 class;

803
-b is the average distance between samples in the nearest 804 clusters.

Calinski-Harabasz Index (CHI) [43], [44]
is the ratio of 809 the between-clusters dispersion and the within-cluster disper-810 sion. The index is computed as in Eq. 8: where: k is the number of clusters;

822
-R q is the set of samples in cluster q;

823
r q is the center of cluster q;

824
-R E is the center of E;

825
n q is the number of samples in cluster q.

826
A higher value of the CHI means that clusters are dense and 827 well separated.    Table 4. (i.e., 1), allowed to preserve local density by obtaining a 868 better clustering performance despite the increasing number 869 of clusters. This allowed to find 41 clusters with a range 870 of cluster sample size from 2 to 8, with some bigger clus-871 ter containing 15 or 25 WDMs. After the clustering was 872 performed as in the previous paragraph, we proceeded by 873 computing K-means centroids. Moreover, we have compared 874 the related centroids with the well-classified SiC dataset of 875 STMicroelectronics by applying a threshold of 90% with a 876 Cosine Similarity validation. Some of the collected outcomes 877 have been reported in Figs. 16, 17. 878 Although K-Means is a classical approach which often 879 shows limits in the determination of the centroids of clus-880 ters, in the application herein proposed we noticed that the 881 combination with Cosine Similarity allowed a robust match 882 between unlabelled clusters. As introduced, the SiC dataset 883 have been previously annotated by engineers of STMicro-884 electronics so that we have checked the similarity between 885 the clusters identified by the unsupervised pipeline with the 886 classes already identified. In our experiments only 7 clusters 887 have a lower value of Cosine Similarity between 70 and 80% 888 and only 2 clusters have been misclassified with classes from 889 SiC dataset (further inspection revealed a bad clustering and 890 a wrong centroid generation        As reported in Fig. 18 the dataset includes 9 differ-915 ent patterns (including ''none'' with different morphology).

916
As shown in Fig. 19, where: 1) defective dies arranged at the 917 center; 2) defective dies arranged as a Donut-like shape; 3) a 918 group of defective dies located on the edge; 4) defective dies 919 along the edge; 5) a group of defective dies located anywhere; 920 6) a wafer full of defective dies; 7) random defective dies; 921 8) single circular scratch; 9) none (normal) Wafer Maps with 922 few defective dies.

923
To cover the mentioned imbalance issue, we significantly 924 reduced the ''none'' class only to 1,000 random samples.  (Fig. 20), 3 mixed type (Fig. 21) and four mixed 930 type (Fig. 22), for a total amount of 38 defect patterns at 931 52 × 52 fixed resolution.  As introduced, the final used dataset is a combination 944 of the previous ones containing a total of 71, 266 WDMs 945 with 45 different classes re-arranged and grouped in a more 946 balanced way. This full dataset has been split into training, 947 validation and test sets according to a 80-10-10 hold-out 948 methodology. The designed supervised deep system has been trained by 951 using single datasets as well as combined ones. Preliminary, 952 ad-hoc data augmentation method has been employed includ-953 ing random rotation, horizontal and vertical flip.

954
As introduced, we have implemented two deep network 955 backbones (Big CNN and Small CNN). Both models have 956 been trained for 100 epochs in PyTorch framework vers. 957 1.10 [47] with CUDA 11.4 running on a workstation based on 958 Intel Core i9-12900K with 64GB DDR4-3600MHz of RAM 959 coupled with NVIDIA RTX 3060 with 12GB of VRAM. 960 We have tested for benchmark comparison both pre-trained 961 (on ImageNet) State-Of-The-Art (SOTA) backbones as well 962 as the same trained from scratch [48]. The Adam algo-963 rithm [49] has been used as optimizer with an initial learning 964 rate of 1e − 4 and, Cross Entropy function has been used as 965 968 where x is the input tensor, y is the target class label, w is 969 the weight (rescaled by weight given to each class), C is the 970 number of classes and N spans the minibatch.

971
In Table 5  was reported while we reported in Table 6 the related con-983 fusion matrix converted to overall accuracy, precision, recall,  Table 6, it is evident how our 993 architectures significantly outperform the method proposed 994 in [4] which it recovers with the only ''none'' class which is 995 strongly imbalanced and in any case not very significant for 996 the analyzed WDMs assessment. Therefore, the robustness of 997 the proposed method is evident in relation to the defect classes 998 that are most valid in the analysis of the defect patterns. 999 More details about Dual-stage WMFPR and our proposed 1000 architectures can be found in Table 6.

1001
About the MixedWM38 dataset the authors of [46] 1002 described a split of their dataset (training and validation set) 1003 as 80% and 20% providing a performance assessment of their 1004 method based on the usage of precision and recall metrics. 1005 In Table 7 we have reported the benchmarks comparison 1006 between the method reported in [46] named DC-Net against 1007 our proposed ones. As showed in Table 7 our proposed solu-1008 tions (in both the designed configurations) outperformed the 1009 DC-Net approach designed in [46] by an average of 4% in 1010 overall accuracy. The performances related to the classifica-1011 tion of single defect-classes (both native and mixed) showed 1012 that our method outperforms the DC-Net approach [46] 1013 on average, confirming the effectiveness of the proposed 1014 approach. More details about DC-Net and our proposed archi-1015 tectures can be found in Table 7.   we have used the accuracy in training, validation and testing 1024 phase. In Table 8  maps of a pre-trained network is more difficult to converge to 1032 feature maps associated with WDMs. A network that builds 1033 its own feature maps from scratch is able to learn better and 1034 therefore perform better. 1035 We also tested architecture based on Vision Trans-1036 former [50] (ViT RGB at 224 × 224 × 3 and ViT at 64 × 1037 64 × 1 spatial resolutions) which however underperformed 1038 compared to ours. We believe that this result is to be further 1039       are defined as the integral-path of the gradients along the 1081 straight-line path from the baseline a to the input a. The 1082 integrated gradient along the i th dimension for input a and 1083 baseline a (with m, the number of steps in the Riemann 1084 approximation of the integral) is defined by the following 1085 Eq. 12: Grad-CAM [54], [55] is a method to produce visual expla-1088 nation of underlying Convolutional Neural Network models 1089 making them more explainable. Grad-CAM uses gradient 1090 information flowing into the last convolutional layer of the 1091 network to assign values to each neuron for a particular 1092 outcome. Given a localization map related to the class C, 1093 the Grad-CAM computes the gradient of the score of class 1094 C (before the Softmax) with respect to the feature map of the 1095 previous activated convolutional layer. This so computed gra-1096 dient is global-average pooled over the width (i) and height 1097 (j) dimensions to obtain the neuron weighting. where A(l i ) is the corresponding attention weight-map at 1108 layer i th (for i to j, so from the first layer to the latest 1109 ones).

1110
In order to show the behaviour of models, XAI methods 1111 aforementioned are now applied to an instance of ''Loc'' 1112 wafer defect pattern (Fig.26). As reported in Figs. 27-36 for 1113 each the tested deep backbones, we have computed explain-1114 ability methods in order to reconstruct the internal represen-1115 tation used by the network for performing the related wafer 1116 patterns classification. The first aspect that is highlighted is 1117 related to the fact that although the defect pattern is single, 1118 the networks internally activate more similar classes such as 1119 Center, Edge-Loc and Loc (as highlighted in the Prediction 1120 plot showed in Fig. 27a, 28a). Anyway, the output of the 1121 network is represented by the most representative class of this 1122 internal map.     both pre-trained (Fig. 29, 31, 33) and trained from scratch 1132 (Fig. 30, 32, 34), showed the same behaviour, i.e., they were 1133 not able to make right predictions as confirmed by XAI based 1134 on Grad-CAM and integrated Gradients which not enabled 1135 any significant activation maps. It is interesting to highlight 1136 that the tested deep models trained from scratch were able 1137 to make a better prediction referred to high significant acti-1138 vation maps such as VGG19, ResNet-152 and DenseNet-161 1139    user will be able to infer such input WDMs through 1175 the well-trained Convolutional Neural Network. Feed-1176 forward inference can be done either as a single WDMs 1177 as well as by group of defect maps. The related classifi-1178 cation of the input WDMs will be done with associated 1179 reports.

1180
• Unsupervised WDMs clustering option. As described 1181 in IV-A this option allows the user to perform unsu-1182 pervised clustering of the input WDMs followed by a 1183 downstream comparison with internal database looking 1184 for new wafer defect pattern classes. The GUI of the 1185 STAI-EWS tool shows the capability to change mul-1186 tiple parameters for UMAP and HDBSCAN such as 1187 the dimensionality reduction factors, filtering param-1188 eters, thresholds configuration, and so on. A related 1189 3D Surface plot will be created to have an overview 1190 of the input WDMs against the adopted dimension-1191 ality reduction and hierarchical clustering configura-1192 tion. The STAI-EWS tool allows the user to enable 1193 the K-Means centroids computation and related Cosine 1194 similarity. As introduced, in case of novel defect patterns 1195 the internal database will be automatically updated and 1196 the related CNN re-trained accordingly (this option can 1197 be disabled by the user). In Fig. 37 an instance of the 1198 unsupervised sub-system embedded in the STAI-EWS 1199 tool.

1200
• The Configuration-Management of the Database. 1201 This section allows the user to configure the STAI-EWS 1202 tool internal defect maps database including the ability 1203 VOLUME 10, 2022  His research interests include the areas of artificial intelligence applied to 1431 medical data processing, detection and segmentation in endoscopic video 1432 imaging systems, visual-knowledge ontology modeling, processing of radio-1433 astronomical images, and temporal series analysis. Italy, in 1985 and 1991, respectively. For more than 1438 30 years of research activity, he has achieved sev-1439 eral important results in various fields, and more 1440 specifically, a large expertise in the field of tech-1441 nology transfer from basic research ideas to proto-1442 types and then to products and applications. This 1443 expertise has been build up combining advanced 1444 research work (within or in cooperation with university, research labs, and 1445 small/medium enterprises) and application to technologies and products 1446 within STMicroelectronics. He has innovated front-end and back-end tech-1447 nologies in the field of power devices introducing new Si power structures 1448 (using thrench and thin wafers) and power structures in semiconductors, 1449 like SiC and GaN. SiC power devices are now in full mass production 1450 within STMicroelectronics. He has authored more than 250 publications in 1451 international refereed journals and holds more than 50 patents.