Quantification of Damages and Classification of Flaws in Mono-Crystalline Photovoltaic Cells Through the Application of Vision Transformers

This work introduces new effective methodologies for the detection, analysis, and classification of diverse defects that may occur throughout the production process of photovoltaic panels. In this context, this work proposes a novel approach that combines Image Processing and Vision Transformers (ViT) to address this challenge. The results of this work comprise a light flaw-type classifier based on ViT, along with computational tools to calculate the length of cracks and the proportional damaged area caused by flaws without requiring the training of other models. The proposed ViT-μ model achieved high accuracy in flaw detection and classification for solar cells, with rates of nearly 98% and 94%, respectively; achieved with a mere one-hour training duration. Moreover, this study introduces a weakly supervised method of visualizing the detected defects within a solar cell, by using attention maps.


I. INTRODUCTION
Production of photovoltaic panels (PPs) has experienced exponential growth in the last decade, driven by the significant benefits that harnessing solar energy offers [1], [2], [3], [4].According to the International Energy Agency (IEA), the installed solar capacity increased from 639 GW in 2022 to 1262 GW in 2023.From a monetary perspective, the emergence of this sector is also evident.For instance, clean energies, led by solar, are set to attract a global investment of USD 1.7 trillion in 2023 [5].
The high demand for PPs is indeed a reality and, at the same time, challenges the manufacturers to continuously innovate their processes [6], [7].From the production to the deployment of the PPs, constant changes have been adopted to improve the panels' efficiency and reduce the cost and time The associate editor coordinating the review of this manuscript and approving it for publication was Yizhang Jiang .
of their fabrication [8], [9].Among the different stages of the fabrication process, one of paramount interest is the detection and classification of cracks and cold solder joints (cold solders) on the solar cells.The importance of this stage arises due to the direct relationship between the aforementioned flaw types and the overall quality as well as the effective lifespan of the PPs.
Typically, across the majority of PP manufacturing facilities, the identification of cracks and cold solder joints is executed through the expertise of trained personnel.The assessment procedure starts with each assembled PP passes through an electroluminescence (EL) test.The EL test returns an image highlighting all the possible anomalies resulting from the assembly process.Then, the image is observed and judged by a qualified worker.If a flaw is detected, the PP is rectified and sent back to repeat the EL test; else, the PP continues to the final manufacturing stages.Since the process is subject to human interpretation, the evaluation leads to detection errors.Moreover, the time spent to evaluate each PP is not regular, entailing considerable delays in the production line.
Different methods to speed up the PP's flaw inspection process have been widely explored in the last years [10], [11], [12].Alongside the advancement of technology, computer vision techniques have predominantly been used to address the problem.In the same manner, the application of image processing along with Machine Learning (ML) has notably dominated the latest developments within this field [13], [14], [15], [16].In this sense, and with the aim to contribute in this topic, the present work adopts a ML approach, specifically, a model architecture known as Vision Transformer (ViT).
The ViT technique was introduced in 2021 [17] and is based on the transformer architecture used in Language Processing Networks (LPN).It splits an image into patches where each patch can be considered a ''word'' in a ''sentence'' and the sentence is equivalent to the entire image.Finally, the network aims to understand the significance of each patch and its spatial position within the image through the utilization of an attention mechanism [18].A significant improvement over convolutional neural networks (CNNs), as mentioned in [17], is that ViTs excel at capturing global pixel relationships, in contrast to CNNs' that focus on capturing relationships among adjacent pixels.On the other hand, it is well known that ViTs face a critical drawback since the amount of data required to achieve comparable task generalization to CNNs is significantly larger.To address this limitation, authors in [19] proposes the integration of knowledge distillation and adjustments to hyperparameters.By applying these changes to the conventional model, the Data efficient image Transformer (DeiT) is created, which achieves CNN-like performance in task generalization while being trained with the same amount of data.Consequently, DeiT proves to be more suitable for transfer learning and fine-tuning compared to ViT.
In this study, we leverage a database provided by the BYD (Build Your Dreams) company, a manufacturer of solar panels, to develop a novel approach for identifying and categorizing damage in solar cells.Specifically, we present a lightweight flaw detector and multi-class classifier version of ViT (ViT-µ), along with an algorithm that calculates the proportional area of damage for each cell based on the type and location of the flaw.Our analysis focuses on three commonly reported classes during the inspection process, namely, cracked cells, cells with cold solders, and undamaged cells.To further improve the accuracy of our approach, we explore image-processing techniques that complement the ML model.Finally, we apply various techniques to segment the cell image depending on the type of flaw detected.For cracked images, we employ a deterministic approach based on image processing, while for cold solders, we consider a weakly supervised segmentation.Overall, our work represents a significant advancement in solar cell inspection, with potential implications for PPs production as it provides a lightweight classifier and means of realizing segmentation requiring little to no training.To the best of our knowledge, no previous literature has presented results of this nature prior to the findings of this study.
The remainder of this manuscript is organized as follows.Section II presents a bibliographic review.Section III details the main features of the database.Section IV revisits the ViT model architecture and show the other models used in this work.Section V introduces the image processing algorithms used in this paper.Section VI introduces the cell's damage measurement algorithm.Section VII presents and discusses the obtained results.Finally, Section VIII concludes this work.

II. RELATED WORKS
The upcoming section offers a review of relevant literature involving ML techniques for flaw detection in PPs.This review sets the stage for our study by highlighting the existing research landscape and the specific contributions we aim to make.
Recently, authors in [20] introduced a deep-learningbased model to detect and classify defect types of solar cells.A novel feature fusion method based on ResNet152-Xception alongside with transfer learning technique are employed to tackle the problem.The authors incorporated data augmentation (DA) and class weighting techniques into their framework to mitigate challenges stemming from dataset size and class imbalance.Results reported 96.17% of accuracy in the binary classification task (presence or absence of defects) and 92.13% for multi-class classification considering nine defect types.In an associated study presented in [21], authors unveil an evolutionary algorithm-based deep CNN pruning method, named PSOPruner.The study aims to optimize the number of parameters of the deep CNN models.Comparisons, in terms of accuracy versus the number of parameters, between the PSOPruner and the traditional deep CNN methods validate the model's usefulness, especially for mobile and embedded devices.Despite the great contributions presented in [20] and [21], the authors do not explore any ViT type model nor propose any damage quantification algorithm for solar cells.
In a similar vein, authors in [22] use a Support Vector Machine (SVM) and a CNN to tackle the problem.Monocrystalline and poly-crystalline solar cells were considered for the training stage.The ML-models reported 83% and 88% of accuracy, respectively.In a congruent study, authors in [23] introduce a CNN based classifier.A novel DA method based on image processing and Generative Adversarial Networks (GAN) is also presented.Despite the higher number of samples in the database due to the DA, the classifier reached 84% of accuracy in the validation set.In [24], a transfer learning approach using ResNet-50 and the trained weights of the ImageNet is adopted.The authors vary the values of the normalization layer and reached an accuracy value of 91% in the test set while also using the heatmap obtained by the last convolutional layer to perform segmentation in cell images with cracks.More recently, authors in [25] present a solar cells flaw detector employing an altered Shifted-window Transformer (Swin), which is an updated form of the ViT.Novel techniques such as the use of convolutional layers along with the ViT's architecture and, the cross windowbased multi-head self-attention were introduced here.More interestingly, the authors demonstrate how ViT networks outperform the more traditional ML approaches in terms of accuracy.As for studies regarding damage quantification, [26] proposes using a UNet model to perform a semantic segmentation of the image highlighting any pixel belonging to an affected area.
Although authors in [25] implemented the ViT approach for PP flaw detection, no flaw type classification was carried out; and while [26] evaluates the affected area caused by a crack, it requires label images to be drawn manually for the training of the model.Additionally, no damage quantification of cold solders on solar cells has been carried out yet either.In general, it's worth noting that any classification system presented in prior studies primarily focuses on assessing the degree of cellular damage rather than delving into the underlying nature of the flaw.Addressing these shortcomings is also a focus of this research.

III. PRELIMINARIES
In this section, the primary aspects of the database are presented, offering an in-depth understanding of its essential details.Furthermore, a flowchart is introduced, serving as a visual representation that elucidates the sequence of this work, outlining the step-by-step sequence followed in this study.

A. DATABASE
Our research is centered on mono-crystalline photovoltaic modules.For this study, we obtained 80 full-panel electroluminescence images from BYD company, resulting in a total of 5362 cell images after excluding any uncertain cells.They were carefully assessed and classified based on the presence or absence of flaws.The quantity of images per class is distributed as follows: • Crack: 999 images; • Cold solder: 791 images; • No flaw: 3572 images; Among these images, a total of 4228 have been designated for training and validation purposes, while the remaining 1134 images are reserved for testing.

B. PP IMAGES MAIN FEATURES
Each panel image has the following characteristics: • Height = Variable, between 2668 and 2674 pixels.
• Number of cells = 72.The identification of each of the 72 cells can be achieved by referring to Fig. 1 where each cell's label consists of a letter denoting the column and a number denoting the row.For example, as shown in Fig. 1, the cell in the lower right corner of the solar panel is indicated by a label 6A.

C. WORKFLOW
The organization chart presented in Fig. 2 aims to summarize the sequence of the processes that make up the solution presented in this article.A detailed explanation of each step is given in the upcoming sections.
In brief, the procedural framework presented herein comprises a sequence of steps as delineated in the subsequent sections: Segmenting the panel image into distinct cell images, as expounded in Section V-A; subjecting each individual cell image to the cell cleansing process expounded in Section V-B; employing each processed cell image as input within the framework of the ViT-µ classifier model, elucidated in Section IV.The subsequent course of action is dependent upon the outcome of the flaw classification: In the event that no flaw is discerned within the image, the current cell evaluation is over and the subsequent cell image is subjected to classification.In the event of crack detection, the procedure is extended to encompass the reapplication of the image binarization technique outlined in Section V-B, followed by the application of algorithms detailed in Sections VI-A and VI-B.These algorithms are instrumental in assessing both the extent of damage and the length of the identified crack, respectively.Lastly, in cases where a cold solder anomaly is identified, an attention map is procured from the model and employed to estimate the affected region, in accordance with the exposition provided in Section VI-C.

IV. CLASSIFIER
This section elucidates the fundamental principles underlying the ViT model and highlights the key modifications made to achieve the desired outcomes.
The training and testing samples were subjected to model processing using a single NVIDIA GeForce GTX 1050Ti, with the sole exception of the Swin-T model, which necessitated RAM processing due to memory limitations.

A. VISION TRANSFORMER
In [17], the ViT was presented as a means of incorporating the transformer technology and attention mechanism for 112336 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.image processing, in contrast to its original use in language processing.Specifically, the transformer encoder is modified, providing a novel method to train an image classifier without relying on convolution layers.As shown in Fig. 3, the image is partitioned into R patches, each one with dimension p × p which are subsequently flattened becoming p 2 × 1.Then, the patches are forwarded through a linear projection layer, specifically a multi-layer perceptron (MLP), where we then have Rd × 1 tokens.Afterward, a position embedding particular to each token is added to it, together with a class-token concatenated at the start of this token vector, with the position embedding value of 0. The class-token is learned by concatenating all tokens into a single Rd × 1 vector and passing it through a MLP.The resulting two-dimensional vector of size R × d is then passed through T transformer encoder layers.Solely the output of the encoder located at the same position as the class-token of the last transformer layer is directed into the MLP classifier head.
Each transformer encoder layer has the same architecture, as shown in Fig. 3.It consists of a layer normalization followed by a multi-head attention (MHA) layer in which all heads compute different self-attentions.Subsequently, the resulting output undergoes another normalization layer, followed by an MLP, and is then combined with the input of the second normalization layer.The MHA layer serves as a pivotal component of the transformer architecture.It comprises H heads of self-attention mechanisms that establish a correlation matrix-like relationship among each token, connecting them with both themselves and other tokens within the vector.The self-attention mechanism requires three inputs: Query (U ), Key (K ), and Value (V ).To construct the it, these three matrices are obtained by multiplying a single input matrix with three distinct trainable weight matrices, as defined by equations ( 1), (2), and (3): here, d k signifies the length of the vector K , S a represents the vector that represents a single head of self-attention, denotes the input matrix, M h indicates the output of the multihead attention, γ symbolizes the concatenation function, ξ refers to a trainable weight matrix applied to the different heads, and W Q j , W K j , and W V j correspond to the weight matrices applied to to obtain the query, key, and value, respectively, in the specific head j.
Furthermore, a visualization technique, such as attention flow or attention rollout [27], can be employed to generate an image that highlights the regions of the cell that are deemed most relevant by the ViT model during the classification process.This attention map can aid in image cleanup or even the estimation of the affected region of the cell.For instance, if the image is classified as containing a crack, the attention map is expected to point towards the crack as the most significant region of the image.This provides a weakly-supervised manner of applying segmentation to the cell image.Some modifications were introduced to the conventional architecture of the ViT model to enhance the results for the given scenario.Firstly, the flattened outputs of all the MLPs of the final transformer layer were utilized as the classifier's input.This approach led to better accuracy results without greatly affecting the training or inference time.Secondly, the hyper-parameters, network architecture, and training parameters were set according to Table 1.Additional details not contained in the table are the presence of data augmentation in the shape of random horizontal flips, random rotation, random zoom, and a normalization layer after the transformer layers and before the MLP classifier.Since this modified model is, despite other changes, a greatly reduced form of the vanilla ViT, it is called ViT-µ from this point on.

B. COMPARISON WITH OTHER MODELS
In order to compare the proposed model in this paper and find the best possible outcome for the problem at hand, various other models, as in Table 2, were trained or fine-tuned using the same database.
Table 2 presents the evaluation of 5 additional detection methods, including 4 pre-trained deep learning methods (including 2 transformer-based methods) and an SVM.All pre-trained models utilized in this study were first trained on the ImageNet dataset [28] and then fine-tuned using our own database for the classification of images into our three classes.

V. IMAGE PROCESSING
Within this section, the panel splitting technique and the algorithm for cleansing solar cells are introduced.A comprehensive explanation of each task is subsequently provided.

A. PANEL DIVISION
The objective of panel division is to partition a single PP image, into distinct individual images for each cell.In the database, the image dimensions may vary, as elaborated in Section III.Therefore, it is crucial to employ a technique that accounts for the variability in image size.Initially, panel boundaries are identified using a morphological opening operation with both vertical and horizontal kernels.Subsequently, a logical OR operation is performed on the 112338 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.resultant outputs, thereby facilitating the detection of panel boundaries using the vertical and horizontal kernels.The subsequent step involves removing the borders to obtain an image exclusively containing the cells.This is accomplished by identifying the coordinates of the first pixel in each direction where the border terminates (i.e., pixel values below 255) and utilizing these coordinates to crop the panel from the image, as depicted in Fig. 4.
Finally, to acquire the individual cell images, the width and height are both divided by the corresponding number of cells (12 and 6, respectively, in this scenario).The resulting values correspond to the dimensions of the individual cell images, as illustrated in Fig. 5.A resizing operation is then performed to normalize the size of all the images to 400 × 400 pixels.

B. CELL CLEANSING
The objective of the cell cleansing process is to produce two distinct outcomes: binary and RGB (or grayscale, depending on the used model) cell images devoid of busbars.The binary images serve as input for the algorithms discussed in Section VI, while the others are utilized as input for the ML model elucidated in Section IV.The cell cleansing process can be outlined as follows.

1) IMAGE BINARIZATION
The process of image binarization involves removing most of the noise present in the cell image while retaining the stronger lines, such as busbars and cracks.We use an adaptive Gaussian inverted threshold followed by the Zhang-Suen thinning procedure [29].
The Gaussian thresholding process is performed in two steps.First, a convolution operation is executed by convolving the image with a Gaussian kernel matrix.Then, the image goes through an inverse binary thresholding procedure, with the threshold constant being set as C. Mathematically, both steps can be written respectively as where the variables i and j are the horizontal and vertical pixel's position, R is the matrix representing the image, Q is the result from the convolution, A is the thresholding output, G W y ×W x is the kernel matrix containing the Gaussian weights, w x = (W x − 1)/2, and w y = (W y − 1)/2 in which W x and W y are respectively equal to the horizontal and vertical size of the window representing the neighborhood.Replicate padding is used at the borders of the image.
Conversely, the Zhang-Suen thinning algorithm [29], also referred to as skeletonization, is applied to a binary image to reduce the thickness of each line to approximately 1 pixel.This algorithm examines the 3 × 3 neighborhood of each white pixel in the image and assesses its conditions, which are elaborated further in [29], to decide its value.It takes A as input and outputs Z.
An example demonstrating the application of the aforementioned algorithms on a photovoltaic cell image is depicted in Fig. 6.

2) HOUGH TRANSFORM MAP
The classic Hough transform is used to specifically detect long lines with angles close to 0, 90, 180, or 270 degrees if thinning conditions are satisfied then 18: end if 20: end for (either horizontal or vertical) [30].In the context of a photovoltaic cell image, those lines are generally border lines or busbars.The Hough transformation maps pixel lines from image space to one dot in parameter space.In this work, normal parameterization was applied and it is given as: In this context, ρ represents the distance between the line to be mapped and the origin (or the length of the normal line).Furthermore, α denotes the angle formed by the normal line and the x-axis, ranging from 0 • to 180 • .Lastly, x and y refer to the coordinates of an individual point.These variables are shown also in Fig. 7a, which shows a noticeable property of this transform: a single point in image space maps into infinite points in parameter space.At the same time, when two or more points are aligned into a line in image space, this line maps into a single point in parameter space, as shown in Fig. 7b.Fig. 8c shows the accumulator matrix output of the Hough transform, which accumulates in parameter space the corresponding lines of every single white pixel in the image space.To obtain the lines of interest, a threshold operation is applied to this matrix to find the most active points in parameter space.
The Hough transform ends by obtaining the ρ and α of every line.However, since we are only interested in the horizontal and vertical lines, we only consider the lines in which 0 • < α < 1 • , 89.5 • < α < 90.5 • or 179 • < α < 180 • .These lines are then drawn white on a black background, generating the map.Meanwhile, all other detected lines are ignored and hence not drawn.Fig. 8 depicts the main steps of the Hough transform applied to a binary photovoltaic cell image.Following our naming convention, the input to this procedure is the image matrix Z and the map output is to be named M. Find the values of ρ and α that satisfy (6) 7: for every ρ and α pair do 8: end for Although the image binarization process cleans and denoises the cells' images, it is evident that it could delete the flaws caused by the cold solders.In this context, to highlight both the cracks and the cold solders' flaws, the inpainting algorithm presented in [31], was incorporated.The inpainting algorithm is a method used to correct images by replacing certain pixels based on a binary map.Here we make use of the result obtained from the Hough transform, depicted in Figure 8b, serving as the binary map.This map determines the specific regions within the images where the inpainting procedure is to be applied.
The algorithm operates by iteratively processing regions identified by the 1s in the map, starting with their borders.For each border point, the coordinates of which are denoted as ''b'', belonging to a region '' '' of 1s in the Hough transform binary map M, the algorithm considers nearby points within a neighborhood of size ''ϵ''.These neighboring points, each having coordinates denoted as ''c'', are outside the region but are close to the border.The algorithm aims to estimate the desired value of the image at position b using (7), where F is the output image matrix, starting off as a copy of the input image Z, F(b) is the value of image F at the coordinates b (similarly for F(c)) and F c (b) is the approximation of the value at point b based solely on the information from a singular neighboring point at c [31].
112340 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.A more exact approximation can be obtained by considering every point in ϵ that is outside .This concatenation of being inside ϵ and outside at the same time is denoted as '' ϵ ''.Then, F(b) can be calculated in the following manner: in which w is a weighting function detailed in [31].
After calculating the value of F at a point b, the value of M also at position b is changed from 1 to 0, and the algorithm redefines what points are at the border of and what even is itself.It then repeats the procedure from ( 7) until becomes an empty set.A comparison between the original image and the inpainted figure is shown in Fig. 9.

VI. SOLAR CELL DAMAGE QUANTIFICATION ALGORITHMS
The quantification of damage in solar cells holds significant importance, as it directly correlates with the efficiency and lifespan of a PP.To tackle this problem, in this section, we present a crack length meter tool and two algorithms to approximate the proportional damage in solar cells.

A. PROPORTIONAL DAMAGED AREA GENERATED BY CRACKS
Cracks may have distinct impacts on the energy production of photovoltaic cells.According to [32] the harm incurred end for 10: end while is proportional to the area rendered inactive by a crack in a cell.To gain a better comprehension of this, an in-depth examination of the components of the cell is required.Fig. 10 displays a photovoltaic cell and its principal elements which facilitate the current flow: the busbars and the fingers.In this context, a damaged area is considered when a finger has all of its paths toward busbars interrupted.This definition enables the analysis of solar cells based on the location of the   damaged regions, which can be categorized into two regions: the edge regions, where the fingers are connected to only one busbar (regions 0 and 5 in Fig. 11); and the internal regions, where the fingers are connected to two busbars (regions 1 to 4 in Fig. 11).Fig. 12 provides illustrations of crack interruptions occurring in both edge and internal regions.
Considering the above-mentioned problem's description, an algorithm was developed to determine more precisely the damaged area using a binarized version of the inpainted image M, hereby named Z ′ as input.The output is a binary image where white pixels correspond to the damaged area of the cell and it also calculates the percentage of the image that is occupied by said pixels, estimating then the solar  cell's proportional damage.Fig. 13b shows the result after the execution of the algorithm.
To reduce the time taken to process each image, a regionwise pixel detection approach was implemented.This involves selecting only the regions of the cell containing white pixels and performing a pixel-by-pixel search on them.If a pixel is found, the algorithm checks if it has a clear vertical path to a busbar.In the edge regions, this means a clear path up (region 5) or down (region 0), whereas in internal regions (1 to 4), each pixel must have a vertical path going either up or down.If no clear path is found, the pixel is deemed to be damaged, and all the pixels on the path to a crack or edge (regions 0 or 5) can also be identified, reducing the processing time.The image splitting process is shown in Fig. 14.After obtaining the damaged area image, the proportional area is calculated via (9) PAF = Amount of white pixels Total amount of pixels × 100.

B. CRACK LENGTH MEASUREMENT
In order to estimate the length of cracks within a solar cell, image Z ′ is harnessed as input after applying a short cycle of morphological closing operations, as it is already being used for the algorithm in Section VI-A.It requires 112342 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the use of a contour detecting algorithm which outputs a boundary box, such as the one in [33] which is also the one used in this work.The algorithm should generate a collection of coordinates that define rectangular boundaries around individual discernible objects present in the image.
It is assumed that each object corresponds to a distinct crack.The length of each crack is approximated using the diagonal measurement of its respective rectangle.
To complete the process, it is necessary to convert the measured crack length from pixels to millimeters.To do so, the ratio of the actual size of the cells to the size of the images is calculated.The provided information states that the cells have a dimension of 158.75 × 158.75 mm, and each image is of size 400 × 400 pixels, allowing us to conclude that one pixel corresponds to 0.396875 mm.
In practice, the algorithm outputs the numerical values representing the measured lengths of cracks.Nevertheless, for the reader's convenience, Fig. 15 visually depicts a graphical representation of an example image's cracks' lengths.

C. PROPORTIONAL DAMAGED AREA GENERATED BY COLD SOLDERING
Similar to Cracks, the harm caused by Cold Solders can be measured quantitatively by means of determining the affected area.In this case, the area of interest is the one relative to the stain caused by the flaw.Such subtle details are quite difficult to measure by using image processing techniques alone, therefore the use of ML was again considered.Taking into account the fact that the ViT-µ is already being used for flaw classification, using the visualization of its attention employing attention rollout as a possible way of highlighting the cold solder stain was considered.
The output of the attention rollout is a grayscale image with values between 0 and 255, however a clear boundary is needed to define what is and what is not a flawed area.To define that a simple threshold function was used and it is described in 10: where D is the image containing the cold solder damaged area, A is the raw attention rollout output, σ (D) is the standard  deviation of the entire image D, µ(D) the likewise global mean and β a constant that has a different optimal value depending on which segmentation method is being used; albeit it should lie usually between 0.25 and 1.5.For ViT-µ, the optimal value for β is 0.8.At this point, ( 9) is also used in this case to determine the proportional damaged area.Fig. 16 contains images from each step of this process.Additionally, performance evaluation metrics will be discussed in Section VII.

VII. RESULTS
In this section, we present and thoroughly discuss the results obtained from the present work.Our focus is directed toward assessing the accuracy of the ViT-µ model and analyzing the

A. CLASSIFIER EVALUATION
Two distinct evaluation tasks were undertaken to assess the ViT-µ classifier model.The first task involved a binary classification objective, focusing solely on detecting the presence or absence of a flaw.The corresponding results for this scenario are presented in Table 3.
The second task encompassed a 3-class classification, considering all the previously mentioned classes (''crack,'' ''cold solder,'' and ''no flaw'').The obtained results for this task are presented in Table 4.The training dataset consisted of images from all three classes, and the same model was employed for both tasks.The sole distinction was observed in the flaw detection task, where the statistics for the ''Crack'' and ''Cold Solder'' classes were combined and treated as a single class denoted as ''Flaw.'' In general, the flaw detection task outperformed the 3-class classifier in terms of accuracy.The results suggest that most errors occur within the flaw classes themselves.This is evident from Table 5, which shows that the number of errors involving the ''No Flaw'' category (24 errors) is considerably lower than the errors between the ''Crack'' and ''Cold Solder'' categories (31 errors).It is worth noting that a significant proportion of errors were associated with misclassifications as ''Cold Solder'', which was expected.This could be attributed to the weighting of classes during training, where the ''Cold Solder'' class had a higher weight value (four) compared to the ''No Flaw'' and ''Crack'' classes (weight value of one).The weighting was chosen to improve the recall score of the ''Cold Solder'' class, which was around 0.73 in initial tests.

B. SEGMENTATION EVALUATION
The damaged area algorithms can be seen as a sort of semantic segmentation into a boolean image, and as such, it was deemed fit for the Sørensen-Dice index [34] to be used as the evaluation metric.In order to realize this evaluation, 150 images were randomly selected from the test dataset and label images were created by hand.This reduced dataset consists of 75 cold solder images and 75 cracked cell images.

1) SØRENSEN-DICE INDEX
The Sørensen-Dice index, or Dice coefficient, is a metric that evaluates semantic segmentation tasks [34].Mathematically, it is given by where TP is the number of pixels considered true positive, FP the same for false positives and FN for false negatives.More intuitively, (11) can be represented in terms of sets' operations.To do so, let's denote G as the set containing every pixel predicted to be in the affected area, and S as the set containing every pixel in the label image identified as affected area.Then, the Dice coefficient can be rewritten as: More interesting, comparing (11) with the F1-score formula presented below, the similarity between those metrics is evidenced showing that are closely related metrics.
2) COLD SOLDERING SEGMENTATION As mentioned in Section VI-C, to execute the segmentation, some sort of attention visualization image is required.For the Vit-µ and the DeiT-T, the attention rollout method was applied.For Resnet-50, the heatmap plot for the cold solder class was implemented.Results, in terms of Dice coefficient values, are presented in Table 6 and Fig. 17.Different values of β were evaluated in order to find the optimal value for each model, the results are: 0.8 for the ViT-µ, 0.6 for the DeiT-T and 1.2 for ResNet-50.In addition, examples visually comparing the segmentation of these models are in Fig. 18.
The Dice index values indicate that, on average, the Vit-µ model has a better result at segmenting cold solder images than the pre-trained models.At the same time, Fig. 18 shows that ViT-µ has a better average performance when both considering images with a strong or weak cold solder stain; even though the DeiT-T model better approximates the shape of a strong stain.

3) CRACK SEGMENTATION
The crack damage estimation described in Section VI-A achieved a Dice coefficient of 0.601.While this result did not exceed the findings in [26], it is noteworthy to emphasize that both crack and cold soldering segmentation possess the advantage of not requiring manual creation of labeled images for each cell image in the dataset.
Furthermore, as depicted in Fig. 13, the manually drawn label image is not flawless, which could be another contributing factor to the relatively lower Dice index value obtained.

C. COMPARISON WITH OTHER MODELS
To validate the Vit-µ model, five pre-existing models (listed in Table 2) were used in this study and applied to the task of flaw classification.The corresponding results are presented in Table 6.
The obtained results highlight a few key points: firstly, the ViT-µ model achieved slightly lower accuracy compared to the other larger models, but the difference is not significant.Additionally, it demonstrated superior performance in segmenting cold solder images.An interesting addendum that the modified ResNet-50 model better results when using cell images without any preprocessing (0.962 accuracy) when compared to when the entire cell cleansing algorithm was applied (0.942 accuracy).This indicates that the proposed image processing algorithm is not a perfect fit for every ML method and should be used cautiously.
The first two points present a contradiction when it comes to cold solder images: should the focus be on more precise classification or better segmentation to accurately assess their condition?To address this dilemma, a solution is proposed: utilize the Swin-T (or equivalent) model for classification, and if an image is classified as a cold solder, subsequently pass it through the ViT-µ model.As we are only interested in Vit-µ's attention visualization, we can remove the model's classification head for the inference; thus reducing its number of parameters by over half, down to 1.01 million.This approach allows for the utilization of both models on a machine capable of running them together, effectively addressing the conflicting objectives.

VIII. CONCLUSION
The ML algorithms employed in this study have exhibited promising results, particularly in the context of flaw detection.The outcomes achieved thus far are extremely satisfactory.
Remarkably, based on the data provided by BYD, our ViT-µ model was shown to be a fierce competitor of the baseline algorithms, achieving at most approximately only 4% reduction in the flaw classification accuracy while having only 10% of the ammount of parameters of the most successful model, that being the Swin-T; while also having over 97% of accuracy on the flaw detection task.This outcome underscores the effectiveness of our approach.In fact, when comparing the results presented in Table 6, our ViT-µ model emerges as not only a cost-effective alternative but also a robust weakly-supervised method for accurately estimating the affected area of cold solders.
Moreover, training and applying models only previously used as flaw detectors now as flaw classifiers demonstrate how malleable they can be.Namely, our proposed strategy of using the ViT-µ model as a standalone estimator for cell damage, complemented by another model with superior classification capabilities, demonstrates great potential.This approach allows for leveraging the strengths of both models.Importantly, considering the use for just the segmentation inference, our ViT-µ model boasts remarkable efficiency, with a modest parameter count of just over 1 million, compared to the Swin-T model's 22 million parameters.By integrating these models together, the increase of overall resource requirements is unnoticeable, making it a practical and effective tool for assisting human on the production line.The ViT-µ model excels in identifying questionable cells for further evaluation and accurately estimating the extent of their damage, thereby contributing to improved operational efficiency and quality control.
The damage quantification algorithms as a whole have demonstrated outcomes that surpass the minimal adequacy standard, effectively measuring the extent of damage in each cell.Particularly, the measurements of damaged areas for cold soldering exhibit great potential as they require no additional labor prior to training, apart from labeling the cell images with their corresponding class.Similarly, crack damage evaluation can be carried out with minimal additional preparations, such as parameter adjustments.
One important aspect of these measurements is that data obtained from a single cell can be combined with data from other cells within the panel, enabling a comprehensive evaluation of the product's quality based on the manufacturer's specifications.
Regarding potential improvements for the tool, primary options include training a deeper network with randomized weights, as it has shown better results for cold solder segmentation, as depicted in Table 6.A more comprehensive and accurately labeled dataset, along with the possibility of employing individual classifiers for each category, leveraging their specific characteristics, would also be desirable.

FIGURE 2 .
FIGURE 2. Flowchart of the proposed algorithm.

FIGURE 4 .
FIGURE 4. Output of the panel border removal.

FIGURE 6 .
FIGURE 6. Example of the process of image binarization.

FIGURE 10 .
FIGURE 10.Diagram showing the fingers and busbars of a photovoltaic cell, taken from [32].

FIGURE 11 .
FIGURE 11.Six regions in which the cell image was divided for proper calculation of the affected area.

FIGURE 12 .
FIGURE 12. Illustration showing how cracks cause interruption of current in a cell, based on [32].

FIGURE 13 .
FIGURE 13.Damaged area caused by a crack.

FIGURE 14 .
FIGURE 14.The processed cell image was partitioned into six regions.

FIGURE 16 .
FIGURE 16.Estimated damaged area caused by cold soldering.

FIGURE 17 .
FIGURE 17. Segmentation performance for different values of β in every model that it is supported.

FIGURE 18 .
FIGURE 18.Comparison between the different models' segmentation using their respective optimal value of β.

TABLE 1 .
Hyper-parameters and training parameters used for ViT-µ.

TABLE 2 .
List of models used in this paper.

TABLE 3 .
Results obtained by the ViT-µ for the flaw detection task.

TABLE 4 .
Results obtained by the ViT-µ classifier for all the 3 classes.

TABLE 5 .
Confusion matrix generated by the ViT-µ classifier for all the 3 classes.functionality and efficacy of the attention maps as a flaw visualization tool in solar cells.It is worth mentioning that this model's training time is approximately a single hour.

TABLE 6 .
Summary of the results obtained by all classifiers used in this work.