DPiT: Detecting defects of Photovoltaic Solar Cells with Image Transformers

Solar energy is one of the most important resources that can be a clean and renewable alternative to traditional fuels. The collection process of solar energy mainly rely on the photovoltaic solar cells. The defects, such as microcracks and finger interruption on the photovoltaic solar cells can reduce its efficiency a lot. To solve this problem, defects detection of solar cells have attracted attention from many researchers. In this paper, we propose a transformer based network to detect defects on solar cells efficiently and effectively. First, we introduce convolutions into the transformer to enable the input embeddings of the transformer, positional information of patches and spatial context more accurate and precise. Secondly, cross window based multi-head self-attention (CW-MSA) is proposed to enlarge the window relation modeling capacity via the strong attention mechanism and can be an effective alternative to the original counterpart. Finally, we propose a multi-scale aggregation block to merge the low-level features into deep semantically strong features by attention to obtain accurate geometry information. Extensive experiments on the elpv dataset demonstrate DPiT can consistently bring significant improvements over its strong baseline Swin Transformer with subtle extra computational overhead. The visualization results show that the proposed DPiT is able to detect various complex defects correctly. In particular, DPiT can achieve impressive 91.7 top-1 accuracy and greatly outperforms other competitive counterparts.


I. INTRODUCTION
With the rapid development of technology, people are paying more attention to the environment protection than ever before. Clean and renewable energy will become progressively more important as time goes on. As a good alternative to traditional energy, solar energy is one of the most important resource to solve the energy issue. People mainly make use of solar energy by solar cells. Solar cells can be divided into monocrystalline cells and polycrystalline cells. Polycrystalline solar cells are the most widely used solar cells in the world, whose cracks detection is much more harder to be handled than monocrystalline. As shown in Fig. 1, photovoltaic solar cells may get damaged during the solar energy collection process. For example, thermal stress and improper operations is able to bring the photovoltaic solar cells cracks and cause cell breakages. These defects are the main reason why the power efficiency of photovoltaic solar cells decline gradually. To prevent the solar cells from getting terrible electrical power loss, we need to detect the defects in the cells in due course to enable the production process stable.
As one of the most challenging problems, defects inspection of solar cells have attracted much attention of many researchers in the past few years. Previous works mainly tend to detect defects of solar cells by physical methods. The work [1] introduces current-voltage (IV) characteristics for silicon cell characterization. Besides, thermography which is only implemented on infrared ray images is adopted for defects detection [2]- [4]. However, the work [2] points out that infrared ray images are at a lower resolution that leads to make the microcracks detection much more difficult. Electroluminescence imaging [5], [6] is an improved alternative to thermography. Actually, electroluminescence (EL) imaging is one of the important technologies for defects visualiza- tion of photovoltaic solar cells [7]. Besides, The automatic processing of EL owns the ablity to analyze the defects of high-resolution solar cell images. The work [5] introduces the Fourier image reconstruction into the field of defects detection in EL data. Tseng et al. [8] first list some regions with high confidence scores and find the target regions by binary feature clustering. Some previous works [9] detect defects of solar cells by hand-crafted methods. The work [9] proposes the dimensional matched filters to inspect the cracks on solar cells.
Methods base on deep learning have dominated image classification and detection in the past few years. Convolutional Neural Network (CNN) has been widely used in computer vision and plays an important role in image classification, object detection and other tasks, which shows that CNN has led the development of computer vison. For example, the pioneering work AlexNet [10] is proposed to prove CNN can achieve better performance than traditional image classification methods. He et al. [11] introduces ResNet to shown residual learning is able to considerably ease the optimization of deep convolutional neural networks. As for object detection, Fast R-CNN [12] and Faster-RCNN [13] first obtain multiple regions of interest and extract region features. These regions are further fed into the detection head for classification and bounding box regression. Inspired by these milestone works, some works [14]- [16] have been proposed to prove that CNN also achieves impressive performance on defects detection. The work [17] adopts a novel method called normalized lp norm to solve the problem of weakly supervised segmentation of cracks on solar cell. Recently, U-net with a region proposal network [18] is proposed to accurately localize and segment the silicon polycrystalline defects. Besides, the method adopts dilated convolution to enlarge the receptive field to enable defects segmentation more precise.
However, convolutional operation only processes a local neighborhood and fails to capture the long-range dependencies among pixels [19]. Different from CNN, the transformer is able to model the global relations [20]. The transformer architecture was first proposed for machine translation in the filed of natural language processing [21]. The encoder network in transformer is composed of self-attention operation in conjunction with multi-layer perceptron (MLP), obtaining remarkable results in capturing the long-range dependencies. Transformers have since become the state-ofthe-art method in many NLP tasks [21]. The success of the transformer in NLP leads to many attempts at adapting the transformer architecture to images. Recently, many visual transformers [22], [23], [25] are proposed to achieve state-ofthe-art performance in the field of image classification. ViT [22] first splits an image into many non-overlapping patches, and model the relations among these patches by a transformer encoder. DeiT [23] points out that ViT only outperforms the state-of-the-art CNN methods when the training data is sufficient and introduces better hyperparameters settings and distillation strategy to train a data-efficient ViT. CPVT [24] proposes a dynamic postional encodings that are implemented by convolution with zero paddings to enable the transformer is able to receive images with different resolutions. The attention map will be very similar in the deep layers of transformer, which may lead to attention collapse. To solve this problem, DeepViT [25] introduces re-attention mechanism to combine attention maps of different heads to enhance the diversities of attention maps. Other work, such as T2T-ViT [26] and PVT [27], exploit efficient transformer designing principles. To reduce the computation cost of transformer, Swin Transformer [28] introduces Window based MSA (W-MSA) and Shifted Window based MSA (SW-MSA). This hierarchical transformer can serve as a generalpurpose backbone and achieve the state-of-the-art performance on image classification and other dense recognition tasks.
However, there still exists gap between the data of polycrystalline solar cells and general images. For example, semantically strong features in deep layers are generally utilized for classification and detection in current frameworks. Low-level features are usually deemed rich in containing geometric patterns and stripes, which is contrary to high-level features. As for the defects detection of polycrystalline cells, the defects including microcracks and finger interruption are mainly contained in the low-level features. It is intuitive that the stripes in the image of polycrystalline solar cell own rich discriminative information, which is beneficial for detecting the defects in the image. Inspired by this, we propose DPiT, which builds upon the Swin Transformer to effectively and efficiently detect the defects of polycrystalline solar cells. The patch embedding layer in Swin Transformer directly split the image into non-overlapping windows, failing to model more precise information. We add another fully convolutional network to enhance feature extraction. As for selfattention layer, we introduce Cross Window based MSA (CW-MSA) to replace the SW-MSA to enlarge the window relation modeling capacity. We also model spatial context in the MLP which only applied on channel dimension. Dy-This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.  namic positional encodings based on convolution with zero paddings are added before every stage of the transformer. Besides, we introduce a bottom-up pathway which is parallel to the backbone forwarding to gradually integrate both high-level and low-level features by attention mechanism.
To demonstrate its robustness and effectiveness, we test our proposed DPiT on the elpv dataset. The results show that DPiT consitently outperforms the original Swin Transformer and other CNN based network by a large margin.
The main contributions of this paper are as follows, • We introduce convolutions into the patch embeddings, positional encodings and MLP to enable the input features of the transformer, positional information of patches and spatial context more accurate and precise. • Cross Window based MSA (CW-MSA) is proposed to replace the original SW-MSA to enlarge the window relation modeling capacity via the strong attention mechanism. • We propose a multi-scale aggregation block to merge the low-level features into deep semantically strong features by attention before classification to obtain accurate geometry information. • Our proposed DPiT is able to outperform the baseline Swin Transformer and achieve very impressive performance on the elpv dataset.

A. OVERALL FRAMEWORK
As shown in Fig. 2, our proposed DPiT builds upon the Swin Transformer architecture. The first step of classification pipeline is to flatten the input 2D image with resolution H ×W into a sequence of 2D non-overlapping patches. Then this feature sequence of patches is fed into a patch embedding block to be projected to a dimension 48. Following the default settings in Swin Transformer, patches with size of 4 × 4 are merged as the inputs of the transformer. The hierarchical backbone of DPiT is composed of 12 DPiT blocks and the patches are downsampled via a patch merging layer. We split the transformer into 4 stages as shown in Figure. . Stage 1 recieves the output of linear embedding layer with shape 48 × H 4 × W 4 and generates the extracted feature with the same resolution but larger channel C. A patch merging layer is inserted into the bottom of "Stage 2", which merges the features of groups of 2 × 2 adjacent patches. The downsampled features are further fed into transformer blocks after neighboring feature merging. In a similar way, "Stage 3" and "Stage 4" receive inputs with resolution H 8 × W 8 and H 16 × W 16 and downsample them by 2×. Finally, a global average pooling layer and a linear classifier are applied on the output feature of "Stage 4". We further specially incorporate four effective components for the defects detection task of polycrystalline solar cells.

B. PATCH EMBEDDING
The original patch embedding block in the Swin Transformer is adopted to partition the input 2D image into some non-overlapping patches. We reshape the input image x ∈ R 3×H×W to obtain the patch grids A linear embedding layer is applied to enable the transformer use the constant latent dimension C as the feature channel through all of its blocks. This patch partition process is very simple but cannot capture low-level information in the earlier layers and ignores the relation among all the pixels near the border of all patches. To effectively learn representations of the low-level information in the image, we additionally introduce convolutions into the patch embedding block to perform resolution reduction due to their strong inductive biases. As shown in Fig. 3, we inserted 4 layers of 3 × 3 convolutions followed by a BN (Batch Normalization) [29] layer and GELU activation [30] function to directly extract features. To avoid breaking the initial behavior of pre-trained model, we design these 4 convolution layers as another extraction path and initialize the weight of the last BN layer as zero. The extraction process of patch embedding block can be illustrated as: where z 0 denotes the patch embeddings, F linear and F conv represents the original linear patch embedding and convolution patch embedding, respectively. We do not add position embeddings to the patch embeddings in this block. The patch embeddings z 0 are the input to the transformer encoder.

C. TRANSFORMER BLOCK
The transformer block proposed in the transformer encoder consists of a multi-head self-attention (MSA) layer and a MLP. Unlike the standard multi-head self-attention layer in other visual transformers, a novel window based multi-head self-attention (W-MSA) layer is proposed in the baseline model. This window based attention mechanism proposes a shifted window partitioning approach to introduce crosswindow connections. The cyclic shift method still can be improved since only a few pixels of each window are shifted during window based self-attention computation. The MLP in transformer block is composed of 2 fully-connected layers and GELU non-linearity in between, which lacks the ability to model spatial relations. To solve these problems, we introduce cross-window multi-head self-attention (CW- MSA) to capture salient window relations and MLP with spatial context module (SC-MLP) to promote the spatial modeling power of MLP in Fig. 4. The forward process of the transformer block can be illustrated as: where l,ẑ and z denote the index of transformer block, the output of self-attention layer and the output of MLP. W-MSA, CW-MSA and SC-MLP denote the window based multi-head self-attentino layer, cross-window multi-head self-attention layer and MLP with spatial context augmentation.

D. WINDOW BASED SELF-ATTENTION
In our implementation, an image is partitioned into h M × w M non-overlapping windows. We first split a windows into M × M patches and compute the self-attention matrix of all patches within the local window in the window based selfattention layer. The computation process of W-MSA is shown as: where Q 1 , K 1 , V 1 denote query, key and value with shape M 2 × d, d is the channel dimension of feature, B 1 is the position embedding bias. W-MSA is able to model the global relation among all the patches in a local window. Through computing dot-product, the similarity between different patches in the window is calculated, resulting in longrange and global attention. To further enhance the capacity of global context, we design the CW-MSA to capture crosswindow relation. The forwarding of CW-MSA is similar to W-MSA: where The MLP (multi-layer perceptron) which adds linear transformation to the features plays a crucial role in almost all transformer-based architectures. MLP performs point-wise operations and are applied to each token separately. It is composed of two fully-connected layers and non-linear activation in between. Given a input feature z of self-attention layer, the MLP transforms the feature as follows: where W 1 ∈ R C×K and W 2 ∈ R K×C denote the weight of the first layer and the second layer, respectively. In our implementation, we set K as 4C. b 1 and b 2 are the biases.
And σ is the non-linear activation, which is GELU in most visual transformers.
Obviously, MLP only performs linear and non-linear transformation on each window, enhancing the representation ability of all windows. However, the spatial information of windows are ignored. In this paper, we propose a spatial context module into MLP to form SC-MLP to extract the spatial representations and merge them into the semantic information. As shown in Figure., we first adopt a convolution layer W k l with kernel size of 3 × 3 and softmax function to obtain the spatial attention weights A l . The attention weights A l is further multiplied by the input featuresẑ l . Then a 1 × 1 convolution W p l is used to linearly transform the spatial information to semantic information. Finally, we normalize the global context to channel attention weights by softmax function. The output z l is obtained via scaling the features by channel attention weights. The computation process of SC-MLP can be shown as following: where F scale is the channel-wise multiplication between the attention map and features, z l denotes the output features of SC-MLP. We also have tried different kernel sizes of W k l such as 5 and 7, convolution with kernel size of 3 achieves the best speed-accuracy trade-off. Therefore, we adopt convolution with kernel size of 3 as W k l by default. In Swin Transformers, the attention maps in self-attention layer are added with the same number of learnable absolute positional encoding vectors. The shape of the positional encodings is fixed, which prevents the model from handling input sequences are shorter or longer than the original sequences. To solve this problem, many visual transformers resize the shape of fixed positional encoding vectors by bicubic interpolation, which still hurts the performance of VOLUME 4, 2016 the model. Recent works have observed that the tokens on the borders can be aware of their absolute positions if we use zero paddings as illustrated in Fig. 6. Inspired by this, we introduce dynamic positional embedding into features of windows before every stage of the transformer. As shown in Figure., we adopt a very simple depth-wise convolution with kernel size of k(k ≥ 3) and zero paddings k−1 2 to dynamically assign positional information. The dynamic positional embedding is added back to the main-stream features. In our experiments, we have tested depth-wise convolution with k = 3, 5, 7 and regular convolution with k = 3. It is intuitive that the stripes in the image of polycrystalline solar cell own rich discriminative information, which is beneficial for detecting the defects in the image. In the image classification network, the final output high-level features are semantically strong, but lacks precise low-level information. Low-level features are usually deemed rich in containing geometric patterns and stripes, which is contrary to highlevel features. Inspired by this, we propose a multi-scale aggregation block (MAB) to fuse as much as possible lowlevel information to enhance the defects detection ability of the transformer.

G. MULTI-SCALE AGGREGATION BLOCK
Unlike some previous works [31], [32] integrate multiscale features by convolutions, we propose to facilitate multiscale feature fusion by cross-attention in this paper. It is worth noting that we are the first to adopt attention to aggregate multi-level features. As shown in Fig. 7, we first resize the low-level feature x i−1 of stage i by 3 × 3 convolution with stride 2. Then we adopt attention mechanism to fuse downsampled low-level feature x i−1 and feature x i at higher level. An attention function can be described as mapping a query and a set of key-value pairs to an output. Therefore, we treat the low-level feature x i−1 as the query and highlevel feature x i as key and value in the attention function. Our motivation is simple: treating the low-level feature as the query to select the most similar features which contains rich geometric stripes information to help defects detection. We apply a linear layer on x i−1 to get the query and two linear layers on x i to get the key and value. Different from aforementioned W-MSA and CW-MSA, we model the relationships among all patches in the image to get precise and accurate low-level information. A residual connection is added to ease optimization. We insert two MAB at Stage 3 and Stage 4 to construct a bottom-up pathway which is parallel to the transformer. Here we show the forwarding process of MAB: wherex i denotes the output of MAB and d i is the channel dimension of feature at stage i. W q i , W key i and W v i denote the linear layers in the MAB.

H. TRAINING LOSS
Assuming the data loaded from the dataset is composed of a sequence of image-label pairs {(X 1 , Y 2 ), ..., (X N , Y N )}, where X i and Y i denote the input image and the corresponding defect label, N denotes the number of images. We use a general cross-entropy loss to optimize the transformer during training stage as follows: where p i denotes the probability of Y i is 1. We only consider the situation that Y i is 0 and Y i is 1 since there are only two classes in defects detection.

A. DATASETS
We conduct experiments on elpv dataset [33]- [35] of solar cell images extracted from high-resolution electroluminescence images of photovoltaic modules. The elpv dataset consists of 2624 samples of 8-bit grayscale images of functional and defective solar cells with varying degree of degradations. The defects in the images are known to reduce the power efficiency of solar modules. The training set that consists of 2298 images from 2 classes is utilized for training. We report the top-1 accuracy on a single crop by default on the validation set which has 326 images. The labels of the dataset are the defect likelihood created by experts. Labels are categorized as four types. A label of 0 percent denotes that the expert ensure the solar cell is correctly labelled as not defective. A label of 33 percent denotes that the expert is not sure about that the solar cell is not defective. A label of 67 percent represents that the expert is not sure about that the solar cell is defective. Finally, a percent of 100 denotes that the expert ensure the solar cell is correctly labelled as defective. The detailed information of labels is shown in Table 1.

B. EVALUATION METRICS
We evaluate the performance of our proposed model by following the general evaluation metrics, including accuracy, precision, recall and F 1 score. The accuracy, precision, recall and F 1 score are calculated as follows: where TP and TN denote the number of images with and without defects are correctly classified, FP and FN denote the number of images with and without defects are incorrectly classified. Accuracy represents the probability of correct classification generated by our proposed transformer. Precision represents the proportion of detected defective images to all the defective images. Recall denotes the proportion of defective images that are correctly detected by our model. F 1 score denotes the harmonic mean of precision and recall.

C. IMPLEMENTATION DETAILS
We train DPiT on 8 GeForce GTX 1080Ti GPUs for 200 epochs (10 epochs for linear warm-up) and adopt the cosine decay learning rate scheduler. AdamW optimizer is ultilized for training. The batch size is set as 256 when we train DPiT. If not specifically noted, we use an initial learning rate of 0.0001 and a weight decay of 0.05. Some data augmentation strategies, such as mixup [36] and cutmix [37] are excluded during training since these two augmentation strategies may erase the regions of defects. Besides, most of the augmentation strategies, including color jitter, rotation, flip, etc. are adopted during training. As for the initialization of our proposed model, we follow the kaiming initialization for all convolution layers.

D. MAIN RESULTS
In this subsection, we show the main results on the elpv dataset in Table 2. We first perform a comprehensive component-wise ablation study on Swin-Small, where different components are gradually inserted to thoroughly analyze the effect of each component. As shown in Table 1, by adding another fully-convolutional patch embedding to the transformer, DPiT-A achieves 90.5 top-1 accuracy, which leads to an absolute 0.7 points gain. This dramatic improvements prove that the low-level information captured by the fully convolutional networks is very essential to defects detection. DPiT-B still makes significant improvements (0.3 points) with Cross-Window MSA instead of vanilla Shifted Windows based MSA involving. We also find that the spatial attention and channel attention captured in SC-MLP benefit the classification performance of the DPiT-C by 0.4 points. Therefore, our proposed SC-MLP owns stronger capacity for both spatial and semantic modeling. Compared to DPiT-C, DPiT-D with dynamic positional encodings boosts the accuracy of the model by 0.3 points, which is a dramatic improvement. Finally, the insertion of MAB fully enables the deep features contain more low-level features to obtain 0.3 points improvement. Besides, we also evaluate the precision, recall and F 1 score of our proposed models in Table 3. Obviously, the gradual insertion of our modules results in that all models (from Swin-Small to DPiT) are arranged in the increasing order of all metrics, which also verifies the effectiveness of DPiT.

E. COMPARISON WITH OTHER METHODS
We report the results of DPiT on the elpv dataset and compare it with other recent competitive photovoltaic defects detection methods. We only focus on the performance of accuracy since precision, recall and F 1 score ommited in some papers. It is worth noting that the resolution of input image in our model is only 224 × 224, which is significantly smaller than the resolution in other methods. It is obvious that the baseline Swin-Small has surpassed all these CNN-based approaches with input resolution 224 × 224, proving the powerful global relation modeling ability of transformer. By augmenting another the fully convolutional patch embedding path, the accuracy is boosted to 90.5, which is dramatically higher than the baseline. Finally, with cross-window based MSA, SC-MLP, dynamic positional encodings and multiscale aggregation block involving, the accuracy is further improved to 91.7, which is superior than other CNN-based methods. The previous work reveals that large scale training trumps inductive bias in transformer. This means that transformer is able to achieve much better performance than CNN when large scale data is utilized for training. However, we only use 2624 EL images to train our DPiT. We speculate our proposed DPiT can achieve better performance if more data is used for training.

F. ABLATION STUDIES ON PATCH EMBEDDING
As shown in Table 5, We conduct additional experiments to find the optimal setting of our proposed fully convolution patch embedding. We first use two 3 × 3 convolutions with stride 2 to downsample the input image. Furthermore, 3 × 3 convolution layers with stride of 1 are used for feature refinement. "3 × 3 Conv stride 1" in Table 5 allows us to vary the capacity and computational cost of the patch embedding block. To investigate the trade-off between accuracy and parameters, we conduct experiments on DPiT with various   Types  Train  Test  Σ  0%  33%  67%  100%  0%  33%  67%  100%   Polycrystalline  683  132  37  301  237  46  13  101  1550  Monocrystalline  438  87  41  249  150  30  15  64  1074  Σ  1121  219  78  550  387  76  28  165  2624   TABLE 2. Performance of our proposed methods on Swin-Small. Results are evaluated on the elpv dataset. "DPiT-A" denotes the convolutional patch embedding path is inserted in Swin-Small. "DPiT-B" denotes the Shifted Window based MSA in DPiT-A is replaced with Cross-Window MSA. "DPiT-C" denotes SC-MLP is adopted in DPiT-B instead of regular MLP. "DPiT-D" represents the insertion of the multi-scale aggregation block in DPiT-C. Our proposed DPiT is constructed when all these proposed modules are implemented in Swin-Small.  numbers of 3 × 3 convolution. It is obvious that more 3 × 3 convolutions is able to lead better classification performance but bring additional parameters and computation cost. To achieve a good balance between accuracy and complexity, we use two 3 × 3 convolutions to downsample and two 3 × 3 convolutions for refinement.

G. ABLATION STUDIES ON SC-MLP
As shown in Table 6, We conduct additional experiments to delve into the settings of our proposed SC-MLP. "add" denotes the addition fusion without the last softmax operation. "mul" denotes the feature is multiplied by the normalized channel attention weights. It is obvious that the kernel size of 1 achieves the worst performance since the convolution lacks the ability to capture spatial information. Besides, the top-1 accuracy of "mul" is higher than "add", proving the superior performance of multiplication. Therefore, we use the addition operation for feature aggregation by default in all the experiments.

H. ABLATION STUDIES ON DYNAMIC POSITIONAL ENCODINGS
Ablation experiments are conducted to find a proper kernel size of the dynamic positional encodings to achieve the best performance. In Table 5, we vary the kernel size of the module and yield similar results (91.1% ∼ 91.3% top-1 accuracy) when the kernel size equals 3, 5, 7, 9 and 11, proving that the performance is robust to different kernel sizes. The classification performance reaches the peak by setting the kernel size as 4. We also find DPiT earns the maximum accuracy increase when the kernel size is 11. This relatively improved performance is a natural result of large receptive field.

I. ABLATION STUDIES ON MULTI-SCALE AGGREGATION BLOCK
The multi-scale fusion strategy of multi-scale aggregation block (MAB) is similar to the milestone work FPN except for the information flow of the module. The MAB and FPN both merge the precise low-level features and high-level features with coarser resolutions. However, FPN propagates the highlevel features at the top-most level to the lower levels, which aims to enhance the semantic information contained in the low-level features with accurate localization information. The top-down information flow of FPN is reverse to the bottom-up information flow of backbone. Different from the design principle of FPN, the motivation we introduce MAB is to make the deep features more discriminative by merging them with low-level features which contains accurate geometry stripes information. Another difference between FPN and MAB is the fusion operation of feature integration. FPN adopts the addition fusion followed by regular 3 × 3 convolution to aggregate multi-scale features. However, vanilla 3 × 3 convolution only processes the local neighborhood, ignoring the global relation. Besides, the relations among pixels from different scales are important for feature fusion. As shown in Table 8, we compare some fusion methods to find the best setting for MAB. It is surprising that "Conv-Conv", "Conv-MSA" and "MSA-Conv" achieve the same performance. The top-1 accuracy of "MSA-MSA" is slightly higher than other 3 methods. This is a natural result of the global relation modeling ability of cross-attention.

J. VISUALIZATION OF DETECTION RESULTS
We visualize several detection results of DPiT on elpv in Figure 8. These results demonstrate that our proposed model is able to focus on the defects areas in the solar cells.

K. VISUALIZATION OF TRAINING PROCESS
In this subsection, we analyze the different behaviors of Swin-Small and our proposed DPiT during training. The accuracy diagrams of Swin-Small and DPiT are shown in Fig.  9. It is obvious that both two transformer-based methods is able to reach around 90% accuracy quickly. This is a natural result of small scale data training. Specially, the red curve of DPiT is generally higher than the blue curve of Swin-Small, proving its superior classification performance. After around 20 epochs, the accuracy curves become stable and stay around 90%. As can be seen in Fig. 10, both transformers converge very quickly. The loss of both two models fluctuate after around 20 epochs. The scale of the elpv dataset is small for the transformer architecture, which may lead to fast convergence and loss fluctuation.

IV. CONCLUSION
In this paper, we propose a novel transformer based network to detect the defects of photovoltaic solar cells effectively. We first analyze the limitations of the Swin Transformer for photovoltaic defects detection.