JCS: An Explainable Surface Defects Detection Method for Steel Sheet by Joint Classification and Segmentation

For surface defect images that captured from a practical steel production line, different shape, size, location and texture of defect object may cause inter-class similarity and intra-class difference of defect images. Despite attractive results have been achieved in some surface methods for defect classification and segmentation, it is still far from meeting the needs of real-world applications due to lack of adaptiveness of these methods. Considering the surface defect image can be decomposed into defect foreground image and defect-free background image, the paper develops a novel joint classification and segmentation (JCS) approach to perform surface defects detection for steel sheet. It comprises of the classification method based on a class-specific and shared discriminative dictionary learning (CASDDL) and the segmentation method based on a double low-rank based matrix decomposition (DLMD), respectively. For the proposed CASDDL method, we learn a shared sub-dictionary as well as several class-specific sub-dictionaries to explicitly capture common information shared by all classes and class-specific information belonging to corresponding class. We adopt a mutual incoherence constrain for each sub-dictionary, a Fisher-like discriminative criterion and low-rank constrain on coding vector to improve the discriminative ability of learned dictionary. For the proposed DLMD method, we formulate the segmentation task as a double low-rank based matrix factorization problem, and the Laplacian and sparse regularization terms are introduced into the matrix decomposition framework. Experimental results demonstrate that our proposed JCS method achieve a comparable or better performance than the state-of- the-art methods in classifying and segmenting surface defects of steel sheet.


I. INTRODUCTION
Automated surface defect classification and segmentation based on machine vision are two most essential and related tasks in quality management of industrial products. For the real-time surface defect detection system based on machine vision, the classification task is used to classify normal images and abnormal images, which is highly beneficial for The associate editor coordinating the review of this manuscript and approving it for publication was Claudio Zunino .
improving the efficiency and accuracy of defect segmentation, whereas the segmentation task is used to detect the locations and boundaries of defects, which highlighting the critical defect regions for high-level image understanding [1].
As shown in Fig. 1, both classification and segmentation tasks are challenging due to the following reasons: heterogeneous and scattered defect: the number and type of defect are generally unknown in advance, and different surface images often have different imaging qualities, i.e., low contrast between each defect and its surrounding surface tissue results in fuzzy defect boundaries; cluttered and complicated background: non-defective background may also have great differences in different images; different types of defect might be contained in a single defect image, and they often exhibit substantial stochastic variability in terms of shape, size, gray, texture and location; the inter-type surface defects may share visual similarities, and the intra-type defects may have visual differences. In the past two decades, many efforts have been devoted for more efficient and accurate defect classification and segmentation methods [2], [3]. These approaches focused on two aspects of feature extraction and classifier design, which are basically customized for a predefined or specific type of defect. Besides, the low computational speed of these methods is a limitation for real-time detection. These factors motivate researchers to develop some new methods for surface defect classification and segmentation.
Most recently, convolutional neural networks (CNN) and generative adversarial networks (GAN)-based deep learning methods have been achieving remarkable performance in image classification and segmentation. Therefore, some studies have attempted to adopt deep learning methods for defect detection [4], [5]. As mentioned in [6] and [7], these deep learning models are complex with many parameters, and training them require a huge number of expert-labelled training samples, complex optimization algorithm, consume a significant amount of computing resources to keep running as its complex network structure, which are the significant challenging problem in industrial environments. Moreover, defective samples are difficult to obtain because of the probability of defect occurrence is very low in industrial manufacturing. In particular, these deep learning models lack of sufficient theoretical support and mostly rely on the human experiences, which limit the practical use.
Lately, dictionary learning has been successfully applied to many machine vision problems [8]- [10]. Sparse representation-based classification (SRC) [11] used original training data as a dictionary directly, and Aharon et al. [12] proposed K-SVD method to learn an over-complete dictionary from original training data. Ramirez et al. [13] developed a structured incoherence regularization term for dictionary learning (DLSI) to promote the independence between different sub-dictionaries. Ling et al. [14] developed a class-oriented discriminative dictionary learning (CODDL) method to emphasis class discrimination of dictionary atoms and representation coefficients. Fan et al. [15] exploited discriminative Fisher embedding dictionary transfer learning (DFEDTL) to preserve the interclass differences and intraclass similarities of training samples. As shown in Fig. 1, defect object in the surface image can be regarded as local anomaly against relatively homogeneous background. The background texture is useful for reconstruction rather than discrimination. For the aforementioned dictionary learning methods, most of atoms are used to represent non-defective background, causing only small part of atoms represent class-specific defect. Therefore, the discrimination of classspecific sub-dictionaries between different defect object will diminish, greatly degrading the classification performance. An intuitive way to capture and separate those shared components from training samples. Recent researches have yielded more promising results by using the idea of shared dictionary, which different classes not only have class-particular parts but also share commonality [16], [17]. Gao et al. [18] constructed a joint dictionary learning algorithm to learned some category-specific sub-dictionaries and a shared subdictionary by imposing cross-incoherence constraint between different sub-dictionaries and self-incoherence constraint in each sub-dictionary. Wang and Kong [19] established a category-specific and shared dictionary learning (COPAR) by exploiting the information of particularity and commonality across all classes. Lin et al. [20] constructed a classshared, class-specific and disturbance dictionary by introducing a robust, discriminative and comprehensive dictionary learning (RDCDL). However, these methods overlook the low-rank ability of sub-dictionaries or coding vector over the shared sub-dictionary. Therefore, Jiang and Lai [21], Rong et al. [22], Wen et al. [23] introduced a low-rank constraint on dictionary decomposition. Furthermore, Vu and Monga [24] proposed a low-rank constraint on the shared dictionary (LRSDL) to encourage its subspace to be of low-dimensionality and its corresponding representations to be similar. Du et al. [25] presented a low-rank graph preserving discriminative dictionary learning (LRGPDDL) by introducing a low-rank constraint on each sub-dictionary. Chen et al. [26] introduced an adaptive dictionary learning strategy combined with an adaptive low-rank representation (ALRR) method for classification. These methods show that incorporating low-rank regularization term into dictionary learning framework can enhance robustness of the learned dictionary and achieved impressive classification results.
Inspired by the idea of shared sub-dictionary and low-rank constrain, we develop a class-specific and shared discriminative dictionary learning (CASDDL) model for surface defect classification of steel sheet. Based on different classes of defect image share similar background, CASDDL-based classification method constructs c class-specific sub-dictionaries associated with corresponding classes and one shared sub-dictionary for all the classes, respectively. With these sub-dictionaries, exclusive features and shared features of surface defect image can be explicitly separated. CASDDL specially introduces incoherence promoting constraints on all the sub-dictionaries and lowrank constraints on coding vector over shared sub-dictionary, to make the learned dictionary more compact, discriminative and robust. Also, a Fisher-like regularization term on coding vectors over class-specific sub-dictionaries ensures more coherence for within-class coding vectors and more disparity for between-class coding vectors.
When the surface image is classified as the defect image, the defect object in defect image should be located and segmented. Some studies based on robust principal component analysis (RPCA) [27] have shown that matrix decomposition techniques are excellent unsupervised method for separating and segmenting the region of interest (ROI) from the image. RPCA assumes that an image can be represented as a combination of a highly redundant part (i.e., background regions) and a sparse part (i.e., foreground object). Mathematically, the feature matrix of input image can be decomposed into a low-rank matrix corresponding to background and a sparse matrix corresponding to foreground object. Some prior knowledge and regularization are incorporate into original RPCA model, which can improve segmentation results in terms of speed and accuracy [28], [29]. Cen et al. [30], Li et al. [31] designed a model of low-rank matrix reconstruction for defect inspection. Yan et al. [32] performed a smooth-sparse decomposition (SSD) with regularized high-dimensional regression to decompose a defect image and separate anomalous regions. Cao et al. [33] presented prior knowledge guided least squares regression (PG-LSR) based on low-rank representation to detect diverse defects. Huang et al. [34] applied a texture prior to construct a novel weighted low-rank reconstruction (W-LRR), which is only suitable for the defect images with regular or near-regular texture. Wang et al. [35] studied the entity sparsity pursuit (ESP) to identify surface defects. These methods don't consider the low-rank characteristic for the defect foreground and defect-free background simultaneously, and ignore the spatial and pattern relations of these regions, which may influence the final segmentation performance.
Motivated by the above analysis, a double low-rank decomposition (DLMD) model for surface defect segmentation of steel sheet is exploited in the paper. Based on the unified low-rank assumption to characterize defect foreground and defect-free background, DLMD-based segmentation approach can be divided into two steps: firstly, the defect foreground image and defect-free background image are separated from surface defect image; secondly, the optimization strategy is further applied to improve the accuracy of the defect foreground image, leading to a higher segmentation performance.
To sum up, we propose a joint classification and segmentation (JCS)-based defect detection approach to provide explainable classification and segmentation results for steel sheet. As illustrated in Fig. 2, the proposed JCS approach first identifies the surface defect by a classification branch via CASDDL model. It's then feasible to discover the locations and areas of surface defect by a segmentation branch via DLMD model. With the explainable classification results and corresponding defect segmentation, JCS largely simplifies and accelerates the detection process for quality experts. This paper is an extension of our previous works of [36] and [37]. Our main contributions are summarized as follows: • We propose a CASDDL approach to train discriminative dictionary for surface defect classification of steel sheet. It not only encourages intra-class samples to deliver the similar feature representation, but also minimizes the inter-class samples correlations.
• We develop a DLMD approach to segment various types of defects from surface defect images of steel sheet. It doesn't need training process by directly decomposing the surface defect image into the defect foreground image and defect-free background image.
• The feasibility and advantages of the proposed JCS method combined CASDDL and DLMD is evaluated by extensive experiments and comparisons with the other stateof-the-art methods, which show that it clearly improves both subjective and objective quality of surface defect detection for steel sheet.
The remainder of the paper is organized as follows. In Section 2, we briefly introduce some related works about surface defects classification and segmentation, dictionary learning, and RPCA, respectively. Section 3 presents our proposed JCS detection approach, including CASDDL-based defect classification model, and DLMD-based defect segmentation model. In Section 4, we validate proposed JCS approach in extensive experiments and compare it with the other state-of-the-art methods. Some conclusions and future works are finally provided in Section 5.

II. RELATED WORK A. SURFACE DEFECT CLASSIFICATION AND SEGMENTATION
For classifying surface defects, different customized feature extraction methods for a variety of problems have been developed. The representative feature extraction methods mainly include grayscale, shape, texture, morphological operator, Fourier, Gabor and wavelet transform. Then, these features are combined with powerful classifiers, such as artificial neural networks, support vector machines. Borwankar and Ludwig [38] used the discrete wavelet transform and rotated wavelet transform for feature extraction, while a KNN classifier for classification. Luo et al. [39] exploited a generalized completed local binary patterns framework and simple nearest-neighbor classifier for steel surface defect classification. Ashour et al. [40] developed a method combining the use of discrete shearlet transform and gray-level co-occurrence matrix to classify surface defects of hot-rolled steel strips.
Traditional segmentation methods of surface defect can be mainly divided into three categories: statistical-based methods, filter-based methods and model-based methods. For the statistical-based methods, such as statistical moments, mathematical morphology, maximum entropy, are used to evaluate the spatial distribution characteristic of pixel intensities. These methods are sensitive to lighting, noise or outliers. In contrast, the filter-based methods, such as discrete Fourier transform, discrete Gabor transform and discrete wavelet transform, the energies of the filters response are utilized as features to segment the defects. These methods require the periodicity of texture structures, which may not suitable to random texture. Furthermore, it's not suitable for localizing the defect regions in the spatial domain. The model-based methods, such as level set, Markov random field, fractal model, and partial differential equation, construct the specific models with image feature distributions, which have a high computational complexity.

B. DICTIONARY LEARNING
Mathematically, dictionary learning can be formulated as follows: where, · 2 denotes l 2 norm, y ∈ R d denotes a given d-dimensional feature vector of training sample, x ∈ R K denotes coding vector of y over dictionary D = d 1 , d 2 , . . . , d j , . . . , d K ∈ R d×K , d j ∈R d denotes the k-th atom of D, θ (D, x) denotes a regularization term to constrain D or x, λ is a positive parameter that balances the tradeoff between reconstructive error y−Dx 2 2 and θ (D, x). For the classification task, discriminative dictionary learning has demonstrated that a well-learned dictionary D will greatly boost classification performance. The discrimination could be developed from the dictionary, coding vectors, or both. Several regularization terms, such as sparsity, lowrank, neighborhood preservation of graph, entropy, incoherence constraint on sub-dictionaries, have been introduced into the learning process to promote the discriminative power of learned dictionary.
Optimizing Eq. (1) can be carried out by an iterative method composing two steps: (a) fixing D to update x; (b) fixing x to update D, which can be solved efficiently by lots of algorithms [41]. According to the learned dictionary D, test sampleŷ is classified as class k * if it satisfies: where, x is coding vector, l k (x) denotes a vector only keeping the entries of x associated with the k-th class and changing others into zeros. As a result, y is assigned to the class k * corresponding to the minimum reconstruction error ŷ−Dl k * (x) 2 2 . VOLUME 9, 2021 C. ROBUST PRINCIPAL COMPONENT ANALYSIS RPCA shows the low-rank representation has a better performance in discovering global structures of data, which can reveal the relationships of the samples: the within-class affinities are dense while the between-class affinities are all zeros [42]. RPCA can be formulated as follows: where, F ∈ R m×n is the input matrix, L ∈ R m×n and S ∈ R m×n are two decomposed matrices; rank (·) denotes the rank of matrix; · 0 denotes l 0 norm of matrix, which equals the number of non-zero element of matrix; λ> 0 is a trades-off parameter between L and S. As Eq. (2) is NP-hard problem, rank (L) can be replaced by nuclear norm L * , and S 0 can be replaced by l 1 norm S 1 or l 2,1 norm S 2,1 , where, · * equals the sum of singular values of a matrix; · 1 equals the sum of the absolute values of all entries in a matrix, · 2,1 equals the sum of l 2 norms of the columns of a matrix, Several optimization algorithms have been proposed to solve RPCA [43], such as alternating direction method of multipliers, inexact augmented Lagrangian multipliers (inexact ALM) method. Supposing that L ∈ R m×n is a matrix with rank r, its singular value decomposition (SVD) operation is denoted as svd (L) = U V T , where, = diag {σ i } 1≤i≤r is the diagonal matrix with σ 1 , σ 2 ,. . . , σ r on the diagonal and zeros elsewhere, σ i is the i-th singular value of L, U ∈R m×r and V ∈R N ×r are left, right singular matrices, respectively. For the traditional soft-thresholding shrinkage operator where, ij stands for the (i, j)-th element of . Each singular value equally shrinks by subtracting the same constant λ, which means that all singular values have equal contributions. Given the weights vector w ∈ R r , the non-uniform singular value thresholding operator can be defined as follows [44]: For the larger singular values which quantify the principal information of image, they should be reduced a little as much as possible, i.e., the larger the singular value is, the more contribution it makes to the major information. Different singular values are treated differently by assigning different weights and can adaptively shrink according to the specific information of image. For the surface defect image, matrix singular values have clear physical meanings, larger singular values corresponding to major projection directions are supposed to be less shrunk to preserve the major components, which can improve the accuracy of low-rank reconstruction and enhance the adaptivity of defect segmentation.

III. OUR SURFACE DEFECT DETECTION APPROACH
Our JCS detection approach consists of an explainable classification branch to identify the defect and a segmentation branch to discover the defect areas. The proposed CASDDL classification model identifies whether the surface image is defect or not, along with convincing visual explanations. To provide complementary pixel-level prediction, the proposed DLMD segmentation model recognizes fine-grained defect areas in the surface defect image. By combining these two models together for better performance, JCS provides informative detection results for surface defect of steel sheet.

A. EXPLAINABLE CLASSIFICATION
The proposed CASDDL-based classification method mainly comprises of two stages, including discriminative dictionary learning, and defect classification.

1) DISCRIMINATIVE DICTIONARY LEARNING a: FORMULATION OF CASDDL
denotes learned dictionary of K atoms, D j j=1,2,...,c ∈R d×k j denotes the j-th class-specific sub-dictionary that trained from a corresponding training samples Y i , D c+1 ∈ R d×k c+1 denotes a shared sub-dictionary that trained from the whole training samples Y , where, K = c+1 j=1 k j , k j denotes number of atoms from the j-th sub-dictionary. Let To enhance the discriminative capability of dictionary, it's ideally desired that for each class, its samples have nonzero coding vectors intensively locating at the corresponding atoms, whereas the coding vectors at other atoms are zero. As shown in Fig. 3, a sample is supposed to be represented only by the corresponding class-specific subdictionary, while can't be represented by other class-specific sub-dictionaries at the same time. It can enhance the discriminative capability of learned dictionary by forcing that all other discriminative sub-dictionaries have poor representative capability of non-corresponding samples. Different subdictionaries should be low coherence, which can guide the learned dictionary to be discriminative. What's more, in terms of intra-class compactness and inter-class separability, the coding vectors of same samples class should be similar, while the coding vectors of different samples class should be dissimilar. The coding vectors corresponding to the shared dictionary should be similar, the corresponding coding matrix should be low-rank, which well addresses the redundant information in the shared sub-dictionary and promotes coding vectors more compact.
Based on above discussion, the proposed CASDDL can be modelled as the following optimization problem: where, Z 1 = Z reconstruction (Y , D, X ) denotes the reconstruction error term; Z 2 = Z incoherence D i , D j denotes the sub-dictionary incoherence term; Z 3 = Z exclusiveness X class i denotes the discriminative promotion term for coding vectors over all the class-specific sub-dictionaries; Z 4 = Z lowrank X c+1 i denotes the low-rank preserving term for coding vectors over the shared sub-dictionary.
To learn a representative and discriminative structured dictionary D, each class-specific sub-dictionary should be supposed to well represent samples from the i-th class, but not other classes. The most important property of the shared dictionary is to represent samples from all the classes.
. . ,c) makes sure that each class Y i has a good representation over corresponding class-specific subdictionary D j , where, V j ∈R K ×k j is the selection operator that selects the j-th class-specific sub-dictionary D j from D, each column of V j has only one nonzero element 1, which the location is column index of corresponding class-specific sub-dictionary atom in D, ensures that the shared sub-dictionary D c+1 make contribution to represent Y i . Hence, the reconstruction error term Z 1 can be defined as follows: To exploit desirable discriminative capability of learned dictionary D, different sub-dictionaries should be as orthogonal as possible, which ensures that each class-specific subdictionary is exclusive to represent corresponding samples well. Therefore, the value of structural incoherence constraint are supposed to be small, where, is the sub-matrix by removing D j from D, I k j is an identity matrix. By adding these two terms, the redundancy among sub-dictionaries would be reduced effectively, which has a direct impact on the speed of computation. Hence, the sub-dictionary incoherence term Z 2 can be defined as follows: where, n c+1 = N , can alleviate the effect of imbalance between the number of samples and atoms of sub-dictionaries.

(iii) DISCRIMINATIVE PROMOTION TERM Z 3
Based on Fisher's linear discriminant, which maximizes the ratio of between-class scatter matrix to within-class scatter matrix, we can minimize the within-class scatter matrix S W (X ) and maximum the between-class scatter matrix where, x l i denotes the coding vector of the l-th training sample over the i-th class-specific sub-dictionary, u i = 1 x l i are mean vector of X i and X , respectively. By directly constraining coding vectors, the separability and discriminability of coding vectors from different classes is further enhanced. Thus, tr Hence, the discriminative coding vector term Z 3 can be defined as follows: As nuclear norm · * is the convex relaxation of rank(·), the low-rank preserving term Z 4 can be defined as follows: Taking all mentioned above into consideration, we have the following CASDDL model: Eq. (8) can be divided into two sub-problems: updating X with fixed D; updating D with fixed X . In order to learn a discriminative dictionary better, K -means algorithm is chosen to initialize the dictionary at first: each class-specific subdictionary is initialized as cluster centers of corresponding training samples, a shared sub-dictionary is initialized as cluster center of whole training samples. As dissimilarity between different cluster centers is high, the initial atoms in classspecific sub-dictionaries obtain approximately discriminative ability. We summarize CASDDL in Algorithm 1.
Initialize: The class-specific sub-dictionary {D j } j=1,2,...,c is initialized by K -means in Y i , the shared sub-dictionary D c+1 is initialized by K -means in Y . While x Update X class i With fixed D and X c+1 i , Eq. (9) can be rewritten as follows: It can be rewritten as follows: where, According to [45], a two-step iterative shrinkage/ thresholding (TwIST) algorithm can be adopted to solve Eq. (11). After first derivative of R X class i with respect to X class i is calculated (Appendix), we have where, Then, we have where, τ σ ( ) denotes soft-thresholding shrinkage operator, According to inexact ALM algorithm, introducing the auxiliary variable H = X c+1 i , Eq. (14) can be defined as follows: where, . , . means the inner product operator for two matrices; · 2 F denotes the Frobenius norm, which equals the sum of squares of each element of matrix; P is a Lagrange multiplier; µ> 0 is a penalty parameter.
Furthermore, we have The detailed procedure of solving Eq. (16) is presented in Algorithm 2.
x Update H Differentiating it with respect to H , and let it to be zero: Then, we have y Update X c+1 Then, we have where, (U , , V ) = svd H (k+1) + P (k) µ (k) , svd (·) denotes SVD operation, = diag {σ i } 1≤i≤r is the diagonal matrix with σ 1 , σ 2 ,. . . , σ r on the diagonal and zeros elsewhere, σ i is the i-th singular value of H (k+1) + P (k) µ (k) , U ∈ R m×r and V ∈ R N ×r are left, right singular matrices, respectively.
y Update D c+1 With fixed X and all the class-specific sub-dictionaries, Eq. (24) can be rewritten as follows: Similar to Eq. (29), Eq. (31) can be solved by CORE algorithm.

2) DEFECT CLASSIFICATION
The proposed CASDDL especially emphasizes class discrimination of both dictionary atoms and coding vector, which not only contributes for learning class-oriented discriminative dictionary, but also results in discriminative coding vector. Different from traditional classification method that treat the coding vector just as input to sophisticated classifiers, we can directly make full use of the discriminative capability of coding vector for a simple and efficient classification scheme, without adding any parameters to be learned.
For a test sampleŷ, we use the obtained dictionary D to compute its coding vectorx = [x 1 ; x 2 ; . . . x i ; . . . ; x c ], where, x i is the coding sub-vector associated with class-specific sub-dictionary {D i } i=1, 2,...,c . Considering the discrimination ofx, ifŷ is from class i, x i will be large than other part. Therefore, the class ofŷ is determined by arg min

B. ACCURATE SEGMENTATION
The proposed DLMD-based segmentation method mainly comprises of four stages, including superpixel oversegmentation, feature extraction, feature matrix decomposition, and defect segmentation.

1) SUPERPIXEL OVER-SEGMENTATION
In order to capture structural information of defect, we adopt the superpixel-algorithm of adaptive simple linear iterative clustering (ASLIC) [47] to partition the surface defect image into several non-overlap sub-regions. It can generate regular shaped superpixels in both textured and non-textured regions alike. Only the number of superpixel sub-regions K should be specified. The bigger K should be chosen if the potential defect object is small and morphological complex, which can produce more deformable shape to enclose the region containing potential defect object, vice versa. As the number of superpixel sub-regions is far less than the pixel of image, which can ease the computational burden and improve the computation efficiency.

2) FEATURE EXTRACTION
The feature of gray-scale, Gabor filters with eight directions on two different scales, steerable pyramid filters with four directions on two different scales are computed and then stacked vertically to construct a 25-dimensional feature vector for each pixel. For each superpixel sub-region, its feature vector is calculated by taking mean of all the feature vectors of pixels contained in it, which is robust to noise. All the feature vectors of sub-regions are normalized into unit column vectors, and are stacked together to construct a feature matrix D ∈ R d×K , where, d is the dimension of feature vector, K is the number of superpixels sub-regions.

3) FEATURE MATRIX DECOMPOSITION a: FORMULATION OF DLMD
As shown in Fig. 4, we try to decompose surface defect image I into defect-free background image B and defect foreground image E. According to the ASLIC algorithm and stack all feature vector of superpixel sub-regions together to form feature matrix F constructed from the original defect image I , feature matrix L represents a background image B, and a feature matrix S represents a defect foreground image E in a certain feature space, respectively. Therefore, F = L+S, where, each column of these matrices stand for the feature vector of individual superpixel sub-regions. Both the background image B and the defect foreground image E contain multiple homogeneous and highly similar sub-regions. These two feature matrices L and S have redundant information and can be assumed to have low-rank due to the similarity among different sub-regions, which form a low-dimensional feature subspace. What's more, in order to reduce the influence of noises and improve the robustness to uneven illumination simultaneously, we assume that the background has the sparse property and lies in a sparse feature subspace.
Based on above analysis, the proposed DLMD can be modelled as the following optimization problem: where, (L, S) denotes the regularization term to enlarge the margin and reduce the coherence between the feature subspaces induced by L and S; η> 0, τ > 0 are regularization parameters.
The local invariance assumption based Laplacian regularization term (L, S) can be defined as follows: where, M ∈ R K ×K is a Laplacian matrix; tr (·) denotes the trace of a matrix; s i , s j denotes the i-th and j-th column of S; w ij of affinity matrix W ∈ R K ×K denotes the weight that represents the feature similarity between sub-regions R i and R j . Supposing that each sub-region of surface defect image is represented by a node, the Laplacian matrix M is defined: where, l 2,1 norm-based penalty term L 2,1 aims to characterize the noise or illumination interference of surface defect image. VOLUME 9, 2021 According to inexact ALM algorithm, introducing the auxiliary variables H = L, J = S, Eq. (34) can be defined as follows: where, P 1 , P 2 and P 3 are Lagrange multipliers; µ> 0 is a penalty parameter. The detailed procedure of solving Eq. (35) is presented in Algorithm 3. x Update H In order to solve H , we can further simplify Eq. (35) as follows: The optimal solution can be obtained as follows: where Z (:, j) denotes the j-th column of matrix Z .
y Update J In order to solve J , the optimal solution can be obtained as follows: Differentiating it with respect to J , and let it to be zero: The close-form solution can be obtained as follows: z Update L To solve L, Eq. (12) can be transformed to Eq. (22): It can be rewritten as follows: The optimal solution can be obtained by Eq. (21): ; w 4µ (·) denotes non-uniform singular value thresholding operator, {σ i } i=1,2,...,r is the singular value of In order to solve S, Eq. (35) can be transformed as follows: It can be rewritten as follows: Its solution is where, } Update µ where, ρ = 1.1, µ max = 10 5 .

4) DEFECT SEGMENTATION
Each column of L = (l 1 ,l 2 , . . . , l K ) and S = (s 1 , s 2 , . . . , s K ) are the feature vector of corresponding superpixel sub-region of decomposed background image B and defect foreground image E, respectively. Then, we transfer L and S from feature domain to spatial domain for visualizing. The gray-value of each superpixel sub-region is maximum value of corresponding feature vector, then allocating it to corresponding pixels to visualize background image B and defect foreground image E, as shown in Fig. 2.
To enhance the completeness of defect objects and suppress the background noise in defect foreground image E, the regression optimization algorithm is adopted as follows: where, w f i and w b i denotes gray-value of sub-region in defect foreground image E and background image B, respectively; s i ∈ s = (s 1 ,s 2 , . . . ,s K ) T denotes the enhanced gray-value of i-th sub-region of defect foreground image E. Following (49) can be reformulated as follows: where, 1 ∈ R K ×1 denotes the unit vector; M ∈ R K ×K denotes the same Laplacian matrix in Eq. (33). Differentiating Eq. (50) with respect to s, and let it to be zero, we have Its solution is Through Eq. (49), the gray-value of defect sub-region in defect foreground image E will become bigger, so the defect object can be highlighted further. Finally, the shape, location and size of surface defect can be easily localized and segmented through a simple thresholding operation.

IV. EXPERIMENT
In this section, various experiments, such as parameters analysis, convergence analysis, robustness to noise, comparisons between our method and some state-of-the-art methods, are conducted to verify the proposed JCS method.

A. EXPERIMENTAL SETUP
Two typical surface defects images (Patch, Scratch) and defect-free image are selected in the following experiments. There are 300 grayscale images (200×200 pixels) per class, and the pixel-level ground truth of defect image is manually marked by using white to denote defective pixels and black to denote defect-free pixels. We evaluated classification results using classification accuracy N R /N , where, N R is the number of test samples that are correctly classified, N is the total number of test samples. All the surface images are normalized and resized to 40×40 pixels, then randomly divide into training samples and test samples in 1:1 ratio. We repeated each experiment ten times, and the average values and standard deviations of the classification results are given. We evaluated segmentation results using qualitative and quantitative metrics: the qualitative metrics refer to human subjective feeling for segmentation performance (i.e., boundary of defect object is clear, contrast between defect and background is obvious); the quantitative metrics refer to precision-recall (P-R) curve, receiver operating characteristic (ROC) curve, average F-Measure (F β ) curve, area under ROC curve (AUC) and mean square error (MAE). Supposing that the pixel belonging to defect is defined as a positive example, and the pixel belonging to background is defined as a negative example. The symbols TP (True Posi- , where, N , H and W denotes the number, height and width of surface defect image.
Let k c , k s denotes number of atoms of class-specific subdictionary and shared sub-dictionary, respectively. We vary k c from 10 to 45 with five interval, k s from 2 to 30 with four interval. For each parameter combination, we compute the classification accuracy of all the sub-dictionary combinations in terms of mean value, and illustrate the classification accuracy in Table 1. The bottom row of Table 1 denotes the mean value of classification accuracy with one β corresponding to different γ , the right column of Table 1 denotes the mean value of classification accuracy with one γ corresponding to different β. As shown in Table 1, classification accuracy rises as the increase of β at first, but a further increase of β over a proper value will decrease the classification performance. The classification accuracy will degrade with a small value of β, which shows that the discriminative coding vector term is useful in learning class-oriented dictionary. Comparably, a larger value of γ will capture the inter-class similarity, and the shared sub-dictionary is more readily to capture the commonality features. However, too large value of γ will decrease the representation ability of shared sub-dictionary, the classification performance will be degraded. For γ , we empirically observe that a value lying in the range [0.5, 0.9] can always achieve an acceptable result. Furthermore, the classification accuracy with γ = 0 are lower than that with γ = 0.7, which illustrates the importance of low-rank term.
From Table 1, the highest classification accuracy is achieved when α = 0.1, β = 0.8, and γ = 0.7, and this parameter combination will be adopted in the following experiments. Besides, we observe that the classification accuracy is robust to different parameter combinations being greater than 89% in most cases.

2) CONVERGENCE ANALYSIS
Although Eq. (8) is non-convex, the optimization algorithm actually adopts an alternatively updating fashion, and the convergence of each sub-problem can be guaranteed. On the one hand, for updating X with D fixed, the optimal solution is gained by TwIST and ALM algorithms. On the other hand, in the process of updating D with X fixed, each atom is optimally renewed for the sub-problem, and the optimal solution is gained by CORE algorithm. As a consequence, the objective function is non-increasing during the whole process of alternatively updating X and D. In addition, we provide the empirical evidence to illustrate the good convergence behavior of CASDDL in Fig. 5. With the increase of iteration numbers, the curve of error gradually decreases and eventually becomes stable, and the curve of accuracy increases for different combination of sub-dictionaries. It shows the proposed CASDDL enjoys a good convergence performance.

3) COMPUTATIONAL COMPLEXITY
The drawback of CASDDL is that it is computationally more complex. Although dictionary learning can be done in parallel and off-line, it is still important to see how long the dictionary learning process would take. A number of experimental parameters can affect the run time of CASDDL, including the number of classes, number of training samples, dictionary size and dimension of feature vectors.

4) ROBUSTNESS TO NOISE
We evaluate the robustness of the proposed CASDDL by corrupting original surface images with additive Gaussian noise in different signal to noise ratio (SNR), including 24dB, 20dB, and 16dB. As shown in Table 2, the classification accuracy is decreased slower when the noise level is increased; CASDDL can achieve 80.81% classification accuracy even at 20 dB noise, which is considered as less sensitive to noise.

5) NUMBER OF ATOMS IN SUB-DICTIONARY
Supposing k c , k s denotes number of atoms of classspecific sub-dictionary and shared sub-dictionary, respectively. As shown in Table 3, we can observe that increasing k c will lead to a higher classification performance. The possible reason is that more discriminative information can be captured by a larger class-specific sub-dictionary. When k s is fixed, the classification accuracy is dropped as the increase of k c . The possible reason is that smaller shared sub-dictionary is enough to capture the shared features of defect images, and larger shared sub-dictionary tends to absorb class-specific features into the shared sub-dictionary, causing some discriminative information lost. The proposed CASDDL always achieves higher classification accuracy despite different number of atoms, which indicates that it has a better ability to reconstruct defect images, even if learned dictionary has a small size. In fact, larger size of dictionary may have stronger  representative ability and achieve better classification performance at the expense of increasing computational load. Therefore, we should make a tradeoff between classification performance and computational efficiency. When k c = 30, the classification accuracy gain is merely promoted very little (∼1%) as the increase of k c . When k s = 2 and k c = 30, CASDDL can still have higher classification accuracy 94.89%, and this parameter combination will be used in the following experiments.

6) VISUALIZATION OF CODING VECTORS
The proposed CASDDL aims to get highly-discriminative coding vectors, through the learned discriminative dictionary, to achieve surface defect classification. Fig. 6 illustrates the coding vectors of training and testing samples are approximately block-diagonal, which further shows the class-label discriminative information in coding vectors.
As shown in Fig. 7 and Table 4  for DFEDTL. Compared to SRC, which is the baseline method in the experiment, CASDDL improves the classification accuracy with a margin of more than 24%. Among above approaches, ALRR performs the best, which is superior to ours by 0.11% for accuracy, and is inferior ours by stability. Besides, CASDDL outdoes LRGPDDL by a significant improvement of above 2.5%.

C. SEGMENTATION RESULTS ANALYSIS 1) PARAMETERS ANALYSIS
The tuning two regularization parameters η, τ in Eq. (34) are chosen by 5-fold cross validation, and the experimental results measured by AUC metric are shown in Table 5. Its show that when the values of η and τ are set properly, the proposed DLMD can achieve better segmentation performance.  When η is small, the performance is very sensitive to the changes of τ ; while η is big, τ is insensitivity. Especially, it would be better to set the values of η much larger than that of τ in order to penalize the feature matrix of defectfree background image to be sparse. The segmentation performance reaches a high level when η = 1.25 and τ = 0.25, and this parameter combination will be used in the following experiments.

2) CONVERGENCE ANALYSIS
We evaluate the convergence of the proposed DLMD to empirically show the convergence through experiments in different iterations, which is calculated via the relative error

3) ROBUSTNESS TO NOISE
We evaluate the robustness of the proposed DLMD by corrupting original surface images with additive Gaussian noise in different SNR, including 24dB, 20dB, 16dB and 12dB. As shown in Table 6, when SNR decreases gradually, the AUC and MAE can remain a relative high level, especially when SNR = 16dB, AUC still remain around 0.8. In general, the proposed DLMD method is considered as robust to noise.

a: QUALITATIVE COMPARISON
The qualitative comparison results between the proposed DLMD and other methods are shown in Fig. 9. It's shown   that most of methods can handle simple defect images with relatively homogeneous background (i.e., column 5, and 10). For some complex defect images that containing multiple objects (i.e., column 6, 11 and 12), or having visually indistinguishable background (i.e., column 3, and 4), some parts of background being falsely classified as the defect. By contrast, the proposed DLMD separates the defect objects from the image background successfully and locates defects precisely, which has achieved the goal of ''highlight the foreground and suppressing the background''.

b: QUANTITATIVE COMPARISON
The six methods are evaluated by P-R curves, ROC curves, AUC values, F-measure curves and MAE values are illustrated in Fig. 10 and Table 7, respectively. They show that  the proposed DLMD significantly outperforms the other five methods. Especially, Precision can remain above 90% within a large threshold range, which reflects a better segmentation performance. Most of AUC is higher than 70%, and DLMD achieves 84.53%, which is competitive with 9.53% improvement to 75.00% achieved by ESP. MAE of DLMD is typically the lowest among all the methods. Compared with ESP, it's increased by 9.53% and 3.44% in AUC and MAE, respectively. These experimental results illustrate the proposed DLMD is effective for segmenting a variety of defects from surface defect image, even if types and number of defects are unknown and exhibit diverse visual features of shapes, scales, directions and locations. Besides, double lowrank constrain of DLMD contributes to the good segmentation performance.

V. CONCLUSION
In this paper, we develop the JCS method including CASDDL and DLMD models to perform surface defects detection for steel sheet. Based on the anomaly characteristics of defect in the surface defect image of steel sheet, we propose a CASDDL method to learn a discriminative dictionary that consists of several class-specific sub-dictionaries associated with corresponding classes and a shared subdictionary shared by all the classes, in which class-specific sub-dictionaries are responsible for exploiting class-specific information, and the shared sub-dictionary is used for capturing and separating the common information. By introducing low-rank, mutual incoherence and Fisher-like discriminative constraints, it can effectively reduce redundancy in training samples. Moreover, we formulate a double low-rank decomposition model to obtain high-quality defect foreground image directly, which provides a robust way to segment the surface defect. Experimental results verify the effectiveness and robustness of JCS for detecting surface defects of steel sheet.

APPENDIX
Computing ∇ X class i X class