Two-Stream Convolutional Neural Network Based on Gradient Image for Aluminum Profile Surface Defects Classification and Recognition

In this article, a novel two-stream convolutional neural network based on gradient image is performed to effectively classify and identify aluminum profiles defects for the first time. Recent feature fusion methods based on two-stream network prove promising performance for defects classification and recognition. In this article, we use data enhancement methods to obtain a large number of samples to prevent the over fitting phenomenon in deep learning. The image gradient is calculated with the Sobel operator, and normalized to transform the data between zero and one under the same dimension. We design a two-stream convolutional neural network model adopting Wavelet transform fusion strategy to realize feature fusion on the ReLU6 layer, which uses the original RGB image of aluminum profile and the gradient image corresponding to the original RGB image as inputs to extract features through two sub-networks and fuses features on a concatenate layer to be input into SVM classifier for classification and recognition. Using Bayesian Optimization function and computing the cross-validation classification error to optimize the hyperparameters to choose the best performance configuration is performed. A series of experimental data, which include accuracy and estimated generalized classification errors of single-stream and two-stream networks with different feature fusion strategies on different fusion layers, are conducted and show that the current model has good convergence, accuracy, stability and generalization. On this basis, this article also proposes a series of innovative methods for the future research of other defects.


I. INTRODUCTION
Due to the growing demand for massive infrastructure investment in the world and rapid development of industrialization, the aluminum profiles with high strength, light weight, corrosion resistance, long service life, rich color and other advantages are mainly used in automobile manufacturing, rail transit, equipment and machinery manufacturing, consumer The associate editor coordinating the review of this manuscript and approving it for publication was Larbi Boubchir .
durables, aerospace industry and other industries [1].Therefore, the surface quality of aluminum profile is very important significance and can greatly affect the product performance.However, on account of extrusion equipment and extrusion technology in the extrusion process, compression deformation, spray grouts and other operating reasons, the aluminum profile surface will produce abrasion mark, traffic mark, dirty points, convex powder, orange peel and other defects, which affects the quality of aluminum profile [2].In many manufacturing industries, human inspection is still an important component in the production process, but cannot guarantee the detection stability and accuracy.Therefore, automated defects detection is naturally needed to solve the problem, and then machine vision technology gradually replaces traditional human inspection methods and can satisfy customer requirements because of its highly recognition rate and detection stability [3].Generally, the aluminum profiles surface defects detection process based machine vision method consists of the following four steps: image capture, image preprocessing, feature extraction and defects recognition.Among them, the feature extraction can directly affect the detection accuracy and even lead to detection failures [4].The traditional machine vision feature extraction generally includes Histogram of Oriented Gradient(HOG), Scale-Invariant Feature Transform(SIFT), Oriented FAST, Speeded-Up Robust Features (SURF), Oriented FAST and Rotated BRIEF(ORB), Local Binary Pattern(LBP) and Haar-like features(HAAR).And SIFT and HOG are all feature extraction methods based on the histogram of gradient direction in the image.SURF operator is an improvement of SIFT.ORB adopts FAST (Features from Accelerated Segment Test) algorithm to detect feature points, which can be used for real-time feature detection.LBP is a kind of feature operator to describe local texture of image, which has rotation invariance and gray invariance.Haar-like is combined with AdaBoost algorithm to be as the most commonly used face detection algorithm.Haar-like features only use feature template to calculate the depth mode of image color, and the template slides on the image with different sizes and positions to calculate the feature value [5], [6].Relative to the above feature extraction methods, the features extracted by Deep Learning has better applicability, generalization and expression ability.And convolutional neural network has outstanding advantages on feature extraction and classification performance, which is widely used for image processing, classifying and understanding [7].
Many scholars have studied the effective feature extraction methods in defect detection system.The following is the research on defect detection of aluminum profile by some scholars.Wei and Bi [8] proposed a recognition technology of aluminum profile surface defects with a multiscale defect-detection network based on deep learning.Liu et al. [9] proposed aluminum profile texture feature extraction method by means of GLCM algorithm and Gabor wavelet transform methods, and used SVM based on RBF to classify the feature.Neuhauser et al. [10] proposed a method of surface defect classification and detection based on a neural network structure adopting data enhancement and migration learning which is the key for network training.Li and Liu [11] studied aluminum plate surface defects recognition method through BP neural network which used three layers neural network structure.
Although there are a few articles on the detection of aluminum profile surface defects, there are many researches on the detection of other materials, such as steel surface defects.Fu et al. [12] proposed a deep-learning-based method via a compact yet effective convolutional neural network model, which emphasized the training of low-level features and incorporated multiple receptive fields.Wang et al. [13] proposed a convolutional neural network (CNN) to automatically extract features for distinguish between the defect free and defective image to achieve product quality control.Weimer et al. [14] proposed Convolutional Neural Networks (CNN) architecture for industrial inspection by designing different configurations to study the influences of different hyper-parameter on the detection results.
According to the above analysis, although the previous methods have achieved good results, there are still some problems to be solved.Firstly, many models adopt machine learning algorithms for RGB image to detect defects but do not effectively use useful information from other sources or representations.Secondly, the single information source cannot completely and effectively reflect the characteristics of objects.Thirdly, due to surface defects that can occur in any size, shape and orientation, the standard defect feature description obtained by traditional common feature extraction methods in machine vision might lead to low accuracy for classification and recognition.
Considering the above some problems, Convolutional Neural Network can be adopted to overcome the difficulties of different feature representations, and the two-stream convolutional neural network can extract the structural information characteristics and the intensity and color information characteristics of the image respectively.And, in order to appropriately and effectively improve the defect classification accuracy, these methods for combining the local feature details and the global features of the whole picture are adopted on different materials defect detection or other aspects.Hao et al. [15] proposed a hyperspectral image classification method based on a two-stream network including two subnetworks with the encoded spectral values of each input pixel by the stacked denoising auto-encoder as one stream and the corresponding image patch processed by deep convolutional neural network as input to the other stream.Yu and Liu [16] designed a two-stream framework which takes convolution neural network as the feature extraction network by the original RGB stream and its corresponding saliency stream for aerial scene classification.Yan et al. [17] predicted the image quality via a two-stream convolution network including two subnetworks with images and gradient images as input respectively, which simultaneously paid attention to the extraction of detailed structural features and the information strength.
Although there are a lot of researches on multi-stream for classification and identification of various materials defects, there are few two-stream neural network methods that integrate global and local features on the aspect of defect classification and detection for aluminum profiles.In order to get better accuracy and recognize more types of defects, the architecture of traditional convolutional neural network will be improved.In addition, there are not many samples in the open aluminum profile database, so it is difficult to meet the requirements of training deep neural network which needs a large number of training samples.And, the storage capacity and computing performance required by the well-known convolutional neural network model are usually very important for real-time defect detection.Therefore, a two-stream convolutional neural network with RGB image and gradient image as input respectively by fusing information from multiple sources is adopted for aluminum profile surface defects classification and recognition, which comprehensively considers combining local structure details with global image features to effectively and easily identify defects for aluminum profile image.

II. DATA PREPROCESSING A. DATA ENHANCEMENT
In order to prevent the insufficient training samples of aluminum profiles from leading to low detection accuracy, low fitting and low robustness of detection model, the amount of input image samples are increased by properly transforming the input image, which can effectively solve the lack of training samples problem and satisfy requirements for accuracy and stability [18].Firstly, the size of aluminum profile image is adjusted to 227 × 227.And then, the following geometric transformation methods are selected to amplify the data in this article, such as flip, brightness adjustment, rotation.The flip operation includes vertical flip, horizontal flip, and vertical flip after horizontal flip.Brightness adjustment is to adjust the gray level of the image through non-linear mapping, and specifies the gray range before and after the transformation.The rotation operation, adopting Bilinear Interpolation algorithm, is to rotate a certain angle around the center point of the image, and the image is subsequently cropped to keep the size of the output image consistent with the input image after rotation, because the previous rotation makes output image larger enough to contain the entire rotated image to ensure that the pixel value beyond the image size range is not lost after the rotation of the source image [19], [20].Figure 1 shows the diagram of data enhancement for aluminum profile image.

B. GRADIENT IMAGE OF THE ORIGINAL GRB IMAGE
Under normal circumstances, the gradient image can robustly reflect the image structures in details under the variations of the image intensities and colors, and the defect area gradient information is more sensitive to defect classification and recognition [21].We believe that using the two-stream feature fusion based on the gradient image stream and the original RGB image stream will promote the detection performance.
As an important content of data preprocessing, the image gradient algorithm can enhance the outstanding change between defect area and homogeneous background due to that the gradient direction is reflected in the maximum change rate of image gray level, and image gradient can strengthen the edge information in the image.And the defect areas of aluminum profile usually have the characteristics of large change rate of gray level value, so most of the redundant background information can be eliminated by gradient algorithm, and the edge information with higher significance can be retained.When calculating the gradient image of aluminum profile image, different operators are selected for comparison to ensure that the calculated gradient image is the most representative.Considering the characteristics of different defects, it is better to consider Sobel operator comprehensively based on the experimental comparison of gradient algorithm with different edge detection operator and [22].Image gradient is generally obtained by derivative of image function, which can express the change rate of image gray.In most cases, we use difference to approximately express derivative for image gradient.The difference form of Sobel operator is shown in Formula 1 [23].
Image gradient is calculated by Sobel operator with a kernel of size 3 × 3 which slides on the image in two directions, and the template convolves with the image using a convolution kernel of size 3 * 3 with 9 pixels.And then according to the gray weighted difference of the upper, lower, left and right neighboring pixels, the edge is detected by the phenomenon of reaching the extreme value at the edge.There are two Sobel operators, one is to detect the horizontal edge and the other is to detect the vertical edge.The calculation formulas are shown in Formula 2 and 3 [24].
therein, f (x, y) is the image gray value in the position of (x, y), G x and G y are the convolution results of Sobel operators on horizontal and vertical directions respectively, and G is the final edge amplitude, which is sometimes simplified as shown in formula 4 to reduce the amount of calculation [25].
According to the above calculation method, the original aluminum profile RGB image with different types of defects are calculated to get the gradient image combining X direction and Y direction.After that, the gradient image is normalized to transform the data between zero and one under the same dimension.Figure 2 shows the original RGB images and the corresponding gradient images [26].

III. PROPOSED METHOD A. TWO-STREAM CONVOLUTIONAL NEURAL NETWORK ARCHITECTURE
The traditional convolutional neural network, generally operated on RGB image or Gray image, can extract various effective features map of image by nonlinear transformation from a single image.And the RGB image by itself is important information, including color feature, texture feature, shape feature and spatial relation feature.Compared with the original RGB image, the Gray image does not contain color information, so that the information content is greatly reduced and the image processing calculation is correspondingly reduced.These traditional methods, using RGB or Gray image to train based on a single-stream convolution neural network, do not consider dealing with local structure characteristics.If we want to consider the extraction of local structure detail information, we tend to select gradient image as another network input for feature extraction at the same time, because the image gradient can significantly reflect the high-frequency information of an image and includes the most significant detail features for identifying the defects of aluminum profile image effectively and easily.And the gradient image has better robustness under different illumination conditions.The gradient feature of the image can find the potential relationship between adjacent pixels in the aluminum profile image, which is insensitive to illumination.However, the experiments as described in Section 4 show that using the gradient image alone is similar to use the original RGB image alone for defect classification accuracy on a single-stream network or worse.In addition, there are also over fitting phenomena in the above two cases [17].
Although the deep learning with a single-stream network based the above works have a certain good performance, there is still a certain distance to achieve satisfactory results.There is also a problem whether it is advantageous to use multi-feature fusion or whether one image suffice.The image gradients keenly reflect the structural components of images, such as image edges.The gradient image can robustly reflect the image structures in details under the variations of the image intensities and colors.The gradient amplitude can reflect the change of texture details, and the histogram of gradient direction can describe the representation and shape of local objects.Considering the above reasons, the two-stream technology is usually used to improve performance by fusing features extracted by two independent convolutional neural networks.And then, some features can be extracted from gradient image which can well reflecting the characteristics of image edge and other details, and reduce the difficulty of extracting some effective features from the RGB images stream.The gradient image stream focuses on the extraction of detail structure features, while the image stream pays more attention to the extraction of intensity information and content features.As far as we know, the multi feature fusion method based on two independent convolutional neural network streams (RGB image stream and gradient image stream) has not yet been applied for aluminum profile defects classification and detection.
Due to the variety of aluminum profile defects, subtle defects, complex shape and the more difficult discovery with the lack of careful observation, the defects are still difficult to distinguish even when the defects are manually marked.In order to improve the accuracy of detection, we design a two-stream convolutional neural network model(TSCNNM) using the original RGB image of aluminum profile and the gradient image corresponding to the original RGB image as input to extract features through two network and fuse features on a concatenate layer to be input into classification layer for classification.In the two-stream network used in this article, the gradient image of the original RGB image is trained as the network input together to retain more edge information.The two-stream network architecture as shown in Figure 3 consists of two independent streams taking an original RGB image and a gradient image of the corresponding original GRB image as input respectively.In this article, the domain knowledge of aluminum profile defects is fully considered.The network structure proposed for defect classification consists of two convolutional neural networks, which process the complete RGB defect image and the corresponding gradient image respectively.Then, the output features of the two streams are spliced and fused on a layer.Finally, the classification probability is obtained through the classifier by the output fused features [27], [28].
We should select the appropriate convolutional network for defects classification and identification to ensure an acceptable processing rate and accuracy of the system.The each stream of the two-stream network architecture in this article is based on ALEXNET [29].In order to meet the network input requirements, all the images are standardized to size 227× 227.And then the network inputs the complete original RGB defect image and the corresponding gradient image to the feature extraction network respectively.Each feature extraction network consists of five convolution layers, one full connection layer for extracting more feature maps effectively.The design of convolution layer is identical with that of ALEXNET.The Rectified Linear Unit immediately follows each convolution layer.The response-normalization layers only follow the conv1 and conv2 layers, and the three pooling layers follow the norm1 layer, norm2 layer and the fifth convolution layer respectively.A full connection layer connects the dropout layer to prevent over fitting effectively.And then, the feature vectors of full connection layers with the ReLU in the two-stream network are spliced on the concatenate layer which connects the output of the two feature extraction network, and the fused feature is input to one full connection layers of 1024 nodes coming after the concatenate layer, and the output is fed into a SVM classifier used to map features to a probability distribution of defect classes for the final decision.The final defect type prediction gives 10 classes [30], [31].

B. FEATURE EXTRACTION
The purpose of convolution operation is to extract different features of input.The early convolution layers can only extract some low-level features such as edge, line and angle.The late convolution layers of network can extract more complex features from low-level features iteratively.In the first convolution layer, ninety-six convolution kernels of size 11 × 11 ×3 are used to slide on images of size 224 ×224 × 3 with 4-pixel intervals to realize convolution operation.And then the convolution feature of this layer is formed, and we make an operation of ReLU and normalization to get the corrective feature map, and the max-pooling operation with a kernel of size 3 × 3 and a stride of 2 is performed to complete the convolution pooling operation of this layer and provide the output to the next layer.The operations of another convolution layers are basically similar to that of the first convolution layer, except that the convolution kernel is different.The convolution kernel size is 5 × 5× 48 in the second layer, 3 × 3 × 256 in the third layer, and 3 × 3× 192 in other layers in turn.In addition, the first, second and fifth convolutions are followed by a pool operation with an overlapping size of 3 × 3 and a stride of 2 [29].
Taking the aluminum profile image with convex powder defect as an example, the RGB image and the corresponding gradient image are respectively input to the convolution neural network to obtain the feature map of each convolution layer.Figure 4 shows the feature maps of the original GRB image with convex powder defect (top) and that of the corresponding gradient image (bottom) extracted from the five convolution layers.It can be seen that the image features respectively extracted from two sub-network streams are not in complete accord, some of which separate background features and extract contour features, color difference features, and even solid color features.
After using the above method to extract the each layer features of the original RGB image and gradient image respectively, we normalize these two features using standard deviation, and then merge the features by weighting in the feature fusion stage.The main purpose is to transform the data of different levels into the same level, and use the calculated value to measure, so as to ensure the comparability between the data.

C. FEATURE FUSION
There are many feature fusion methods, such as Sum fusion, Maximum fusion, Concatenation fusion, Wavelet transform fusion and so on.Selecting different feature fusion methods or different feature fusion layers will affect the classification and recognition accuracy.Considering that a single image feature cannot fully describe the image, two identical sub-network streams are trained at the same time to extract the different and effective image features with different image inputs in this article.The output features of these two network streams at a certain layer, given different weights, are fused to be fed into the corresponding SVM classifier for training and get the final classification results.In order to improve the accuracy of defect detection for aluminum profile, we compared the effect of different layer fusion selection under different fusion strategies on the accuracy.In the fourth section of the paper, we listed the comparison results.Through the comparison, we selected the appropriate fusion strategy and fusion layer, which can help us achieve better accuracy [32], [33].
We have evaluated different feature fusion strategies to compare the accuracy for aluminum profile defect classification.The calculation formula of the Sum fusion strategy which is to compute the sum of the values of the same location of two feature maps is shown in formula 5.
therein, 1<i<H, 1<j<W, and f a i,j is a feature value of feature map of the original RGB image at point (i, j), and f b i,j is that of the corresponding gradient image [34].
The calculation formula of the Concatenation fusion strategy is shown in formula 6.The spliced feature matrix is a longitudinal splicing matrix, where the matrix f b i,j is added to the last row of the matrix f a i,j [34].
The calculation formula of the Maximum fusion strategy which takes the maximum value at the same position of the two feature maps is shown in formula 7 [34].
The Wavelet transform fusion strategy is that the 5-level wavelet decomposition of two feature images are executed, and the maximum value of approximate signals and the minimum value of detail signals for fusion with 'db2' wavelet are adopted for fusion [35].
The double scale classification formula of two-dimensional image wavelet decomposition is shown in formula 8 [36].
therein, ϕ is the scale function, ψ is the wavelet function.
The feature maps of the aluminum profile image and the corresponding gradient image are respectively decomposed into sub images which has high and low frequency sub information by two-dimensional Mallat algorithm of wavelet transform.
The low-frequency information contains the main outline of the image and reflect the approximate and average characteristics of the image.The high-frequency information contains the details of the image output such as bright lines,boundaries and area contours.
The N-level decomposition of aluminum profile feature map can be realized by Mallat algorithm.Each decomposition layer has four sub images, which represent the low-frequency component, horizontal high-frequency component, vertical high-frequency component and diagonal high-frequency component of the decomposed image.Choosing different decomposition levels has different effects.The selection of decomposition level should be moderate, otherwise the low-frequency contour can't be decomposed well or the low-frequency contour will be lost too much.In this article, 5-level decomposition is used, then the fusion is carried out by adopting the appropriate fusion rules under the optimal number of decomposition layers.The selection of fusion rules is an important part in the process of image fusion, which directly determines the quality of the fused image.Many experiments have been done for a variety of fusion rules, and the fusion rules with the highest detection rate are selected.In this article, the high frequency part is fused by the minimum detail fusion rule.The low frequencies are fused by the maximum approximation fusion rule, and the weighted average method is generally used to fuse for lowfrequency coefficients [37], [38].
The effect of Wavelet transform fusion is that if the target in the gradient image before fusion is more significant than the same target in the original RGB image, the target in the gradient image after fusion will be retained and the target in the original RGB image will be ignored.In this way, the wavelet transform coefficients of the target in these two images will dominate at different resolution levels, so that the salient objects in the gradient image and the original image are preserved in the final fusion image.The flow chart of decomposition and fusion of Wavelet transform is shown in the figure 5.At the same time, we've evaluated the fusion on different layers.There are great differences on accuracy for feature fusion on different layers.The features from the conv5 layer or the previous layers also described as feature level fusion are fused before being input into the classification model as shown in figure 6 [39], [40].
The feature fusion for aluminum profile defect detection before the fully connection layer cannot get higher accuracy, even the results are worse than expected results.But the output of feature fusion on the first fully connection layer can get better results to satisfy the defect detection for the aluminum profile data set in this article and is more flexible.This fusion method can be said to be a combination of output classifications from two streams, which structure is as shown in Figure 7 [39], [40].
The two features from the fully connection layer 6 are integrated by Wavelet transform fusion and fed to SVM to be trained to generate the final detection result for the proposed method.The different streams do not play the same part on detecting defect events, so they should have different weights in order to make perfect use of the advantages of each stream.Therefore.The weighted average method is often used in the fusion of wavelet transform.

D. OPTIMIZE CLASSIFIER
In the deep neural network, it is not easy to adjust the hyperparameter combination, because it is very time-consuming to train the deep neural network and impossible to optimize hyperparameters by gradient descent method like general parameters.The time cost of evaluating a set of hyperparameters configurations is very high.The choice of hyperparameters has a great influence on the final effect of the model, and different models will have different optimal hyperparameters combinations.The influence of different hyperparameters on the model performance is very different.Some hyperparameters, such as regularization coefficient, have limited influence on the model performance, while other hyperparameters, such as learning rate, have great influence on the model performance.Optimizing the hyperparameters by appropriate optimization methods to choose the best performance configuration is very critical [41].
In this article, the classifier uses Support Vector Machine (SVM) model.The optimizer uses Bayesian Optimization which is an adaptive hyperparameter search method and can predicts the next combination with the maximum benefit according to the currently tested hyperparameter combination to select the model and its hyperparameter values [42].And then we compute the cross-validation classification error for each model and use standard categorical cross entropy loss to optimize our two-stream network.After the optimization, the whole training data set is trained to obtain the optimized model.The optimized model can classify the test data optimally so as to check model performance [43].

IV. EXPERIMENTS AND DISCUSSION
The reasonableness, validity, robustness and classification performance of the two-stream convolution neural network model proposed in this article are verified on the aluminum profile defect data set representing 10 different classification categories with visual defects occurring on the surface.Firstly, ALEXNET is used as the basic network to realize the two-stream convolution neural network model  in the experiment.In this case, the two-stream network should be pre-trained.Additionally, during the training stage, the genuine training RGB image samples and the corresponding gradient image samples are input to each stream of the two-steam network respectively, and a concatenate layer is used to fuse GRB image and gradient image features.And then we perform experimental analysis for the proposed method in this article.We compare our method with single stream network structure which input is original RGB image or gradient image respectively.We also study the feature fusion influence on the performance of defect classification for aluminum profile with different feature fusion strategies on different layers of ALEXNET network.Finally, we compare the optimized network with the un-optimized network.In order to prove the robustness of our network, we take random sorting samples for five times to get the value in the process of network training and testing procedure.

A. EXPERIMENTAL DEVELOPMENT ENVIRONMENT AND DATA
We conducted our experiments by a training system with twelve cores of 2.5 GHz double CPU, 128 GB memory and an NVIDIA Tesla M40 24GB GPU to accomplish the calculations of our proposed two-stream network system using Matlab.In our experiments, the data used in this article are mainly from [44].We selected nine sample types with single defect and one defect-free sample type.The selected samples include a total of 1745 aluminum profile images and be manually divided into 10 different categories from dataset.The ten categories are non-conducting (n-c), abrasion mark(AM), horizontal stripe shallow recessing (HSSR),   1.The samples in the data set are amplified by the way of data enhancement in this article, and the data after amplification is shown in Table 1.The number of the corresponding gradient images is consistent with the number of original images.
The whole data used in the experiment is divided into two parts.The first complete set of images now contains 1655 original samples and 8882 samples after data enhancement which are randomly shuffled and split into 70% training samples and 30% verifying samples for each category of the dataset for each evaluation.And then the rest part of the data which includes 90 original samples and 545 samples after data enhancement was tested with the trained classifier.
In the training of the network, we proposed a learning rate of 0.001, batch size of 10, validation frequency of 30, momentum of 0.9 and epochs of 8 (epoch means the number of the network is trained).Other parameters are established as the default values.

B. COMPARISON OF THE PROPOSED TWO-STREAM NETWORK AND THE SINGLE-STREAM NETWORK
In this section, we compare the two-stream network with the single-stream network for proving the reasonableness of the network designed in this article by the experiment analysis.And then we design experiments to fuse the features of different layers of the network to be input to SVM for classification.As for the aluminum profile defect classification task, a variety of evaluation metrics are used, including accuracy, Receiver Operating Characteristic, Precision, F1-score and so on [45].
Firstly, we compare the verification and test results of aluminum profile defect classification and detection between two-stream and single-stream networks.We use accuracy and estimated generalized classification error (EGC-Error) to measure the difference between the two networks.The specific data is shown in Table 2.We conducted all defect classification and detection experiments on the same data set in this article, and then the data set followed the same training and validation segmentation settings and unified test data.
We compare the difference on accuracy of the singlestream network and the two-stream network which features from different layers are fed to SVM classifier for classification.Table 2 lists test accuracy and the estimated generalized classification error (EGC-Error) of the single-stream network and two-stream network on the aluminum profile image data set.It can be seen from table 2 that using two-stream network can achieve better accuracy, no matter which layer is used for feature fusion with wavelet transform.The comparison trend chart of the test accuracy and the estimated generalized classification error on three different networks on different layers is shown in Figure 8.
Figure 8 shows that the test accuracy and the estimated generalized classification error trend curves by three networks, one of which is two-stream network and the others are single-stream networks for original RGB images and gradient images respectively.The red curve denotes the two-stream network on test set and the black and blue curve denote single-stream networks on testing set.So the conclusion from Table 2 and Figure 8 is that the proposed multi feature fusion strategy achieves better detection accuracy than that using single feature method in this case for aluminum profile defect detection on data set in this article.

C. COMPARISON OF DIFFERENT FEATURE FUSION STRATEGIES ON DIFFERENT LAYER
In order to evaluate the effects of different feature fusion strategies on different layers, several groups of experiments were designed.The specific data is shown in Table 3.We compare the difference on the accuracy with four feature fusion strategies, including Sum fusion, Maximum fusion, Concatenation fusion and Wavelet transform fusion.And in order to find the most suitable layer location for defect classification and detection in this data set, we study the influence on the accuracy with selecting different fusion layers.The location of the feature fusion layers includes conv5, releu5, pool5, fc6, releu6, drop6, and fc7 layers.
The accuracy comparison figure with different feature fusion strategies on different layers for test is shown in Figure 9(left), and the estimated generalized classification error comparison with different feature fusion strategies on different layers for test is shown in Figure 9(right).
As shown in Table 3 and figure 9, the results show that Maximum fusion and Wavelet transform fusion have the better performance, and the Concatenation fusion has the worst performance in this article through experiments.And the accuracy of the classification results by convolution layer feature fusion is obviously lower than that of FC6 layer in this case.And then the total training verification and test time on the whole data set for Maximum fusion strategy, Sum fusion, Concatenation fusion and Wavelet transform fusion are is 674.872s,1380.963s,1756.028s,690.016s respectively, so the Maximum fusion and Wavelet transform fusion strategy has less calculation than others.However, the estimated generalized classification error with Wavelet transform fusion is the lowest.Therefore, from the analysis of experimental data, the features fusion model on the ReLU6 layer with Wavelet transform fusion is more suitable for defect detection of aluminum profile based on the data set in this article to achieve better results from a comprehensive point of view.The accuracy of the adopted method for classification verification is 96.37%, and for classification test is 94.44%.
The estimated generalized classification error is 0.010.The curve of the accuracy and loss value of the adopted model on the training set and verification set with the number of iterations increasing is shown in Figure 10.
The accuracy curve increases and the Loss curve decreases smoothly with the increases of the training iteration numbers, and the curves are relatively balanced and have little fluctuation.As shown in Figure 9 and 10, the proposed method can obtain a satisfactory result which has good convergence effect in 8 epochs and 384 iterations.
With the optimal network architecture determined, we now compare positive predictive values and true positive rates for each class in the proposed method, and the specific data is shown in Table 4.The comparisons on evaluation metrics including precisions, recalls, specificity and F1-Score for each aluminum profile defect class are shown in Figure 11.
As can be seen in Table 4 and Figure 11, abrasion mark is easily mistaken for convex powder, flawless sample, traffic mark and dirty points.And the small defect of traffic mark and abrasion mark or flawless sample with a defective background is easily mistaken for a dirty spot.Obviously, according to the above results, the two-stream network has made good use of the structure and visual information of multiple streams to obtain effective features.The performance of a high-level feature fusion scheme for the two-stream network is better than that of the low-level feature fusion scheme or the single-stream network.Gradient image plays an important role in improving the ability of image representation and recognition.The proposed fusion scheme of Wavelet transform fusion strategy achieves the better fusion effect.
The above results show that the accuracy of most early feature fusion schemes is less than 83%.In contrast, the accuracy of later feature fusion has been improved by about 10%.The feature fusions on FC6, ReLU6 and Drop6 layers achieve the better detection accuracy of 92%.The performance of the adopted method is better than that of the single stream network, which improves accuracy by about 3% and about 7% respectively.Experimental results show that the proposed network structure can effectively achieve better detection accuracy.

D. OPTIMIZATION INFLUENCE
We often need to optimize the system parameters to achieve desired detection results.This section shows how to use hyperparameters optimization to optimize a classification model with training predictor and response data.Experimentally, the optimum model and its hyperparameter values are obtained by using Bayesian Optimization function and computing the cross-validation classification error, which are expected to classify new data optimally.Due to the over fitting phenomenon caused by hyperparameters adjustment, we often compare the trained optimizable model which does not always have a higher accuracy to the previously trained models and run the optimization for longer to try to get better results.
In the optimization process, the accuracy and generalized classification errors are compared before and after optimization.And the higher accuracy and the lower error value indicate that the generalization of the classifier is better.The comparison of the accuracy and the generalized classification error is shown in Table 5.And then, the min observed objective is compared to estimated min objective, which is shown in Figure 12.And the total function evaluations is set as 30.The curve the value of the estimated objective is nearly consistent with that of the minimum value of the observed objective.And the minimum estimated objective value is 0.001730 comparing with the minimum the observed objective value which is 0.001725.And the total optimization time is 2038.486seconds.

V. FUTURE WORK
As the proposed neural network system has not yet achieved the best effect, it is still necessary to improve the performance of defect detection.Firstly, due to insufficient samples of training a deep neural network, more various aluminum profile surface defects samples will be provided for training the network in the future.Secondly, in order to improve detection accuracy for some defect classes with poor performance, we will improve the neural network structure by conducting deeper analysis.We can consider adding a network stream based on the method in this article to realize multi-stream network.We can generate ROI around defects in aluminum section area to be as the input of the third neural network stream for multi feature fusion to obtain the better extracted features by region segmentation method or RPN (Region  Proposal Network).For example, the defect-free aluminum profile image shown in Figure 13 has an abrasion mark defect indicated with the yellow rectangle in its background area, which is likely to be predicted as an abrasion mark sample during the training.Therefore, by extracting the smallest ROI which can cover the whole aluminum profile indicated with the red rectangle as shown in Figure 14, the defect location of the image can be analyzed intuitively, so the purpose of this method is to reduce the influence of similar defects in the background area on the detection accuracy.
Finally, the proposed network can also deal with the defect detection of other materials, such as steel surface defects, infrastructure surface cracks, rail surface defects and others.

VI. CONCLUSION
Firstly, we propose a novel two-stream convolutional neural network to extract features from RGB image stream and gradient image stream to effectively fuse multiple features for defect classification and detection of aluminum profiles, which is an important innovation of this article.And each network stream has a feature learning process from low level to high level.Secondly, we mainly introduce the image enhancement method and the method of obtaining gradient image for image preprocessing.Because deep learning mostly relies on the support of a large number of training data, the data enhancement methods are used to obtain a large number of samples in order to prevent the phenomenon of over fitting in deep learning.The image gradient is calculated by Sobel operator and normalized to transform the data between zero and one under the same dimension.Thirdly, a large number of experiments are carried out between the single-stream network and the two-stream network with different feature fusion strategies on different fusion layers using the original RGB image and the corresponding gradient image of aluminum profiles to prove the effectiveness of the proposed two-stream network.

FIGURE 1 .
FIGURE 1.The data enhancement diagram for aluminum profile image.

FIGURE 2 .
FIGURE 2. The aluminum profiles GRB images with defect and the corresponding gradient images.

FIGURE 3 .
FIGURE 3. The two-stream network architecture.

FIGURE 4 .
FIGURE 4. Feature maps of the original RGB image with convex powder defect (top) and the corresponding gradient image (bottom).

FIGURE 7 .
FIGURE 7. Fully connection layer late fusion strategy.
peel (OP), traffic mark (TM), crater formation (CF), coating cracking (CC), dirty points(DP), convex powder(CP), flawless sample(FS).All data with the original resolution of 2560 × 1920, which are cropped to 227 × 227 pixels to contain aluminum profile pixels and subtract unnecessary background pixels, are randomly divided into training, validation, and testing samples, and the specific number of each image type is shown in Table

TABLE 2 .FIGURE 8 .
FIGURE 8. Comparison of the test accuracy and EGC-Error for single-stream and two-stream network on different layer.

FIGURE 10 .
FIGURE 10.The accuracy and loss curve during network training process with Wavelet transform fusion strategy on relu6 layer.

FIGURE 11 .
FIGURE 11.Comparisons on evaluation metrics for ten aluminum profile defect types.

FIGURE 12 .
FIGURE 12.Comparison of the observed min objective with estimated min objective.

FIGURE 13 .
FIGURE 13.Flawless sample with abrasion mark defect in background area.

FIGURE 14 .
FIGURE 14. ROI of Flawless sample with the red rectangle.
CHUNMEI DUAN is currently pursuing the Ph.D. degree with the Institute of Light Alloy, Central South University, Changsha, China.She is also a Senior Engineer.Her research interests include computer vision, deep learning, and defect classification and identification.TAOCHUAN ZHANG received the Ph.D. degree from Central South University.He is currently an Associate Professor.His research interests include signal processing, deep learning, and fault diagnosis.

TABLE 1 .
Classification statistics of aluminum profile data set.

TABLE 3 .
Comparison of different feature fusion strategies on different layer for the two-stream network.Comparison of the test accuracy and EGC-Error with different feature fusion strategies on different layers.

TABLE 4 .
Comparison on evaluation metrics for each class.

TABLE 5 .
Comparison of accuracy and estimated generalized classification error before and after optimization.