A Novel Multi-Branch Channel Expansion Network for Garbage Image Classification

Due to the lack of data available for training, deep learning hardly performed well in the field of garbage image classification. We choose the TrashNet data set which is widely used in the field of garbage image classification, and try to overcome data deficiencies in this field by optimizing the network structure. In this article, it is found that the deeper network and short-circuit connection, which are generally accepted in the field of deep learning, will not work well on the TrashNet data set. By analyzing and modifying the network structure, we propose an effective method to improve the network performance on TrashNet data set. This method widens the network by expanding branches, and then uses add layers to realize the fusion of feature information. It can make full use of feature information at slight additional computational cost. Using this method to replace the core structure of the Xception network, the performance of the improved network has been improved greatly. Finally, the M-b Xception network proposed by us achieves 94.34% classification accuracy on the TrashNet data set, and has certain advantages over some state-of-the-art methods on multiple indicators. The python code can be download from https://github.com/scp19801980/Trash-classify-M_b-Xception.


I. INTRODUCTION
With the development of human society, the problem of environmental pollution is becoming more and more serious [1], and environmental pollution has great harm to the earth and all its organisms [2]. Among them, most of the pollution is caused by domestic garbage. The decomposition of some domestic garbage may lead to the high concentration of chemical substances in the environment [3], damaging the ecological environment. There are also some domestic wastes that rarely biodegrade. For example, plastics are ubiquitous pollutants in all marine environments around the world [4]. Therefore, the first step to solve the waste pollution is to classify the waste according to its nature. Many countries in the world require to separate dumping of wastes [5].
The associate editor coordinating the review of this manuscript and approving it for publication was Jun Wang . Nevertheless, for those who lack professional knowledge, it is very difficult to identify all kinds of domestic garbage accurately. The intelligent garbage classification system can solve this problem. Applying it to the intelligent garbage bins or smart phones can guide people to dump the domestic garbage correctly. At present, the difficulty is that the intelligent garbage classification system cannot classify garbage images accurately.
In the past decade, due to the improvement of computing power and theoretical system, deep learning accessed a period of rapid development [6]. Now deep learning has penetrated into all aspects of computer vision, and has achieved exciting results in image classification, target detection and image semantic segmentation tasks [7]. The main advantages of deep learning method over other machine learning methods is its powerful modeling ability [8], as well as its end-to-end learning method, which free people from heavy manual work [9]. Deep learning is highly dependent on data [10]. As a relatively new field, garbage classification has no standardized data sets for neural networks to learn, so in 2016, Mindy Yang and Gary Thung established the TrashNet Database for garbage image classification [11]. After that, the work about garbage image classification based on deep learning is gradually increased. Although the TrashNet data set has been widely used, due to the small number of images and the lack of feature information, these work based on it have not achieved good results.
For the garbage image classification based on deep learning, people tend to use deeper neural network gradually. In fact, image processing based on deep learning is mostly developing towards deeper network [12]. Generally speaking, deepening the network will make the work of each layer simpler, which helps to obtain a better nonlinear representation ability [13], and better fitting effect for complex feature information. However, the result of deepening the network is not certain. It is usually appropriate to use a deeper neural network in target detection, image semantic segmentation, or complex scene classification [14]. For the TrashNet, a small data set with a single background, the difficulty comes from the data set itself, i.e., the small amount of feature information, the small number of data samples, and the large similarity among classes. In this case, classification performance is hardly to be improved by increasing network depth. Therefore, we should consider the particularity of garbage classification and explore a more effective method.
In summary, the main contributions of this article are listed as follows. • We not only analyze the characteristics of the TrashNet data set and the reason why the deeper neural network is not suitable for the TrashNet data set but also prove that much deep neural network will reduce the accuracy. We shortened the Xception network and achieved higher accuracy on the TrashNet data set.
• A new method that can expand the branch of a specific network layer is proposed. This method can widen network structure and extract feature information more effectively, which is beneficial for garbage image classification.
• The proposed network broadening method was used to improve the shortened Xception network and the network structure with better performance obtained. We called it as M-b Xception. Moreover, comparing with some state-of-the-art methods, the M-b Xception can provide higher classification accuracy, and more balanced single category classification ability, which allows it to be used in practical applications.
The rest of this article is organized as follows. Section 2 discusses some related work on garbage image classification. In Section 3, the TrashNet data set was tested and a method based on multi-branch feature information fusion is proposed. We used this method to establish the M-b Xception that performed better on the TrashNet data set.
In Section 4, First, we evaluated the M-b Xception and two contrast networks in this article in several different aspects, and proved the advantage of the M-b Xception. After that, we optimized the number of channels for M-b Xception to further improve the accuracy. Then we compared it to some new works. Finally, conclusions are provided in Section 5.

II. RELATED WORK
In the past few years, researchers have done a lot of work for garbage image classification, which can be divided into two categories, based on traditional machine learning methods and based on end-to-end learning system.
SVM is a powerful classification algorithm developed from Vapnik's statistical learning theory [15], which deals with machine learning tasks on the basis of optimization theory [16]. Its main operation is to find an optimal hyperplane to separate the two categories. For multi-classification problems, we can combine multiple SVM classifiers, or use the one-time solution method, which is to optimize the parameters of all categories by an optimized formula. SVM provides different processing methods for linear separable problems and nonlinear separable problems [17]. In 2016, Mindy Yang et al. utilized the SVM algorithm to work on the TrashNet data set, with the final accuracy of 63%.

2) K-NEAREST NEIGHBOR (KNN)
KNN classification algorithm is one of the simplest methods in data classification [18]. KNN algorithm has no training stage. In KNN algorithm, the category of the samples to be divided is determined according to the category of the nearest one or several samples [19]. KNN is widely used in image classification and prediction because of its efficiency and simple implementation [20]. In 2018, Bernardo S.Costa et al. used KNN algorithm to classify six kinds of garbage images in the TrashNet data set, with the final accuracy of 88% [21].

3) RANDOM FOREST (RF)
RF algorithm is to generate a number of different data sets by sampling, and then train a classification tree on each data set. Each tree will participate in the final decision of prediction results [22]. The advantage of RF algorithm is that it has good robustness when dealing with missing data and strong reliability when dealing with tasks with more variables [23]. In the meanwhile, it has fast training speed [24]. In 2018, Mandar Satvilkar used RF algorithm to classify garbage images on the TrashNet data set, achieving 62.61% accuracy [25].

4) EXTREME GRADIENT BOOSTING (XGBoost)
XGBoost algorithm is improved based on GBDT (Gradient Boosting Decision Tree) algorithm [26], which is a kind of supervision algorithm [27]. The idea is to establish a VOLUME 8, 2020 certain number of classification regression trees, so that the predicted number value of the tree group is as close to the real number value as possible and has the greatest generalization ability [28]. The advantage of XGBoost algorithm is that it is hard to over-fitting, it can specify the default direction of branch for missing number value or specified number value. In 2018, Mandar Satvilkar used XGBoost algorithm to classify garbage images on the TrashNet data set, achieving 70.1% accuracy.

B. END-TO-END LEARNING SYSTEM
Generally speaking, the traditional machine learning method has achieved good results in the field of image processing due to its long development time and effective theoretical system [29]. Nevertheless, these methods are usually composed of several independent steps, so they need a lot of storage space to store the intermediate results [30]. The implementation process is cumbersome and not intelligent enough. The emergence of end-to-end learning system solves this problem. As long as the training data and test data are given in advance, the end-to-end learning system will automatically calculate the error results between the prediction and the real data. In addition, it will update the weight and solve the gradient by using the back propagation method, and find the minimum value of the loss function with the gradient descent method [31]. The whole process of convergence is accomplished independently and coherently. In recent years, there are many garbage image classification work based on end-to-end learning, and they all use the TrashNet data set.
In early 2018, Kennedy Tom proposed the OscarNet network (fine-tuned by vgg19), which achieved 88.42% classification accuracy [32]. In October 2018, Bernardo S. Costa et al. proposed a fine-tuned AlexNet network with 91% accuracy and a fine-tuned VGG16 network with 93% accuracy [21]. Stephen L. Rabano et al. proposed a fine-tuned MobileNet network, which achieved 87.2% accuracy [33]. In December 2018, Rahmi Arda Aral et al. tested multiple classic networks on the TrashNet data set. They achieved 89% accuracy using Inception-Resnet V2 and 89% accuracy using DenseNet121. They also attempted to fine-tune the weight of the pre-trained model on the ImageNet data set, where fine-tuned DenseNet121 achieved 95% accuracy and fine-tuned Inception-ResNet V2 achieved 94% accuracy [34]. And in June 2019, Victoria Ruiz et al. achieved an accuracy of 87.71% by using the Inception network, 88.34% by using the Inception-ResNet network and 88.66% by using ResNet network [35]. These methods are based on the classic networks with outstanding performance in large-scale target detection, image semantic segmentation and multi-category image classification competitions. Moreover, these classic networks are improved with common methods in the field of deep learning, without considering the particularity of the TrashNet data set. Our work just fills in the gaps in the above works.

A. DATA SET AND ITS SPECIFICITY 1) TrashNet DATA SET
The data set used in this article is the TrashNet, which was produced by Mindy Yang and Gary Thung in 2016. There are 2527 RGB images, including 501 glass images, 594 paper images, 403 cardboard images, 482 plastic images, 410 metal images and 137 trash images. The background of all images is white, and they are all taken under sufficient illumination. All images are 512 × 384 pixels in size. Fig. 1 shows some garbage images of the TrashNet data set. Different from other classification data sets, each image in TrashNet data set only contains a single object, which may make the task easier for human eyes, but not for computers. Convolutional neural network has the ability of feature extraction far beyond human eyes. For computers, it's not difficult to find the details of all positions in the image by using the trained model [36]. However, for the images of the TrashNet data set which merely contain a single object, the number of features that can be extracted is few, so the fault tolerance is poor. There are no other objects in the image can provide extra feature information. When there are some differences between the sample object and its class, it will finally show great differences. It is one of the difficulties of garbage classification task. Another difficulty is the limited amount of data in the TrashNet data set. Deep learning relies on large-scale data for massive parameters training due to the very cumbersome gradient back-propagation optimization [37]. When the data is insufficient, the training process will be very difficult, and even significantly overfitting [38]. Owing to the TrashNet dataset's two challenges, we next tested it quantitatively.

2) DEEP CNN IS NOT SUITABLE FOR THE TrashNet DATA SET
With the improvement of computing power and the solutions to the gradient disappearance problem, people tend to use deeper neural network to realize the complex scene image classification. The advantage of this is obvious. A deeper network means better nonlinear representation ability, learning more complex transformations, and fitting more complex feature inputs [39]. However, experiments show that deeper network is not beneficial for image classification on the TrashNet data set.
Firstly, the Xception network is used to test the data set. The Xception network structure is shown in Fig. 2. The part of the network receiving 14 × 14 × 728 images is called the core structure part. Owing to the limited space, it is not expanded. Actually, the core structure is composed of a 9-layer structure repeated 8 times of linear stacking. The 9-layer structure consists of 3 Relu layers, 3 separable Conv2D layers and 3 batch normalization (BN) layers. Here, the depth separable convolution (DSC) uses separable Conv2D under the Keras framework instead of depthwise Conv2D. separable Conv2D must be used with the activation layer [40].
After that, some network layers of Xception are removed to explore the relationship between network depth and network performance on TrashNet data set. Here, the non-core structure of Xception is preserved. The reason is that there are few convolution layers in non-core structure. If the number of convolution layers is further reduced, the performance of the network will be greatly affected [41], [42]. Therefore, we chose to remove some network layers from the core structure of Xception to achieve network shortening.
For more intuitive, the complete core structure of Xception network is shown in Fig. 3. By removing part of the network layer, 7 new network structures are constructed and named L-w Xception (Lightweight Xception) 1-7. The core structures of L-w Xception 1-7 are shown in Fig. 3, and the non-core structure is exactly the same as that of the Xception network.
The 7 newly constructed networks L-w Xception 1-7 and the Xception are trained under the same conditions. The accuracy and the number of parameters of them are listed in Table 1. It can be seen that in Table 1, with the shortening of the network, the number of network parameters decreases continuously, and the accuracy even increases slightly. This shows that the deeper network cannot play a positive role for the classification on TrashNet data set. In this experiment, the L-w Xception 1 with the shortest core structure is proved to have the best performance. In the following experiment, the L-w Xception 1 is adopted and simply referred as ''L-w Xception''.

B. THE PROPOSED METHOD 1) THE PROPOSED NETWORK BROADENING METHOD
Depth and width are two important attributes of the convolutional neural network. Enough depth can make the network have good nonlinear representation ability, and enough width can make the network learn abundant features [43]. In general, we prefer to deepen a network rather than widen it when improving the network structure. This is because when the depth and width of the network are small, deepening the network usually yields higher performance gains [44]. However, when the network is deep enough, further deepening the network will not improve the network performance but make the network more difficult to train and even make it performance decline [45]. Some related works have also shown that shallow and wide networks may work better than deep and narrow ones [46]. Sergey Zagoruyko et al. have proved that a wide ResNet can achieve at least as much accuracy as a deep ResNet [47]. In 2016, Junting Pan et al. proposed a shallow network and a deep network, which proved to have similar prediction errors in saliency prediction [48]. Work by Zifeng Wu et al. has shown that well-designed shallow networks can perform better than many deep networks [49]. In the previous experiments, the shallow L-w Xception network is proved more suitable for the Trash net data set than the deep Xception network. Therefore, we propose a network broadening method to further improve the network performance.
The network broadening method we proposed is to build branches for the target network layer and to realize the feature information fusion among branches with the add layer. It is important to note that the use of the add layer here is different from the traditional residual connection method. The traditional residual connection usually adds a short-circuit mechanism to the linear network, as shown in Fig. 4. Its idea is to map the information identity in the lower layer of the network to the higher layer of the network. In calculation, it can be understood as put the input x into the output as the initial result, so that the output H (x) = f (x) + x [50]. Even when f (x) = 0 is output in the deep network, H (x) can also be made equal to x. The advantage of it is that the transmission gradient can be lossless. Therefore, the negative effect of back propagation, i.e., the problem that the learning rate of the network layer near the input is smaller than that near the output is solved. The add layer used in this article can realize horizontal corresponding channel information fusion, that is, stacks the output of multiple branches. As shown in Fig. 5. There are three ways to increase the width of network. The common method is to expand the number of channels or use concatenate layer for channel merging. As a matter of fact, the use of add layer will also have the same effect. The difference is that enlarging the number of channels can directly increase the number of features extracted from each convolution layer; the concatenate layer can stack the channels horizontally, which will increase the number of features, but the information of Each channel will not increase; the add layer will increase the amount of information in each channel, but the number of feature maps will not increase. Now, three network widening methods are compared from the perspective of the number of network parameters. K represents the size of convolution kernel, N represents the number of channels of the layer, and M represents the number of feature maps of the previous layer. The number of parameters of Conv2D can be calculated as Taking the structure of Fig. 5 as an example, the size of input image is 256 × 256. Here, it is considered that the convolution type is Conv2D. Both add layer and concatenate layer will not increase the additional parameter amount, but concatenate layer will increase the number of channels and the parameter amount of the later network layer. The total number of parameters of convolution layer is 1962 by doubling the number of convolution layer channels directly, 1386 by concatenate layer and 1206 by add layer.
Next, in terms of time complexity, the three network widening methods are compared. The time complexity of convolutional neural network is evaluated as The size of the input feature maps S is determined by the size P of the input matrix, the size K of the convolution kernel, the Padding, and the Stride together. The corresponding relationship can be represented as Obviously, it is more advantages in time complexity by using add layer to increase network width. Lower time complexity means less training time, faster prediction, and lower demand for computing power.
Next, some experiments are performed to compare the three network broadening methods quantitatively. In the experiments, the three methods are used to double the width of the core structure of L-w Xception network respectively, as shown in Fig. 6. Fig. 6 does not show the full network structure. The parts that are omitted are the same as that of the Xception.
In Fig. 6, the method 1 is to double the number of convolution layer channels in the core structure. Since the add layer can only connect to the network layer with the same number of channels, we add a 1 × 1 convolution layer, followed by a BN layer. The method 2 is to expand a branch of the core structure, and to connect the outputs of the two branches with a concatenate layer. Since the number of channels after connection becomes 1456, we add a 1 × 1 convolution layer and a BN layer. In method 3, the proposed network broadening method is utilized. A branch of the core structure is expanded, and then the outputs of the two branches are connected with the add layer. Since the number of channels before and after the connection remains unchanged, there is no need to add 1 × 1 convolution layer in the residual connection.  The above three methods are trained under the same settings, and the experimental results are shown in Fig. 7. Moreover, in order to analyze the influence of residual connection on network performance, we also remove the residual connection in the above three methods, i.e., to remove the yellow line and the network layer with yellow border in Fig. 6. The experimental results are also shown in Fig.7. It can be seen that, under the same conditions, the accuracy of the proposed network broadening method is higher than that of method 1 and method 2, and the number of parameters is also less, whether there is residual connection or not. This proves the network broadening method with add layer is more suitable. Moreover, for the proposed method, after removing residual connections, the network performance can be further improved. Therefore, we choose the network broadening method without residual connection as the final method.

2) THE PROPOSED LEARNING NETWORK
According to the characteristics of TrashNet data set, the Xception network is improved by using the proposed channel expansion method. The network structure is shown in Fig. 8. After a large number of tests on the Xception network, it is found that the network layer with four Downsampling times has the best effect on the feature extraction of TrashNet data set. Therefore, we broaden this part of the network structure, the specific method is to change the original 8 times repeated linear structure to 8 branches parallel structure. The reason why the number of branches is set to 8 is that it can make full use of the advantages of this part of structure feature extraction. In addition, it ensures that the proposed network has the same number of network layers as the Xception network. This method is convenient for quantitative analysis as well as comparison.
We call this new network as M-b Xception (Multi-branch Xception). The M-b Xception finally achieve an accuracy of 0.9325 on the TrashNet data set, which was 1.75% higher than the Xception. We also try to add a residual connection to the core structure of M-b Xception, but we don't get satisfactory results. The accurancy obtained by adding the residual connection was 0.88% lower than before, so we ultimately chose not to add the residual connection to the core structure of M-b Xception.
In the process of forward propagation of the proposed learning network, the input of the first convolution layer of the l branch is: W is the weight of the m-th neuron in the input assigned to the n-th neuron in the target convolution layer. a (l) n is the output of the n-th neuron in the previous layer. b (l) n is the corresponding bias. Although the information source of the eight branches is the same one, the information assigned to the first convolution layer among them is different due to different weights as well as bias, which is controlled by back propagation. The first convolution layer of the 8 branches receives the information of each branch and then carries out the cross-correlation operation of each branch. The output is given by Here, a (l) and Z (l) are the output and input of the first separable Conv2D layer of the l-branch, respectively. g (·) represents the cross-correlation operation. The outputs of the 8 convolution layers are respectively passed to the next network layer of the current branch until the last layer is completed. This process is independent and will not interfere with each other. At last, the output of each branch will be fused by 7 add layers.
There are four short-circuit connections in the non-core structure of the network. Consider the network structure between each short-circuit connection as a short-circuit block. According to the chain rule, the back propagation gradient from deep short-circuit block D to shallow shortcircuit block S can be defined as In formula (6), loss is the error distance between the actual output value and the label value. Gradient ∂loss ∂x S consists of two parts. One part is ∂loss agated through the weight layer, and another part is ∂loss ∂x D that propagated directly [51]. Gradients in the deep layers of the network can spread to the shallow layers of the network, which ensures that the non-core structure of M-b Xception does not cause the gradient to disappear. In the core structure of M-b Xception, the back propagation process is shown in Fig. 9. The outputs of the 8 branches after passing the 7 add layers are It can be seen that formula ∂z ∂x i = 1 holds no matter when integer i takes from 1-8. In Fig. 9, a represents the derivative of the back propagation to the core structure of the network. Suppose a = ∂L ∂z , according to the chain rule, b = ∂L ∂x i = ∂L ∂z · 1 will be obtained. Therefore, in the core structure of M-b Xception, gradients can be transmitted to each branch losslessly. In addition, the network structure of each branch is relatively short, so the core structure of M-b Xception network will not cause the gradients to disappear.

A. THE PARAMETER SETTIN
In this section, the data enhancement settings and optimization settings are listed. In order to ensure that the experimental results only reflect the performance of the network structures,   and then verify the effectiveness of the proposed method, data enhancement is used in the training process of all networks, and the relevant settings are exactly the same.

1) DATA ENHANCEMENT
Due to the lack of images in the TrashNet data set, data enhancement is utilized to suppress the phenomenon of over fitting during training [52]. It is a process of expanding data samples in the training data set [53], including cutting, scaling, rotation and so on. The data enhancement settings are listed in Table 2.

2) THE OPTIMIZATION AND REGULARIZATION SETTINGS
The deep learning method usually requires a lot of training times to achieve an excellent performance of a model. If the parameters are not initialized correctly, the training process would require a long time and fall into a local minimum [54]. After a large number of experimental tests, the appropriate optimal parameter settings of our model are listed in Table 3.

B. THE EXPERIMENTAL COMPARISON OF THE THREE NETWORKS
In this section, we will further examine the performance of the three representative networks in the methodology. These networks are the Xception, the L-w Xception obtained by  shortening the Xception core structure and the M-b Xception obtained by network broadening. Table 4 shows accuracy, the number of parameters and time of single image training of three networks. Our operating system is Windows 10 with an Intel Core i5-8300K, 8 GB of RAM and GeForce GTX 1050 Ti. Since the L-w Xception is obtained by shortening the Xception, it has a much smaller number of parameters than the Xception. The M-b Xception is obtained from the L-w Xception extension branch. Since the number of network layers and the number of channels in each layer of the M-b Xception is exactly equal to the Xception, they also have the same number of parameters. It is satisfactory that the S-I training time (Single image training time) of the M-b Xception is only a little higher than that of the L-w Xception, which indicate that the channel capacity expansion method proposed in this article could widen the network at the cost of relatively small increase in time complexity.

1) COMPARISON OF ACCURACY, NUMBER OF PARAMETERS, AND TIME COMPLEXITY
In general, the M-b Xception network can trade an acceptable increase in time complexity for the accuracy higher than the L-w Xception 1.31%. Compared with the Xception, the accuracy of the M-b Xception network is 1.75% higher than that of Xception without increasing the number of network parameters and time complexity.
In addition, the three networks are also trained without data enhancement, and other settings remain unchanged. The results show that the accuracy of Xception is 0.8431, that of L-w Xception is 0.8497, and that of M-b Xception is 0.8540. This shows that the performance improvement comes from the improvement of network structure, and further proves the effectiveness of the proposed method.

2) COMPARISON OF Grad-CAM VISUALIZATION
Grad-CAM (Gradient-weighted Class Activation Mapping) is a visualization technique proposed by Selvaraju R R in 2020, which can produce a localization map to highlighting the important regions in the image [55]. In this article, the Grad-CAM is adopted as the visualization method. The visualization results of some test images with the Xception, the L-w Xception, and the M-b Xception are shown in Fig. 10. It can be seen that in Fig.10, the extracted regions of M-b Xception are more complete and accurate, which are most consistent with the important region of the garbage images. This indicates that the M-b Xception has the strongest ability to extract garbage image features, and then come L-w Xception, and finally Xception. This result is consistent with the accuracy comparison of the three networks. Xception is better than that of the Xception, which proves that the M-b Xception has a good fitting ability for the TrashNet data set. Therefore, the final verification loss is lower than that of the Xception.

3) COMPARISON OF TRAINING PROCESS
The visualization of the training process can reflect the learning effect of the network through the convergence speed and fitting effect, but it cannot accurately reflect the gap between the network performance. In this experiment, the visualization results of the training process of L-w Xception and the M-b Xception are very similar, so they are not shown separately. In the following experiments, we will quantitatively analyze the performance of these networks through a large number of experimental results. For the L-w Xceotion network and the M-b Xception network, the comparison results we obtained will reflect the key differences between them.

4) COMPARISONS OF ROBUSTNESS
Some studies have shown that neural networks are vulnerable to external interference [56]- [59]. Network models are prone to make false predictions when the ambient light around the object changes, or the object is obscured [60], [61]. For garbage classification, the most common situation in real life is that garbage samples are incomplete or obscured. Therefore, some experiments are carried out to test the robustness of the Xception, the L-w Xception, and the M-b Xception.
A gray square with RGB = (192,192,192) is used to occlude the images of the original test set. 9 new test sets are established according to the different occlusion locations, as shown in Fig.12. The occlusion location of the new test set 1 is the location of number 1, and the occlusion location of test set 2 is the location of number 2, and so on.
Next, the 9 new test sets are used to test the three network models that we have trained. Fig. 13 shows the overall accuracy of the three models on the 9 test sets. It can be seen that   the performance of L-w Xception is the worst, which shows its poor robustness. Although the accuracy of L-w Xception is higher on the original test set than that of the Xception, it does not mean that L-w Xception also has good performance under other conditions, such as occlusion. In practical application, the performance of garbage image classification may be more affected by the robustness of the model. In this experiment, the M-b Xception model shows the best performance as it does on the original image set. This shows that the M-b Xception model has good robustness and further proves its effectiveness. Fig. 14 shows the single category accuracy of the three models when garbage images are occluded from different positions. The results show that in most cases, the M-b Xception model performs best, followed by the Xception, and finally the L-w Xception. The accuracy of the M-b Xception model in the glass, paper, and plastic categories is significantly higher than that of the other two models. It also has some slight advantages in the cardboard, metal, and trash categories. It should be noted, especially trash, the samples of these categories are messy and the images within the class are quite different, which makes it difficult to achieve significant improvements in accuracy.
The above experiments prove that the proposed M-b Xception model has good robustness. Therefore, the M-b Xception model can provide reliable classification results, even if the test sample is incomplete or occluded.

C. OPTIMIZATION OF CONVOLUTION CHANNEL NUMBER OF THE M-B Xception CORE STRUCTURE
Both width multiplier α in MobileNet and scale factor s in ShuffleNet play the same role, i.e., quickly adjust the number of channels in the convolution layer of the network [62], [63]. This design intends to make it easier for optimizing the network structure for some data sets. It also reflects that finding the number of convolution channels suitable for the data set used is an important problem in network optimization.
In this article, we change the number of convolution channels of the core structure of M-b Xception network to realize the better performance on TrashNet data set. Fig. 15 shows some typical experimental results. It can be seen that in Fig. 15, when the number of convolution channels in the core structure increased to 896, the network performance is the best, and the accuracy is increased by 1.09%.
However, when the number of convolution channels is further increased to 1024, the accuracy is reduced. The reason is that for a narrow network, increasing the number of convolution channels can improve the network performance [64]. However, when the number of convolution channels can meet the current task, the further increasing will bring negative effects. For example, the increase of parameters may make the network relatively difficult to converge, and then affect the network performance [65]. Fig. 16 shows the comparison results of confusion matrix with or without the channel number optimization. The vertical axis shows the real category, and the horizontal axis shows the predicted category. The main diagonal element is the percentage of the number of images identified correctly. It shows that when the channel number of the core structure is set to 896, the three hard-to-predict categories, i.e., glass, metal and trash, accuracy increased by 4%, 2% and 8%, respectively. This has greatly improved the classification balance of the M-b Xception, and reduces the occurrence of unreliable detection for a certain category in practical application.

D. COMPARISON BETWEEN THE M-B Xception AND SOME STATE-OF-THE-ART WORKS
In this section, we compare the M-b Xception to some stateof-the-art works. All of these new works are being done on the TrashNet data set.

1) COMPARISON OF OVERALL ACCURACY
The comparison results of M-b Xception and other related new works are shown in Fig. 17. The methods used for comparison include four machine learning methods, i.e., SVM, XGB, RF, KNN, and 10 deep learning methods. Among them, the fine-tuned DenseNet121 proposed by R.A. Aral et al. has a slightly higher accuracy than the method in this article, but this method belongs to transfer learning. It uses pretrained weights on the ImageNet data set. However, the model in this article is trained in the TrashNet data set with the weight of random initialization, so the comparison of accuracy is not fair enough. R. A. Aral et al. also tried to train DenseNet121 on the TrashNet data set with the weight of random initialization, and obtained an accuracy of 0.89, which is 5.34% lower than our accuracy. Therefore, the performance of M-b Xception network proposed by us on Trash-Net data set is better than other existing networks. It should be noted that the accuracy of M-b Xception in Fig. 17 is different from that in Table 4. The reason is that the number of convolution channels of M-b Xception is optimized in Section IV.C.

2) COMPARISON OF SINGLE CATEGORY ACCURACY
Following, the confusion matrices of several other methods are converted into some single category accuracy. In practical application, we can't predict which type of garbage needs to be predicted at any given moment. Here we assume that each kind of garbage that needs to be classified appears at the same odds. Therefore, we add the accuracy of each kind of garbage and then divide them by 6 to get the average value, so as to evaluate the accuracy in practical application. The comparison results are listed in Table 5. To compare the overall performance of different methods, the accuracy comparison results of single category are shown in Fig. 18.
In Fig. 18, the other four methods are described as bar and the M-b Xception as broken line. It shows that the average recognition accuracies of the six categories of the M-b Xception is generally higher than that of other methods. The M-b Xception has no significant difference in recognition accuracies of different categories. Compared with other methods, the M-b Xception has obvious advantages in average accuracy of six classes.

3) COMPARISON OF F1 SCORE
F1 score is a comprehensive evaluation index to balance accurancy and Recall, which could reflect the comprehensive performance of a model. The F1 score of single category of the above methods is calculated, and the result is shown in Fig. 19. We can see that the M-b Xception has obvious advantages in cardboard, glass, metal, paper and plastic categories, which further proves the effectiveness of this method.

V. CONCLUSION
In this article, a novel network improvement method based on channel expansion is proposed for garbage image classification. It can make full use of feature information at slight additional computational cost. When some common network improvement methods are hardly to work, the proposed method can greatly improve the network performance. Compared with the Xception network, the M-b Xception network has higher accuracy on the TrashNet data set. It also shows good robustness in occlusion experiments. comparing with some related new methods, the M-b Xception can provide higher accuracy and F1-score. It also has more balanced predicting ability, which allows it to be used in practical applications. In the near future, we will continue to study end-to-end learning systems, and analyze quantitatively the impact of small changes of network structure on classification performance. In addition, it is more pressing to minimize the network volume while maintaining high accuracy. At last but not least, our work will be transplanted into the mobile phone.
RUIYANG XIA is currently pursuing the bachelor's degree with Qiqihar University, Qiqihar, China. He has applied for two patents of invention. His research interests include digital image processing and machine learning. His research project won the provincial Students Awards.