A Layer-Wise Theoretical Framework for Deep Learning of Convolutional Neural Networks

As research attention in deep learning has been focusing on pushing empirical results to a higher peak, remarkable progress has been made in the performance race of machine learning applications in the past years. Yet deep learning based on artificial neural networks still remains difficult to understand as it is considered as a black-box approach. A lack of understanding of deep learning networks from the theoretical perspective would not only hinder the employment of them in applications where high-stakes decisions need to be made, but also limit their future development where artificial intelligence is expected to be robust, predictable and trustable. This paper aims to provide a theoretical methodology to investigate and train deep convolutional neural networks so as to ensure convergence. A mathematical model based on matrix representations for convolutional neural networks is first formulated and an analytic layer-wise learning framework for convolutional neural networks is then proposed and tested on several common benchmarking image datasets. The case studies show a reasonable trade-off between accuracy and analytic learning, and also highlight the potential of employing the proposed layer-wise learning method in finding the appropriate number of layers in actual implementations.


I. INTRODUCTION
Convolutional neural networks (CNNs) have been successfully utilized for various applications with image inputs such as image classification, pattern recognition, object detection, image segmentation. Numerous CNN structures have been proposed over the years in attempts to obtain better empirical results for different applications. From a simple structure with just several convolutional layers in LeNet [1], deeper and larger convolutional neural networks have been constructed over time, such as AlexNet [2], and VGG [3]. The learning of these deep networks, deep learning (DL), is largely based on backpropagation (BP) of the gradient information of the loss function [4], where the training requires 2 passes: forward and backward. In the forward pass, the data is propagated from the input layer to the output layer to produce an output error, and then in the backward pass, the gradient of the loss function based on the error is propagated backward to the front layers to adjust the weights. Although CNN models The associate editor coordinating the review of this manuscript and approving it for publication was Nikhil Padhi . trained by BP have achieved great successes, majority of the achievements are from the empirical perspective. The theoretical understanding of these successes still remains unknown, and the deep models trained by BP algorithm are well known to be black-box [5] and thus difficult to explain. In high-stake decision making, humans cannot put their trust in something of which they have no knowledge. Improving the theoretical understanding of deep networks and their learning algorithms is crucial for their robustness and trustworthiness. While gaining understanding of deep learning networks, some trade-offs could be acceptable such as the decrease of the model accuracy [6]. Developing a theoretical framework for a better understanding of deep learning would pave the way for artificial intelligence (AI) to be deployed in an extensive scale and in higher-stakes applications in the future.
Besides black-box property, some inherent issues of backpropagation have been known over the years such as backward locking (the weights in a specific layer have to wait for the signal to propagate through other layers before they can be updated) [7], memory reuse problem [8], and biologically implausibility [9]. Recently there is an increasing interest in an alternative of BP algorithm, layer-wise learning of deep neural networks, in which the large black boxes are dissected into smaller pieces. The idea was originally used as a pre-training method called greedy layerwise pre-training [10], where the whole network still needs fine-tuning using backpropagation of global errors after pretrained. The method as a complete learning algorithm without fine-tuning of whole networks has attracted more interest in the machine learning (ML) community recently under different concepts such as forward thinking [11], or learning using local errors [9], [12]. The approach is believed to produce fairly comparable accuracy to global backpropagation in a number of applications. However, despite the reasonably good performance, the theoretical analysis was not sufficiently considered, and there was no convergence analysis for these algorithms [9]- [12].
The problem of convergence analysis in deep networks is well-known to be very challenging. The optimization problems in deep networks are high-dimensional, highly non-convex and thus very complicated to analyze. This is one of the reasons why convergence issue has not received sufficient considerations from ML scientists. A learning algorithm without ensuring convergence could pose a risk when employed in systems that requires stability and robustness, such as in control and robotics [13]- [15]. The ignorance of convergence issue can also hinder the future development and deployment of deep learning. Thus, more efforts have been made recently in analyzing the convergence of the gradient descent method in network learning.
For ease of analysis, most results have considered only shallow networks with one or two hidden layers and different techniques have been used to prove the convergence in learning of the shallow networks [16]- [20]. In [16], the convergence of a two-layer network was analyzed by adding an identity mapping to the standard structure. For networks with standard structures, most studies have considered over-parameterization networks [17], [18] where the widths of the hidden layers are assumed to be very large so that the problem can be analyzed mathematically. Moderating the over-parameterization property was the aim of some recent studies [19], [20] but the results are still limited to shallow networks.
Recently, few works have been devoted to analyzing the convergence issue in deep networks [21]- [23]. In [21], an analytic layer-wise learning framework with guarantee of convergence was developed for multilayer fully connected networks. The learning algorithm can be applied to both classification and regression problems in offline and online robotic applications such as real-time robot control. However, the framework is limited to fully connect networks (FCNs), which therefore cannot be used for full CNNs whose structure is different from the FCNs. In [22], [23], over-parameterized networks were analyzed for deep learning by assuming a huge width in each inner layer. Although these works have contributed towards the theoretical understanding of deep learning networks trained by gradient descent, there is still a huge gap between these theoretical analyses and practical experiments. In particular, the theoretical assumption of having huge width in deep networks has made it impractical. As a result, there were also no experimental results to support the theories given in those studies [22], [23].
In this paper, we develop a layer-wise theoretical framework for learning deep convolutional neural networks. As compared to fully connected networks with only dense weight matrices, the weight sharing in convolutional filters and the presence of pooling layers in CNNs create a unique problem which cannot be directly solved by techniques used in FCNs. The convolutional neural networks are in fact a more general model for image classifications as they consist of convolutional layers, pooling layers and also fully connected layers. In this paper, we explicitly introduce the matrix representations for different types of layers in CNNs to derive a general model of deep CNNs. The weight sharing property of the convolutional layers is clearly formulated in the model. Based on the model, a layer-wise learning algorithm for CNNs is proposed and the convergence is analyzed. The proposed method does not require the assumption of over-parameterization. The method is then tested on several common image datasets and the results show a reasonable trade-off between test accuracy and analytic learning. Though there is a trade-off in test accuracies in some case studies, the results also show that some deep CNNs may not need as many convolutional layers as in their original structure to achieve reasonable accuracies. This demonstrates the possibility of using the layer-wise learning method as an indicator to determine the appropriate number of layers in final implementations of the models.
The rest of this paper is organized as follows. Section II presents the problem formation where the maxtrix representations of the mathematical operations taking place in CNNs are introduced and the full equation of CNN models is derived. Section III presents the learning algorithms where the theoretical analysis is provided to prove the convergence of the proposed algorithms. Section IV presents the case studies where different CNN structures and databases are employed to demonstrate the efficacy of the proposed method.

II. PROBLEM FORMATION
We consider the problem of supervised learning with a dataset containing input images and their labels. The aim is to approximate the true mapping between the input images and their labels for predicting the labels of unknown images. Denoting X ∈ R n×n×n c as an input image (where n is the width and also the height, and n c is the number of channels (or depth) of the image), and y ∈ R p as the output variable representing the image's label (where p is the number of classes in the classification problem), the true mapping f can be described as (1) VOLUME 10, 2022  The mapping f in (1) is to be approximated by a CNN. An example of CNNs is shown in Fig. 1. Normally, a CNN can be divided into 2 main parts: convolutional (Conv) part which includes convolutional layers and pooling layers, and fully connected (FC) part which includes fully connected (or dense) layers. Between the 2 parts, there is a flattening operation which reshapes the 3D volume of neurons into a 1D list of neurons. For simplicity, the pooling layers in our work are accompanied with the preceding Conv layers and referred to as pooling operations within the Conv layers. The Conv part can thus be referred to as Conv layers, and the FC part can also be referred to as FC layers.
In this section, we introduce the matrix representations of various operations in a CNN, including convolution and pooling in a Conv layer, and flattening at the end of the Conv part.
A. MATRIX REPRESENTATIONS 1) CONVOLUTION WITH 2D IMAGES Before considering the more complex 3D inputs, we first study the case of 2D inputs where n c = 1. Examples of 2D inputs are black and white images which can be found in MNIST [24] or Fashion MNIST datasets [25]. The convolution operation with a 2D image is illustrated in Fig 2. Given an input image of size n × n (pixels) with n being the width (and also the height) of the image and a filter with a dimension of f × f as shown in Fig. 3a and Fig. 3b respectively.
Convolving the input image with the filter results in an m × m convolution output where m = (n − f + 2p)/s + 1 with p being the size of zero padding and s being the stride of the convolution operation. Fig. 3c shows the 2D convolution output with size of m × m. For simplicity of presentation, we present the case of no padding and stride 1, which means m = n−f +1. Despite that, the matrix representations for the general case are still the same. The resulting output elements of the convolution operation can be computed as follows Equation (2) can be rewritten as where Similarly, we have Generally, for i, j = 1..m, Noting that x f ij is simply a row vector (with f 2 elements) containing all the pixels of the input image that correspond to the output neuron at the i th row and j th column. In other words, x f ij is the vector form of the neuron's receptive field. We can combine all z ij and the equations for the output in a compact form can be given as follows where z = [z 11 , z 12 , · · · , z 1m , is a vector which represents the convolution output, and is defined as the filter-dependent input matrix, which is constructed from the input image based on the filter design, and v ∈ R f 2 ×1 defined in (7) represents the unknown weights of the filter.
When there are n f filters to exact n f features from the input images as shown in Fig. 2, equation (11) can be extended as follows ×n f with z i being the resulting output feature when convoluting the input with the i th filter.

2) CONVOLUTION WITH 3D INPUTS
We now consider an input of size n × n × n c with n being the width (and also the height) and n c being the number of channels (or the depth). For the input layer of a CNN, n c should be 1 for gray images (like previous subsection) and 3 for RGB color images. To do the convolution, each filter should have the dimension of f × f × n c . The convolution operation with 3D inputs is illustrated in Fig. 4.
Similar to the 2D inputs, it is possible to represent the convolution operations in a similar form as in (10) by defining x f ij and v as follows  and (17) where x ijk and v ijk are respectively the pixel values of the input as seen in Fig. 5 and the weights of the filter as seen in Fig. 6. Again, noting that x f ij is simply a row vector (with f 2 ·n c elements) containing all the pixels in the receptive field of the i th row, j th column output neuron, and v is a column vector (with f 2 · n c elements) containing all the weights of the filter. Hence, the output can be calculated as Similarly, denoting z ∈ R m 2 ×1 (a column vector of m 2 elements) and X f ∈ R m 2 ×f 2 n c (a matrix of m 2 rows and f 2 · n c columns) as in (12) and (13) respectively, we also have When there are n f filters to exact n f features from the input images, equation (19) can be extended as follows where representing the weights of the i th filter, and Z = [z 1 , z 2 , · · · , z n f ] ∈ R m 2 ×n f with z i being the resulting output feature when convoluting the input with the i th filter.

3) ACTIVATION FUNCTION
After the convolution operation, an activation function is applied for each element of the resulting matrix Z in (20) to produce the matrix R which has the same dimension as Z (m 2 rows and n f columns).
A commonly used activation function is ReLU [26].

4) POOLING OPERATIONS
The function of the pooling layer is to down-sample the feature maps in each channel. A kernel of size k p × k p is used to summary the key feature of the region it covers in each channel of the feature maps. The kernel is moved with a stride of s × s. We consider the case where the kernel size and the stride of the pooling layer are equal, or k p = s. In this case, with an input of size m × m × n f as in Fig. 7, the output of the pooling operation has a size of m s × m s × n f . With R being the matrix representation of the pooling input (output of equation (21) above), T being the matrix representation of the pooling output, the pooling operation can be represented as follows: As the pooling operation is done independently for each channel, it can be represented as follows for the i th channel of the pooling input R (in total n f channels), where r i is the i th column of R, and t i is the i th column of T.

5) FLATTENING OPERATION
In the end of the convolutional part of a CNN, a flattening operation is required to convert the 3D output feature map into a list of elements before being fed to the fully connected layers. For the matrix representation, the flattening operation is just a rearrangement of the elements in the matrix T ∈ R (m 2 /s 2 )×n f into a vector. We can write the matrix T as where t i is the i th channel of the output of pooling layer. The flattening operation can be expressed as We consider a deep CNN with n conv convolutional layers and n FC fully connected layers as illustrated in Fig. 8. At the j th conv layer, denoting n j as the width and the height, n cj as the number of channels (or the depth) of the input volume, n f j as the number of filters, m j as the width and the height of the output volume after the convolution operation, and s j as the stride of the pooling operation. The output volume of the j th conv layer can be computed by and the depth is For the convolutional layer that does not have the pooling operation, j is an identity matrix. At the last conv layer, we have The flattening operation occurs at this layer, the input of the FC part is hence computed as follows The output of the CNN can thus be computed as where W j and ϕ j are the weight matrix and the vector of activation funtions at the j th FC layer respectively. The output activation function vector can also be denoted as σ ϕ n FC . Therefore, the overall equation is given as

III. FORWARD PROGRESSIVE LEARNING OF CONVOLUTIONAL NEURAL NETWORKS A. FORMULATION
In forward progressive learning (FPL) of CNNs, the entire CNN is trained part by part in sequence. That is, the convolutional part is trained first and then the fully connected part is trained afterwards. The convolutional part of the CNN is learned in a layer-wise manner as illustrated in Fig. 9. Each time, a convolutional subnet which contains 1 convolutional layer and 1 FC layer is trained. The subnet, which can be referred to as two-layer training CNN in this work, is detailed in Fig. 10. The presence of the pooling action in the subnet depends on whether the convolutional layer in the original structure of the entire CNN has the pooling action or not. When learning the subnet of the j th convolutional layer with filter weight matrix V j , an FC layer with pseudo weight matrix W j is used. After the subnet is trained, the FC layer is discarded. V j is then frozen and the computed output T j of the j th convolutional layer is used as the input X j+1 of the subnet of the next convolutional layer. The learning process continues until reaching the last convolutional layer.
The subsequent subsection will introduce in details the subnets (two-layer training CNNs) in the FPL of CNNs. Fig. 10 shows a two-layer training CNN as the subnet of the j th convolutional layer of the entire CNN, with one convolutional layer and one FC layer. The output of the two-layer training CNN that contains one convolutional layer and one fully connected layer can be expressed as follows

B. TWO-LAYER TRAINING CNNs
where ϕ j = ( j j (X f j j V j )). Denoting n j as the width and the height, n cj as the number of channels (or the depth) of the input volume, f j as the kernel size of the filters, n f j as the number of filters, m j as the width and the height of the output volume after the convolution operation, and s j as the stride of the pooling operation, n flj as the number of neurons after the flattening is the denotation for the flattening operation in which the matrix is the vector after flattening, W j ∈ R p×n flj is the FC pseudo weight matrix, and y ∈ R p×1 is the output vector of the network. The flattening operation can be expressed as Equation (43) can be rewritten as follows where W j,i ∈ R p×(m 2 j /s 2 j ) denotes a sub-matrix of W j , v j,i is the i th column vector of matrix V j , and φ j,i is the i th column vector of matrix j . It can be seen that v j,i and φ j,i correspond to the weight of the i th filter and the respective output (i = 1..n f j ).
Each two-layer training CNN is trained through 2 phases: Pre-training and fine-tuning. For the pre-training phase, the FC layer (W j ) can be trained by using any standard learning algorithm for the output weights such as least square methods [27], [28] or one-layer update algorithm for FCNs [21]. The fine-tuning phase for the subnet is presented in the subsequent subsection.

C. FINE-TUNING ALGORITHM
In this section, we develop update laws to fine-tune concurrently the filter weights V j and the pseudo output weights W j of the two-layer training CNN. With sufficient neurons in the hidden layer, there exist optimal weight matrices V j and W j such that the output of the training CNN given in (43) can approximate the target y in (1). At the k th learning step, we have The weight matrices V j and W j are updated incrementally by 2 update laws. Their estimated values at the k th step of learning are denoted asV j (k) andŴ j (k) respectively. The estimated outputŷ(k) at the k th step is defined aŝ The output estimation error at the k th step can be calculated by e(k) = y(k) −ŷ(k). Hence, which can be written as

Properties of ϕ j (k). Similar to (45), we havê
For average pooling, j,i =ˆ j,i (k). For max pooling, j,i ≈ j,i (k) in the fine-tuning phase. In this phase, it is also possible to use the following property where j,i (k) ∈ R m 2 j ×m 2 j is a diagonal matrix whose diagonal entries are defined aŝ withφ j,i,l (k) being the l th -row, l th -column entry of j,i (k), x j,rl (k) being the l th row of X f j j (k). Therefore, Properties of δ(k). Let the activation functions σ be monotonically increasing and their derivatives be bounded above by f σ , the following properties will hold: i, The corresponding elements of e(k) in (48) and δ(k) in (49) have the same sign, i.e.
ii, The absolute value of each element of e(k) in (48) is less than or equal to f σ times the absolute value of the corresponding element of δ(k) in (49), i.e.
Based on the output estimation error e(k), the learning law to update the estimated weightŴ j (k) is proposed as followŝ where α 2 is a positive scalar, L(k) ∈ R p×p is a positive diagonal matrix. In (58), let w j,h denote the h th column vector of matrix W j , w j,h (k) the h th column vector ofŴ j (k) andφ j,h (k) the h th element of vectorφ j (k). The update law (58) can be rewritten in vector form aŝ The learning law to update the estimated filter weight matrixV j (k) based on the output estimation error e(k) is proposed aŝ wherev j,i (k) denotes the the i th column vector of matrix V j (k), α 1 is a positive scalar. It can be seen that v j,i corresponds to the i th filter in totally n f j filters of the convolution operation. P i (k) is chosen as To prove the convergence, an objective function is defined as Using (59) and (60), the objective function at the next step of learning can be expressed as A change of the objective function from current learning step k th to the next learning step (k + 1) th can therefore be calculated as Denoting From (50), we have Next, substituting into (64) gives Replacing (61) and (55) into ξ (k) leads to ξ (k) ≈ 0. Thus, (67) becomes At the fine-tuning phase where the errors are adequately small, the last 2 terms in (69) can be negligible as they are of O 3 while the other terms are of O 2 . The equation (69) becomes Using the properties (56), (57) yields then V (k) ≤ 0 for any e(k). Hence, the value of the objective function satisfies V (k + 1) ≤ V (k). Moreover, the function V (k) is bounded from below as it is non-negative, we thus have V (k) converges. So, from (71) we have e(k) converges when k increases. Remark 1: For networks with insufficient number of hidden neurons, it can be guaranteed that and Therefore, there exists an ultimate bound such that the error always stay within the bound after reaching it. Noting from (74) that this ultimate bound tends to zero when the bound of the NN approximation error tends to zero. Remark 2: While the convergence issue in deep learning has only received more attention from the ML community recently due to the interests of explainable AI (XAI), convergence and stability analyses have always been important for neural network-based learning control since the early days of its research [13]- [15]. For robotic applications, the neural network-based control has also been an active research topic [29]- [35]. These problems are mostly formulated as regression problems for dynamic control and most of the studies has focused on shallow networks only. In [21], multilayer fully connected networks were employed for both regression and classification tasks in robotic systems where Lyapunov-like method was developed for convergence analysis of deep learning networks. It has thus bridged the gaps between the fields of deep learning, control and robotics, so that deep dense networks can be used reliably in robotic applications. These formulations are all based on dense or fully connected networks, but for image classification tasks, deep convolutional neural networks have shown to be more effective. The convolutional neural networks can be treated as more general networks for image classification tasks as they consist of convolutional layers, polling layers and fully connected layers.

IV. CASE STUDIES
In this section, the proposed FPL method for CNNs is evaluated based on different network architectures and datasets. Comparisons are made with the stochastic gradient descent (SGD) method.

A. SVHN
The first classification task is based on SVHN dataset [36]. It is a dataset that contains real-world images of house numbers in 10 classes, each class for each digit. It has 73,257 digits for training, 26032 digits for testing. Each of these images has the size of 32×32 pixels. SVHN dataset is similar to the classical MNIST dataset [24], but it contains natural images of house numbers instead of images of handwritten digits.
The VGG11 [3] was used for the classification task of SVHN. The architecture of the VGG11 is shown in Fig. 15. The VGG11 has 8 convolutional layers with pooling operations at conv 1, 2, 4, 6 and 8, and 3 fully connected layers. The activation functions of the convolutional layers and inner fully connected layers are ReLU, and the activation function of the output layer is sigmoid.

1) CONVERGENCE ISSUE IN SGD
The SGD optimizer was first used to train the VGG11. For consistency with the FPL method, the batch size for SGD was chosen as 1. The number of epochs was set at 300. The learning rate was initially set as 0.01, and scheduled to be halved after the 150 th epoch. The learning converged at first, but then failed to converge after the 100 th epoch, as can be seen in Fig. 11 and Fig. 12.

2) NETWORK PERFORMANCE
To illustrate the performance of the proposed method, the convolutional part of the VGG11 was trained by FPL through training 8 conv subnets sequentially, each of which was a two-layer training CNN, and the fully connected part of the network was trained by FPL through training 2 FC subnets. There were 2 phases in learning of each subnet: pre-training and fine-tuning. In pre-training phase, the output layer, i.e. the fully connected part, of the subnet was trained in 2 loops by using the one-layer update algorithm of fully connected networks [21]. In this phase, the learning gain was automatically calculated. In the fine-tuning phase, the update laws (58) and (60) were used where the convolutional part were updated. There were about 200-600 loops for fine-tuning of each conv subnet and 100 loops for each FC subnet. For the conv subnets, the training stopped at the 200 th , 500 th or 600 th loop when the overfitting was likely to start happening. The initial gain for the fine-tuning of all subnet was set at 0.01. To make sure that the initial gain was not too large for convergence, in the first several loops, the gain was automatically reduced by checking the condition (73). After the gain was adjusted   to a suitable value, the new value of the gain matrix was kept for the successive loops of fine-tuning. Fig. 13 shows the best test accuracy for each subnet. It can be seen that the test accuracy increases when adding a new convolutional layer for the first 4 convolutional layers and then maintains similar accuracies when adding the remaining convolutional and fully connected layers. The accuracy for the last hidden fully connected layer is 93.91%.
For comparison, the SGD was used again but with a smaller learning rate (0.005) so that the convergence did not fail as in subsection IV-A1. There were again 300 epochs. The obtained results are shown in Table 1. It can be seen from the table that there is a trade-off in the performance where the proposed FPL can guarantee convergence but the test accuracy (full net) is slightly lower than that of SGD.

3) THE POSSIBILITY OF PRUNING TOP LAYERS
Besides ensuring convergence, another advantage of using the proposed layer-wise learning is the possibility of constructing the optimal number of layers of the convolutional filters. As seen in Fig. 13, the test accuracy peaks after adding the first few convolutional layers and then does not improve further with more layers added. To see if the trend happens for a new test set with unseen data as well, we have also tested the trained subnets with the MNIST dataset where the classification task is similar. The tests were done for both training set and test set of MNIST. To avoid confusion, we shall call the training set and test set of MNIST in this case as test set A and test set B respectively. To do the tests, we extended the number of channels of the input images from 1 to 3, to match the number of channels of color images of SVHN. It can be seen from Fig. 14 that the trend in the test set of SVHN is also present for both test set A and test set B obtained from MNIST. This indicates the possibility of pruning the convolutional layers based on the proposed layer wise method. The layer-wise learning can also be terminated when the accuracies do not further improve by adding one or two more layers.
For comparison, the entire VGG11 network trained by SGD was also tested with the MNIST dataset. The results for FPL (at the conv4 subnet and full net) and SGD (full net) are shown at Table 2. It can be seen that the FPL generalizes better for this specific problem even with the use of less layers and parameters.

B. FASHION MNIST
Fashion MNIST [25] is a dataset of fashion products in 10 classes: t-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, ankle boot. The dataset contains 60,000 training examples and 10,000 test examples. Each example is a 28×28 grayscale image. The dataset was intended to serve as a replacement for the classical MNIST database.

1) AlexNet-LIKE CNN
We also conducted the experiments on a CNN with no hidden fully connected layers in the full net, aiming to test the FPL algorithm for convolutional layers which has been developed in this paper. An AlexNet-like CNN whose architecture is shown in Fig. 16 was chosen to achieve that aim. The design of the network was inspired by the original structure of the AlexNet first introduced in [2]. All of the input images were resized to 32 × 32 before being fed into the network. The AlexNet-like CNN has 5 convolutional layers with pooling operations at conv 1, 2 and 5. The activation functions of the convolutional layers are ReLU, and the activation   Number of parameters and neurons in the VGG11 whose structure is given in Fig. 15.
function of the output layer (the only fully connected layer) is sigmoid.
The network was trained by FPL through training 5 subnets sequentially, each of which was a two-layer training CNN. There were 2 phases in learning of each subnet: pre-training and fine-tuning. In pre-training phase, the output layer, i.e. the fully connected part, of the subnet was trained in 2 loops using the one-layer update algorithm which has been developed in [21]. In this phase, the learning gain was automatically calculated. In the fine-tuning phase, the update laws (58) and (60) were used. There were 200-600 loops for training of each subnet. The training stopped at the 200 th , 400 th or 600 th loop when the overfitting was likely to start happening. The initial gain for the fine-tuning of all subnet was set at 0.001. To make sure that the initial gain was not too large for convergence, in the first several loops, the gain was automatically reduced by checking the condition (73). After the gain was adjusted to a suitable value, the new value of the gain matrix was kept for the successive loops of fine-tuning. Fig. 17 shows the best test accuracy for each subnet. It can be seen that the test accuracy increases when adding a new convolutional layer for the first 4 convolutional layers and then maintains a similar value when adding the last convolutional layer. The accuracies for the 3 th , 4 th , 5 th layer are in fact quite similar.
The SGD optimizer was also used to train the AlexNet-like CNN. There were 500 epochs. The learning rate was initially set as 0.001, the momentum was 0.9. The highest accuracy for the test set was recorded during training and is shown in Table 4. The accuracy for the training set in  From Table 4, it is noted that the test and training accuracies of FPL are comparable with those of SGD. Since the network is not deep as compared to VGG11, the reduction in parameters by implementing the subnet is not significant but a similar test accuracy can be maintained.

2) VGG11
We then continued the experiments on Fashion MNIST with VGG11. All of the input images were also resized to 32 × 32 VOLUME 10, 2022  before being fed into the network. The network was trained similarly as the training of SVHN. Fig. 18 shows the best test accuracy for each subnet. It can be seen that the test accuracy increases when adding a new convolutional layer for the first 3 convolutional layers and then stays at similar values when adding the 4 th , 5 th , 6 th convolutional layer. The accuracies for the remaining successive layers are the same (93.81%).
The SGD was also used to train the entire VGG11. It can be seen from Table 5 that the test and training accuracies of FPL (full net and subnet) are comparable with those of SGD. In addition, the proposed training method also indicates that a subnet can be implemented with lesser parameters but similar accuracy.

C. CIFAR-10
The next classification task is based on CIFAR-10 database [37]. It is a database that contains color images of objects in 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. It has 50,000 examples in the training set and 10,000 examples in the test set. Each of these images has the size of 32×32 pixels.
The VGG11 was used for the classification task of CIFAR-10. The network was trained similarly as the training for Fashion MNIST.   Fig. 19 shows the best test accuracy for each subnet. It can be seen that the test accuracy increases when adding a new convolutional layer for the first 4 convolutional layers and then stays at similar values when adding the remaining convolutional and fully connected layers. The accuracy for the last hidden fully connected layer is 87.30%.
The SGD was also used to train the entire VGG11. It can be seen from Table 6 that the test accuracy of FPL (full net) is lower than that of SGD (about 2% less for full net and 1.6% less for subnet). Though there is a trade off in performance, the convergence can now be ensured in training and the model can be implemented with lesser parameters while achieving a similar accuracy as the full net trained by FPL.

D. KMNIST
The next classification task is based on KMNIST database [38]. It is a database that contains 10 classes of  Japanese characters. It has 60,000 images for training, 10,000 images for testing. Each of these images has the size of 28×28 pixels.
The VGG11 was used for the classification task of KMNIST. The images were resized to 32×32 before being fed into the networks. The network was trained similarly as the training for Fashion MNIST. Fig. 20 shows the best test accuracy for each subnet. It can be seen that the test accuracy increases when adding a new convolutional layer for the first 4 convolutional layers and then stays at similar values when adding the remaining convolutional and fully connected layers. The accuracy for the last hidden fully connected layer is 93.63%.
The SGD was also used to train the entire VGG11. It can be seen from Table 7 that the training and test accuracies of FPL (full net and subnet) are similar to those of SGD.

E. DISCUSSION
It can be seen from the experimental results that there is a trade-off between test accuracy and analytic learning. The test VOLUME 10, 2022  accuracies of some tasks are comparable with SGD and of some other tasks are slightly less than SGD. The convergence is observed through all case studies of the proposed method while the convergence cannot be assured in the process of training in SGD if the learning rate is not chosen carefully via trial and error, as reported in subsection IV-A1.
With FPL method, the convergence can be guaranteed and the learning gain can be automatically tuned using condition (73). Another benefit of FPL is the possibility of cutting down the number of layers: As we can see from Fig. 13, Fig. 14, and Fig. 18 to Fig. 20 and as discussed in section IV-A, the test accuracies peak after the first few convolutional layers and then maintain similar values when more layers are added. Therefore, with the FPL method, one could consider pruning the top layers of the full network or terminate the layer-wise training earlier to save computational resources and training time. Table 3 shows the number of parameters and neurons of VGG11 so that one knows how many could be saved by cutting down the top layers of the full network. By using the results of every subnet, it can be seen that the FPL can also perform better than SGD for some of the cases.

V. CONCLUSION
In this paper, deep convolutional neural networks have been analyzed and trained by the forward progressive learning framework. The convergence of the proposed framework can be guaranteed by theoretical analysis. The proposed method has been validated in several classification tasks with popular benchmarking datasets. It can be drawn from the experimental results that the proposed method can yield comparable accuracies as the gradient descent method for most cases. In some classification tasks, there is a trade-off between performance and guarantee of convergence in which the end-toend gradient descent method performs slightly better than the layer-wise approach in terms of accuracy while the proposed layer-wise approach is able to guarantee the convergence of the learning algorithm and also to be used as an indicator to determine the appropriate number of layers in final implementations of the models. It has been shown in the case studies that the final implementations of some widely used CNNs do not need as many convolutional layers as in their original structure to achieve reasonable accuracies. Additionally, while gradient descent method performs well for each specific dataset, the comparison of the generalization property by using the MNIST dataset as test sets for the models trained on SVHN dataset has shown that the gradient descent method cannot generalize well for similar tasks. In contrast, our proposed method has performed better in terms of the generalization property for similar tasks in this case study.