Homogeneous Vector Capsules Enable Adaptive Gradient Descent in Convolutional Neural Networks

Neural networks traditionally produce a scalar value for an activated neuron. Capsules, on the other hand, produce a vector of values, which has been shown to correspond to a single, composite feature wherein the values of the components of the vectors indicate properties of the feature such as transformation or contrast. We present a new way of parameterizing and training capsules that we refer to as homogeneous vector capsules (HVCs). We demonstrate, experimentally, that altering a convolutional neural network (CNN) to use HVCs can achieve superior classification accuracy without increasing the number of parameters or operations in its architecture as compared to a CNN using a single final fully connected layer. Additionally, the introduction of HVCs enables the use of adaptive gradient descent, reducing the dependence a model’s achievable accuracy has on the finely tuned hyperparameters of a non-adaptive optimizer. We demonstrate our method and results using two neural network architectures. For the CNN architecture referred to as Inception v3, replacing the fully connected layers with HVCs increased the test accuracy by an average of 1.32% across all experiments conducted. For a simple monolithic CNN, we show HVCs improve test accuracy by an average of 19.16%.


I. INTRODUCTION
n [1], the authors argued that standard convolutional neural networks are "misguided" in their usage of neurons that are composed of singular scalars to summarize their activation.The authors proposed (a) the concept of a "capsule", which is comprised of multiple scalar values and (b) posited that these capsules would be capable of recognizing a "visual entity over a limited domain of viewing conditions and deformations" and Manuscript submitted on June 19, 2019.A. Byerly is with Brunel University London, Uxbridge, UB8 3PH UK and lectures at Bradley University, IL 61625 USA (e-mail: abyerly@fsmail.bradley.edu).
T. Kalganova is with Brunel University London, Uxbridge, UB8 3PH UK Fig. 1.The standard approach to transforming the final convolutional layer into class predictions involves flattening the final set of feature maps and then classifying through one or more fully connected layers of weights.
Fig. 2. Our approach is to reshape the final set of feature maps into j ndimensional vector capsules, where j•n is the total number of weights coming out of the final set of feature maps.The final classification is done, rather than with scalar output neurons, with y n-dimensional vector capsules.And the trainable weights between these two sets of vector capsules are also a set of vectors, rather than matrices, and rather than using matrix multiplication between the layers, we use the Hadamard product (element-wise multiplication).This necessitates that these final two sets of vector capsules be of the same n dimensions, thus we refer to these as homogeneous vector capsules.
that the capsule's members would include both the probability that the entity is present as well as a set of "instantiation parameters" that "may include the precise pose, lighting and deformation relative to the canonical version of that entity."In their work, they (c) demonstrated that capsules could learn the x and y coordinates of a visual entity and (d) made a convincing case that capsules could learn to identify "any property of an image that we can manipulate in a known way."Research into capsules did not progress much until a pair of papers were pre-published on arXiv in late 2017 [2] [3].The first of these two papers (Sabour, Frosst, & Hinton, 2017) received an especially significant amount of attention, due to the fact that it published results on par with the state-of-the-art for both the standard MNIST [4] and smallNORB [5] datasets

Homogeneous Vector Capsules Enable Adaptive Gradient Descent in Convolutional Neural Networks
Adam Byerly and Tatiana Kalganova I using relatively shallow networks in combination with capsules.Additionally, the network described in the first paper was shown to be highly effective at segmenting highly overlapped digits from the MNIST data.Both papers utilized an iterative routing mechanism between layers of capsules.They referred to the method in the first paper as "Dynamic Routing" and used a different method in the second paper based on the Expectation-Maximization algorithm [6].The architecture described in the second paper (Hinton, Sabour, & Frosst, 2018) improved upon the state-of-the-art classification accuracy for smallNORB by 45%.The architectures described in both papers used two layers of capsules in order to make the final classification and used matrix multiplication between them.In both papers, in addition to learning the weights used in the matrix multiplications using backpropagation, the routing algorithm was employed to iteratively "refined" the weights of the matrices.The authors interpret the first set of capsules as "parts" and the second set as "wholes" and the routing algorithm as a method for finding agreement about which whole is best described by the particular set of parts.
Both papers published results on relatively small data sets.In both cases this was due to the high computational cost associated with using a routing algorithm.Additionally, the architecture from the first paper requires a large number of parameters per output class (147,456) just for the weights between capsule layers, making datasets with a large number of output classes (like the 1,000 classes in ImageNet) intractable.
Another important thread of neural network research is choosing the best optimization algorithm and its hyperparameters.Stochastic Gradient Descent (SGD) with momentum is simple and effective but requires careful tuning of both the learning rate η and the schedule for decaying that learning rate as training progresses.Though guidance has emerged in the form of rules-of-thumb [7], it is none-the-less true that the choice of the learning rate and rate decay scheme remain a matter of trial-and-error and heavily dependent on the data being trained on.As such, alleviating the need to carefully tune a single learning rate has emerged as an important research area.
The most successful strategy for alleviating the need to carefully tune the learning rate has been to maintain separate learning rates for every trainable parameter and to learn each of these learning rates based on the magnitude of previous gradient updates to those parameters.This method in general is referred to as adaptive gradient descent.Research into this began in earnest with AdaGrad [8] and has continued to be an active area of research up to the present, with the most popular adaptive method currently being Adam [9].Adaptive methods of gradient descent are popular for several reasons.First, because they adapt a learning rate for every parameter, they are able to learn sparse, yet highly informative features differently than more dense information that may be less predictive.Second, they reduce the need for careful tuning of the learning rate and learning rate decay by allowing the learning rate to be "learned" from the data.And third, they tend to approach a convergence much earlier in the training scheme compared to non-adaptive methods for the same data and network.
Unfortunately, adaptive gradient descent methods have some weaknesses.First, sparsely occurring features that are not highly informative have overweight influence relative to less sparsely occurring features.And second, empirically, they are prone to overfitting and creating a generalization gap between the in-sample and out-of-sample predictions.This has led some researchers to state that the generalization gap of adaptive gradient descent methods is an open problem [10] and has led other researchers to recommend not using adaptive methods at all [7].Indeed, the best performing convolutional neural networks (CNNs) of the past few years have all used non-adaptive gradient descent methods and hand-tuned learning rate decay schemes [11][12] [13][14][15] [16].
In this paper, we present a new way of parameterizing and training a pair of capsule layers that we call homogeneous vector capsules (HVCs) and show, empirically, that modifying CNNs to use them can achieve superior classification accuracy as compared to equivalent network architectures using a final fully connected layer preceding the output predictions.HVCs use the Hadamard product (element-wise multiplication) between capsule layers, rather than traditional matrix multiplication as in [2] and [3].Additionally, HVCs avoid the expense of an iterative routing procedure, instead relying solely on the weights learned during backpropagation.Finally, we were able to (a) train a simple monolithic CNN to superior accuracy when using HVCs (a 63% improvement in top-1 classification accuracy and a 35% improvement in top-5 classification accuracy compared to the baseline without HVCs) with the Adam optimizer using default settings and (b) train the Inception v3 architecture, modified to use HVCs, to accuracies comparable to baseline using the Adam optimizer both with default settings and with a slowly decaying base learning rate.Using an adaptive gradient descent method to train a network to comparable accuracy as when using a finely tuned learning rate and decay schedule solves an open problem in convolutional neural network research.

II. RELATED WORK
Morzhakov et al. [17], inspired by the work of Hubel & Wiesel [18], put forth a neural network architecture similar to that used by [2] in that it utilized vector neurons, rather than scalar neurons, which shared common inputs and outputs.As their work was inspired by the physiology of primate brains, they characterized the structure as minicolumns, the term used for the analogous structure in primate brains.It is noteworthy that their architecture did not use any analog to the routing mechanism employed by [2] and [3].While performing comparably with traditional CNNs on the MNIST dataset, it performed worse than the architecture employed by [2].
Roy et al. [19], compared the effects of various forms of image degradation (additive white gaussian noise, salt and pepper noise, etc.) on MobileNet [20], VGG16 & VGG19 [11], Inception v3 [13], and CapsNet [2] and found that CapsNet was far more robust against the degradation methods they tested than any of the others.They hypothesize that this is not only due to the presence of the capsule neurons and/or dynamic routing, but also due to the shallower nature of CapsNet, having gone through fewer layers of convolutions.
Nair et al. [21], ventured to apply the CapsNet architecture proposed by [2] to more complex datasets than MNIST-Fashion MNIST [22], SVHN [23], and CIFAR-10 [24].Additionally, they experimented with a greater range of affine deformations than the small amount of translation used in the original experiments.Their conclusion was that the CapsNet architecture is "unlikely to work on other classification tasks, let alone machine learning tasks in general."They also concluded that the design was "not making full use of routing to encode" the spatial relationships between the components of the objects the network was classifying.They hypothesized that a neural network, as opposed to a routing algorithm, would better accomplish the goal of reweighting the coefficients used to determine the agreement between capsule layers.This method was experimented with by [25], though they were unable to produce any significant results.Additionally, they hypothesized that for data more complex than MNIST, deeper networks may be required.We agree with these last two hypotheses and for our experiments, (1) we use a neural network approach, rather than a routing approach, when transforming between capsule layers, and (2) we use deeper networks for the more complex (than MNIST) ImageNet classification dataset.
Fang et al. [26] applied a capsule network to the task of protein gamma-turn prediction, rather than to image classification-the first such application of capsule networks in the bioinformatics domain.Novel to their experiments is that they prepended the capsules portion of the network with an inception block ala Szegedy et al. [13] rather than a simple convolution.They achieved a new state-of-the-art performance on the GT320 benchmark [27] for gamma-turn prediction with an MCC (Matthew correlation coefficientsthe metric used for this task) of 0.45, beating the previous state-of-the-art of 0.38.
To the best of our knowledge, prior to our work, no convolutional neural network employing the use of capsules had yet to be trained on and for the full ImageNet ILSVRC 2012 classification challenge dataset.This is likely due to the fact that when using matrix multiplication between layers of capsules the number of parameters becomes intractable when using currently available hardware.
One of the best performing CNNs to be published in the past several years that has been trained on and for the ImageNet classification dataset is Inception v3 [13].To train their architecture, they used the RMSProp1 optimizer, which is indeed designed to be an adaptive gradient descent method.However, they set the  parameter under the radical in the denominator of the per-parameter adaptive term to 1.0: is the exponential moving average of the past squared gradients for the parameter.
The intended purpose of the  parameter is to provide numeric stability by mitigating the danger of division by zero, and thus implementations default this value to 1x10 -10 , which would create a range of possible values for the per-parameter adaptive term of 0-1x10 6 .By using a value of 1.0, they limit this range to 0-1, thus setting an upper bound five orders of magnitude less than intended for this term.While still technically adapting each parameter, the range of adaptation is so dampened that we would characterize RMSProp with a 1.0  as quasi-adaptive at best.As such, we agree with Chen and Gu [10] that effectively utilizing (truly) adaptive gradient descent methods with convolutional neural networks remains an open problem relative to Inception v3.
The Adam optimizer [9] has an analogous per-parameter adaptive term for each of the past squared gradients (in addition to another term not relevant to this discussion for past gradients that gives Adam a momentum-like behavior): ̂ is the bias corrected exponential moving average of the past squared gradients for the parameter.
Here again, the authors employ the use of an  that implementations default to 1x10 -10 .Since in Adam, the  is moved out from underneath the radical, Adam is able to adapt each parameter by five orders of magnitude more than RMSProp (with a range of 0-10 10 ).In this paper, except for when establishing baseline results for the Inception v3 network, we use the Adam optimizer with the implementation default  of 1x10 -10 .

III. CAPSULE LAYERS CONFIGURATION
Sabour et al. [2] proposed two final layers of capsules.The first of which has 8 dimensions shaped as a vector and the second of which has 16 dimensions, also shaped as a vector.The transformation between the two layers of capsules is a typical matrix multiplication, wherein every pair of capsules has an associated 16x8 matrix of trainable parameters and is multiplied by each of the 8-dimensional vector capsules and summed to form the input into the 16-dimensional capsule.In (3), an equivalent transformation simplified to two and four dimensions for clarity is presented: A problem with this transformation becomes apparent when viewing it as an overdetermined system of linear equations in matrix form: every dimension in the second layer of capsules, beyond the dimensions in the first layer, are at best redundant and more probably, due to the random initialization of the weights, a challenge to the optimization algorithm used during back-propagation to reconcile multiple differing losses derived from each activation in the previous layer.
Also, it should be noted that each dimension of the second layer of capsules is a linear combination of all dimensions of the first layer of capsules.This is a desirable property in a fully connected layer in a neural network.However, with the interpretation and empirical verification in the work of Sabour et al. [2] of the dimensions of a capsule as being distinct features of a given sample, it is our hypothesis that this entangling of distinct features from one layer into all features in the next layer is an undesirable property.
In their follow-up work, Hinton et al. [3] switched to using an equivalent number of dimensions in neighboring capsule layers, though they did not cite their motivation for doing so as to alleviate the problem of an overdetermined system.Additionally, they shaped their capsules as matrices rather than vectors.The authors noted that this reshaping had the effect of reducing the number of trainable parameters (for every pair of capsules) from being the product of the dimensions of the two layers of capsules to being only the number of dimensions of a single layer of capsules.This method of matrix capsules requires that the number of dimensions in neighboring layers be both equivalent and a perfect square.In (4), an equivalent transformation simplified to four dimensions is presented: In addition to alleviating the problem of an overdetermined system and significantly reducing the number of trainable parameters, this formulation results in only the square root of the total number of features in the first layer being entangled with each feature in the second layer.
We propose a new method for the transformation from one layer of capsules to the next.Rather than using the typical transformation matrix, the proposed method involves using a transformation vector and rather than using the typical matrix multiplication, the proposed method involves using the Hadamard product (element-wise multiplication).This method is shown in (5), simplified to four dimensions for clarity: This method goes back to using vectors for the shape of the capsules and requires that the neighboring layers of capsules be of equivalent dimension, thus we call these homogeneous vector capsules.With the constraint of requiring equivalent dimensions in the capsule layers, this method comes with the following benefits: 1) Because this method uses the Hadamard product rather than typical matrix multiplication, the drawback of using the more intuitive vector shape for a capsule is removed, as the number of trainable parameters per pair of capsules stays equal to the number of dimensions in the capsules (as in Hinton et al. [3]), rather than being that number of dimensions squared (as in Sabour et al. [2]).2) By the nature of the Hadamard product, this method cannot suffer from the problem of an overdetermined system.3) This fully disentangles features from the dimensions in the first layer of capsules from differing dimensions in the subsequent layer of capsules.i.e., each dimension in the first layer maps to one and only one dimension in the second layer.4) This eliminates all of the addition operations used in matrix multiplication for a modest reduction in computational cost.

IV. EXPERIMENTAL SETUP AND RESULTS
For the first of two sets of experiments we conducted, we wanted to test homogeneous vector capsule performance on the ImageNet ILSVRC 2012 classification dataset using a simple monolithic CNN (as opposed to one that uses Inception blocks [12][13] or residual connections [14] [15]).We adopted the data augmentation and random cropping methods used in the Inception v3 experiments [13], but with a smaller crop of 224x224 as used in the VGG experiments [11].This smaller crop size was used to facilitate faster execution per epoch, enabling more epochs to be tested.We designed two experiments that each used the same stem of operations described in TABLE I.Both experiments used the Adam optimizer with the proposed defaults [9], batch normalization for every set of weights, ReLU activations, softmax classification, and cross entropy loss.The difference between the two experiments was in the layers that followed the convolution and pooling operations.In the first experiment, after flattening the output of the final convolutional layer, we applied a 50% dropout rate [28] prior to the final classification output of 1,000 neurons.In the second experiment, we instead used two layers of 8dimensional (n=8 in Fig. 2) homogeneous vector capsules, the first of which consisted of 512 capsules (j=512 in Fig. 2; a number that directly results from the reshaping of the output from the previous layer when using n=8) and the second of which consisted of 1,000 8-dimensional capsules (y=1000 in Fig. 2; one for each class).For this experiment, prior to the softmax classification, the capsules in the final layer were reduced to their Euclidian norms.As shown in TABLE II and Fig. 3, the first of these two experiments resulted in a maximum top-1 accuracy of 28.19% and top-5 accuracy of 52.41% after only 19 epochs before overfitting began to occur.This experiment was stopped after 65 epochs of execution, as continued use of computational resources seemed unwarranted.The second experiment achieved a top-1 accuracy of 45.91% and top-5 accuracy of 70.96% after 326 epochs and no overfitting had yet to occur.As such, continuing to execute for more epochs had the potential to achieve an even higher maximum with continued execution, however, the moving average of the last 50 epochs only showed a 0.3% improvement, so continued use of computational resources seemed unwarranted.Clearly, this accuracy falls short of current state-of-the-art results for the ImageNet classification dataset, nor would we expect to approach state-of-the-art with such a simple monolithic CNN with only 5.5M parameters.Rather, this result demonstrates that homogeneous vector capsules enable an adaptive stochastic gradient descent method (Adam) to achieve a higher accuracy and avoid the generalization gap associated with using adaptive gradient descent methods with CNNs as described in [7], while using a nearly identical number of parameters (see Table III).The differences are due to the fact that every neuron in the fully connected layer has a trainable bias parameter, adding 1,000 parameters to the models using fully connected layers whereas every dimension in a pair of HVC layers has two trainable batch normalization parameters.For the second set of experiments we conducted, we wanted to test homogeneous vector capsule performance using the Inception v3 network architecture-one of the best performing CNNs to be published in the past several years.We attempted to recreate the architecture described in [13] as faithfully as possible while referencing the associated code published on GitHub.We conducted two baseline experiments (experiments 1 and 2 in TABLE V), both of which were identical in all aspects aside from the learning rate decay schedule.The first of the two baselines used the schedule published in [13].The second baseline used an alternate schedule not published in the paper, but published with the GitHub code, that produced marginally better accuracies.Both baselines achieved slightly lower accuracies for single-model results than reported in [13].
Why our recreation resulted in this slightly lower accuracy is an open question, but the following hypothesis may explain the discrepancy.As is the standard practice when evaluating the performance of a single-model neural network, evaluations made in our experiments used only one set of values for the parameters as they existed after the completion of any epoch of training.However, in their paper they state that "[m]odel evaluations are performed using a running average of the parameters computed over time" without providing a formula for their "running average" (simple?, exponential?, etc.) So, while reported as single-model results, the authors may have actually discovered something similar to Snapshot Ensembling [29] (1.5 years prior to its publication), which averages (or takes a majority vote) of the model's past parameter values.(See also Fast Geometric Ensembling and Stochastic Weight Averaging-methods that build on the success of Snapshot Ensembling [30][31].) Our first alteration to the baseline (experiment 3 in TABLE V) involved no change to the network architecture, but instead of training with the RMSProp optimizer with either learning rate decay schedule, we used the Adam optimizer with default settings.As expected, this achieved test accuracies much lower than the baselines, confirming prior work showing the generalization gap associated with adaptive gradient descent methods when applied to CNNs (see Fig. 6 and Fig. 7).a Inception v3 implemented as described in [13].
b Inception v3 implemented as described in [13] but using a learning rate decay schedule not mentioned in the paper but published on their GitHub site consisting of a base learning rate of 0.1 exponentially decayed by 0.16 every 30 epochs.c Inception v3 implemented as described in [13], but replacing the RMSProp optimizer and learning rate decay schedule with the Adam optimizer using default settings.d Inception v3 implemented as described in [13], but replacing the final fully connected layer with HVCs and replacing the RMSProp optimizer and learning rate decay schedule with the Adam optimizer using default settings.e Inception v3 implemented as described in [13], but replacing the final fully connected layer with HVCs and replacing the RMSProp optimizer and learning rate decay schedule with the Adam optimizer using a base learning rate of 0.001 exponentially decayed by 0.96 per epoch with a minimum base learning rate of 1x10 -6 .
For the final two experiments (experiments 4 and 5 in TABLE V) we altered both classification outputs of the Inception v3 architecture to use homogeneous vector capsules rather than fully connected classification layers.For the auxiliary classifier, we replaced the flatten operation, the output neurons and the fully connected weights between them with a set of homogeneous vector capsules following the preceding convolutional layer's output.We used 16 8dimensional capsules (j=16 in Fig. 2; a number that directly results from the reshaping of the output from the previous layer when using n=8) and 1,000 8-dimensional capsules (y=1000 in Fig. 2; one for each class).For the main classifier, we replaced the flatten operation, the dropout, the output neurons and the fully connected weights between them with a set of homogeneous vector capsules following the preceding convolutional layer's output.We used 256 8-dimensional capsules (j=256 in Fig. 2; a number that directly results from the reshaping of the output from the previous layer when using n=8) and 1,000 8-dimensional capsules (y=1000 in Fig. 2; one for each class).For these experiments, prior to the softmax classification, the capsules in the final layer were reduced to their Euclidian norms.Both the baseline architecture and the architecture modified to use HVCs have 24.45Mtrainable parameters.Using HVCs negligibly decreases the number of trainable parameters by < 0.01% (see Table VI).Both of these final two experiments used the same network architecture and were trained with the Adam optimizer.The first of these two experiments (experiment 4 in TABLE V) used the default settings for the Adam optimizer.For the second (experiment 5 in TABLE V), we adopted a simple base learning rate decay schedule where the initial default learning rate η of 0.001 was exponentially decayed by 0.96 every epoch until reaching a minimum of 1x10 -6 .
The HVC altered network, when using the default learning rate without decay, slightly underperformed relative to the baseline experiment using the learning rate decay schedule published in [13] in both top-1 (by 0.06%) and top-5 (by 0.01%) accuracies.It also slightly underperformed the baseline experiment that was using the learning rate schedule found on GitHub in both top-1 (by 0.27%) and top-5 (by 0.25%) accuracies.
The HVC altered network, when using the decaying learning rate, outperformed the baseline experiment that was using the learning rate decay schedule published in [13] in both top-1 (by 0.46%) and top-5 (by 0.15%) accuracies.It also outperformed the baseline experiment that was using the learning rate schedule found on GitHub in top-1 accuracy (by 0.27%) but underperformed in top-5 accuracy (by 0.09%).Thus, outperforming the baseline in three out of four metrics, and coming within 0.09% for the fourth metric.

V. DISCUSSION
The addition of convolutional layers to neural networks resulted in considerably better performance in image classification tasks as compared to networks composed entirely of fully connected layers [4].This is correctly attributed to the convolutional layers' ability to extract localized features that are more complicated than a single pixel.The feature extractors do this by assigning meaning to the spatial relationships among pixels that are close to each other.Such meaning is absent when using fully connected layers.As the term "full connected" implies, in fully connected layers every pixel is able to be associated with every other pixel without regard to their relative positions in the image.Giving meaning to spatial relationships among the pixels can be understood as enforcing constraints upon which neurons are allowed to be associated with each other using trainable parameters.Understood in this way, the success of convolutional neural networks can thus be understood as, in part, resulting from applying constraints on which neurons are allowed to affect other neurons in the next layer.
We interpret homogeneous vector capsules as performing a similar function, at the output stage of a convolutional neural network, as convolutional layers perform at the input stage.In the traditional design of the classification stage of a CNN as depicted in Fig. 1, every neuron is able to adapt independently during back-propagation.We hypothesize that this fact combined with the fact that adaptive gradient descent methods adapt independent learning rates for every parameter imparts two orders of adaptability-or stated another way, "too much" "freedom" (to adapt to the training data).This would indeed result in overfitting and a generalization gap as has been observed when using adaptive gradient descent with CNNs.By reshaping the output of the final convolutional layer into vectors and then connecting those vectors to a classification layer also composed of vectors, we are constraining groups of n-dimensional vectors of neurons to train together.Additionally, our experiments showed that when using HVCs, comparable results are achieved using the Adam optimizer with or without a decaying base learning rate, though the training progressed differently in the two cases.By slowly decaying the default base learning rate of 0.001 (by a factor of 0.96 per epoch-experiment 5 in TABLE V) we observed that the loss and accuracy at each epoch was similar to those values as when training the baseline models.Additionally, both baselines and the HVC enabled network with the slowly decaying learning rate showed that they had clearly converged by around 150 epochs, though we let the models train up to 175 epochs.Without the decaying base learning rate (experiment 4 in TABLE V), loss and accuracy were similar in earlier epochs, but exhibited greater variance across the later epochs.As such, we let this experiment train for an additional 50 epochs.This experiment achieved its best accuracies at epoch 175 for top-1 and epoch 211 for top-1.The plot in Fig. 8 clearly shows that without the decaying learning rate, the variance in the test loss across epochs was higher.V), showing more variance, was using the Adam optimizer with the default settings and no learning rate decay schedule.The bottom blue line (experiment 5 in TABLE V), was also using the Adam optimizer with the default settings, but with a slowly decaying base learning rate.
The higher variance and slower convergence when not using a decaying base learning rate can be understood as yet another case in which applying constraints shows to be beneficial.With adaptive gradient descent methods, there is a base learning rate η that is the same for all parameters and a separate per-parameter learning rate that is adapted based on previous gradient updates to that parameter.The two are multiplied together to determine each parameter's actual update.With the Adam optimizer, the suggested base learning rate η is 0.001 and the range of possible values for the perparameter update are 0-10 10 .After being multiplied together, this gives a range of possible per-parameter updates of 0-10 7 .With the decay scheme we used, we started with each parameter having a range of possible update values of 0-10 7 , which we gradually reduced over the epochs of training to 0-10 4 .This is similar to, but far less extreme (by four orders of magnitude) than the dampening effect caused by using a large  in the denominator of the per-parameter term of an adaptive gradient descent method (contra its intended purpose).Further, this dampening is applied gradually over time as the parameter values descend the loss landscape, rather than statically for the duration of training (as in the case of a large ).

VI. CONCLUSION
In general, we hypothesize that fully connected layers of scalar valued neurons are indeed "misguided" (as per Hinton et al. in [1]).Specifically, that using them after the convolutional layers in a CNN works against the goal of preserving meaning in spatial relationships within the features of an image.The first layer of capsules in a pair of HVCs, groups outputs from the preceding convolutional layer together, preserving the spatial relationships that have been learned as meaningful.By "routing" them to a second layer of capsules via trainable vectors, groups of capsules (the first layer of HVCs) that have preserved feature extractions from the convolutional layers are allowed to learn when they should be associated with each other to make a classification prediction (the second layer of HVCs).
In this paper we have shown that adopting HVCs for classification, rather than classifying with a fully connected layer achieves significantly superior classification accuracy-a 63% improvement in top-1 classification accuracy and a 35% improvement in top-5 classification accuracy over the baseline architecture-in a simple monolithic CNN and achieves comparable classification accuracy in a more complex CNN (Inception v3).For both models, doing so without adding to the parameter count of the model (<0.01%fewer parameters when using HVCs).
Thus, enabling convolutional neural network researchers to: 1) Use adaptive gradient descent methods when training CNNs without experiencing a generalization gap.2) Save time and compute cycles searching for the best learning rates and learning rate decay schedules to use to train their network with a non-adaptive gradient descent method and instead use an adaptive gradient descent method that does not require this fine-tuning.

Fig. 4 .
Fig.4.The simple monolithic network's test loss, trained with Adam, before and after introducing homogeneous vector capsules.Prior to the introduction of HVCs, test loss reaches a minimum after 19 epochs and then a large generalization gap emerges (the top orange line in the graph).After the introduction of HVCs, the network's test loss continues to improve, even after more than 300 epcohs (the bottom blue line in the graph).

Fig. 5
Fig.5The simple monolithic network's top-5 test accuracy, trained with Adam, before (the bottom orange line in the graph) and after (the top blue line in the graph) introducing homogeneous vector capsules.

Fig. 6 .
Fig. 6.The plot of training loss for the first 30 epochs of the Inception v3 experiments shown in TABLE V.The top light-blue line (experiment 3 in TABLE V) shows the Inception v3 baseline architecture (without HVCs) trained with the Adam optimizer using default settings.The other four experiments all exhibit better loss and less variance.

Fig. 7 .
Fig. 7.The plot of top-5 test accuracy for the first 30 epochs of the Inception v3 experiments shown in TABLE V.The bottom light-blue line (experiment 3 in TABLE V), shows the Inception v3 baseline architecture (without HVCs) trained with the Adam optimizer using default settings.The other four experiments all exhibit better accuracy and less variance.

Fig. 8
Fig. 8 Test loss of the Inception v3 Network with HVCs and trained with the Adam optimizer.The top green line (experiment 4 in TABLEV), showing more variance, was using the Adam optimizer with the default settings and no learning rate decay schedule.The bottom blue line (experiment 5 in TABLEV), was also using the Adam optimizer with the default settings, but with a slowly decaying base learning rate.

TABLE I THE
STEM OF COMMON OPERATIONS USED IN THE SIMPLE MONOLITHIC CNN * Max-pooling is a sub-sampling operation that involves no trainable parameters.

TABLE II COMPARISON
OF SIMPLE MONOLITHIC NETWORK PERFORMANCE -RESULTSARE REPORTED FOR SINGLE-CROP, SINGLE-MODEL EXPERIMENTS

TABLE V COMPARISON
OF INCEPTION V3 [13] NETWORK PERFORMANCE -RESULTS ARE REPORTED FOR SINGLE-CROP, SINGLE-MODEL EXPERIMENTS USING THE NON-BLACKLISTED SUBSET OF VALIDATION IMAGES IN THE IMAGENET ILSVRC 2012 CLASSIFICATION DATASET