Improved Bilinear Pooling With Pseudo Square-Rooted Matrix

Bilinear pooling is a feature aggregation step applied after the convolutional layers of a deep network and encodes a matrix of local features into a fixed-size bilinear representation. It improves performance in many image classification tasks. Since its emergence, this pooling has seen two major improvements: Compact Bilinear Pooling (CBP) and square-root normalization. Recently, the combination of these two elements has been widely studied. However, due to the lack of good normalization solutions, existing combination approaches showed less efficiency when they are plugged into different networks and less compatibility when they work with existing CBP techniques. To solve this problem, in this paper, we propose to apply Newton iterations, a fast square-root normalization method, to produce a new normalized matrix called pseudo square-rooted matrix. Subsequently, the new matrix allows a CBP technique to encode itself into a compact and normalized bilinear representation. In order to further accelerate the normalization process, our approach has two variants which can handle feature matrix extracted by different networks. Tested on three fine-grained image classification datasets, it provides competitive classification performance while consuming less computational time than other prior works.

in a D × D matrix as: where is a small positive value, e.g. 1e-7, added to diagonal values in matrix A. Compared with average or max pooling, widely used in the CNN, bilinear pooling records richer second-order statistics between different feature dimensions [2]. Furthermore, it is an orderless pooling approach and is robust to spatial variance of the inputs. With these advantageous properties, bilinear pooling has shown its effectiveness for multiple computer vision tasks, such as texture synthesis [3], style transfer [4], segmentation [5] or visual question answering [6], [7]. In particular, bilinear pooling shows superior performance for fine-grained image classification [8] on which we concentrate in our experiments.
Following the pioneer work of [8], some works such as [9], [10] extended bilinear pooling for higher performance but those studies are out of our scope. Instead, in this paper, we focus on two improvements dedicated to the pooling itself, i.e. compact bilinear pooling and square root normalization.

B. Compact Bilinear Pooling
Original bilinear representations are of cumbersome size. In [8], since the feature vector dimension D is equal to 512 for the VGG-16 network, A ∈ R D×D has more than 250.000 elements. Such a big representation is not practical in many aspects. One of them is leading to a heavy classifier, containing numerous parameters to be trained. Even worse, for ResNet50 [11], D is equal to 2048. Therefore, a more compact bilinear representation should be considered. Gao et al. applied existing kernel approximations such as Random Maclaurin and Tensor Sketch, to produce compact bilinear representations [12]. They show that the size can be 32 times smaller while keeping almost identical performance. Kong and Fowlkes considered a low-rank bilinear SVM to run classification based on bilinear representations [13].
The key advantage of this approach is that it avoids to explicitly compute the bilinear representation.

C. Square Root Normalization
Besides the size issue, the representation A is a Symmetric Positive Definite (SPD) matrix. It lies in the Riemannian manifold and training a linear classifier in such a non-Euclidean space, as done by the original work [8], is clearly sub-optimal. In order to map the SPD matrix into a Euclidean space, multiple works [14], [15], [16] suggested normalizing A into its square-rooted matrix A 1/2 so that A = A 1  Newton iterations which speeds up this normalization process on GPUs [15]. Li et al. further introduced Newton iterations into gradient propagation of the normalization for achieving a faster training process [16].

D. Combination
Compact bilinear pooling and square root normalization improve bilinear pooling in different manners but it is not trivial to combine them [17], [18]. To solve this combination problem, there exist several works whose workflows and properties are respectively summarized in Fig. 1 and Table I. Instead of using existing CBP techniques, MPN-COV [19] applies a 1*1 convolutional layer to reduce the feature dimension before bilinear pooling. The resultant compact representation is then normalized via Newton iterations, a fast way to obtain square root matrix from a SPD matrix. The problem of such an approach is that the convolutional layer for dimension reduction should be pre-trained (as well as the whole network) on a large scale datasets such as ILSVRC2012. Gou et al. [18] proposed an approach called MoNet that normalizes the feature vectors X into L ∈ R D×N so that A 1/2 = LL T . Then, the matrix L is transformed into a compact representation of A 1/2 with an existing CBP techniques [12], [20]. Unfortunately, this approach needs a Singular Value Decomposition (SVD) computation which is not efficient in deep networks. In order to solve the SVD issue, iPCCP suggests to produce a compact and normalized bilinear representation via Newton iterations [21]. The authors calculated the square root of a Gram matrix, i.e. X T X, and used it as an additional input in the following CBP module. However, this behavior is considered as one inconvenience of iPCCP because it needs a re-implementation of every existing CBP techniques. Moreover, iPCCP runs efficiently only when  [20]. SRM needs smaller binary (+1, −1) projection matrices and makes the CBP faster. To go one step further, the authors propose SRM+ which is able to encode a compact normalized bilinear representation with SRM. However, it requires the processing of the matrix A 1/2 which can't be derived efficiently from A when the feature dimension D is large.
As summarized in Table I, all these approaches partially solve the combination problem and there is still room to conceive a better solution. We need a solution that can use a general CBP module, without using any SVD and that is efficient both when N D and N D. To this end, in this paper, we introduce a new normalized matrix: pseudo square-rooted matrix which can perfectly run with most kinds of existing CBP techniques and produce compact and normalized representation. To calculate our pseudo square-rooted matrix more efficiently, we leverage the normalization approach from [22] which was applied for Fisher score representation. Different from the original one, the new approach has two variants and is more flexible. It can hold its efficiency when the size of input features falls into two cases: N D and N D. We test our approach on three fine-grained image classification datasets and three deep networks. The results show that our approach achieves competitive classification accuracy and runs more efficiently than other state-of-the arts. Lastly, we run an experiment to visualize the heatmap of the normalized matrix.
Our contributions are multiple: r We propose a complete solution that combines compact bilinear pooling with matrix normalization.

II. OUR APPROACH
For simplicity, in the rest of the paper, we denote 1 √ N X as X and omit small values I, so that (1) is A = XX T . Given the matrix X, its singular value decomposition (SVD) is expressed as: where Q and V are unitary matrices and Σ ∈ R D×N is a rectangular diagonal matrix. Inspired by [18], we look for a matrix L such as A 1/2 = LL T , but we define it differently as: where Σ 1/2 pseudo is calculated by square rooting the diagonal elements of Σ. Since X is not a SPD matrix, there is no square rooted matrix X 1/2 . But like square root normalization, we scale its diagonal elements with square root and obtain a normalized matrix X 1/2 pseudo . Hence, we call X 1/2 pseudo the pseudo square-rooted matrix.
Let us now explain how to efficiently obtain X 1/2 pseudo from X ∈ R D×N for the two cases: N D and N D. Since the fast Newton iterations can not be applied to non-SPD matrices, such as X, we use smart alternatives by applying it two times to either XX T or X T X.

A. Case 1 (N D)
As illustrated in Fig. 1 (Ours-left), we start by evaluating A = XX T . By using the SVD notations, we have: where a temporary matrix H = [I D |0] ∈ R D×N (I D is a D × D identity matrix) is introduced such that Σ =ΣH. Then, we apply Newton iterations on the SPD matrix A and the outputs converge to A 1/2 = QΣQ T and A −1/2 [15]. Then, we apply again Newton iterations on the input A 1/2 to obtain A 1/4 and A −1/4 = QΣ −1/2 Q T . Finally, we have access to X 1/2 pseudo according to: As illustrated in Fig. 1 (Ours -right), we start by evaluating the Gram matrix B = X T X ∈ R N ×N [23] which is equal to: where H = [I N |0] T ∈ R D×N and Σ = HΣ. Similarly as Case 1, after two consecutive Newton iterations, we obtain B −1/4 = VΣ −1/2 V T and evaluate X 1/2 pseudo as: In both cases, Newton iterations can always process a smaller SPD matrix and thus, speed up the normalization step. Furthermore, in the two cases, X Hence, like MoNet [18], applying CBP techniques to X 1/2 pseudo can output a compact bilinear representation which approximates A 1/2 .

A. Datasets
We run experiments on three fine-grained image classification benchmarks: Caltech-UCSD Birds-200-2011 [24] (Bird), FGVC-Aircraft Benchmark [25] (Aircraft) and Stanford Cars Dataset [26] (Car). The Bird dataset provides 200 bird species images where 5994 images compose the training dataset and 5794 images are used as test dataset. The Aircraft dataset contains 100 aircraft variant models with 67 training images and 33 test images per category. The Car dataset is composed of 16185 images of 196 classes of cars. Following a roughly 50-50 split per class, there are 8144 training images and 8041 test images. Note that we always make use of official training-test splits released along with the three datasets.

B. Experimental Settings
Models: We use VGG-16 (VGG) and ResNet-50 (ResNet) pretrained on ILSVRC2012 [27] and extract feature vectors X after the layer relu5_3 in the VGG and before the global average pooling in ResNet-50. Li et al. also trained their own ResNet-50 on ILSVRC2012, called iSQRT, where global average pooling is replaced with MPN-COV [16]. Following the implementation in [21], we extract local features X ∈ R 256×784 after the last convolutional layer of this network and center each feature vector x i ∈ X to a zero mean: Training and evaluation: During training and evaluation, we adopt the same image pre-processings as [16], [21] for fair comparison. On the Bird and Car datasets, we resize input images to 448 × 488 for training and evaluation. For the Aircraft dataset, a center crop of 448 × 488 from a resized image of 512 × 512 is considered as input image. The only data augmentation applied during training is a random horizontal flip. At test time, we either average the predictions on the input image and on its flipped version or consider the results on the input image as final prediction. If not specified, all the results are averaged over 5 runs.
For the experiments with iSQRT backbone, we follow the finetuning strategy from [16], [21] where the optimization algorithm corresponds to a stochastic gradient descent with a mini-batch size of 10, a weight decay of 0.001 and a momentum of 0.9. The learning rate for the classifier is 6e −3 and for the rest of layers is 1.2e −3 . All the experiments stop after 100 epochs.
For the experiments with ResNet-50 and VGG-16, like other approaches [2], [15], [18], [20], the training phase is composed of 2 steps. In the first step, we only train newly-added classifier for 100 epochs with learning rate of 1, a mini-batch size of 32, weight decay of 1e −5 and a momentum of 0.9 while parameters in the rest of layers are frozen. Then, we finetune the whole network for another 100 epochs with learning rate of 2.7e −2 and weight decay of 1e −4 . For both steps, learning rate is divided by 10 when the training loss is not smaller than 0.99 times actual minimum value for 10 epochs.

C. Results
Tables II and III display classification accuracy and statistical significance of our approach with the three backbones as well as the results provided in the reference papers for comparison. Note that the highest accuracy is not our goal. The comparison with other methods is meant to prove that our approach is compatible with various backbones and shows competitive performance.
Compatibility with different backbones: According to Table II, our method overall outperforms the classical bilinear   [20]. A probable reason is that SRM+ is not efficient enough to be used with high dimensional features such as those from ResNet-50 (see Table IV). The poor performance of MPNCOV is probably due to the high number of layers (dimension reduction + classifier) that require more training data to avoid overfitting. iPCCP does not have any reported accuracy on these datasets. Efficiency: To show efficiency of related works which produce compact and normalized bilinear representations, we measure computation time to run training/evaluation for one mini batch with three CBP techniques (see Table IV). ResNet-50 and iSQRT backbones are selected because their feature vectors before bilinear pooling respectively satisfy the two studied cases: N D and N D. When N D, MoNet and SRM+ are much less efficient because the former runs the SVD and the later feeds Newton iterations with big SPD matrices of size 2048 × 2048. On the contrary, our solution and iPCCP avoid the SVD computation and the input matrix is much smaller (196 × 196).
When N D, our approach and SRM+ are clearly more efficient because, for iPCCP, the size of SPD matrices before Newton iterations are 784 × 784 while our approach and SRM+ are using smaller SPD matrices (256 × 256).
We notice that the previous approaches are efficient in a single case at most, while our approach can handle both cases with its flexible structure, because it satisfies the three efficiency elements reported in Table I.
Number of Newton Iterations: As shown in the Table V, 5 iterations keep good balance between running efficiency and classification performance. SVD can also produce plausible accuracy, but at the cost of low efficiency.
Visualization: In Fig. 2, we visualize the L2-norm of the features before (X) and after (X 1/2 pseudo ) normalization for some images. These illustrations suggest that after our normalization, more discriminative parts are discovered since corresponding feature vectors are enhanced. With more detail features, the next bilinear representation is more powerful.

IV. CONCLUSION
We propose an approach to efficiently combine compact bilinear pooling and square root normalization, considered as two essential elements to improve bilinear pooling. Inspired by the previous approach MoNet, our approach can efficiently normalize input feature vectors into a pseudo square-root matrix. The normalized matrix can then be encoded by most of the existing CBP techniques. Through the tests, our approach shows strong compatibility with different deep networks and needs less computation time than recent alternatives.