Recurrent Affine Transform Encoder for Image Representation

This paper proposes a Recurrent Affine Transform Encoder (RATE) that can be used for image representation learning. We propose a learning architecture that enables a CNN encoder to learn the affine transform parameter of images. The proposed learning architecture decomposes an affine transform matrix into two transform matrices and learns them jointly in a self-supervised manner. The proposed RATE is trained by unlabeled image data without any ground truth and infers the affine transform parameter of input images recurrently. The inferred affine transform parameter can be used to represent images in canonical form to greatly reduce the image variations in affine transforms such as rotation, scaling, and translation. Different from the spatial transformer network, the proposed RATE does not need to be embedded into other networks for training with the aid of other learning objectives. We show that the proposed RATE learns the affine transform parameter of images and achieves impressive image representation results in terms of invariance to translation, scaling, and rotation. We also show that the classification performance is enhanced and is more robust against distortion by incorporating the RATE into the existing classification model.


I. INTRODUCTION
Achieving invariance between well-posed and misaligned images is a desired property in computer vision and many other imaging domains [1]- [6]. Downstream tasks can benefit from the invariance, such as object recognition [4], [5], simultaneous localization and mapping (SLAM) [7], image registration [8] and defects removal [9]. The study of the image alignment can be traced back to a well-established problem in computer vision [10], [11].
Traditional feature descriptors such as HOG [12] or SIFT [13] measures the correspondence between two images by detecting and matching the local feature, then pruning mismatches using geometric constraints. Those approaches work in many situations but are vulnerable to situations such as intra-class variations and non-rigid deformations. Convolutional Neural Network (CNN) based feature descriptor [14] The associate editor coordinating the review of this manuscript and approving it for publication was Wenbing Zhao. mitigates this constraint by learning more representative features. However, in many cases, robustly measuring the correspondence between a pair of images is difficult. Such scenarios can have intra-class variability like background, pose, and affine transform. Joint image alignment [15] aims to align a collection of images, which reduces the intraclass variability. [16] first introduced deep learning into unsupervised joint image alignment, where they use network depth-based features to adjust statistics of the specific data being aligned. Recently, Gradient-Aligned Convolution [17] achieves rotation invariance by implementing a prior pixel-level gradient alignment operation before regular convolution. The attention method in [18] combines spatial attention mechanism and channel attention mechanism to reduce the spatial variance of the object. Alternatively, the eigenvector approach of [19] applies a scale and orientation correction for images based on eigenvectors and eigenvalues of the image covariance matrix. In adaptive Gabor convolutional networks [20], the convolutional kernels are adaptively VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ multiplied by Gabor filters to achieve invariant information extracted from images. The spatial transformer network (STN) [4] is used in [2] to tackle the joint image alignment problem on larger datasets with higher variability. STN [4] can spatially transform the input images by embedding the spatial transformer block into a target network or algorithm. Inspired by STN, Inverse Compositional Spatial Transformer Networks (IC-STN) [21] further enhance the alignment ability of the STN by adopting a recurrent transform strategy. The intuition of the aforementioned approaches is to fulfill the target network's learning objectives, such as classification or object recognition. However, those methods alone do not have a training objective and hence cannot learn the image affine transform parameter independently to perform image alignment.
One of the advantages of STN and IC-STN is the compact network structure, where they only contain a CNN encoder. Unlike STN and IC-STN, other alignment algorithms usually contain many parameters and use complex structures. For example, [22]- [28] use an autoencoder architecture.
In summary, previous approaches for achieving invariance to affine transforms in images have three limitations: i) Some of these algorithms only contain a spatial invariance module [4], [17]- [21] embedded in a neural network designed for classification, object recognition or other tasks. As this module has to be trained via the learning objectives associated with those learning tasks, it is unable to learn the image transform parameters independently. ii) Since this module is trained with the learning objectives other than affine transform, the existing methods only learn the affine transform indirectly or implicitly and hence may not learn the exact affine transform but rather a mixed spatial transform matrix, which may lead to a less optimal affine image alignment. iii) Some of these algorithms require a complicated network structure [22]- [28] instead of a standard CNN.
To address the aforementioned issues, we propose a Recurrent Affine Transform Encoder (RATE) that can be used for unsupervised joint image alignment with compact network architecture. The proposed RATE provides a learning architecture that enables a CNN encoder to learn the affine transform parameter of images. We express an image X o as the multiplication of an affine matrix M o that describes its pose and a canonical image base X b . If we purposely transform the image X o with a predefined affine matrix M, we obtain another transformed image X t . Both X o and X t can be expressed as different transformed versions of the same image X b . The proposed learning architecture decomposes the predefined affine transform matrix M into two separate transform matrices M o and M t to circumvent the problems of unknown ground truth M o and M t and hence enables unsupervised learning by self-supervision. The proposed RATE is trained by unlabeled image data without any ground truth. The trained RATE infers the affine transform parameter of input images, which can be used to represent images in the canonical form to greatly reduce the image variations  in affine transform such as rotation, scaling, and translation, as shown in Figure 1. We refer to Appendix A for a list of abbreviations.
In the remainder of the paper, we first review the spatial transformer network (STN) in Section II. We introduce the RATE in Section III, while in Section IV, we show classification performance and image alignment result. We further conduct an in-depth ablation study in Section V. We offer concluding remarks in Section VI.
Contributions: In summary, the main contributions of this paper are: (i) We propose a compact network that can learn the affine transform parameters and infer the canonical image base of the dataset in a self-supervised manner. (ii) We demonstrate through extensive experiments that downstream tasks such as image alignment or classification can benefit from the learned canonical image base.

II. BACKGROUND: SPATIAL TRANSFORMER NETWORK
The intuition of the Spatial Transformer Network (STN) [4] is to fulfil the learning objectives such as image classification or object recognition by spatially transforming the input images. Figure 2 illustrates the STN for classification task, where X o is the original input image, E 1 is the spatial transformer block, M is the estimated alignment matrix,X b is the aligned image, and E 2 is the classification network. The loss function is L = min ||y−ŷ|| 2 2 , where y andŷ are the ground truth and predicted image class label.

III. THE PROPOSED RATE A. TRANSFORM MATRIX FROM AFFINE PARAMETERS
We use variables (x, y) to represent the spatial coordinates of an image X and define a column vector x = (x, y, 1) T . Then, an affine transform of an image X by a transform matrix M can be expressed by the matrix multiplication Mx. The transform matrix M mixes different affine parameters together, making the physical meaning of the transform unclear. To represent images with clear affine parameter, we propose to construct the transform matrix M from affine parameters with a fixed sequence: rotation θ, horizontal and vertical zooms (p, q) and translations (x, y). We denote a set of affine parameters as c = {θ, p, q, x, y}. During training, we can control the affine transform range as rotation θ ∈ [−ε θ , ε θ ], horizontal and vertical zooms p, q ∈ [1 − ε pq , 1 + ε pq ], horizontal and vertical translations x, y ∈ [−ε xy , ε xy ].
Considering all possible combinations of affine transform, there are many ways to construct the transform matrix from a set of affine parameters. For illustration purposes, here we construct the transform matrix M as follows (we mention more ways to construct the transforms in the ablation study in Section V. We also show how to construct the transform matrix M that includes skew transform in Appendix B):

B. DECOMPOSITION OF AFFINE TRANSFORM
An affine transform links two images before and after the transform but an encoder infers the affine parameter from a single input image. To let the encoder network learn the affine transform, we propose to decompose an affine transform into two parts. We express an image X o as the combination of an affine matrix M o that describes its pose and a canonical image base Figure 3). Thus, from one image, we generate a pair of images X o and X t for training the transform encoder. To map the transform from image space to affine parameters, we encode both X o and X t to affine parametersĉ o andĉ t using a learned encoder E. The estimated affine matricesM o andM t are then  α i n degrees of rotation.

C. LEARNING FRAMEWORK OF THE PROPOSED RATE
The main learning framework of RATE is illustrated in Figure 4. The proposed Recurrent Affine Transform Encoder (RATE) is a small convolutional network with a few convolutional layers and one fully connected layers. The network structure used in this paper is shown in Figure 5 (For different datasets, the exact number of layers and kernel size may vary a bit due to different image sizes). Algorithm 1 describes the procedures to compute the affine regularization loss L affine . The loss function of the proposed RATE is L affine = min ||M −M|| 2 2 .

D. RECURRENT INFERENCE
Inspired by the recurrent alignment strategy applied in IC-STN [21], we found that the proposed affine transform encoder can also be utilized in a recurrent strategy (RATE) to improve the alignment performance during inference. Figure 6 illustrates the image alignment inference block (AIB) of the proposed RATE. Within the alignment inference  Figure 7 illustrates the complete image alignment inference process of the proposed RATE. The first estimated canonical image basê X b1 and estimated transform latent vectorĉ o are obtained by feeding the input image X o to the AIB block. The second estimated canonical image baseX b2 and estimated transform latent vectorĉ b1 are obtained by feeding the first estimated canonical image baseX b1 to the AIB block again. By repeating this process, we can obtain the estimated canonical image baseX bn and estimated transform latent vectorĉ b(n−1) . Ideally,X bn is the final result we wish to obtain. However, due to the interpolation effect during affine transform, the resolution of the image becomes lower and lower with more transforms. Hence, we calculate the product of the estimated transform matrixM −1 and then only apply affine transform once on the input image X o obtaining aligned imageX b(all) to reduce the loss of resolution due to multiple affine transforms.

IV. EXPERIMENTAL RESULTS
In this section, we show the image representation results of the proposed RATE on four datasets: MNIST, SVHN [29], traffic sign [30], scene classification. 1 We also show the image representation result of Tiny-ImageNet [31] with Vision Transformer architecture [32] in Appendix C. We 1 https://www.kaggle.com/nitishabharathi/scene-classification further show the image representation results of the human face dataset CelebA [33] and CelebA-HQ [34] in Appendix D and E. For quantitative results, we show the classification accuracy from Figure 8 to Figure 11, where we train the classification model (e.g., CNN, STN + CNN, and IC-STN + CNN) with the original dataset without augmentation and test on the dataset with small distortion to examine the robustness of the model. RATE is pretrained with the same range of distortion in a self-supervised manner. We keep the distortion in a small range to simulate the real-world scenarios (e.g., ε θ = 15 • , ε pq = 0.1 and ε xy = 0.1w, where w is the width of the image). We further divide the distortion range into four quartiles to evaluate the robustness of the model (e.g., ''no distortion,'' ''25%,'' ''50%,'' and ''100%''). The comparisons are made between the classification accuracy of the model with and without RATE alignment (e.g., dashed line and solid line). For methods with RATE alignment (e.g., RATE + CNN, RATE + STN + CNN, RATE + IC-STN + CNN), the training and testing images are first aligned by RATE to a canonical image base before being fed to the classifier.
For qualitative results, we show the image alignment results with larger distortion (e.g., ε θ = 45 • ) for visualization effect from Figure 12 to Figure 15. We use IC-STN-4 as recommended in the [21] (4 times of recurrent alignment). For RATE, we find it is sufficient to only align the image once for small distortion. We illustrate a detailed recurrent alignment strategy for large distortion (e.g., ε θ = 180 • ) on the human face dataset CelebA in Appendix D. We train all the models using 128 batch size, Adam [35] optimizer with 0.0002 learning rate and 30 epochs. The classification accuracy is the average of 5 runs.

A. MNIST CLASSIFICATION
Due to the simplicity of the MNIST dataset, we use a large range of distortion: scale range ε pq = 0.2 and translation range ε xy = 0.2w. In Figure 8, we observe that the classification accuracy of the models without RATE alignment (solid line) is not robust (curvature of the line) to distortion. However, the performance of IC-STN + CNN is better than STN + CNN and CNN. By contrast, the classification accuracy of the models with RATE alignment (dashed line) is robust (flatness of the line) against distortion. The differences vary less than 1% for all the models with RATE alignment from no distortion to 100% distortion.

B. SVHN CLASSIFICATION
As images in the MNIST are in grayscale with pure background, we further test the proposed RATE on the colored Street View House Numbers (SVHN) dataset [29], which also contains more background variations. As can be seen from Figure 9, the models with RATE alignment are more robust against perturbation. The models with RATE alignment achieve higher accuracy than those without RATE alignment (e.g., RATE + CNN vs CNN) even on test datasets without distortion. RATE + IC-STN + CNN achieves the highest classification accuracy on the test dataset without perturbation and outperforms IC-STN + CNN, which shows that the image alignment strategy of the proposed RATE is also helpful for the classification without distortion.

C. TRAFFIC SIGN CLASSIFICATION
We then evaluate the proposed RATE on the German Traffic Sign Recognition Benchmark (GTSRB) [30], which contains 39,209 training and 12,630 test images from 43 classes taken under various conditions in real-world. The GTSRB dataset is more challenging than the previous ones since some images are taken with motion blur and low resolution (e.g., 15 × 15). In Figure 10, the models with RATE alignment are more robust against distortion. The distortions almost do not affect the classification accuracy of RATE + STN + CNN and RATE + IC-STN + CNN. We also observe that the RATE + CNN outperforms the CNN with a large margin and achieves  comparable results with STN + CNN and IC-STN + CNN on the test dataset without distortion.

D. SCENE CLASSIFICATION
Finally, we show how RATE can be applied to higher resolution images in the real world. We evaluate our proposed method on the scene classification dataset, which contains 17,034 training images with 150 × 150 pixels from 6 classes such as buildings, forests, etc. We split the dataset into training and validation datasets with a ratio of 8:2. In Figure 11, the overall classification accuracy is low because we only use a simple CNN as the baseline. The models with RATE alignment achieve higher accuracy than those without RATE alignment (e.g., RATE + CNN vs CNN). RATE + IC-STN + CNN achieves the highest classification accuracy on the test datasets across all distortions and outperforms IC-STN + CNN, which shows that the image alignment strategy of the proposed RATE is helpful for the classification with or without distortion.

E. IMAGE ALIGNMENT RESULTS
In this section, we compare the image alignment results between RATE, STN and IC-STN on images with larger VOLUME 10, 2022 FIGURE 11. Classification accuracy on the Scene dataset. The notations are the same as in Figure 8. distortion for visualization effect, where we increase the rotation to ε θ = 45 • (see Figure 12 to 15). All the models are trained using augmented data with the same range as the distortion. The proposed RATE is trained in a self-supervised manner, while STN and IC-STN are trained in a supervised manner. We show more image alignment results on human face dataset CelebA and CelebA-HQ in Appendix D and E and detailed illustrations of the recurrent alignment strategy in Appendix D.
From Figure 12 to 15, the first row shows variations of a sampled image, followed by the aligned versions of STN, IC-STN, and the proposed RATE. Besides the alignment of rotation distortion, we highlight the alignment of the translation in Figure 12 and the scaling in Figure 13 and 14.
In Figure 12 and 14, we observe that STN and IC-STN aligned images are larger than the RATE aligned images. This is because the goal of STN and IC-STN is to achieve higher classification accuracy. Hence they tend to make the prominent objects larger. However, the goal of RATE is to align the images to the canonical image base of the dataset.  Hence the size of the aligned image obtained by RATE tends to be the average size of the objects in the dataset. For example, the size of the objects aligned by RATE is similar to the average size of the training samples of the same object. Despite the smaller size of the prominent objects aligned by RATE, which may not be in favor of the classification task, models trained with RATE still outperform or achieve comparable results to STN and IC-STN in classification tasks, which shows the effectiveness of image representation of the proposed RATE. We also observe that IC-STN is more robust than STN in terms of rotation variations (see Figure 12 and 13), which may explain why IC-STN tends to outperform STN for classification tasks. Figure 15 shows that the proposed RATE achieves better alignment results compared to STN and IC-STN in more complicated scenarios as well.

F. DEFECTS ALIGNMENT
In this section, we show an application of the proposed RATE: building defects alignment. Buildings may suffer from defects over time, and the inspection photos may be taken  from different angles and distances during different trials. Hence, it is difficult to compare whether the defects have become more severe over time based on those photos with different perspectives. Figure 16 illustrates that by using RATE trained on photos with buildings, different photos of the buildings can be aligned to the same canonical pose. Thus, the defects on the building over time can be easily compared.

V. ABLATION STUDY: DIFFERENT AFFINE PARAMETER COMBINATIONS
In this section, we show the effect of learning different affine parameter combinations by the encoder on classification accuracy. We use the RATE + CNN on the traffic sign  dataset as the testbed. For Table 1 and 2, the rows show affine parameter combinations learned by RATE, and the columns show the distortion applied on the testing data, the RATE + CNN classifier is trained on data without augmentation. In Table 1, we also include a less commonly used affine transform: skew (''K''). We observe that RATE + CNN performs better than CNN alone. We also observe that the RATE trained with translation learning (RST and RSTK) performs slightly better than that without translation learning (RS).
In Table 2, we evaluate the effect of different affine transform sequences on classification accuracy. It shows that the RATE trained with different sequences of affine transforms has similar performance to that trained with the same sequence as in the testing data. This shows that learning different sequences of transforms has little impact on classification.

VI. CONCLUSION
We proposed a Recurrent Affine Transform Encoder (RATE) that can be used for unsupervised image representation learning with compact network architecture. The proposed RATE provides a learning architecture that enables a CNN encoder to learn the affine transform parameter of images and infer the canonical image base of the dataset in a selfsupervised manner. The learning architecture decomposes an affine transform matrix into two separate transform matrices to circumvent the problems of unknown ground truth and hence enables unsupervised learning by self-supervision. Downstream tasks such as classification can benefit from the proposed RATE by incorporating it with other models (e.g., CNN, STN + CNN, and IC-STN + CNN). RATE alone can be used for applications such as image alignment and building defect alignment as well. In an in-depth ablation study, we investigated the effect of different combination sequences of affine transforms on classification performance. In future work, we will explore more transforms such as perspective transform and 3D transform. VOLUME 10, 2022

APPENDIX A ABBREVIATIONS
All abbreviations used in this paper and their corresponding full name are listed in Table 3.

APPENDIX B AFFINE MATRIX CONSTRUCTION INCLUDING SKEW TRANSFORM
In this section, we show how to construct the affine transform matrix M that includes the skew transform. The last matrix on the right of Equation 2 is the skew transform matrix.

APPENDIX C RATE WITH VISION TRANSFORMER ON LARGE-SCALE IMAGE DATASET
Apart from conventional CNN architectures, transformerbased architectures [32], [36]- [38] have achieved remarkable performance in visual recognition tasks recently. In this section, we show how to incorporate the proposed RATE with the vision transformer architecture (ViT) [32] and evaluate their performance (ViT vs RATE + ViT) on a large-scale image dataset Tiny-ImageNet [31] both quantitatively (see Figure 18 and qualitatively (see Figure 19). Tiny-ImageNet [31] is a large-scale image dataset that consists of 100k training images of 200 classes. We evaluate the performance on the validation dataset since the labels for the testing dataset are not available. To obtain a transformer-based classifier on the Tiny-ImageNet, we utilize  the ViT [32] (ViT-Base) pretrained on ImageNet-21k [39] as the feature extractor, then remove the pre-trained prediction head and attach a zero-initialized D × K 1 feedforward layer, where D = 784 is the input dimension of the MLP head and K 1 = 200 is the number of classes in Tiny-ImageNet. The network is fine-tuned based on the training data of Tiny-ImageNet.
To train the RATE for the Tiny-ImageNet, we also utilize the ViT (ViT-Base) pretrained on ImageNet-21k [39] as the feature extractor, then remove the pre-trained prediction head and attach a zero-initialized D × K 2 feedforward layer, where D = 784 is the input dimension of the MLP head and K 2 = 5 is the number of affine parameters (see Figure 17). The network is then trained on the training data of the Tiny-ImageNet in a self-supervised manner (see Algorithm 1 FIGURE 19. Image alignment on the Tiny-ImageNet dataset. The first row shows variations of an image followed by its aligned versions generated by the proposed RATE.   and Figure 4). During inference, the affine parameters are estimated from the input image with RATE. The canonical image base X b is then obtained from the input image and the affine parameters. The canonical image base X b is fed to the ViT classifier for classification during training and testing of RATE + ViT.
We utilize the code and the pre-trained model from [40]. We train the models (both RATE and ViT classifier) with 128 batch size, 3000 iterations, and SGD optimizer with a learning rate of 0.003. Due to the limitation of GPU memory, we also set the gradient accumulation step to 3 so that the weights are updated every 3 batches. In Figure 18, we show the classification results of the ViT with and without RATE alignment on testing images of different levels of distortion. To simulate the real-world scenarios, we keep the distortion in a small range from no distortion to the maximum of ε θ = 15 • , ε pq = 0.1, ε xy = 0.1w (indicated by 100% in Figure 18). The results are the average of 3 trials. As can be seen from Figure 18, the ViT model with RATE alignment is more robust against perturbation compared to ViT model alone. At 100% distortion ratio, the difference between ViT and RATE + ViT is nearly 3 percent, which further verifies the effectiveness of the proposed RATE on image alignment tasks.
In Figure 19, we show the image alignment result of the proposed RATE on the Tiny-ImageNet dataset. For visualization purposes, the rotation range is increased to ε θ = 45 • . Compared to simple datasets such as MNIST or SVHN, which only contain similar types of objects, Tiny-ImageNet includes a large variety of objects such as animals, vehicles and sophisticated backgrounds. Despite this challenge, Figure 19 suggests that RATE can learn the canonical image base across different types of objects.

APPENDIX D ILLUSTRATION ON RECURRENT ALIGNMENT STRATEGY
In this section, we show the recurrent image alignment strategy of the proposed RATE on CelebA human face dataset [33]. In Figure 20 and 21, the distortion range is ε θ = 180 • , ε pq = 0.1, ε xy = 0.1w. The first row is the distorted images, followed by the intermediate alignment results x b1 and x b4 . The last row is the final alignment result x b(all) . We observe that x b1 can almost align the distorted images to the canonical base. With more iterations of alignment, some tiny distortion is further fine tuned. However, more iterations of affine transform make the images become blurry. Hence, we can combine all the affine transforms and only transform once to obtain x b(all) .

APPENDIX E IMAGE ALIGNMENT RESULTS ON CelebA-HQ
Image samples aligned on the CelebA-HQ dataset (1024 × 1024) [34] are illustrated in Figure 22 and 23. The RATE network for the CelebA-HQ dataset is shown in Table 4. We see that we only need to add a few convolution layers even for high-resolution images like 1024 × 1024.