Global Convolutional Neural Networks With Self-Attention for Fisheye Image Rectification

Fisheye images are attracting attention in computer vision such as autonomous vehicles and virtual reality because of the wide field of view (WFoV). However, fisheye images have geometric distortions caused by the refractive index of the lens. Conventional fisheye rectification methods require multiple images to calculate distortion coefficients and lens intrinsic parameters. This means that if the fisheye lens is changed, the same operation will have to be repeated. On the other hand, by using deep learning, images with different distortion coefficients can be rectified. Also, with end-to-end learning, no feature engineering is required. To improve the performance of fisheye image rectification, we propose global convolutional neural networks with self-attention to rectify the fisheye images. The proposed method employs dilated convolutional neural networks (D-CNNs) to enlarge receptive fields, and self-attention to extract the most important features of input images. In this way, the proposed method can extract global features from input images. To better train and evaluate the proposed method, we generate fisheye images from the Place2 dataset with Cartesian and polar coordinates, and label them with original images (ground-truth). we also schedule the learning rate with cosine annealing and use an integrated loss function. The experimental results show that the proposed method achieves an excellent performance in both qualitative and quantitative evaluations.


I. INTRODUCTION
Fisheye images are used in many computer vision tasks [1], [2], [3], [4], [5], [6], [7], [8] because it captures the wide field of view (WFoV). However, the images taken by fisheye lens suffer from sever geometric distortion at the same time. Therefore, fisheye image rectification is absolutely necessary. There are two methods of fisheye image rectification. The first is a pattern based method and the second is a deep learning based method. Most previous fisheye image rectification methods include pattern based methods such as [9], [10], and [11]. In Zhang et al. [9], fisheye images are rectified by planar patterns with a closed-form solution, followed by a non-linear refinement based on the maximum likelihood criterion. Shah et al. [10] employs a non-linear The associate editor coordinating the review of this manuscript and approving it for publication was Muhammad Sharif . transformation between points in the world coordinate system and their corresponding location on the image plane. The other type of fisheye image rectification methods are deep learning based such as [12], [13], [14], [15], [16], and [17]. They use many distorted images to train rectifica-tion networks. Networks learn distortion patterns from a large number of images. Generally, deep learning based methods are better than pattern based methods. Most of the deep learning based methods employ convolutional neural networks (CNNs). Networks using CNNs have a lot of parameters, features of each region are reduced through pooling layers. However, when pooling operation is performed, the existing features are lost. In addition, these features are computed over the local neighborhood. It means that repetitive convolutional operations are required to extract overall features of input images.
A mathematical model for fisheye image rectification requires specialized knowledge and complex calculations [17]. Also, since it only calculates the distortion rate for a specific pattern of the fisheye image, it is not possible to rectify the fisheye image of various patterns. On the other hand, the deep learning model can rectify various patterns of fisheye images, and the rectified image can be obtained only by inputting fisheye images without going through complicated calculations.
In this paper, we propose global convolutional neural networks with self-attention to rectify fisheye images. In Fig. 1, the output of our networks is rectified images. The proposed method achieves an end-to-end process from fisheye images to rectified images.
We employ dilated convolutional neural networks (D-CNNs) [18] to enlarge the receptive field of filters. Previously, in order to enlarge the receptive field, features were reduced through pooling layers and the convolutional operation was repeated. In image reconstruction, there is an auto-encoder consist of an encoder that extracts features through convolutional operations and a decoder that reconstructs images based on features through de-convolutional operations. On the other hand, dilated convolution operation, which can enlarge the receptive field by adding zero padding to the filter, shows good performance [19] without using auto-encoder. Simultaneously, we employ selfattention to find which features of the input image should be more attentive to. Our contributions can be summarized as follows: • We propose a method for combining CNNs and D-CNNs and applying self-attention to all convolutional layer outputs to rectify fisheye images. And we simplified the model by adjusting the number of parameters in the proposed model.
In the image generation process, the distortion coefficients are set randomly to generate fisheye images with different distortions. These images we generated are used as input to train our networks.
• We schedule the learning rate of the proposed model with the cosine annealing technique, and define and use an integration loss function for efficient fisheye image rectification. The rest of the paper is organized as follow: Section II introduces previous related works. The generation of fisheye images from original images is detailed in Section III-A. The architecture of proposed method is described in Section III-B. In Section III-C, an integrated loss function applied to our networks is proposed. Finally, Section IV introduces the experimental detail and the qualitative and quantitative evaluations.

II. RELATED WORK
The distortion rectification of the initial fisheye image is an operation of converting the fisheye image into a corrected image through a rectification model to obtain intrinsic parameters of the fisheye lens. Several methods have addressed radial distortion in fisheye lens. These methods obtain the parameters of the camera lens and rectify the fisheye image as a model for radial distortion rectification. Kang [21] predicted parameters using radial distortion made up of parallel straight lines in space through a single image. It uses the minimum vanishing point dispersion constraint to estimate both radial and decentering lens distortion. But it has the disadvantage of locally correcting only the distortion in a specified area. Zhang et al. [9] proposed a method of correcting a single image through a model that calculates parameters by inputting multiple images. This method is widely used for 3D camera calibration. However, there is a disadvantage that images taken from various angles are already required for rectification.
Another distortion rectification method is to use projection. Abidi et al. [22] is a survey approach, using a measuring plate to find out different quadratic surfaces on the x and y axes of the Cartesian coordinates, and then correct the image by projecting onto a plane. However, this method has limitations because it is an approximation rather than finding an ideal coordinate. Melo [23] and Barreto et al. [24] introduced a method of detecting straight lines in an image and using them to correct the image. But these methods are highly dependent on the accuracy of detecting straight lines.
Recently, many image rectification methods based on deep learning have been proposed. Methods using deep learning may have limitations of existing methods, such as requiring multiple images to calibrate a specific fisheye lens. But good performance can be achieved without complex calculations for image rectification. Rong [25] and Yin et al. [26] used convolutional neural networks to rectify image distortion. They predict the distortion coefficients and apply algorithms to rectify images. However, this method requires a separate operation to rectify the image using coefficients. Xue et al. [27] proposed an end-to-end model that predicts a corrected image by receiving a distorted image as an input. This method shows good performance by designing the model structure without image pre-processing. Yin et al. [26] construct a deep CNN model to extract image features and feed the obtained features to a scene parsing network and a distortion parameter estimation network. However, it is not clear whether these obtained scenes can play an important role in the fisheye image rectification. In Yang et al. [28], the predicted parameters are employed to correct strong distortion that exists in the fisheye image and authors synthesize the corresponding distortion using the original distortionfree image. Their method is excellent, but the rectification for the edge is not perfect.
In this paper, we propose global convolutional neural networks with self-attention to rectify fisheye images. Additionally, we employ dilated convolutional neural networks to enlarge the receptive field of filters and self-attention to find which features of the input image should be more attentive to.

III. PROPOSED METHOD
In this section, we describe the detail of proposed model. The generation of fisheye images from original images is detailed in Section III-A. The architecture of the proposed method is described in Section III-B. In Section III-C, an integrated loss function applied to our networks is proposed.

A. BASICS OF IMAGE DISTORTION
The Cartesian coordinates consist of x and y axes. In contrast, the polar coordinates consist of the distance r away from the origin and the angle θ. Fisheye images can be generated through radial distortion. The origin of the image coordinates are moved to the center, and then by reflecting the actual pixel distance, world coordinates are obtained. In Fig. 2, we can obtain equations (1) and (2).
The above equation (3) shows the relationship between Cartesian and polar coordinates.
where r d is distortion radius, r u is normal radius, and k i is i-th polynomial coefficient. The image can be radially distorted using equation (4) above. A distorted image can be obtained by converting the polar coordinates into the Cartesian coordinates through the distortion radius.
Usually k 3 has little effect on radial distortion, so we only use k 1 and k 2 , and we distort the image by setting k 1 and k 2 to random values and k 3 to 0.

B. NETWORK ARCHITECTURE
The structure of our networks is shown in Fig. 5 and consists of the following three layers: base layer, dilated convolutional layer, and self-attention layer. Our networks are supervised, and it uses three layers below to extract global features and implements end-to-end fisheye image rectification from fisheye images to rectified images. D-CNNs use enlarged filters to expand the receptive field. In the convolution operation, the extended receptive field has less dimensional loss, so it is possible to extract global features. The self-attention extracts the most important features and residual mapping improves the training efficiency of networks. We conducted ablation studies on the proposed method to justify the combination of each layer of our networks affects the performance. The details of each layer are introduced in the following section.

1) BASE LAYER
The base layer is designed to extract features from the input for fisheye image rectification. Many deep learning studies suggest that CNNs trained with large amounts of data perform well in various computer vision tasks, such as image classification and object detection. So we employ CNNs for our base layer. Five base layers are used to extract features from the image. The number of filters for the convolutional layer is 16, 32, 64, 128, and 3. The output of the dilated convolutional layer is used as an input to the base layer. Also, self-attention is connected to the output of each convolutional layer to extract global features.

2) DILATED CONVOLUTIONAL LAYER
Dilated convolutional neural networks (D-CNNs) represent extension of CNNs using enlarged filters. A way to improve the performance of networks using CNNs is to enlarge the receptive field. However, in traditional CNNs, enlarging the receptive field increases the computational amount of the networks. So in general, the filter size is reduced and the depth of networks is increased by stacking layers. In contrast, D-CNNs use enlarged filter termed as dilated convolutional filter with dilation rate. The dilation rate is a spacing between the values in a filter. The dilated convolutional filter is shown in Fig. 3, where (a) is a traditional convolutional 3 × 3 filter, (b) shows that 3 × 3 dilated convolutional filter with a dilation rate of 2 in two dimensions will have the same field of view as a 5 × 5 filter. Similarly, (c) is a dilated convolutional filter when the dilation rate is 3. By adjusting the dilation rate, the receptive field can be enlarged with the same amount of computation as a 3 × 3 convolutional filter. Also, it shows better performance than auto-encoder [19]. The number of dilated convolutional layers and the number of filters are the same as the base layer.

3) SELF-ATTENTION LAYER
Recently, attention mechanisms are the most popular in models that need to capture global dependencies [29]. In particular, self-attention [30] calculates attention score in single image by attending to all positions within the same image. The self-attention is shown in Fig. 4. Given an input features X ; three different sets of features: queries Q, keys K , and values V are calculated using a linear transformation: Then the self-attention weight is calculated by a scaled dot product between Q and K , the scaling factor d is the dimension of the vectors in Q and K . The self-attention layer uses this attention weight and value of matrix V to compute its output (attention score).

C. LOSS FUNCTION
Our loss function is defined as: First, we use the L 2 loss, which is the sum of the squares of the difference between the input and ground-truth.
where y i andŷ i are the ground-truth and the predicted value from networks respectively. However, when the networks is trained with L 2 loss, it has splotchy artifacts.
To compensate for splotchy artifacts, L SSIM loss is jointed. The L SSIM is used to produce visually pleasing images. SSIM [31] is defined as: where µ x and µ y are photometric measure of the luminous intensity per unit area of light travelling in a given direction, and σ x and σ y are difference in luminance or color that makes an object (or its representation in an image or display) distinguishable. σ xy is cross-covariance for x, y, and C 1 , and C 2 are constant values.
We empirically set coefficient α to 0.85. The training goal is to minimize this loss function (see equation (9)).

IV. EXPERIMENTS A. FISHEYE IMAGE GENERATION FOR TRAINING
A crucial problem remains in training the proposed networks which require fisheye and real images. To generate fisheye images, we randomly set k 1 , k 2 (see equation (4)). The range is limited from 0.01 to 0.2. Then we use the calculated distortion radius to transform the pixels of the image from the polar coordinates to the Cartesian coordinates to obtain fisheye images. The results of fisheye image generation are shown in Fig. 6.

B. EXPERIMENTAL DETAILS
We resize the images of Place2 dataset [20] to 256 × 256, and then we generate fisheye images (Section III-A) and label them as real images. We use AdamW [32] optimizer to train our networks and schedule the learning rate with cosine annealing [33] (see equation (13)). The initial learning rate is set to 0.001.
where LR t is the current learning rate calculated by cosine annealing, and LR i max and LR i min are the maximum and minimum values of the learning rate. T cur is the current epoch in cycle (decay steps), and T i is the cycle to perform cosine annealing. The learning rate of the proposed method is shown in Fig. 8.
Batch size is set as 16 and total number of training data is set as 60K for Place2 dataset. We apply the Rectified Linear Unit (ReLU) activation function to all base layers. Following [34], we apply the LeakyReLU activation function to dilated convolutional layers. In addition, the workstation configuration used in the experiment is shown in Table 1. VOLUME 10, 2022 FIGURE 5. The architecture of the proposed method. The entire networks consist of three layers: D-CNNs, CNNs, and Self-attention layers. D-CNNs use enlarged filters to expand the receptive field. The filter size of CNNs matches the receptive field range of D-CNNs. In addition, self-attention is applied to the output of CNNs to extract global features, and residual mapping is applied to optimize our networks.

C. QUALITATIVE EVALUATION
To demonstrate the effectiveness of the proposed method, we provide a qualitative comparison with recent works in terms of visual performance, as shown Fig. 7. Chao et al. [12] advance self-supervised learning strategies to rectify fisheye images. This GAN-based model learns pixel-level distortion flows with unique cross rotation. However, artifacts are easily found in the rectified image that is the output of the model. Li et al. [13] propose general framework to rectify different types of geometric distortion from distorted images. They estimate the distortion parameters, and use it to generate rectified images. Their proposed framework is not as robust as other fisheye image rectification methods because it aims to rectify various types of distortion as well as radial distortion of fisheye images. Yang et al. [14] propose a parallel complementary structure to rectify fisheye images. This complementary mechanism is a method of correcting features through an encoder that reduces the degree of distortion by successive convolution and pooling operations and a decoder that is a flow estimation. This method shows comparable performance to our method. As shown in Fig. 9, our method has better quality in the local region.

D. QUANTITATIVE EVALUATION
To verify the robustness of the proposed method, we quantitatively compare it with other fisheye image rectification methods, including: Chao et al. [12], Li et al. [13], and Yang et al. [14]. Specifically, we select Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Map (SSIM) as evaluation metrics. Our quantitative comparison results are shown in Table 2.
As shown in the Table 2, the proposed method is superior to other methods in PSNR and SSIM. Chao [12] and Li et al. [13] show poor performance in PSNR and SSIM, and Yang et al. [14] shows comparable performance to the proposed method.

E. ABLATION STUDY
We analyze the effectiveness of base layers, dilated convolutional layers, and self-attention layers of the proposed method. First, we remove layers to confirm the need for dilated convolutional layers (w/o dilated convolutional layers). The removed dilated convolutional layers are replaced with base layers. Secondly, we determine how extracting global features affects the performance of the networks by removing self-attention layers (w/o self-attention layers). Additionally, we check whether the networks optimization is affected when residual mapping layers are removed (w/o residual-mapping layers). As a result of Table 3, the w/o self-attention layers shows the lowest performance, and the w/o dilated convolutional layers with the highest performance among them also shows lower performance  [14], the output of the proposed method, the ground truth. The results of this experiment are superior to those of related studies.

FIGURE 8.
Cosine annealing with an initial learning rate of 0.001, decay steps of 10. For each annealing, the initial learning rate is decreased and the period is increased. than the networks including the entire layers. Therefore, it can be found that each layer plays an important role in improving the performance of the proposed networks.

F. COMPARISON OF MODEL PARAMETERS
We compare the number of parameters of the proposed model and comparison models to demonstrate the advantages of the proposed model.    As shown in the Table 4, the number of parameters of the proposed model is 21 times less than Chao et al. [12] and 25 times less than Li et al. [13]. Moreover, the number of parameters of the proposed model is 47 times less than Yang et al. [14]. Therefore, the proposed model is simplified and can be used in the environment of low computing power.

V. CONCLUSION
In this paper, we propose global convolutional neural networks with self-attention to rectify fisheye images. Additionally, we employ dilated convolutional neural networks to have the same computation as traditional convolutional neural networks and to enlarge the receptive field of filters, and we apply self-attention to global features, which allows the input image to interact with each other and finds features in which input should pay more attention. To better train and evaluate the proposed method, we generate fisheye images from the Place2 dataset with Cartesian and polar coordinates and label them with original images, also we schedule the learning rate with cosine annealing and use an integrated loss function. To demonstrate the robustness of the proposed method, qualitative and quantitative evaluations with other related methods are performed. In addition, an ablation study is conducted to confirm whether each layer of the proposed method contributes to networks performance improvement. As a result of the experiment, the proposed method shows excellent performance both visually and numerically.