Accent for Visible and Infrared Registration (AVIR): Attention Block for Increasing Patch Matching Rate Through Edge Emphasis

In this paper, we propose an efficient attention module for visible and thermal infrared (TIR) matching deep learning networks. This method judges the right or wrong of heterogeneous sensor matching through the proposed deep learning model and increases the matching rate through the attention module using the edge-utilizing structure. This paper contributes to three aspects: 1) The first aspect is Convolutional Neural Network (CNN) structure comparisons for heterogeneous sensor registration. We consider the matching network as a classification problem when stacked heterogeneous sensor data become input of a single CNN network. Therefore, this paper shows result that is related with not only the network designed for heterogeneous sensor matching, but also various deep learning networks used for classification. 2) the second is a consideration for efficient attention module. The experiments show the module can replace lots of convolution blocks and the results achieve more better performance. The attention module uses a 1×k filter and a k×1 filter to extract horizontal and vertical edges and convolves two paths using them. 3) The third is suitable deep learning model for aerial complex visible and TIR data registration. To compare the various methods, we describe the calibration process of aerial visible and TIR data obtained directly from a drone. By using the calibrated data, this paper presents an AVIR attention block-based architecture that shows optimal matching results with minimal addition of parameters.


I. INTRODUCTION
With the recent development of technology, the use and interest in Artificial Intelligence (AI) is increasing. Such AI is used for Automatic Target Recognition (ATR), Autonomous driving system, medicine, mechanics, and security. Detection and recognition using a single sensor, a common and important issue in these fields, are becoming a red ocean from a blue ocean. The limitations in the recognition and detection of a single sensor are clearly present due to the advantages and disadvantages of the sensor characteristics. In the case of the visible camera that uses the reflection band of the light spectrum, it has a highresolution field of view and spatial resolution, but there is a limit that cannot be seen at night. In the case of Thermal Infrared (TIR), only the rough outline of an object can be checked with a low resolution, but because it uses the emitted radiance information, it has the advantage of being able to distinguish objects both day and night and obtaining thermal information of the object [1]. Our research team wanted to proceed with object detection through a heterogeneous sensor fusion network using visible and infrared (IR) data, which have different advantages. During the pre-processing of the implementation of this network, it was confirmed that the frame per second (FPS) of the two videos was different due to telecommunication and hardware limitations despite the same settings. This is a problem that occurs when two cameras are attached to the same location but acquire data with different devices. In addition, each video also had problems such as inter-frame interruption and frame omission due to telecommunication. Since the frames of the visible image and the IR image do not match, we tried to find the same frame between the two videos by finding a correlation through the image matching network. In the process, a study of matching networks was conducted, and this paper deals with the contents.
Visible/IR Cameras are passive sensors. The sensors' data measure different characteristics depending on the wavelength from the reflected spectrum of the natural or artistic illumination of the target to the emitted spectrum. Images acquired with various spectra have different characteristics depending on the wavelength of light. Visible sensors are using red, green, and blue (RGB). An Electro-Optical (EO) includes a wider spectrum than visible. So, EO/IR systems cover the range from ultraviolet (UV) through visible and IR. Each band has the different wave length (UV: 0.25~0.38um, visible: 0.38~0.75um, and infrared: 0.75~14um). Visible, near IR(NIR), short wave IR(SWIR), and mid-wave IR (MWIR) measures reflected radiance, while long wave IR (LWIR) measures emitted radiation [1]. Therefore, judging various types of matching with heterogeneous sensors measuring different spectral bands as the same matching network may make an error of ignoring physical characteristics. We perform matching using visible and TIR data, which are aerial drone data using LWIR.
Our contribution to the matching of Visible and TIR can be summarized in three aspects.
--First, we compare both of matching network and classification network in matching scene.
--Second, we propose useful attention module for heterogeneous sensor matching.
--Third, we suggest suitable network for visible and TIR matching and execute the matching with directly acquired complex aerial drone data. This paper explains the utility of the proposed edge attention by applying the attention block in heterogeneous sensor matching and proves that the stacked input can efficiently determine the matching. In addition, by using stacked input, we prove that the proposed network is robust through comparison with classification networks as well as matching networks. To learn TIR data, we used complex aerial drone data to prove the robustness of our network. This paper shows the construction of a matching network through analysis of various parameters and present an efficient network to classify matching results of Visible and TIR. There are various cases of using stacked input for matching problems [2], [3], [4]. Since these stacked inputs can be reflected as a single 2-channel input or 4-channel input, it is required to compare between matching networks with classification networks. Additionally, Judging the matching of heterogeneous sensors as a classification network is a task that has not been done before.
From the point of view of heterogeneous matching, we tried to find the comprehensible module for alignment through the Attention Module, which is similar with human thinking. Figure 1 shows an example of registration and misregistration at the same time. Through what characteristics do you judge the right or wrong of matching in this figure? Our research team directly made the matching ground truth and judged that the primary focus of human visual judgment of matching between Visible and TIR is the aligned edge of Visible and TIR, and secondarily, it is texture. Yellow lines indicate that a matching of visible and TIR worked well. And Red lines show mismatch. Because bus parts of the under image do not match. The line information is important to distinguish image matching and the texture of the object is also necessary. Through this, we propose a module that works similarly to human thinking. For the matching of Visible and TIR, we organized a section as follows: 2. Related Work, 3. Proposed Method, 4. Ablation study, and 5. Conclusion. In the 2. Related Work, we explain various example of fusion network, and then focus on image sensors based heterogeneous matching. This section is divided into A) Heterogeneous Data Matching, B) Deep Learning for Classification, and C) Attention Module. In A) Heterogeneous Data Matching, examples of applying deep learning as well as conventional methods to matching heterogeneous data are described. We conducted a comparative experiment with the classifier in the experimental section, and contents related to the classifier were described in B) Deep Learning for Classification. C) Attention Module introduces the attention modules, a field that has recently been in the spotlight for Convolutional Neural Network (CNN) research. 3. Proposed Method explains the proposed network and its feasibility. AVIL block using attention module, AVILNet using the blocks, and loss function effective for binary classification are introduced in this section. 4. Ablation Study proceeds with the acquisition procedure for experimental data and various network comparison experiments. At 5. Conclusion, we conclude by explaining the effectiveness, practicality of this paper, and the future works.

II. RELATED WORK
This section consists of three parts. A. Heterogeneous Data Matching introduces various cases of deep learning used for image matching. B. Deep Learning for classification describes the history of deep learning classifier and introduces the convolution-based networks compared in this paper. Also, we explain why the classifier algorithm is used for matching heterogeneous sensors. C. Attention Module briefly explains the attention module and its usage examples. Before describing the sub-section, the fields of various sensors will be described.
Many sensors are being used in various fields. Lidar can acquire depth information of an object using point cloud and has the advantage of high precision, but it cannot obtain the color and surface characteristics of the object [5], [6]. X-rays used in medicine are produced when very fast-moving electrons collide with heavy atoms. Although there is an advantage of short-time examination, only rough information of soft tissue (subcutaneous-tissue /muscle/ligament) can be grasped using X-ray diffraction. Computed Tomography (CT) has the advantage of being able to check the cross-sectional view of an object, but it is also insufficient to measure soft tissues in detail. Such various single sensors have limitations, advantages, and disadvantages in the acquisition process and image information.
Multi-sensor fusion is being studied to solve and supplement the limitations of a single sensor. Fusion technologies used in the autonomous driving field are mainly Visible and Lidar convergence [5], [6]. Gong et al. [5] implemented fusion for 3D object detection using point cloud and visible information, and Caltagirone et al. [6] used deep learning-based fusion technology for path detection. In the medical field, fusion of heterogeneous sensors such as CT and X-ray was carried out. CT and Chest X-ray (CXR) fusion was implemented based on deep learning for Diagnosis of COVID-19 [7], and Panwar et al. [8] also suggested deep learning framework for detection of COVID-19 and proved the responsibility by using GradCAM [9]. These attempts are efforts to go beyond the limits of a single sensor.
Efforts to fuse heterogeneous sensors are no exception to fuse between heterogeneous images. There are many cases based on EO and IR fusion [2], [10]- [13]. A dual-tree complex wavelet transform (DTCWT) technique based on region segmentation was proposed for fusion of airborne infrared and visible images [10]. Fusion of low intensity visible and thermal infrared was performed, and frequency band fusion was performed after reinforcing low light information using IHS conversion [11]. Sensor fusion was performed using a CNN-based DeepFuse network and a learning loss based on the structural similarity index measure (SSIM) [12]. H. Li and X.-J. Wu [13] proposed DenseFuse which is a dense blockbased network and used a learning loss applied with SSIM. Sensor fusion results were acquired through deep learning.
For such fusion, matching between the two sensors must be performed, and fusion must be performed through correctly implemented matching. Using the reflection bands of 3 RGB channels and 1 NIR channel as inputs, Dense Block type network was designed and matching or mis-matching were determined [2]. Matching was performed on EO aerial data at different times using the Siamese network [14], and a 2ch network was suggested, which is a similarity measurement network between the visible band and near IR(NIR) [3]. Its input is Visible converted to gray level and NIR. The network consists of a total of three convolution layers. Zhang et al. [15] proposed a Siamese Network-based sFcNet. For EO/Near Infrared (NIR), EO/TIR, and EO/Synthetic Aperture Radar (SAR), first feature was acquired through convolution layers for high-resolution EO, respectively, and second feature was obtained through heterogeneous sensors. Using second feature as a filter for first feature, the matching score was determined. Wang et al. [16] proposed an algorithm that learns by vectorizing the patch of each image and finds the matching point. Through the following three sections, we introduce the related papers of the techniques used in this paper.

A. Heterogeneous Data Matching
For matching heterogeneous sensors, Scale Invariant Feature Transform (SIFT) or feature-based extraction methods have been mainly studied, but the similarity measurement using deep learning is currently being developed. Ma et al. [17] extracted matched pairs for aerial photographs using SIFT. In the process, they used a gradient magnitude of the Gaussian scale-space image by means of Sobel filters to create robustness of the descriptor. Ye et al. [18] measured the structural similarity between images and performed registration between EO, SAR and Lidar heterogeneous data using a histogram of orientated phase congruency (HOPC) descriptor. Li et al. [19] performed multi-modal image matching using Radiation Invariant Feature Transform (RIFT). RIFT uses phase congruency instead of image brightness to detect feature points, and extracts corner points and edge points for optical-optical, infrared-optical, Synthetic aperture radar (SAR)-optical, Map-optical, and day-night matching.
There are examples of a combination of feature-based and area-based fusions [20], [21], and line feature-based fusion [22], [23].  Recently, a matching algorithm using deep learning is also being studied, and the framework of the study is shown in Figure 2. Figure 2(a) is a Siamese network, where input A and input B enter different inputs into the same network sharing weights [24]- [30], and figure 2(b) is a case where the network structure is the same but does not share weights. Figure 2(c) shows that the stacked input enters a single network and measures the similarity [2], [4]. J. Zbontar and Y. LeCun [24] performed stereo matching of the two visible data taken from the different angle, and He et al. (2018) [25] found a matching point for the EO data of the different weather and time zone. Han et al. [26] constructed a feature network and a metric network using MatchNet which considers various sizes of a patch. He et al. (2019) [27] proposed multi-support patches siamese networks (MSPSNs), and the registration was studied using satellite multispectral data (e.g., Landsat-5/8, ZY-3, and GF-1). Various sizes were reflected by adjusting the patches to sizes of 24×24, 48×48, and 97×97. There are also networks that have matches through various paths. Figure 3 is for an nstream network. P. L. Suárez et al. [4] used the 2-stream network. En, S., Lechervy A., and Jurie F. [29] proposes three streams (TS-Net) and constructs a layer with two paths for a single input. It suggests two siamese networks to obtain three stream outputs. Balntas, V., Johns, E. and Tang, L. [30] proposed a PN-Net (triplet network) and trained by receiving the same input pair w and x and a different input pair y as input.
They used the 3-stream network. Aguilera et al. [28] also proposed a quadruplet network called Q-net, which uses two pairs of EO and NIR inputs and uses a total of four inputs.
Baruch, E.B. and Keller, Y. [31] showed better performance on VEDAI, CUHK, and VIS-NIR data sets compared to Aguilera et al. [28], which is an example of a network that does not share network parameters in figure 2(b). They designed a block like figure 2(a) and 2(b) together and conducted research. Networks corresponding to figure 2(a)-(c) were made and tested [3,4]. Aguilera et al. [3] proposed a 2ch network, which uses stacked input. Zagoruyko, S. and Komodakis, N. [32] further tested the 2 streams network and finally proved the robustness of 2 channel -2 stream. Suarez et al. [4] designed a two-channel network similar with Aguilera et al. [3], but using fewer parameters, higher performance than Aguilera et al. [3] was derived in the match of visible and NIR. Higher performance than Aguilera et al. [3] was derived for visible and NIR aerial images using the dense block [2]. Vectors from feature are also performed for matching [33]. Chen et al. [34] presents FSNet which is kind of a siamese network and suggests registration for heterogeneous images. These developments in deep learning have made great strides in the performance of matching between two heterogeneous images. With reference to the history of this development, we conducted a study using deep learning. As the channel stack network is developed, we thought that it is necessary to consider the judgment of heterogeneous sensors matching through the developed deep learning classifier. Recent research on deep learning has been inspired by the shape of the brain. Lenet5 [35] is a classic deep learning network used for text classification. The classification was performed on the 32×32 input using three convolution layers, two subsampling layers, and one fully connected layer. Deep learning networks have evolved in the direction of using a deep convolution layer while solving the problem of vanishing gradients through weight initialization [36], [37], batch normalization [38], etc. VGG [39] uses a 3x3 convolution to improve the classification performance for input data of different sizes through layers of various depths, and Resnet [40] improves the classification performance by using a residual block. ResNeXt [41] which use the grouping of filters using cardinality and the combination of the residual block used in Resnet [40] recorded high performance top-5 errors. Deep learning network has also developed into a form of accumulating channels, such as DenseNet [42]. In DenseNet, the performance of classification was improved by concatenating and using the previously used convolution block. Dual Path Network (DPN) [43] was designed as a model that utilizes both the advantages of ResNet and DenseNet using both residual and dense network paths for the dual path structure. MobileNet [44] simplifies networks by shortening them to fit mobile devices. Therefore, depthwise convolution and pointwise convolution were used to reduce the number of parameters and the amount of computation. ShuffleNet [45] uses point-wise group convolution and channel shuffle to create a small model to reduce the number of parameters and computational amount like MobileNet.

B. Deep Learning for Classification
CSPNet [46] is an abbreviation of Cross Stage Partial Network, and the network's convolution layers were constructed by dividing the base layer into parts and convolutional only part of it, and then merging the rest. EfficientNet [47], which is currently showing high performance in the public classification data Cifar10 and Cifar100 [48], improves performance through compound scaling in the direction of changing the size of various existing models such as width scaling, depth scaling, and resolution scaling. By using EfficientNet, the current best performance was derived from cifar10 and 100 through learning using the teacher network and the student network [49]. Recently, ViT [50], BiT [51], and Swin [52], which applied the transformer used in natural language processing to image classification, are also showing high performance in ImageNet [53]. However, as can be seen in figure 2(c), since the stacked data of heterogeneous images is used as an input, it is worth considering the image classification networks in the registration. Figure 4 is a schematic history diagram of the image classifiers which we explained. Networks written in red are used for comparison with proposal network in this paper. In this paper, we propose a matching network using CNNbased attention module. So, we compared the CNN-based networks highlighted in red at figure 4. The classification networks showed optimal performance in public data sets. Since the stacked 2 or 4 channels as input for classification network are proper, we judge that it could be sufficiently used for classification networks. Also, the optimal input of the network presented in this paper is 128x128, which is sufficient to apply to the classification network.

C. Attention Module
We tried to improve the performance of the matching network by applying the attention module that improved the performance of the classifier. Therefore, in this section, the progress of the attention module will be briefly described. Attention modules improved classification performance [54]- [57]. SE block [54] which is an example of improving the performance by performing channel-oriented attention was designed. Channel attention was performed through global average pooling, and the channel was emphasized through a fully connect layer. A BAM block [55] used channel attention through global average pooling and spatial attention through 1x1 convolution. Woo et al. [56] presented a CBAM block, which uses features using max pooling as well as global average pooling additionally, and unlike the simultaneous usage of channel and spatial attention in BAM, first proceeds channel attention and additionally proceeds spatial attention. After that, apply residual blocks for reinforcing input features. A residual attention network was designed and applied to the classifier [57]. Channel attention was performed by designing RCAB [58], and the residual in residual concept was applied to image super-resolution. A channel-wise and spatial attention residual (CSAR) block was designed and used for super-resolution [59]. The advantage of channel attention is that it gives weights to important channels. However, there is a limit to not seeing spatial characteristics. To solve this spatial weakness, the attention block studies have been conducted in the direction of designing the spatial module, and the existing spatial attention block uses a square filter. This square filter is interpreted to mean that unnecessary spatial information is also emphasized in matching where edge information judgment is important. In this paper, we propose the validity of AVIR block emphasizing edge components through comparison with SE, BAM, and CBAM used in the classifier.
The next section is a description of the proposed network and loss, and additional settings are augmented through the experimental part.

IV. Proposed Network
The problem with the existing matching network is that it was not possible to acquire high matching rate for the TIR and Visible pairs of the drone aerial data. The data we used for research is drone data obtained directly, and there are various complex objects like buildings, trees, grass, and vehicles. TIR images are information containing the emitted radiance characteristics of various objects. So, the TIR images are different with visible images which are used reflection radiance. The data also contains many more complex features than distinct edge information. A dataset used at [60], [61] contains human radiance information and indoor-oriented data pairs, so the dataset has distinct characteristics that distinguish background and people information. It is necessary to discuss networks for complex aerial data.  In this paper, the proposed network using AVIR block is named AVIRNet. A 2-channel stacked input in which gray scale visible and 1-channel LWIR are concatenated is used as input. It was designed considering various single channel inputs (i.e. panchromatic, LWIR, SWIR, and MWIR images) used in aerial field. It has very concise layers, a total of 5 convolution blocks, 5 AVIR blocks, and 5 pooling to form a layer, and the features after the last 2d convolution are input to the fully connected layer through global average pooling. The 5 convolution blocks used for this include 3×3 filter, zero padding of 1, stride of 1 interval, batch normalization [38], and ReLU. Finally, only a scalar value is reduced to 0 and 1 through the sigmoid function, 0 means mismatching and 1 means matching. Figure 5 is a schematic structure of the network, and Table 1 is a description of each block of AVIRNet. CB is convolution block which consists of convolution of 3×3 filter, batch normalization, and ReLU. AT stand for attention module. MP stand for a max pooling. GP stand for a global average pooling. FC is a fully connected layer. Filters of AT are composed of three types, which are horizontal, vertical, and spatial filters in order. P stands for padding and S stands for stride. Figure 6 is the AVIR block. Max pooling and average pooling make different feature from previous feature. After Horizontal and Vertical convolution, feature can concentrate in an edge information. 1×1 convolution makes combination of the horizontal and vertical edge information. Sigmoid function can compress the result. The following is the description of the AVIR block. When designing the attention map, the most focused information was judged to be edge information through manual registration. To this end, we designed an edge module through filters on the horizontal and vertical axes to include less spatial information. Assuming F as a feature after convolution, channel-wise max pooling and avg pooling for the feature are equal to ∈ ℝ × × and ∈ ℝ × × . Each pooling is indicated by = ( ) and = ( ) , and the edge module can be expressed as a horizontal convolution module and a vertical convolution module as in (1) and (2).
Equation (1) is designed so that horizontal information can be judged deeply by configuring a horizontally long filter, and (2) is configured with a vertically long filter. So, each feature makes it easy to utilize horizontal and vertical information. As for the padding used for each filter, the quotient of filter size divided by 2 is applied to the horizontal and vertical axes on both sides of the feature. [A; B] means concatenate between A and B. Equation (1) is a 1 × filter designed in the direction of the x-axis, and the left of Figure 7 is an example of a convolution diagram using a 1×5 filter. Also, (2) is a filter designed in the y-axis direction, as shown in the middle of Figure 7. In addition, the combined feature through (3) is extracted through spatial convolution of × size of horizontal information and vertical information, and the final value of each pixel is compressed to 0-1 through the sigmoid function marked σ in (4). The feature of the final attention model is the dimension of ( ) ∈ ℝ × × . The strength of this Edge Module is to expand the judgment of edge information to overall spatial information by judging horizontal and vertical information as different paths. As we change the values of , , and we deal with the filter size most appropriate for the matching rate of aerial data through an ablation study.
The learning loss used a binary cross entropy loss for finding the correct answer and a smoothing term. Human can judge matching as 1 and mismatching as 0, but in machine learning, the concept of probability distribution, there cannot be a perfect integer result. Therefore, the smoothing term was additionally set to make it impossible to have perfect 0 and 1, and the performance improvement is also presented through an ablation study that is higher than the case of using a single binary cross entropy for finding the correct answer. of (5) is an abbreviation for binary cross entropy loss. is a ground truth and ( ; ) is the output of the input by the model parameter . Equation (6) is a loss equation which contains smoothing term used in the experiment. For the loss ratio, the ratio of binary cross entropy loss and smoothing term is determined by the ϵ value, and 0.05 was used in the experiment of this paper. ( ; ) is the probability result and is the value after the sigmoid function.
is the desired learning target, meaning 0 for mismatching and 1 for matching.

A. Dataset
Drones with visible/TIR cameras are one of the newest devices that can be organically used in civilian and defense applications to monitor objects day and night. We acquired drone data for surveillance and reconnaissance research and confirmed that the two images have different fps due to different devices. To solve this problem, a matching network was implemented to check the correlation between two images. For the visible/TIR data set, a matching set was constructed using the aerial data measured with DJI M200 model drone and DJI XT2 camera. Through this, four locations in Gyeongsan, Gyeongsangbuk-do, Republic of Korea were filmed. The XT2 can shoot both a thermal camera and a visual camera at the same time, and the spectral range for the thermal imaging camera is 7.5 to 13.5 . The possible shooting temperature includes the range of -25 to 135˚C. The spatial resolution is 640×512 and it has an operating cycle of 30 fps.
The visual camera has a FOV of 57.12˚×42.44˚ and a maximum resolution of 3840×2160. Figure 8 shows the DJI M200 drone and DJI XT2 camera.   Figure 9 is examples of an aerial image. The following is a description of the overall process for the acquired drone dataset. The drone data of the EO/IR pair we acquired was taken from the same position and angle. Due to the characteristics of the drone, when we check the acquired images, the Barrel distortion effect in EO images and the pincushion distortion effect in TIR images are occurred. To geometrically correct it, we measured internal and external parameters using Matlab and geometric correction was performed using the camera calibration toolbox [62]. Also, registration between two images was performed through homography registration.
Among Matlab toolbox, Camera Calibrator can check internal and external parameters of the camera and identify lens distortion. It includes a camera calibration function. In addition, calibration should be performed using the chess board image acquired using the Visible/IR camera. In the case of aerial photography, since the focal point is at a far distance, correcting the chess board image measured at a short distance may cause a large error for long distance image registration. Thus, a black-and-white image of a chess board shape as shown in figure 10 was obtained using a 60cm x 60cm tile sculpture measured from above, and the obtained internal parameters are shown in figure 11.   The Figure 10 was obtained after correcting for lens distortion. The transformation between two coordinates by homography is shown as (7). Since the two sensors acquire images using a single drone, by obtaining the image matching parameters only once, all photos at similar height can use the single homography matrix. In coordinate system transformation using homography, when four matching pairs are obtained for two matching points p and p', coordinate transformation is possible through the following matrix in (7). In the formula, if the coordinates corresponding to x, y, and 1 on the right are p, the coordinates corresponding to wx', wy', and w are p' [63].
Finally, Figure 12 was obtained through lens distortion correction and homography correction to match visible and TIR. The problem with data generated in this way is that data generated by manual matching of heterogeneous sensors has pixel errors. The first cause is an error caused by geometric correction. The second cause is the error due to the distance between the stereo cameras. The minimum pixel error of the data we finally computed can yield an error of 0 to 5 pixels, and since the error can prevent the network learning, we scale the original data by 1/4 to reduce the error to 0 to 1.25 pixels. In addition, due to the limitation of manual homography stereo matching, the error widened toward the outside, so the experiment was conducted using 1024×1024 in the center of the image. In various papers, there have been cases in which data sets are constructed with an input size of 64×64 and a smaller input size of 36×36. However, considering the nature of the aerial data, the input size was selected as 128×128 in consideration of the data information. Too small image patch does not have enough information for registration. Through preprocessing, we obtain matching patches at Figure 13. We filmed a variety of environments, including farmland, settlements, rivers, college towns, roads, and forests. Scene 1 is the university interior and forest, scene 2 is the driveway and building, scene 3 is the village, forest, parking lot and driveway, and scene 4 is around the university's main gate, farmland, and stream. Since the data cannot be disclosed due to internal security issues, a clear designation is omitted. The data structure was divided into 4 data taken at different locations as shown in Table 2. Training data consist of scenes 1, 2, and 3 and validation data consist of scenes 3 which is different of training data but has similar aerial property with training data. In addition, test was conducted using the parameters of the epoch with the lowest validation loss, and the test was configured using scene 4. Due to the nature of the aerial drone dataset, we often observe unlearned landscapes, so it was judged that it was the right experimental result to obtain a robust network even in scene 4 that is not related to the train set.

B. Experiment Setting
For experiments, AdamW [64] was used, learning rate was 0.001, and the beta values were 0.9 and 0.999. The learning rate was adjusted using the cosine annealing scheduler [65]. The training epoch used is 100 epochs. Batch size is 16 for train. In addition, the data loader stage was configured as shown in figure 14 to learn various mismatching for learning. It is judged to be 1 in the case of a matching patch and 0 in the case of mismatching. In addition, if the r distance is set in the data loader, the data loader is configured so that learning can be robust even at various pixel distances by using mismatching patches at random angles for data that deviate by the corresponding distance from the center of the reference image. The configuration of this data loader can increase the diversity for mismatching pairs, and the range from 0 to 360° is used, and r uses the range of [50,70] to increase the diversity through a random distribution in which the input is not determined. A random flag was given with a probability of 1/2 during training so that mismatching and matching data automatically occurred. Figure 14 shows the data loader.

C. Ablation Study
In this paper, a total of 11 network experiments were conducted, and the networks used were 5 matching networks and 6 classification networks. We use overall accuracy (OA) of (11) and mean accuracy (MA) of (12). Because A random flag with a fixed seed is used in data loader. TP means that a classification result is positive, and the result is correct. FN means that a classification result is negative, and the result is incorrect. TN means that a classification result is negative, and the result is correct. We use true positive rate (TPR) of (8) and true negative rate (TNR) of (9) for calculating accuracy.
Table III is the result according to the operation of the AVIR block. (*) means a broadcast element-wise multiplication. We think that broadcast element-wise multiplication had a direct effect on features and (+) was done to avoid excessive data loss. This is because, when the AVIR block is applied directly to the broadcast element-wise multiplication to the feature, most values except for the edge become close to 0. So, (+) can prevent values from disappearing. In this experiment, only broadcast element-wise multiplication applied gave the best performance. we used batch normalization, and through this, learning was carried out through edge emphasis, which was appropriate to keep the feature values from being completely zero. Figure 15 shows features at each stage of AVIR block. The vertical features are from (2) and the horizontal features are from (1) at every attention block. We denote area which is from (4). When looking at these features intuitively, it can be confirmed that edge information is detected in the first AVIR block. When the filter is designed long, the network learns edge information by itself. As the network deepens, various convolution layers are applied, and the shape of the feature extracted from the AVIR block is transformed. We extract the features from every attention block and each feature size is 128×128, 64×64, 32×32, 16×16, and 8×8 from the left. After the first AVIR block the result show the edge emphasis image of visible and TIR. The network can automatically learn the shape of edge and we prove the edge through visualization. Table IV is the comparison of the result according to the loss. The smoothing term of the proposed loss was found to have a large effect. From a machine learning point of view, the results of 1 and 0, which are complete integer labels, do not exist probabilistically. Therefore, using the smoothing term resulted in good performance.  Table V describes the results according to the filter size of the AVIR block, and heuristically, the highest matching rate was obtained for , at 5 and at 1. Using AVIR block shows better performance than not using AVIR block. In the test, when the AVIL block was not used and the AVIL block of , at 5 and at 1 was compared, the OA increased by about 8.401%.  Table VI is a comparison experiment between matching networks. (×) mark in the table means that the network could not learn the data. As a result of the experiment, AVIRNet obtained the dominant result. In the experimental process, Dense-based network [2], 2ch-2stream [32], and TS-Net [29] all performed matching learning with visible and NIR, the reflection band, and 2ch [3] and Domain Siamese [66] performed visible and TIR. Although we conduct same setting experiment with domain Siamese network [66], Learning did not proceed. The boundary of the used data is clearer, and the characteristics are clearer than our data, because the experiment was conducted using [60], [61], which is data with a clear temperature difference in an indoor space. In addition, there is a difference in information between the information around people of TIR used in the experimental process and the aerial data used in this paper. An experiment was conducted using the data of the size used in each paper. On the other hand, none of the networks learned. Furthermore, although the networks learned using the data size 128x128, they did not learn when the data was resized to the size used in each paper. Since the input was 128×128, we could proceed with the experiment using the classifier network. To experiment with classifiers for different input sizes, we removed the flatten function before the fully connected layer of all networks and the multiple fully connected layers after convolutional layers. We replaced fully connected layers with a single fully connected layer using global average pooling. This was changed because the input data size of different networks was not constant, and it was not learned in the experiment using the flatten function of the fully connected layer used in the existing matching network [2], [3], [32]. In VGG 5, 6, and 7, our research team confirmed the result of not learning in the network study of a single fully connected layer using flatten, so we obtained the results shown in Table VII by using global average pooling instead of vector flatten. In addition, it was confirmed that learning of 2ch-2stream [32] was performed by changing the fc layer using flatten to the fc layer using global average pooling and changing the convolution filter size from 5×5 to 3×3. Because vector flatten ignores channel information and spatial information, it is not considered to be suitable for learning complex data and large-size images. VGG includes 3 fc layers in the network itself, but it is removed and replaced with 1 fc layer using global average pooling.
We also constructed the network in Table VII to test how the change in the number of convolutions in the existing classification network affects. VGG 5, 6, and 7 do not exist at official paper. These use 2, 3, and 4 of the first convolution blocks (convolution + batch normalization + ReLU) of VGG9 and 2, 3, and 3 max pooling, respectively. ResNeXt also, ResNeXt11, 20, 38, 47, 56 layers do not exist as official. This was expressed as 11, 20, 29, 38, 47, and 56 for the case of using 1 to 6 Residual Blocks of ResNeXt29, respectively. As a result of the experiment, it was confirmed that many flops and parameters does not obtain good results in matching unlike classifiers, and the performances of VGG19, ResNeXt38, and DenseNet121 are good in the existing network. However, since it is necessary to find a matching point in a large image, the number of network parameters and the number of floating point operations (FLOPs) have a large effect on the time to find a matching point. AVIRNet is 1.792% higher in validation OA and 1.897% higher in test OA than VGG19 which is second highest. The multiplier-accumulate (Mac) is 4.89 GMac for VGG19 and 0.83 GMac for AVIRNet, which is ×5.89 difference. As the number of parameters, VGG19 is 20.04M and the proposed network is 3.91M, which is ×5.12 difference. The following explains the robustness of the edge module in matching through comparison with the existing attention module. The attention blocks used for comparison are SE [54], which is channel attention, BAM [55] and CBAM [56], which use channel attention and spatial attention at the same time. These were tested by substituting the AVIR block of AVIRNet, and as a result, it was confirmed that the AVIR block obtained a high matching rate. This result indicates that emphasizing edge information is more efficient in matching than emphasizing spatial information.  Figure 16 shows the matching score map of sliding window from a test dataset. We execute sliding window for visualization of matching score map. The interval of sliding window is 2 pixel and the result show the highest value at center point. We do not recommend to learning too short r distance at data loader. A very small r is considered nonmatching, and when the label is set to 0, it may actually be positive due to an image pixel error. the resulting value can be larger than the threshold value of 0.5. In conclusion, the highest value was obtained at the center point (128, 128) in this example.

VI. CONCLUSION
The use of large size input can handle registration information through feature learning rather than edge information. Therefore, the matching result, which was strong in the validation set, can lead to a low result in the test set that was not used for learning, and these results may cause unexpected problems in the problem of automatic registration. We applied the edge attention module to construct a network that can derive robust matching results even for unlearned data, although the input is larger than that of the existing matching networks. In addition, we proposed a matching network suitable for flight data matching and obtained 1.897% higher matching overall accuracy even when 11 convolution layers were insufficient than VGG19, which performed the best in the existing classification network. Efficient removal of network parameters and judgment of matching results over existing layers through the attention module are meaningful because they are like human visual effects. As a result, it showed 6.233% higher performance than the SE [54] block, which showed the best performance among attention modules. we hope not only the effect of increasing the matching rate due to the addition of convolution in various framework configurations of the matching network, but also the effect of increasing matching rate using the attention module. This process is expected that it makes easy to design a fusion model for detection through preprocessing of EO/IR data that finds matching points. Recently, it has been found that the performance of object classification and detection using a transformer is excellent. This shows better performance than CNN-based technology when learning through a lot of data. With the advancement of these technologies, the matching of heterogeneous images should also be studied in the direction of deriving high accuracy by learning a lot of data. This is because, in the case of night, it is difficult to detect an edge compared to the daytime and scattering by light is sufficient to prevent common edge detection between heterogeneous images. So, deriving more precise results through learning a lot of data will show excellent performance in day and night surveillance and reconnaissance. Our research team plans to conduct research using transformers in the future.