Multiscale Image Matching for Automated Calibration of UAV-Based Frame and Line Camera Systems

Unmanned aerial vehicles (UAVs) equipped with integrated global navigation satellite systems/inertial navigation systems together with frame and/or line cameras are used for a variety of applications. Geometric system calibration is crucial for delivering accurate products from UAV-based imaging systems. This article presents automated geometric calibration strategies for UAV-based frame and line camera systems to estimate accurate system calibration parameters without the need for ground control points or manual measurements of tie points. The matching strategy used in this article to establish conjugate features among overlapping frame camera images is based on a traditional Structure from Motion technique augmented with several layers of matching outlier removal. On the other hand, a new strategy relying on ortho-rectified images is introduced for automated feature matching in line camera scenes. Then, a general bundle adjustment procedure with system calibration capabilities for frame and line cameras is presented, where the derived automated tie points are used for estimating accurate geometric system calibration parameters. The proposed approach is evaluated using four datasets—two datasets captured by frame cameras and two datasets captured by line cameras. The results show that the developed automated calibration strategy is capable of producing the same level of absolute accuracy when compared to using manually measured tie points for both frame camera and line camera systems. Results also indicate that the presented automated system calibration approach can be applied to systems even with significant deviation of actual system parameters from their nominal values, and still produce accurate estimates of calibration parameters.

variety of applications for accurate 3-D reconstruction. UAVs provide ease of deployment and variety of choices for imaging and GNSS/INS sensors [1]. Among potential applications, the use of UAVs equipped with RGB and hyperspectral imaging systems in precision agriculture [2]- [5] has expanded rapidly in recent years due to their relatively low cost and ability to provide data with high spectral, spatial, and temporal resolutions [6]. However, to produce accurate 3-D data, system calibration is needed for consumer-grade UAV imaging systems. In this article, "consumer-grade" refers to UAV systems that are equipped with relatively low-cost GNSS/INS units and imaging systems that require system calibration by the user instead of by the manufacturer or other high-end laboratory settings.
Other than estimating internal characteristics of the different sensors, geometric system calibration includes spatial calibration-i.e., determining mounting parameters relating the GNSS/INS unit and camera coordinate systems-as well as temporal calibration-i.e., estimating time delay between the GNSS/INS recorded event markers and actual time of image exposure. Since geometric system calibration is vital to achieving higher levels of accuracy, it has been a major focus in existing research. Spatial and temporal system calibration has been performed on a variety of frame camera imaging sensors with many requiring ground control points (GCPs) to achieve the desired accuracy [7], [8]. Establishing such ground targets is expensive and labor intensive, and more importantly, the distribution and number of GCPs are usually less than optimal to provide adequate control for determining system calibration parameters. To overcome this limitation, there has also been recent research focusing on UAV-based RGB camera [9] and hyperspectral scanner system calibration [10] without the need for GCPs. Yet, even without using GCPs, those techniques have only been applied using manually measured tie points in overlapping images. Manual tie point measurements are known to be accurate but are very time consuming and labor intensive. One should note that although several studies focusing on automated identification of tie features among overlapping images have been introduced in the literature, the majority of these approaches target applications such as image registration, object detection, and multiview 3-D reconstruction. To the best of the authors' knowledge, automated feature extraction and matching for UAV-based system calibration has not been considered by existing research. Due to the excessive volume of collected data by hyperspectral cameras, the far majority of these sensors adopt a single line in their focal plane (i.e., they are line cameras). On the other hand, the majority of RGB and thermal cameras onboard UAV systems use a frame imaging mechanism. Frame and line camera systems capture imagery though different data acquisition mechanisms as depicted in Fig. 1. The main difference between frame and line cameras is that a frame camera scene is captured in one instance by a single image, whereas a line camera scene is built by concatenating multiple images captured sequentially. Here, a scene is considered as a 2-D coverage of an area on the ground, while an image represents the output of a single exposure of the light-sensitive elements (pixels) in the image plane. Given this geometry, frame camera's image coordinates will consist of both x and y values that are variable depending on the image point location which is bounded by the sensor's angular field of view (AFOV). However, line camera's y image coordinates will have variable values but x will always be constant depending on the sensor placement in the image plane (for the line camera, the x-axis of the image coordinate system is aligned along the flight direction). These acquisition mechanisms cause the automated tie point generation and system calibration process to differ for frame and line camera systems.
For a robust estimation of system calibration parameters with minimum correlation between such parameters, specific flight missions are required. As discussed in [9], to conduct spatial and temporal calibration, it is recommended to derive the system parameters using opposite flying directions at different flying heights, as well as having a variation in the linear and angular velocities. In addition to potential differences among overlapping UAV images such as changes in illumination, viewpoint, and distortions, data acquisition from multiple flying heights results in imagery with varying scale. Such factors adversely affect automated feature extraction and matching in frame and line camera imagery. Fig. 2 illustrates raw imagery acquired from Although several scale invariant image matching algorithms such as scale invariant feature transform (SIFT) [11] and speeded-up robust features (SURF) [12] have been developed, when dealing with UAV-based images with significant scale differences, these algorithms would result in high percentage of outliers. In order to illustrate the impact of scale difference on the automated matching process, three different stereo-pairs acquired by a frame camera from 1) 40-m flying height, 2) 20-m flying height, and 3) 40 and 20-m flying heights are considered for image matching. The selected stereo-pairs are shown in Fig. 3. Table I shows the matching results for the three different stereo-pairs. In this table, initial matches are based on the SIFT algorithm, with the number of inliers derived through a random sample consensus (RANSAC)-based outlier removal [13]. As reported in Table I, the number of initial correspondences identified in images with different scale is significantly smaller than that for images at the same scale. In addition, while only 14%-18% of initial SIFT matches are detected as outliers for stereo-pairs with the same scale, this value jumps to 37% in case of using images captured from different flying heights. Incorporating such matching results in the bundle adjustment (BA) with self-calibration process can lead to inaccurate system parameters. Hence, to increase the robustness of the reconstructed object space when dealing with image datasets exhibiting significant scale differences, removing matching outliers is a crucial step.
In addition to the challenges arising from scale differences in imagery, in line camera systems, due to turbulence in the UAV trajectory, concatenating sequential scan lines [as shown in Fig. 1(b)] results in wavy patterns in the generated scenes, which further complicates the process of automated image matching. One should note that although a gimbal mount could mitigate such wavy patterns in line camera scenes, it is not capable of handling large payload such as the coaligned cameras that will be used in this research. Fig. 4 shows two samples of overlapping line camera scenes captured from the same flying height (40 m) but in two different flying passes, along with the corresponding zoomed-in regions. As can be seen in Fig. 4, raw scenes exhibit a wavy pattern and consequently, corresponding features look quite different [see Fig. 4(c) and (d)]. Hence, for reliable extraction of conjugate features in line camera scenes, the impact of the wavy pattern needs to be considered.
This article focuses on fully automated spatial and temporal calibration of frame and line cameras onboard GNSS/INSassisted UAV imaging systems without the need for GCPs while using a general BA framework. First, a review of existing work dealing with spatial and temporal calibration for frame and line camera systems is presented. Next, the systems specifications and dataset description are provided. The methodology starts by describing the Structure from Motion (SfM) approach used for automated feature extraction and matching for frame cameras while focusing on the matching outlier removal strategies. Next, a new automated feature matching strategy for line camera systems is discussed which applies the SIFT algorithm on partially ortho-rectified images instead of raw images. Here, the term "partially," refers to the fact that only a rough digital surface model (DSM) as well as nominal system calibration parameters are used in this procedure. Then, the general mathematical model used for the GNSS/INS-assisted BA with system self-calibration process for both frame and line cameras is introduced. The experimental results for frame and line cameras onboard UAV systems are then presented where the ability of the proposed approaches to identify well-distributed, sufficient number of tie features is evaluated. Through these experiments, the resulting system calibration parameters are investigated using four UAV datasets. Last, conclusions and recommendations for future work are discussed.
II. RELATED WORK Spatial and temporal system calibration is essential for accurate 3-D reconstruction from consumer-grade UAV imaging systems leading to much research over the years. Spatial calibration has been extensively covered in past research [14]- [18]. However, temporal calibration has been addressed to a lesser extent. Chiang et al. [19] proposed a two-step geometric calibration method for a fixed-wing UAV-based imaging system. First, exterior orientation parameters (EOP) were obtained through indirect georeferencing with the help of GCPs. Then, the differences between such EOP and the interpolated GNSS/INS position and orientation of the inertial measurement unit (IMU) body frame were derived. Those differences were then used in a calibration algorithm to solve for lever arm components, boresight angles, and time delay. Blazaquez [7] introduced a one-step approach for spatio-temporal calibration of manned aircraft multisensor systems. This approach focused on modifying the sensor model to include a time synchronization parameter while using specific flight and GCP configurations. Rehak and Skaloud [8] addressed the system calibration for consumer-grade cameras on micro aerial vehicles. In their proposed approach, lever arm components were derived using a static field calibration introduced in [20], and then camera interior orientation parameters (IOP) and boresight angles were estimated in a self-calibration procedure using GCPs. Next, given the estimated camera IOP and spatial system calibration parameters, the authors modified the mathematical model to include time delay as a parameter in the BA procedure. Yet, all the above system calibration techniques required GCPs for acceptable absolute accuracy and focused on frame camera systems. LaForest et al. [9] presented two approaches-direct and indirect-for estimating geometric calibration parameters for UAV-based frame camera systems without the need for GCPs. The direct approach modified existing collinearity equations to incorporate time delay as a system parameter along with lever arm components and boresight angles while using manually derived tie points. On the other hand, in order to use existing bundle block adjustment packages, the indirect approach exploited a bias impact analysis for the system parameters and their correlations to first solve for time delay, and then estimate spatial calibration parameters. Even though several studies have addressed spatial and temporal calibration of frame cameras, line camera system calibration has primarily focused on spatial calibration. Kocaman et al. [21] focused on modifying the adjustable parameters for self-calibration triangulation of airborne linear array cameras. Habib et al. [10] performed rigorous boresight calibration for a hyperspectral line camera equipped with a GNSS/INS unit onboard a UAV.
System calibration using a bundle block adjustment requires image coordinates measurements of tie points, among other potential observations and unknowns. Manually established tie points are known to be accurate but very time consuming. To address this challenge, some studies have focused on selfcalibration while reducing or eliminating the need for manual measurements. Habib et al. [14], [22] focused on LiDAR and frame camera self-calibration using manually derived straight lines and planar patches. Recent research has also focused on SfM techniques for automated in-situ estimation of camera internal parameters. Gneeniss et al. [23] developed an in-flight camera IOP refinement procedure using a least squares surface matching algorithm to align SfM-derived point clouds to LiDAR control points. In a similar study, Zhou et al. [24] used photogrammetric points as well as LiDAR data for refining camera parameters without the need for GCPs. However, there is still a gap when it comes to automated spatial and temporal calibration for frame and line camera systems without the need for GCPs. Moreover, to the best of the authors' knowledge, there is no existing work that has focused on a unified BA engine for aerial triangulation of frame and line camera systems. This article focuses on a general, fully automated BA with self-calibration including spatial and temporal system parameters for frame and line camera systems without the need for having GCPs.

III. SYSTEM SPECIFICATIONS AND DATASET DESCRIPTION
Feasibility of the proposed research is validated using datasets collected by two UAV systems; a Dà-Jiāng Innovations (DJI) M200 [25] and a DJI M600 Pro [26]. Both systems relied on the Applanix APX-15 UAV v2 GNSS/INS unit for direct georeferencing with a predicted positional accuracy of 2-5 cm and heading and roll/pitch accuracy of 0.080°and 0.025°, respectively [27]. The M200 platform carried a coaligned FLIR Duo Pro R 640 RGB and thermal frame cameras. The RGB sensor array size was 4000 × 3000, with a pixel size of 1.85 μm and a nominal focal length of 8 mm and had an AFOV of 56° [28]. The Uncooled VOx Microbolometer thermal sensor array was 640 × 512 with a pixel size of 17 μm and had a nominal focal length of 19 mm and an AFOV of 32°. Fig. 5 shows the FLIR Duo Pro R and APX-15 configuration onboard the M200 UAV, and illustrates the coordinate systems for the IMU body and cameras frames. The FLIR Duo Pro R utilized a mobile-phone-based app to set camera parameters via Bluetooth. The mobile-phone-based app provides the ability to set the capture interval for the camera and to start and stop triggering. Event feedback to the APX was provided directly by the FLIR Duo Pro R using the "Frame Sync" option. This option outputs a low-voltage transistor to transistor logic (3.3 V) pulse that was wired directly to the event input of the APX-15.
The DJI M600P platform was equipped with two different line camera sensors; the Headwall's visible and near-infrared (VNIR) and shortwave infrared (SWIR) sensors. The VNIR line camera covers 270-273 spectral bands ranging between 398 and 1000 nm with a band width of 2.2 nm. The scan line consists of 640 pixels (samples) and has a detector pitch of 7.5 μm [29]. This sensor has a focal length of 8.2 mm resulting in a field of view of approximately 31°along the scan line. The SWIR line camera covers 166 spectral bands ranging between 900 and 2500 nm with a band width of 9.6 nm. This sensor consists of 384 pixels (samples) along the scan line and has a pixel pitch of 24 μm [30]. The focal length of the SWIR line camera is 24.6 mm resulting in a field of view of 21°along the scan line.
An image of the DJI M600P equipped with the VNIR and SWIR line cameras, along with the orientation of the sensors and IMU body coordinate systems is shown in Fig. 6. The   Table II outlines the nominal lever arm components and boresight angles for all the systems used in this article. The M200 frame camera system and the M600P line camera system each had one collection date over study sites located at Purdue's Agronomy Center for Research and Education. Table III shows the flight parameters and dataset description for each imaging system. The M200 UAV system was flown over an agricultural test field with five checkerboard targets and the line camera datasets were acquired over a study site with different geomorphic features, i.e., grass, pavement, and crops, and with 11 checkerboard targets and several highly reflective and hutshaped targets [see Fig. 7(c) and (d)]. While the checkerboard targets were used as check points for absolute accuracy assessment in the experimental results, remaining ground targets were only deployed for a LiDAR calibration experiment, which is not the focus of this article. The centers of checkerboard targets were surveyed by a Topcon GR-5 GNSS receiver with an accuracy of 2-3 cm. It should be noted that the image coordinates for check points were manually measured. Some sample images/scenes captured by the frame/line cameras are illustrated in Fig. 7. As shown in this figure, all datasets show repetitive image patterns with significant scale differences. These are even more clear in thermal frame and SWIR line datasets, as they are of lower spatial resolutions and also operate in spectral bands that result in low-textured images/scenes. In addition, from Fig. 7, one can note that RGB frame camera has much larger AFOV than the thermal frame sensor. Similarly, the AFOV of VNIR scanner is larger than that for SWIR line camera. Fig. 8 outlines the flight trajectory and configuration of check points for both frame and line camera datasets.

IV. METHODOLOGY
In this section, the proposed strategies for automated image feature extraction and matching for frame and line cameras onboard UAV remote sensing systems are introduced. The matching strategy used in this article to establish conjugate  features among overlapping frame camera images is based on a traditional SfM technique which is augmented with several layers of matching outlier removal. On the other hand, by taking advantages of partially ortho-rectified images generated for each flight line, a new strategy for automated feature matching for line camera systems is introduced. Last, a general BA procedure for system calibration of frame and line cameras is presented.

A. Automated Feature Extraction and Matching for Frame Camera Systems
In this article, a graphic processing unit (GPU)-based implementation of the SIFT algorithm, known as SiftGPU [31], is used to efficiently extract image features along with their descriptors in a large set of frame camera images. As discussed in Section I, although the SIFT algorithm is designed to be invariant against scaling, in practice this technique results in a high percentage of outliers when dealing with images acquired at significantly different flying heights with repetitive image patterns. In order to increase the reliability of established conjugate SIFT features when applied on images with such challenging conditions, this article implements four layers of matching outlier removal (see Fig. 9): 1) Forward/backward consistency check: In this step, considering two overlapping images, hereafter denoted as left image and right image, first, left-to-right (forward) matching is conducted and an initial set of conjugate features is established. Then, a given pair (p1, p2) is considered to be valid if the same pair is established in the right-to-left (backward) matching. This step helps with initial removal of obvious matching outliers. 2) RANSAC-based ROP estimation: Here, the Nister fivepoint algorithm [32] is augmented with a RANSAC [13] framework to simultaneously estimate ROPs and remove potential matching outliers while using epipolar geometry constraints. 3) Incremental outlier removal: In order to remove more matching outliers arising from the repetitive patterns and scale differences in the imagery, this step of the proposed SfM technique implements an incremental EOP recovery process based on the work presented by He et al. [33]. This EOP recovery step starts with establishing a local coordinate system by selecting a stereo-pair, and then sequentially augmenting the remaining images into the final image block. Image augmentation is achieved using the linear rotation averaging and translation averaging techniques introduced by He et al. [33]. The matching outlier removal step is conducted within the translation averaging process where conjugate points that exhibit large back-projection residuals are detected and removed.
Interested readers can refer to [33] for more detailed information about the employed incremental EOP recovery strategy. 4) BA-based outlier removal: In the last step of the SfM framework employed in this article, a BA process is incorporated while using an iterative procedure for detecting and removing more matching outliers. Prior to conducting the BA process, first all remaining SIFT features (i.e., those surviving the prior outlier removal strategies) are tracked among all the involved imagery, and their approximate object coordinates are derived using a multi light-ray intersection procedure. Next, using the SIFT features, their corresponding object coordinates, and camera EOP, a BA process is conducted to minimize the back-projection residuals for all image tie points. At this stage, most SfM pipelines that conduct BA-based outlier removal such as COLMAP [34], simply remove points with large residuals. Although the huge redundancy and geometric strength of the BA process can be assumed to map matching outliers to the corresponding image coordinate residuals, there might be situations where inlier leverage points (i.e., sparse points in key geometric locations) could end up with large residuals. In order to avoid such situations, this article conducts a chi-squared statistical test, where a standardized test of the residuals while considering their variance-covariance matrix is used for removing potential outliers. After removing these outliers, the BA is executed again to refine image EOP as well as reconstructed object points. This procedure is repeated until no points with residuals larger than a predefined threshold remain. It is worth mentioning that the implemented multiscale image matching strategy for frame camera systems does not exploit any information from the onboard GNSS/INS unit. Thus, this approach is capable of identifying conjugate features even in systems with large deviations of the actual calibration parameters from their nominal values. Fig. 10 shows sample 3-D point clouds generated from-1) Pix4D Mapper Pro and (2) the implemented multiscale approach for the RGB frame dataset acquired at 20 and 40 m flying heights. As illustrated in Fig. 10, although Pix4D Mapper Pro is able to generate more points in the object space, the 3-D reconstruction accuracy is lower when compared to the result from the implemented approach. More specifically, from the perspective view shown in Fig. 10, one can note that using Pix4D Mapper Pro, images captured at different flying heights result in two separate surfaces. This highlights the challenge in finding conjugate features between images with significant scale differences. One would expect even worse results from existing SfM strategies when dealing with multiscale thermal images due to their more challenging characteristics (e.g., lack of uniquely identifiable features). Therefore, we cannot compare the performance of those approaches with the proposed multiscale matching strategy.

B. Automated Feature Extraction and Matching for Line Camera Systems
As mentioned earlier, in addition to the challenges caused by scale differences, in UAV-based line camera systems, wavy patterns in the generated scenes further complicates the image matching problem. In order to overcome the feature matching problem in raw scenes with wavy patterns, the proposed framework establishes the feature correspondences on partially ortho-rectified scenes of each flight line. An ortho-rectified scene is obtained by correcting the raw scenes for nonsmooth trajectory as well as perspective and camera distortions, if available, such that the generated scene is correctly scaled and corresponds to a map projection throughout the scene. The proposed strategy for automated generation of tie features among line camera scenes is presented in Fig. 11. As shown in this figure, the proposed approach consists of four main steps: partial ortho-rectification of individual flight lines, window-based SIFT matching, SIFT feature tracking, and conversion of ortho-rectified coordinates to raw scene coordinates. Implementation details of these steps are discussed in the remainder of this subsection.
Generation of partially ortho-rectified scenes aims at removing the wavy pattern of the raw scenes, so that corresponding features in different flight lines look similar. The employed ortho-rectification process is depicted in Fig. 12. As illustrated in this figure, in order to appropriately map intensities observed by the camera to their location in the ortho-rectified scene, a rough elevation model (DSM) of the scene in question is required. Since the perspective distortions in line camera scenes are not severe when compared to those for frame camera imagery, possible errors in the used DSM will not significantly affect the quality of the partially ortho-rectified scene.  1) The scanner is perfectly vertical.
2) There is no variability in the roll, pitch, or yaw in the scanner's attitude. 3) The flight direction is along the rows or columns of the DSM.
4) The scanner is placed vertically below the perspective center.
5) The scanner is precisely aligned in the normal direction to the flight line.
As a result of any deviations from these assumptions, the DSM cell would not be visible in its closest scan line. To deal with this issue, the corresponding scan line to each DSM cell is determined through an iterative procedure. More specifically, in the next step, scan line EOP and camera IOP are used to project each DSM cell onto the initial scan line using the collinearity equations. One should note that as the scan line is vertically below the perspective center of the utilized camera, for a DSM cell to be visible in a scan line, the projected point has to have the image coordinate component along flight direction-e.g., x coordinate-as close as possible to zero. The correct scan line corresponding to a given DSM cell is derived iteratively, where in each iteration, the projected point coordinate along flight direction is used to determine the scan line for the next iteration. Once the correct scan line is determined, pixel grey values of the raw scene at the projected point are assigned to the corresponding cell in the orthro-rectified scene. Also in this step, sample and line information-i.e., the derived coordinate across flight direction (e.g., y coordinate) and the scan line index-for each cell of the ortho-rectified scene are stored in a look-up table and will be used later in the last step of the proposed matching strategy. Fig. 13 shows the partially rectified scenes corresponding to the raw scenes illustrated in Fig. 4. By comparing these two figures, one can observe that the wavy pattern of the raw scenes is not present in the ortho-rectified scenes. This can significantly help the image matching algorithm in identifying conjugate points among overlapping scenes from different flight lines.
In the next step of the proposed framework, the SIFT detector and descriptor algorithm is applied on all partially ortho-rectified scenes. The traditional matching strategy establishes conjugate features by comparing each feature descriptor in one scene with all feature descriptors in the other scene. In some applications such as digital agriculture, due to repetitive image patterns, traditional feature matching algorithms often result in insufficient number of matched features and/or matches with high percentage of outliers [35]. Reducing the search space in such situations could mitigate some of the matching ambiguity problems caused by repetitive patterns/scale difference in imagery. As such, several approaches using data structure techniques and/or trajectory information have been introduced to improve the matching for frame camera imagery [36], [37]. This article exploits the geo-location information of the partially ortho-rectified scenes to reduce the matching search space in line camera scenes. In other words, for each selected SIFT feature in the first scene, its predicted location in the second scene is estimated. Then, among all SIFT features in the second scene, only those which are located inside a search window centered at the predicted point location are considered as candidate conjugate features. The search window size can be determined according to the accuracy of the trajectory information and nominal system calibration parameters, as well as the quality of the utilized rough DSM. Finally, to remove obvious matching outliers, a forward/backward consistency check is conducted. Fig. 14 illustrates the abovementioned process for the two partially ortho-rectified scenes shown in Fig. 13. Now that conjugate features are established among all stereopairs of ortho-rectified scenes, all SIFT matches are tracked among all involved flight lines. One should note that coordinates of the established conjugate features have been determined with respect to the coordinate systems associated with the ortho-rectified scenes, while for the triangulation process, raw scene coordinates are required. Hence, using the look-up table generated during the ortho-rectification stage, sample and line indices are assigned to each tracked SIFT feature. Fig. 15 shows a sample of tracked SIFT feature on ortho-rectified scenes as well as corresponding raw scenes. Through visual inspection, a good matching quality in both ortho-rectified and raw scenes can be seen. One should note that in contrast to the implemented multiscale matching strategy for frame cameras, the proposed matching algorithm for line camera scenes exploits trajectory information provided by the onboard GNSS/INS unit. Thus, when dealing with line camera systems with significant differences between nominal and actual system calibration parameters, relaxed thresholds, e.g., large search window size, might need to be selected. The robustness of the proposed multiscale matching strategy for line cameras with significant variations of the system calibration parameters from their nominal values will be tested in the results and discussion section of this article.

C. Bundle Block Adjustment
This section presents a general BA procedure with geometric system calibration capabilities for frame and line cameras using automatically derived tie points. In this article, a vector connecting point "b" to point "a" relative to a coordinate system associated with point "b" is represented as r b a , and a rotation matrix transforming from coordinate system "a" to coordinate system "b" is represented as R b a . The mathematical model for the collinearity principle while considering spatial and rotational system calibration parameters is shown in (1). As mentioned earlier, while both x, y components of image coordinatesr c(t) i in (1)-in frame cameras have variable values depending on image point location, in line camera systems x is always constant, e.g., x = 0 for systems with scan line vertically below the camera perspective center. Despite this difference in image coordinates of line cameras, the overall collinearity equations shown in (1) remain unchanged. An illustration of the collinearity equations is presented in Fig. 16 for frame and line where r m I ground coordinates of the object point I , t camera exposure time,  As mentioned earlier, this article aims at deriving both spatial and temporal calibration parameters for frame and line cameras. LaForest et al. [9] introduced the modified collinearity equations for frame camera systems that incorporate the time delay between the mid-exposure and recorded event marker by the GNSS/INS unit, as shown in (2). More specifically, the position and orientation of the IMU body frame at the actual time of exposure tr m b(t) and R m b(t) -are derived from those at the initial exposure t 0 -r m b(t 0 ) and R m b(t 0 ) , the instantaneous linear and angular velocities (ṙ m b(t 0 ) ,ω b(t 0 ) ,φ b(t 0 ) ,κ b(t 0 ) ), and time delay (Δt). The instantaneous linear and angular velocities of the IMU body frame are estimated by computing the variations in position and orientation at a user-defined time interval. Interested readers can refer to [9] for more detailed information about the modified model.
This mathematical model can be applicable for both frame and line cameras. A flowchart of the general BA procedure with system calibration capabilities for GNSS/INS-assisted frame and line cameras is shown in Fig. 17. As shown in this figure, the unknowns, i.e., lever arm components, boresight angles, time delay, and ground coordinates of the tie points, are estimated through an iterative procedure in a single bundle block adjustment. In addition, the position and orientation of the GNSS/INS unit at the time of frame/line camera exposure, are used as pseudo observations in the model, and consequently, are adjusted in the bundle block adjustment.
The exposure time of each frame camera image is recorded by the GNSS/INS event marker, whereas time of exposure is recoded for each individual scan line in line camera systems. In the first step of the general BA procedure shown in Fig. 17, the initial exposure time t 0 for each tie point is derived from the recorded event marker and timestamp of the corresponding scan line for frame and line cameras, respectively. The initial exposure time t 0 is used as the starting time t in the first iteration. Then, the position and orientation of the IMU body frame at time t as well as instantaneous linear and angular velocities are computed through an interpolation using the trajectory information (with 200 Hz data rate) provided by GNSS/INS unit. In this process, linear interpolation is used for positional components, whereas rotational parameters are derived through spherical linear interpolation [38]. Next, the BA implements a least squares solution to determine the best estimates of unknown parameters. Once an incremental time delay-Δt -is estimated, the initial exposure time is updated to t 0 + Δt . In other words, the position and orientation of the IMU body frame as well as the instantaneous linear and angular velocities at time t 0 + Δt are computed for the next iteration of the BA. The iteration procedure is repeated until the incremental time delay is smaller than a predefined threshold. Last, the final time delay is derived through summation of incremental time delays estimated in all iterations, and the spatial calibration parameters are derived from the last iteration of the BA process.
In order to estimate the system parameters with the least amount of control, previous research has been completed on deriving the optimal/minimal flight configuration which considers any small changes in the system parameters that would impact the target function of the BA [9]. Since changes in lever arm component in the Z direction does not impact the precision of the BA-i.e., wrong estimates of such lever arm component will not impact the BA target function, it cannot be estimated in the absence of vertical control [39]. Therefore, this parameter is not estimated in this article. For this article, the observations include the image coordinate measurements of the check and tie points. The image coordinate measurements for check points were manually measured whereas the image coordinate measurements for the tie points were automatically derived, as explained in the previous section.

V. RESULTS AND DISCUSSION
In this section, the experimental results for the RGB and thermal frame cameras as well as the VNIR and SWIR line cameras are presented. The collective objectives of the experimental results using these four datasets are listed as follows: 1) The first objective is to evaluate the proposed multiscale matching approaches' ability to establish well-distributed and sufficient number of conjugate features among images/scenes acquired from different flying heights. All four datasets are included in testing this objective. 2) The next objective of the experimental results is to compare the estimated system calibration parameters and absolute accuracy of the reconstruction while using manually measured check points against those using SIFT-based tie points. All four datasets are used to test this objective.
3) The third objective tests the feasibility of the proposed automated calibration strategies when working with systems with significant deviation of actual system parameters from their nominal values while accurately estimating those parameters and ground coordinates of check points. As mentioned earlier, the performance of the proposed strategy for frame cameras does not depend on the quality of the nominal system calibration parameters. Therefore, this objective is tested only for one of the line camera systems, i.e., VNIR dataset. 4) The last objective of the experimental results is to examine the potential use of the proposed multiscale matching strategies in applications other than system calibration, e.g., SfM-based applications, where it is required to reconstruct precise 3-D point clouds from images captured at multiple heights. To test this objective, the relative accuracy of the reconstructed SIFT-based object points is evaluated for the four datasets.

Objective 1-Qualitative and Quantitative Sufficiency of Multiscale Matching Results for Frame and Line Cameras
In this section, multiscale matching results for the four datasets are evaluated in terms of the number of established tie features, distribution of tie features, and the ability of the  Fig. 7(a) and (b)]. Inspecting the matching repeatability criterion reported in Table IV, it can be observed that the average number of images among which an object point is visible is slightly higher for the RGB dataset than the thermal one, i.e., 5.0 versus 4.8. This can be attributed to the higher percentage of overlap and side-lap (due to the larger AFOV) between the RGB images (see Table III). As explored before with the help of Fig. 7(c) and (d), scenes generated by the SWIR line camera are more challenging for establishing conjugate features when compared with VNIR line camera scenes. However, as reported by the total number of conjugate points in Table IV, almost the same number of SIFT matches are identified for the two datasets. This can be explained by the implemented window-based image matching strategy for line camera scenes, as it can reduce the matching ambiguity which is more significant in the low-textured SWIR camera scenes. Also, matching repeatability for the VNIR dataset is slightly higher than the SWIR dataset due to the increased percentage of side-lap between neighboring VNIR scenes. It is worth mentioning that since two different study sites with different geomorphic features were used for the frame and line camera calibration experiments, no comment can be made on the comparative performance of the proposed multiscale matching strategies for frame and line cameras. However, assuming similar flight configuration over the same study site, one would expect smaller set of conjugate points in line camera datasets due to the absence of overlap among successive scan lines as well as smaller field of view in such cameras.  For evaluating the distribution of the established tie points, Fig. 18 illustrates the image/scene location of all manually measured check points (in red) and automatically derived SIFT points (in blue) in the different images/scenes for the four datasets. One should note that the UAV speed at 40 m flying height was twice as high as its speed at 20 m (see Table III), and consequently, the number of scan lines generated in 20 m flying height was almost double that of 40 m flying height. Thus, distribution of tie points for line camera datasets are shown separately for different flying heights [ Fig. 18(c) and (d)]. As expected, in comparison with manual measurements of check points, using the implemented multiscale strategies led to more points with better distribution in all cases.
As mentioned earlier, to reduce the correlation among system calibration parameters, it is critical to identify tie points in images acquired from different flying heights. Fig. 19 represents the ability of the proposed matching strategies in producing sufficient matches between different flying heights. More specifically, this figure shows the percentage of the SIFT-based object points that are established among images acquired only at a single flying height (20 m in blue and 40 m in orange) as well as this percentage for the object points that are identified and matched among images captured at multiple flying heights (20 m and 40 m in green). Thus, the tie points corresponding to the green bar in Fig. 19 have the major role in avoiding correlations among the system calibration parameters as they tie the two flying heights more efficiently. As illustrated in Fig. 19, in all cases, more than 50% of the SIFT-based object points are derived from both flying heights. Therefore, one can conclude that the proposed multiscale matching strategies for both frame and line camera systems can establish sufficient number of features among images/scenes acquired from different flying heights.
In addition, as images acquired at 40 m flying height have more ground coverage than images captured at 20 m, one might expect those images to include more distinctive features and consequently, result in more SIFT-based conjugate points. However, a closer inspection of the results for the thermal and SWIR cameras in Fig. 19 reveals that for these datasets, the percentage of the features established only in images acquired at 20 m flying height is significantly larger than that at 40 m. This can be attributed to the fact that these cameras, as mentioned earlier, operate in spectral bands that result in images with very low level of details and this condition deteriorates as the spatial resolution of the images decreases. Consequently, in images/scenes captured by thermal/SWIR cameras, it is likely to detect fewer distinctive features in images captured at a higher altitude, i.e., 40 m, and therefore, smaller set of conjugate points can be established among those images.

Objective 2-Reconstruction Accuracy Using the Estimated System Calibration Parameters for Frame and Line Cameras
In this section, using the automatically derived tie points, system calibration parameters, and absolute accuracy of the resulting reconstruction are estimated and compared with those using manual measurements of check points. The nominal values and estimated system calibration parameters using check points and SIFT points for the four datasets are presented in Table V. Standard deviation (STD) of each estimated parameter and the derived square root of a-posteriori variance factor,σ 0 , from the BA procedure are also reported. Theσ 0 value indicates the quality of fit between observations and estimated unknowns as represented by the mathematical model of the collinearity equations. According to Table V,σ 0 values for all datasets are smaller than 4 pixels, which is an indication of small image residuals as a result of the back-projection process using the estimated system calibration parameters. Although, as expected, the back-projection errors when using accurate manual measurements of check points are smaller than those resulting from automatically derived tie points, the derivedσ 0 values using the proposed strategies are still well within the expected accuracy of SIFT-based conjugate points (for the BA process in this article, accuracy of manual measurements and SIFT points are considered as 0.5 and 7 pixels, respectively). Looking into the system calibration parameters reported in  boresight angles, with slight improvements in STDs of the estimated parameters when using SIFT-based tie points due to the high redundancy that these points produce. Also, from the reported time delays, it can be observed that while there is a significant time delay in frame camera systems (approximately −200 and −270 ms for RGB camera and thermal camera, respectively), time delay estimated for line cameras ranges between +5 and +10 ms, which is considered relatively small. It is worth mentioning that a negative estimated time delay shows that the exposure time is recorded before the actual moment of image capture, whereas a positive time delay is an indication of recording the exposure time after the image is captured.
For an absolute accuracy assessment, the XY Z ground coordinates of the check points are derived by applying the estimated system parameters in a multiray intersection while using the position and orientation of the GNSS/INS unit at the time of exposure. The derived ground coordinates are then compared to the corresponding RTK GNSS measurements. A summary of the mean, STD, and root-mean-squared error (RMSE) of the differences in X, Y , and Z coordinates of the check points for the four datasets is presented in Table VI. The overall horizontal and vertical components of the RMSE differences while using check points and SIFT points are no more than 1 and 4 cm, respectively, which are considered negligible according to the GSDs of the utilized cameras. Larger vertical component of the RMSE values can be mainly attributed to using inaccurate focal length, resulting in a constant shift as reflected by the large Z mean values for both cases (check and SIFT points). Overall, one can conclude that the proposed multiscale matching strategies for frame and line cameras can lead to system calibration parameters with the same absolute accuracy level as those when using manual measurements of check points.

Objective 3-Applicability of Proposed Approach for Line Camera System Calibration With Erroneous Nominal Parameters
As presented in Objective 2, depending on the utilized time synchronization technology, a small or large time delay might exist in an imaging system. On the other hand, according to the positional and rotational offset between the onboard camera and GNSS/INS unit, only a small deviation between the actual and nominal mounting parameters is expected (up to a few centimeters/degrees for the lever arm components/boresight angles). Hence, by singling out the time delay among system calibration parameters, this objective is tested by evaluating the performance of the proposed calibration strategies for handling systems with high time synchronization errors. Also, as reported in the previous section, the large time delays (more than 200 ms) existing in the frame camera systems were successfully estimated using the proposed approach. Given that the multiscale matching strategy for frame cameras does not rely on the GNSS/INS trajectory. Thus, it can inherently handle systems with large deviations of the actual calibration parameters from their nominal values. On the other hand, the time delays in the utilized line cameras in this article were shown to be quite minimal (no more than 10 ms). In order to evaluate the ability of the proposed automated calibration approach for line cameras to handle systems with larger time delay, different amounts of time delays are artificially introduced to the VNIR line camera dataset. One should note that as the implemented partial ortho-rectification process uses nominal system calibration parameters, e.g., Δt = 0, introducing artificial time delay to the system results in distorted partially ortho-rectified scenes. This impedes the process of automated feature matching between overlapping scenes. Fig. 20 illustrates corresponding zoomed-in regions from partially ortho-rectified scenes generated while considering different artificial time delays, i.e., zero to 125 with 25 ms interval, for the VNIR dataset captured from 40 m flying height. From this figure, one can note that as the amount of the introduced artificial time delay increases, more significant distortions are observed in the partially ortho-rectified scenes.
The number of reconstructed object points,σ 0 from the BA procedure, expected time delay, and estimated system calibration parameters along with their STDs for the VNIR dataset with and without artificial time delay are presented in Table VII. For all experiments, a constant search window size (2 ×2 m)-which is selected according to the maximum displacement of 0.75 m resulting from the maximum flying speed (6 m/s) and maximum artificial time delay introduced (125 ms)-is used for deriving conjugate features. Looking into the number of reconstructed object points in Table VII, it can be seen that by introducing larger time delays, fewer tie points are identified. This is mainly caused by the distortions in the partially ortho-rectified scenes for the experiments with significant time delay, as shown in Fig. 20. According to theσ 0 values reported in Table VII, although the back-projection errors increase (from 2.7 to 5.1 pixels) as larger time delays are introduced, these errors are still within the expected accuracy of the SIFT-based tie points. The time delay estimated for the VNIR dataset without an artificial time delay was 10 ms (shown in Table V and Table VII) Table VII, one can note that the deviation between the expected and estimated time delays is small (no more than 6 ms) for all cases, except for the experiment with 125 ms artificial time delay. More specifically, when 125 ms artificial time delay is introduced to the system, there is a 47 ms difference between the estimated and expected time delay values. In addition, in this experiment, estimated mounting parameters, particularly boresight angles Δϕ and Δκ, show slightly large deviations from their estimated values when there is no artificial time delay introduced. It is worth mentioning that although the proposed approach does not lead to accurate system parameters when a 125 ms artificial time delay is introduced, the estimated time delay can be used to generate new ortho-rectified scenes, and consequently, new set of conjugate features with higher percentage of inliers can be derived. This process can be conducted iteratively until accurate system calibration parameters are estimated-i.e., until no significant change in the estimated system parameters is observed between two successive iterations. One should note that the iterative approach described here is different from the currently implemented procedure (discussed in Section IV), in the sense that the current approach involves iterative technique when solving for the time delay without refining the matched features. However, in the proposed strategy for line camera systems with significant time delays, new set of conjugate features is established in each iteration.
Absolute accuracy results for the experiments with different artificial time delays are presented in Table VIII. As reported in this table, except for the experiment with 125 ms artificial time delay, which shows a vertical absolute accuracy larger than 11 cm, in other cases the overall mean, STD, and RMSE values are very similar to those with significantly smaller time delay in the system (experiment without any artificial time delay). Furthermore, through adopting the abovementioned iterative procedure for line camera systems with significant time delays, we would expect to achieve high absolute accuracy in the experiment with 125 ms artificial time delay. It is worth mentioning that according to the current time synchronization technology used in line camera imaging systems, time delay with such magnitude is rare and atypical in practice. Overall, one can conclude that the proposed approach accurately estimates the system calibration parameters and achieves similar absolute accuracy whether the time delay in the system is minimal or significant.

Objective 4-Feasibility of Using the Proposed Multiscale Matching Strategies for 3-D Reconstruction
The last objective of the experimental results is to examine the potential use of the proposed multiscale matching strategies for generating 3-D point clouds with high precision from UAVbased images acquired at multiple flying heights. This test is conducted by looking into the relative accuracy of the object space derived from the BA process. However, before evaluating the relative accuracy, quality of intersection for the reconstructed object points is examined by measuring the maximum intersection angle, α, between light rays at the location of object points. In general, the larger the magnitude of the base-to-height ratio, the larger α and the better the intersection geometry will be. Fig. 21 illustrates histograms of α angles for SIFT-based object points for the four datasets, while representing the mean α angles for SIFT and check points using green and blue lines, respectively. Also, minimum, maximum, mean, and STD values of α angles for the four datasets are reported in Table IX. According to Fig. 21 and Table IX, it can be seen that for all datasets, although the maximum α angles for check points and SIFT points are very similar, the mean α values (ᾱ) for the former is larger. This shows that although the proposed strategies are able to establish objects points whose intersection quality is as high as intersection quality of check points, the majority of SIFT-based tie points are established only among images with high percentage of overlap, i.e., short baselines. This is an indication of a relatively low matching repeatability for SIFT-based features (see Table IV), particularly in frame camera datasets. For example, while each ground target in the thermal dataset was manually measured in an average of 39 images, each SIFT-based object point was established only in five images. These values were nine images for check points and four images for SIFT points in the SWIR dataset, indicating that using the GNSS/INS information, the proposed multiscale matching strategy for line cameras showed better performance in terms of matching repeatability when compared to the implemented matching approach for frame cameras. Also from Table IX, one can note that in case of check points, the mean α (ᾱ) values for frame camera datasets are larger than those for line cameras. This is mainly caused by the fact that there is no overlap between line camera scenes in the flight direction, and thus, only light rays from neighboring flight lines can contribute to the 3-D reconstruction process when using such cameras.
Having discussed the quality of intersection for the four datasets, the relative accuracy-precision-of the object points is presented below. The relative accuracy is evaluated by examining the STDs of the check/SIFT point coordinates derived from the BA process. One should note that the planimetric precision of the object points is a function of flying height (H) and the precision of the image coordinate measurement (σ xy ). On the other hand, vertical precision can be determined from H, σ xy , and the baseline between images. A summary of mean STD-relative accuracy measure-of the check points and SIFT-based object points obtained from the BA is reported in Table X. In addition, to better understand the impact of intersection quality (α angle) on the precision of the object space, relative accuracy of the SIFT points with α angle larger than their corresponding mean value (ᾱ), is also reported in Table X. Looking into the XY Z mean STDs in this table, one can note that for all datasets, check points show a better relative accuracy than SIFT points, especially in vertical direction. While the better planimetric precision can be attributed to the higher accuracy of manual measurements compared to automated extraction of SIFT features, lower vertical relative accuracy for SIFT points is mainly caused by the weaker intersection geometry, i.e., small α angles. Examining the XY mean STDs reported in Table X, it is worth mentioning that although the planimetric precision of SIFT points are not as high as that for the check points, they are no more than four times that of the GSD of the original images at the 40 m flying height. Furthermore, from Table X, it can be seen that by removing SIFT points with weaker intersection geometry, planimetric and vertical relative accuracy can be improved for all datasets. These improvements, which are more pronounced in the vertical direction, show the impact of intersection geometry on the precision of the reconstructed object points. Based on the relative accuracy evaluation in this section, one can conclude that the proposed multiscale matching strategies have the potential to be used in SfM-based applications and produce point clouds with high planimetric relative accuracy. Regarding the vertical precision of the reconstructed object space, further improvements can be investigated, especially for the proposed approach for frame cameras by using the available GNSS/INS information, to increase the matching repeatability and consequently improve the intersection quality.

VI. CONCLUSION
This research presented a methodology for reliable multiscale automated feature extraction and matching for both frame and line camera systems. The automatically derived tie points were then used in a general GNSS/INS-assisted BA with system self-calibration to produce accurate system parameters. The key motivation for such development is mitigating the limitations of existing UAV system calibration techniques for automated estimation of spatial and temporal system calibration parameters for both frame and line cameras. Results showed that the matching approach paired with the automated system calibration presented in this research was capable of producing the same level of absolute accuracy compared to using manually measured tie points for both frame and line camera systems. Additionally, the results indicated that the automated system calibration approach presented in this article can be applied to systems with significant and minimal time delay discrepancies and still achieve reliable estimates of system parameters and absolute accuracy. In addition, the relative accuracy analysis showed that the proposed multiscale matching algorithms can also be used in SfM-based applications and produce 3-D object space with high/moderate planimetric/vertical precision. In conclusion, this article presented an automated spatial and temporal calibration approach that produces accurate system calibration parameters without the need for GCPs or manually measured tie points for both frame and line camera systems. Future work will focus on simultaneously integrating frame and line camera images as well as LiDAR data in the feature matching and BA procedure. Also, other algorithms such as SURF will be explored as feature matching alternatives.