Accuracy Evaluation and Prediction of Single-Image Camera Calibration

This paper proposes an application to statistically predict the accuracy of single-image geometric camera calibration that uses given 2D-3D correspondences. Deriving both camera intrinsics and extrinsics from correspondences between a single image and a 3D shape, is important for the scene analysis when the optical system of the camera is lost, such as in the analyses of traffic accidents. It is unclear whether the single-image calibration will be successful in practice, particularly when the number of 2D-3D correspondences is small, even if we could assign accurate correspondences by manual labor. To this end, we perform a systematic evaluation of the camera parameter accuracy using synthetic environments. Based on the statistics observed during the experiments, our application predicts the calibration accuracy from simple variables (e.g., the area that correspondences could be given). Since the prediction process does not rely on 3D shapes, it provides an estimate of the success of the calibration before time-consuming processes, i.e., 3D scanning and 2D-3D correspondence mapping.


I. INTRODUCTION
Camera calibration of intrinsic and extrinsic parameters is a traditional yet essential problem in computer vision. Given only a single image and a three-dimensional (3D) model of the environment, the problem of finding the camera parameters becomes practically challenging, even though it is of practical importance.
A major application of single-image camera calibration with known 3D geometry is to estimate the camera parameters for images capturing traffic accidents, which can be important evidence. During traffic accident reconstruction (TAR) [1], investigators often estimate the position and movement of cars and pedestrians from the given evidence images [2], [3], which are often captured by dashboard cameras or smartphones. Since the camera parameters of The associate editor coordinating the review of this manuscript and approving it for publication was Wei Liu.
off-the-shelf cameras are generally unknown, camera calibration from the single evidence image is fundamental for the accurate reconstruction of the target scene [4]. Static scene geometry may be obtained after the incident with a 3D scanner, while it is difficult to reproduce the same optical system as when the image was taken because the camera can be damaged or misaligned.
Geometric calibration methods [5], [6] have been widely used to estimate camera parameters. In particular, the methods without relying on planar markers (e.g., Tsai's method [5]) can, in principle, be used to estimate both intrinsic and extrinsic parameters using the correspondences on an arbitrary 3D model and a two-dimensional (2D) image. Methods tailored for single-image calibration are also studied [7]. These geometric methods are known to be valid if the good (in terms of both numbers and quality) correspondences are obtained [8], [9]. It is, however, uncertain that the calibration with a single image that captures an urban landscape will be successful, which is often the case of the target scenes of TAR as shown in Fig. 1. Even if accurate correspondences are given by manual labor, the number of correspondences will be smaller and tends to be biased in the image and 3D space, compared to using a well-designed rig. The image captured with a dashboard camera (see Fig. 1(a)) contains large areas that we cannot yield the correspondences (e.g., the sky at infinity, or a vehicle body moving along with the camera). Besides, most of the possible corresponding points are often distributed on three planar surfaces, i.e., the ground and roadside facilities such as buildings. They are poorly textured, thus limiting the distribution of the possible corresponding points on the image. In the case of overlooking a scene (see Fig. 1(b)), which often occur in security camera images, correspondences can be found in various part on the image; however, since the variation of distance of the scene is small, it may affect the calibration accuracy. These issues can lead to inaccurate or unstable solutions when estimating the position or speed of cars or pedestrians at the time of the accident.
Our goal is to develop an application that statistically predicts the accuracy of single-image camera calibration via comprehensive and systematic experiments. In this study, we use a traditional geometric calibration method. Since the instability of the solution is inevitable in single-image cases with relatively few correspondences, we employ a practical technique to obtain a reasonable solution by sampling the initial intrinsics. We evaluate the calibration accuracy with changing various factors (e.g., the number of correspondence and distribution in the image) by generating 2D-3D correspondences in synthetic environments. Based on the experimental results, we develop an application that predicts the calibration accuracy from an image and simple additional information.
Our application is intended to provide users a guide to acquiring enough information for scene analysis by estimating the success of the camera calibration for a given set of variables. Given an image, it helps to make decisions on whether or not to engage in time-consuming processes in 3D shape acquisition (e.g., via 3D scanners) and 2D-3D correspondence mapping (e.g., via manual labor). Also, it can be used to estimate the number of points or coverage of 3D shapes to meet the required accuracy. To assess the reliability of TAR and its admissibility as evidence at the court, predicting the accuracy of the camera parameter estimation, as well as its stability and confidence intervals, are quite important. Through experiments in a real-world environment that mimics traffic accidents, we show that the predicted accuracy is well in line with the practical scenarios. Our implementation is available at https://github.com/Kikkawa-OPP/CalibPrediction. 1

A. CONTRIBUTION
We provide a practical analysis of single-image calibration accuracy and its confidence interval depending on various factors, which emphasizes the application to predict the calibration accuracy in practical environments. The prediction process does not require 3D shapes and 2D-3D correspondences; it thus provides the estimation of the success of the calibration before time-consuming processes (i.e., 3D scanning and correspondence mapping).

II. RELATED WORKS
Our goal is to provide a quantitative measure and a prediction tool of the camera calibration accuracy through systematic evaluation. Our work is thus closely related to camera calibration and its evaluation.

A. GEOMETRIC CALIBRATION
Geometric calibration of camera intrinsic and extrinsic parameters is a fundamental technique in computer vision [10], [11]. Intrinsic parameters are often estimated using the correspondences on known 3D geometry [5] or planes [6]. Several different models for camera intrinsics have been proposed, such as using sixth-order radial distortion [12] or tangental distortion [13]. These methods, in principle, optimize both intrinsics and extrinsics, thus can also be used for extrinsic parameter estimation against a known geometry.
The perspective-n-point (PnP) problem [14], [15], [16] is to estimate the extrinsic parameters with given intrinsics and 2D-3D correspondences. The PnP problem can be extended to estimate (a part of) intrinsics like focal length [17], [18]. Recent studies use deep learning to estimate extrinsic parameters for the alignment between a camera and a depth sensor [19]. Though slightly different from the calibration using 2D-3D correspondences, recent structure-from-motion (SfM) methods targeting unordered image collection often compute camera intrinsics, as well as extrinsics, using numbers of image correspondences [20], [21].

B. SINGLE-IMAGE FULL CALIBRATION
Even the 2D-3D correspondences on a single image are given, traditional methods (e.g. [5]) can still achieve a full calibration (i.e., estimation of both intrinsics and extrinsics). In practice, this task is often done using calibration rigs [22], which enables detecting plenty of correspondences. There are also methods specialized for single-image calibration using orthogonal planes [23] or lines [24]. Similarly, circles [25], vanishing points [7], [26], and low-rank textures [27] are known to be useful assumptions for single-image calibration. In the augmented reality application, fiducial markers are designed for estimating intrinsic parameters [28]. To avoid the lack of generalization, we will investigate a general method for geometric calibration, which does not rely on planar scenes or any other special assumptions.
Recently, single-image calibration using deep learning is also studied. Some recent methods do not require 3D shapes but relying on numbers of 2D training images [29]. They, however, only calculate a rough extrinsics (i.e., place recognition) or partial rotation (i.e., only roll and pitch) [29], [30], which is difficult for the use in TAR applications.

C. EVALUATION OF CALIBRATION ACCURACY
The accuracy evaluation of camera calibration has a close relation with ours. Early attempts include the comparison of distortion models [31] and different calibration methods [32], [33]. The influences of measurement noise were also investigated as an important factor for calibration accuracy [34]. Several studies perform task-oriented evaluation, which assesses the influence of calibration error to the accuracy of stereo vision [35], [36] or 3D reconstruction [37], [38], [39]. Also, a recent paper [40] seeks the camera calibration options suitable for autonomous driving applications.
Similar to our work, Sun and Cooperstock [9] provide a systematic evaluation of traditional calibration methods. The experiment was carried out using synthetic 3D models to assess the influence of noises and the number of correspondences. However, the previous study focused on the use of well-designed rigs (i.e., 3D patterns, or 2D checkerboard) captured with multiple images. For single-image calibration with urban scenes, the problem becomes notably challenging in terms of both the number of correspondences and the distribution of the corresponding points.

III. SINGLE-IMAGE CAMERA CALIBRATION WITH INITIAL PARAMETER SEARCH
Although proposing a calibration method is not our main contribution, we here introduce a practical technique for single-image camera calibration, which is used in our experiment.
Our method estimates both the intrinsic and extrinsic parameters from given 2D-3D correspondences defined between a single image and a 3D point cloud such as acquired by a 3D scanner. When performing gradient-based nonlinear minimization like the Levenberg-Marquardt (LM) method [41], local minima far from the actual solution are likely to be derived if the number of corresponding points is small or if there is a bias in the distribution in the image. It is possible to obtain the initial values of intrinsic parameters by solving a linear system using singular value decomposition (SVD) and other methods [5], [6]. However, it is easily assumed that both the linear solution by minimizing the algebraic distance and the local solution of the re-projection error by the nonlinear least-squares method will be unstable, especially when only a few correspondences are given.
In this study, we restrict the solution space by sampling the initial values of the parameters that largely affect the reprojection error. We use a grid search of focal length f = (f x , f y ) and the second-order radial distortion k 1 , and search the best intrinsic parameters that minimize the re-projection error using the LM method. During the grid search, we first optimize the remaining parameters while fixing {f, k 1 } at the grid point, then optimize the all parameters using the given solution as the initial guess.

A. IMPLEMENTATION DETAILS
Since the calibration functions implemented in OpenCV are commonly used nowadays, we employ the camera model based on the definitions in OpenCV, which is slightly different from Tsai's model [5] but includes the sixth-order radial distortion [12] and tangental distortion [13]. We thus used the intrinsic parameters consist of focal length f = (f x , f y ), principal point c = (c x , c y ), and the distortion coefficients including three radial k = (k 1 , k 2 , k 3 ) and two tangental distortion terms p = (p 1 , p 2 ). Similar to traditional methods, this study uses the mean square of the re-projection error as the objective function and alternatingly optimizes the extrinsic and intrinsic parameters by the LM method [41].
For grid search, we sample the initial focal length f = (f x , f y ) converted from the vertical field of view (FoV) α.
where I h denotes the height of the image. The FoV α is searched in the range [10 • , 170 • ] with an interval of 10 • . For the distortion coefficients, we initially set k = (k 1 , 0), p = 0 and sample k 1 in the range [−10, 10] with the interval of 0.5.
The initial values of the principal point c are equivalent to the image center. Although a rough FoV may be obtained from the specification information of the camera used, the focal length converted from the FoV using Eq. (1) becomes notably different from the actual value when the radial distortion is large, thus it is useful to search for the focal length using the grid search. Our implementation is based on calibrateCamera function in OpenCV. With our single-threaded Python implementation, the whole process took 1.9 [sec] on a CPU (Intel i7-8700K, 3.70 GHz) when 50 correspondences were given.

IV. SYSTEMATIC EVALUATION
The heart of this study is to analyze the error factors for single-image geometric camera calibration to develop an error prediction application. Previous studies reported that the noise and the number of correspondences affects the calibration accuracy via experiments on synthetic 2D-3D correspondences simulating 3D rigs [9]. We also investigate the factors that are likely affecting when using correspondences on a single urban image, such as the scene geometry and the correspondence distribution on the image.

A. SYNTHETIC ENVIRONMENTS
To assess the influences of the variables (e.g., number of correspondences) under both ideal and practical scenarios, we use two types of synthetic 3D scenes: random-3D and urban-like scenes as shown in Figure 2. We generate these types of scenes to automatically acquire 2D-3D correspondences by randomly selecting 3D points from the 3D scenes and projecting them to the virtual camera. In order to obtain reasonable results, the pose of the virtual camera was randomly set each time (translations ranged from 0m to 100m)

1) RANDOM-3D SCENE
To analyze in a similar setting with a previous study [9], we prepare random-3D scenes. For this type of scenes, we randomly select the 3D object points in the view frustum of the virtual camera, to fulfill the given variables listed in sec. IV-B. The random-3D scenario simulates an ideal case for 2D-3D correspondences generation, intending to evaluate the overall trend of the influence of the variables on the calibration accuracy, which does not rely on scene geometry.

2) URBAN-LIKE SCENE
We also create urban-like scenes that simulate practical scenarios in TAR, which usually deal the roadside images captured by, e.g., dashboard cameras. As discussed in Sec. I, we assume the scene geometry of the roadside is mostly composed of three orthogonal planes (i.e., roads and buildings). Since the road and building surfaces are not perfectly planar, we add the random noises on the point locations with the standard deviation of 5 [cm] for the road, and 100 [cm] for the building. In this analysis, we fix the height of the virtual camera at 1. 5 [m]. This represents the approximate height of the rear-view mirror, which is the location of the dashboard camera in most cars. We evaluate the calibration accuracy by changing the road width and the height of the buildings.
For both scenes, the intrinsic parameters of virtual cameras simulate an actual dashboard camera with a resolution of 1920 × 1080. We use a wide FoV camera, in which f = (1000, 1010) and k = (−0.3, 0.1, 0.0). We add slight tangential distortions p = (0.02, 0.01) and shifted principal point c = (1020, 560), which often occur by windshields. To avoid the effect of the initial guess of parameter estimation, we randomly translated and rotated the entire scene (i.e., both 3D points and the camera) before each computation.

B. VARIABLES
For the synthetic experiments, we consider the scene shape and the quality of 2D-3D correspondence could influence the calibration accuracy. We, therefore, systematically evaluate the calibration accuracy by changing these variables. Specifically, based on the existing study for 3D rigs [9], we assume the calibration accuracy relies on the following factors: correspondence distribution, scene geometry, and noises. Table 1 summarizes the range and intervals of the variables we assess during the experiment, which are used to construct the statistics of calibration accuracy. These variables are related to the distribution and the quality of correspondences, as well as the scene characteristics. Given the set of variables, we randomly sample the 3D points from the point cloud that are projected in a given bounding box, as depicted in Fig. 3.
We here denote the set of correspondences on 2D image points and 3D object points as C 2D and C 3D , respectively. We use the functions to represent the range r, density ρ, average µ, and noise level n of the given set of correspondences. Also, # counts the members of a given set. The variables are now defined as follows. This variable affects the distribution of 2D image points. Let I and B as the sets of the pixels representing the image and a bounding box that contains the corresponding points C 2D (i.e., we only sampled the correspondences in the given bounding box). The size of the bounding box #(B) #(I) ∈ [0, 1] is denoted as the ratio of the area (i.e., the number of pixels) of the bounding box #(B) in the whole image #(I). During the experiment, we randomly generated bounding boxes which match the designated area ratio #(B) #(I) .
2) CORRESPONDENCE DENSITY ρ(C 2D ) We control the number of the corresponding points in the given bounding box. To simplicity, we denote the correspondence density ρ(C 2D ) as the number of points par 100×100 = 10, 000 pixels (when using cameras with 1920 × 1080 resolutions).
Letting the depth values of the object points C 3D as d(C 3D ), we randomly sample the object points in the depth range defined as where d(C 3D ) denotes the depth values of the set of 3D points. Thus, µ(d(C 3D )) and r(d(C 3D )) denote the representative value and the range of the depth as illustrated in Fig. 3.

4) NOISE ON CORRESPONDENCES n(C 2D ) [px], n(C 3D ) [cm]
To the given 2D and 3D corresponding points, we respectively add the Gaussian noises based on the standard deviation defined as n(C 2D ) and n(C 3D ). While the noises on 2D image points n(C 2D ) simulate the errors of feature point detection or manual correspondence assignment, n(C 3D ) simulates the 3D measurement error during laser scan or the fusion of multiple scans. The direction of the noise vectors is randomly selected.

5) ROAD WIDTH w [m] AND BUILDING HEIGHT h [m] (FOR URBAN-LIKE SCENES)
During the experiment using urban-like scenes, we also control the scene characteristics. Since the scenes simulate the road-side scenario, we change the road width w and the height of the road-side buildings h.

C. EVALUATION METRICS
We evaluate the accuracy of intrinsics and extrinsics using the well-known measures: The re-projection error, the camera position error, and the orientation error.

1) RE-PROJECTION ERROR (GIVEN CORRESPONDENCE) Re C 2D [px]
We compute the root mean square (RMS) of the re-projection error Re C 2D on the given 2D-3D correspondences.

2) RE-PROJECTION ERROR (ENTIRE IMAGE) Re I [px]
We evaluate the RMS of the re-projection error on equallydistributed pixels (i.e., grid points at 10 [px] intervals) in the image, Re I , not only on the given correspondences. This is computed by the back projection of the pixels to the representative depth of the scene, µ(d(C 3D )).

3) RE-PROJECTION ERROR (BOUNDING BOX) Re B [px]
In practical TAR scenarios, e.g., to estimate vehicle position, it is often sufficient to be accurately calibrated in the image region around the vehicle. In this case, accurate re-projection errors may not necessarily be required for the entire image. We, therefore, evaluate the RMS of the re-projection error at grid points inside the bounding box, Re B , which contains the corresponding points.

4) POSITIONAL ERROR E pos [cm]
To evaluate the accuracy of the extrinsics, we calculate the positional error of the estimated camera E pos . This measure is useful for predicting the accuracy of the self-localization of vehicles.

5) ORIENTATION ERROR E ori [deg]
Similar to the positional error, we also evaluate the orientation of the camera E ori . This is computed as the angle between the optical axes of the estimated and the ground-truth camera.
D. RESULTS Figure 4 shows the errors on the correspondence projection Re C 2D , Re I , Re B and extrinsics E pos , E ori while changing variables for both scenes. The results shown were generated by changing each of variables independently, while fixing the other parameters as a set of the default values #(B) For a given set of variables, we repeated the estimation 1, 000 times using random correspondences, and the figure shows the mean and confidence intervals calculated from these samples. Overall, the results of two (random-3D and urban-like) scenes share a similar tendency, while the error scale is different due to the different correspondence distribution on 3D space and 2D planes. We then introduce detailed discussions related to the correspondence distribution, depth variation, noises, and scene geometry.

1) CORRESPONDENCE DISTRIBUTION ON 2D IMAGES
If the bounding box size is relatively small (e.g., #(B) #(I) < 0.6), the estimate outside the bounding box is quite unstable. Meanwhile, even in more challenging cases, the estimate inside the bounding box and the camera localization accuracy the re-projection error inside the box Re B . In such cases, the camera localization accuracy E pos , E ori are also acceptable (i.e., E pos < 5 [cm]) intending that the correspondences only given around the regions of interest can be used for TAR scenarios only using a local part of the image. Restricting the bounding boxes of correspondences is theoretically the same as to use narrow-FoV cameras; in our case, #(B) #(I) = 0.3 is converted to the vertical FoV of approximately 30 • , if they share the same principal point and aspect ratio.

2) DEPTH VARIATIONS
The depth variation affect the accuracy of extrinsics. Larger depth µ(d(C 3D )) with a smaller range r(d(C 3D )) leaded the larger camera positional errors E pos . Only in urban-like scenes, larger scene depths also led the inaccurate guess for camera orientation E ori and re-projection errors Re I . The cause is related to the scene characteristics. Especially when the road width is narrow, it is difficult to obtain the correspondences far from the camera on the large areas in the image plane. This is a similar effect when decreasing the size of the bounding box #(B) #(I) .

3) NOISES
Noises n(C 2D ), n(C 3D ) also influence the overall accuracy, as reported in [9]. The relationship between the noise levels and the errors was almost linear.

4) SCENE GEOMETRY
The scene geometry slightly influenced the stability of the calibration. Regarding the confidence interval, the estimation in narrow (e.g., < 5 [m]) or wide (e.g., > 45 [m]) roads sometimes, slightly, drop the accuracy related to the re-projection errors and positional errors. In such cases, the scene can be approximated as a plane.

V. PREDICTION OF CALIBRATION ACCURACY
Based on the systematic experiments, we can easily develop an application that predicts the calibration accuracy. For TAR-related analyses such as the localization of vehicles, it is necessary to measure the scene geometry (e.g., using laser scanners), which may lead to traffic restrictions. For practical use cases, therefore, our application leverages users' prior knowledge of the target scene, without acquiring 3D point clouds and 2D-3D correspondences. It can be used for primary screening of images as possible evidence of the incident (e.g., traffic accident) to avoid or minimize the burden of investigators and society. Also, demonstrating accuracy with confidence interval is important to ensure the reliability of evidence at court. In this section, we describe the detail of the application software as well as experiments using real-world scenes using our software.

A. APPLICATION DETAILS
As shown in Fig. 6, we suppose to use our application in pre-survey of geometric camera calibration. The application estimates the mean and confidence interval of the calibration errors from the scene type (random-3D or urban-like) and variables used in our experiment (see Tab. 1). This is simply doable via the systematic evaluation using the synthetic correspondences that fulfill the given variables as we showed in the previous sections. While it predicts the success of calibration for given variables, it can also be used to estimate the lower or upper bounds of variables that meet a required accuracy FIGURE 6. A use case of our application. It can be used to predict the calibration accuracy from multiple measurement plans and examine whether TAR can be conducted with reasonable accuracy.

FIGURE 7.
Dataset for the real-world experiment. As input data for our accuracy prediction software, we acquire (a) images from dashboard cameras, (b) manual correspondences between 3D models and images. We also measure (c) the ground-truth camera movement acquired by a 3D scanner for the evaluation purpose. in principle. The application, for example, can be used to predict the minimum number of correspondence or coverage of 3D scans to meet the given accuracy requirement, which contribute to minimizing manual labor and traffic restrictions. While we can use the pre-computed error statistics yielded during the previous experiments, re-computing the statistics for new camera settings (e.g., for different resolution or largely different FoV) is possible by a reasonable time (approximately 10 minutes for 1, 000 trials when parallelized on a CPU, Intel i7-8700K, 3.70 GHz, 6 cores, 12 threads).

B. REAL-WORLD EXPERIMENTS 1) DATASET
To validate our application, we experimented using a real-world scene shown in Fig. 7. We captured six images from two dashboard cameras (denoted as A and B in Fig. 7), which have similar FoV used in the previous experiments, by changing the position of the cars equipping the cameras. The resolutions of four images were 1280 × 720 while the others were 1920 × 1080 pixels. The captured dataset mimics traffic accidents, where Scene 1 (1-1 and 1-2) and Scene 2 (2-1 and 2-2) simulate the frontal and rear-end collisions as shown in Figure 7(c). Image IDs (e.g., 1-1:A) correspond to the scene name and camera ID.
To acquire the ground truth of the prediction, a 3D point cloud of the scene was obtained by a laser scanner (Z+F IMAGER 5010C, Zoller+Fröhlich, Wangen im Allgäu, Germany). Since there is no access to the accurate ground truth for camera pose, we indirectly evaluated the camera localization accuracy E pos via the amount of the car movement, which was measured as the difference of car positions between two VOLUME 11, 2023 frames (e.g., 1-1 and 1-2) captured by the laser scanner. Figure 7(c) shows the measured values of vehicle movements. We also evaluated the re-projection error on manually assigned 2D-3D correspondences Re C 2D , where the number of the corresponding points varies from 17 to 35 depending on the scenes.

2) VARIABLES
Since the real-world scene was mostly composed of nearplanar surfaces, we use the urban-like dataset for the prediction. To determine the bounding box B, we roughly set an area containing discriminative points in the given image. Instead of generating a large number of bounding boxes, we use the designated bounding box B for the accuracy prediction. To the correspondence density ρ(C 2D ), we count the discriminative points on given images. To the noise levels, n(C 2D ) and n(C 3D ), we assumed that the error in correspondence assignment by the human annotator was a few pixels/centimeters.
We approximate the variables regarding the scene geometry, according to the prior knowledge that is easy to measure as shown as Fig. 8. We set w and h as actual road width and wall height. To the representative depths, µ(d(C 3D )) and r(d(C 3D )), we determine them from two measures, the distance to the nearest point l and the distance between the nearest and farthest point L as We fix l = 3 [m] in our experiment since it is usually realistic to yield discriminative points around the front end of the vehicle. The farthest point depends on the target scene, we thus select a reasonable point (i.e., semantically discriminative and easy to measure its 3D location by the laser scanner) from a given image. We empirically confirmed that the approximation was reasonable since the actual representative depth calculated from the well-estimated optical center differs from these approximations by only a few meters at most. Table 2 summarizes the prediction by our application as well as the ground-truth calibration errors. The last set of columns indicates the percentile values of the actual errors across 1, 000 synthetic samples generated by the application (a smaller percentile means a smaller error). In most cases, the real-world results were in the confidence interval predicted by the application.

4) COMPARISON OF ERROR DISTRIBUTION
To confirm if the confidence intervals yielded by our application fit the real-world scenario, we compute the distribution of the positional accuracy of actual 2D-3D correspondences using subsampled correspondence sets.
We evaluate the amount of car movement in the realworld images, which originally have more than 15 correspondences, along with the ground-truth car movement observed by the 3D scanner. We randomly subsample the corresponding points from 6 to 15 points (1000 trials for each) and estimate the car movement based on the calibrated camera parameters using the subsampled correspondences. Since each trial yields a single estimation, we can get the error distribution over the subsampled correspondence sets. We use the urban-like scene for the error prediction. Figure 9 shows the comparisons between the actual (i.e., subsampled) and predicted error distribution. Our application reasonably reproduces the real-world error distribution.

VI. DISCUSSION
We have introduced an evaluation of single-image camera calibration and a new application that predicts the calibration accuracy for scene analysis with TAR. The accuracy of re-projection errors and camera localization in a practical real-world scene was reasonably estimated via the error simulation using synthetic environments. Since our application does not rely on the 3D shape of target scenes, it can be used for primary screening of images as possible evidence of the incident to minimize the burden of investigators (i.e., 3D acquisition, and correspondence assignments) and society (e.g., road traffic restrictions).

A. GENERALIZATION ABILITY
While we use shape templates mimicking practical scenarios, i.e., urban-like scenes, it is worth discussing the generalization ability of the error prediction using our model. To assess the generalization ability of our predefined shape templates, we conduct an experiment on different types of synthetic scene geometry that mimic road intersections and bare grounds. For the prediction of calibration accuracies, we use urban-like scene approximations. Specifically, we set the width and height parameters that are the same as the main road (w = 10 [m], h = 5 [m] at the intersection. For the bare ground scene, we set large road width (w = 1000 [m]) and zero height. Figure 10 compares the calibration errors. The calibration accuracy for the intersection scene is slightly better compared to the urban-like scene due to the higher degree of freedom for the correspondence selection. Since the errors are still inside the confidence interval by the predictions using urban-like geometry, we consider our design choice of using the geometry mimicking road reasonable. Meanwhile, a practical future direction for better prediction is to increase the variation of shape templates for our predictor in addition to random-3D and urban-like scenes. For the bare-ground scene, the urban-like template accurately represents the observed scene geometry.

B. DISTORTION MODEL SELECTION
A number of distortion models have been proposed. Given a enough number of correspondences, it is known that using camera models with larger number of parameters can achieve the accurate camera calibration [40]. However, through the experiments, we found the complex camera models are often not suitable for our conditions where the correspondences are only sparsely obtained. Figure 11 shows the comparison between the models with different number of radial distortions, where we use three coefficients for the standard model and six for the complex model. We observe fault-like artifacts when increasing the number of distortion parameters. The visualization of re-projection errors ( Fig. 11 (b)) highlights the artifacts appearing along circumferences.

C. SINGLE-IMAGE RECONSTRUCTION FOR TAR
The estimated camera parameters are essential information for TAR to estimate the position, shape, and behavior of target objects (e.g., car). As shown in sec. V, the speed of cars can simply be computed from the positional difference between consecutive images. Also, it can be used for estimating how many points or the coverage of 3D scans on the target scene are needed to meet the accuracy requirement of the scene analysis. Although most 3D reconstruction methods for TAR [2], [3], [4] have used multi-view images, single-image 3D reconstruction is another promising application that broadens the availability of criminal investigation because the single-image metrology is fundamentally well studied [42]. We are keen to develop and deploy a whole framework for TAR for the actual investigation scenarios, including the error prediction process proposed in this paper.