An Efficient Algorithm to Select Reference Views for Virtual View Synthesis

View synthesis is one of the key techniques for generating immersive media. Virtual view synthesis techniques require considerable input views to provide a wide viewing space to users, such as 360 virtual reality. However, the computational complexity is increased, and the synthesized virtual image is blurred when all input views are used as reference views. We analyzed the algorithm complexity and synthesized image quality according to the number of reference pictures. Based on the results, we propose a systematic algorithm to compose an optimal subset of the reference pictures to reduce the hole area and increase the accuracy of the overlapped pixel data in the virtual view. The algorithm consists of the screening and optimal composing steps, which involve logical and geometric measurements. The experimental results demonstrate that the proposed method reduces algorithm complexity and improves the virtual view quality.


I. INTRODUCTION
As applications using immersive media have been developed, various technologies to synthesize virtual views have been studied to construct virtual media systems, such as metaverse, digital twin, and augmented reality (AR) or virtual reality (VR) services, supporting three to six degrees of freedom (DoF). Many companies have developed various virtual media systems with a head-mounted display displaying the screen content. The Moving Picture Expert Group (MPEG), an international standardization organization, formed MPEG-I [1], [2] to standardize the codec for volumetric and immersive media [3]- [9]. In immersive media systems, geometric information, such as camera poses and depth maps, is critical to producing virtual views with high quality. When texture and depth data are provided, a virtual view can be synthesized along an arbitrary viewing angle according to the user's viewpoint within a limited virtual space.
The methods to synthesize virtual views are classified into the dense light field-based rendering (DLFR) [10], [11] and depth image-based rendering (DIBR) techniques [12]- [24]. In DLFR techniques, multiple views are taken by a set of cameras or micro-lenses, which are arranged with a dense baseline. Optical effects, such as light reflection and transmission, can be restored; thus, these techniques generate realistic, immersive images. However, these methods require professional equipment, such as the plenoptic camera [10]. In addition, to generate content supporting a high DoF, such as 6 DoF, the number of employed cameras increases exponentially, and it can be a burden for developing the systems.
In DIBR techniques [12]- [24], a 6 DoF immersive video can be created using general cameras instead of professional equipment, where the quality of the synthesized view predominantly depends on the quality of the depth map. The depth information resulting from techniques based on DIBR may have a variety of holes, and it degrades the quality of the synthesized image. Therefore, reducing the area of the hole regions and increasing the quality of the depth information are some of the most crucial issues. When depth information is derived, if all neighbor reference views are used to construct a virtual view, using unnecessary and similar reference pictures blurs the synthesized picture. In addition, the computational complexity increases as the number of reference views increases [25]. However, if few reference pictures are used, the necessary information to construct the virtual view is not provided. Thus, the depth map has wide hole regions that produce distorted images. To derive the depth map, which produces high-quality virtual views, we should select the optimal reference views, maximizing the synthesized picture quality.
Some research has been conducted in the research field on selecting reference pictures [29]- [33] and is explained in detail in Section II. Because the methods have limitations in improving image quality and reducing complexity, we propose an efficient algorithm to optimize the set of reference views based on the cost function, considering the hole size and diversity of the reference pictures used. This paper is organized as follows. In Section II, we explain the existing view synthesizers and conventional algorithms to select reference views. Section III provides the preliminary analysis for view selection, where we analyze the algorithm complexity and synthesized image quality according to the number of reference views used. These data produce the tendency of the algorithm performance for various scenarios. Based on these data, we design a cost function to represent the algorithm performance and propose an algorithm to optimize the reference pictures used in Section IV. The simulation results are presented to demonstrate the performance of the proposed algorithm in Section V. Finally, we conclude this paper in Section VI. Fig. 1 represents the problem of this study that needs to be solved. It shows N pictures to synthesize a virtual view. While the location and pose of the virtual camera change, a synthesized picture is constructed using the related reference pictures. If all reference pictures are used to construct a virtual image, it would be computationally expensive, and the synthesized image would be blurred due to the misalignment of several overlapped image patches. However, if the virtual view is derived with too few reference pictures, the synthesized picture includes many holes due to a shortage of information. Therefore, it is important to select the optimal set of reference pictures out of all input images. The proposed method refers to only the most suitable views to synthesize a virtual view. Fig. 2 explains the core technologies to synthesize a virtual view, where DIBR methods derive the depth information from the neighbor reference views and use it to construct a virtual view. The DIBR methods were imported into a variety of standard visual systems, such as three-dimensional (3D) television [26], free-viewpoint television [27], and multiview video coding [28], which MPEG made. The depth image is derived from the selected reference views; thus, selecting the reference views used to construct the depth map among all available reference views is one of the most critical issues. As indicated in the literature, depth maps may include various holes that degrade the quality of the synthesized image because the synthesized image quality  This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. predominantly depends on the accuracy of the derived depth data. This section explains various conventional methods for view synthesis and reference view selection.

II. RELATED WORK
The DIBR algorithms are classified as single view-based [13]- [16], stereo view-based [17]- [19], and multiview-based syntheses [20]- [24] according to the number of the reference views to construct a target virtual view. In the single viewbased algorithm category, a single neighbor view is used as a reference view, where the texture image, depth image, and set of camera parameters of the selected neighbor view are used to construct a target virtual view. Because the information for the single-neighbor view is insufficient to generate all depth values in the derived depth image for the target view, the depth image may include a variety of holes that degrade the synthesized picture quality. Various algorithms to remove the holes are proposed in [13]- [16].
In the stereo view-based systems, the left and right views of the target virtual view are used as reference views. Because the amount of information used in the stereo view-based system exceeds that of the single view-based algorithm, the synthesized picture quality from the stereo view-based DIBR is better than that from the single view-based algorithm. Among the stereo view-based algorithms, the technique proposed by Tanimoto et al. [18] was adopted for view synthesis reference software (VSRS) [18], which was created by the 3D video group at MPEG. The VSRS constructs the target view using data from one or two reference views. The picture resulting from the VSRS may have some small holes. Thus, some post-processing to conceal those holes is needed. Zhu et al. [19] proposed a novel algorithm to fill the holes, in which occluded, unoccluded, and invisible backgrounds are identified to fill the holes.
In the category of the multiview-based algorithms, Li et al. [20] proposed a view synthesis framework to exploit the data for multiple reference views to construct a target virtual view. They defined complementary views that are mutually cooperative to reduce the hole sizes in the generated virtual view. In 2018, the MPEG-I visual group published a reference synthesizer that imports a versatile view synthesizer [21], reference view synthesizer (RVS) [22], and view weighting synthesizer [23]. The tools in [21]- [23] construct a virtual view using multiple reference views.
As the number of the reference views increases in these three categories, the computational complexity also increases. As for the synthesized picture quality, when the number of reference views is too small, the synthesized picture may have holes that degrade the virtual view quality. However, when the number of reference views is too large, some parts of the virtual view are constructed from overlapped patches warped from multiple reference views. The overlapped patches can be misaligned, resulting in blurred parts of the synthesized image. Thus, the number of reference views should be optimized to improve the image quality while considering the algorithm complexity.
Various techniques [29]- [33] to select the set of the used reference views have been studied. Maugey et al. [29], [30] proposed a novel reference view selection algorithm for multiview video coding, where coding efficiency and transmission speed are considered in the cost function for optimal selection. However, these techniques are constrained in improving the performance of virtual view synthesizers because they do not consider the variation of the synthesized image quality according to the reference view composition. In [31], the virtual view synthesis (VSVS) algorithm was proposed to select reference views. It selects two reference views that are nearest to the virtual view. The constraint on the number of the reference views to two simplifies the algorithm, but the resulting picture may have various sized holes.
In [32], a 3D photorealistic environment simulator (PreSim) was proposed to select the reference views, where the nearest reference views from the virtual view are selected up to 10 pictures. PreSim excludes some inappropriate reference views. The algorithm assumes that the baselines for all input cameras are dense enough. As we observe from preliminary tests, the algorithm usually selects 10 reference views, although all 10 reference views are unnecessary to increase the synthesized picture quality. With 10 reference views, the algorithm efficiency decreases with respect to the visual quality and computational complexity.
In [33], we proposed an algorithm to compose a set of reference views, where simple conditions were checked to determine whether the corresponding reference view was used. In this research [33], the depth data were not used, and the overlapped regions warped from the reference views were estimated heuristically without a cost function. The simulation results in [33] demonstrated that the algorithm has some constraints in improving the quality of the synthesized image, although it is very simple. In this paper, we estimate the hole size generated by stitching the pictures warped from the neighbor reference views to overcome those problems. In addition, the redundancy among these reference views to synthesize the virtual view is predicted based on the geometric relationship between them. The set of reference views is optimally selected by minimizing the cost function, which is based on the estimated hole size, predicted redundancy, and mutual supplementation between reference views.

III. PRELIMINARY ANALYSIS FOR VIEW SELECTION
In this section, we check the tendency of the performance of the conventional synthesis algorithm concerning the algorithm complexity and virtual view quality according to the number of reference views. The RVS [22] is used as a synthesis algorithm in these tests because it includes a function to set the number of reference views and has excellent performance on natural scene and computer graphic data. Based on the analysis resulting from these preliminary tests, we design the algorithm to select the reference views. Fig. 3 illustrates the four sets ('umbrella,' 'chair,' 'checkerboard,' and 'C908') of the test images used in these tests. The resolutions of the umbrella, chair, checkerboard, and C908 sets are 1920 1080, 1280 720, 3000 4000, and 3000 4000, respectively. Fig.  3(d) presents a panoramic image stitched from multiple input pictures.

A. COMPLEXITY ANALYSIS OF THE VIEW SYNTHESIS
where and are the computational complexity and proportional constant, respectively. The number of reference views is denoted by , and is the resolution of each reference picture.

B. ANALYSIS FOR HOLE SIZES IN THE SYNTHESIZED IMAGE
The holes resulting from the view synthesis algorithm are classified as 'uncovered' and 'disoccluded' holes. The uncovered hole is made when the field of view of the virtual view is mismatched with that of the reference view. When some part of the virtual view corresponds to the hidden area of the reference view, the part results in a disoccluded hole. Fig. 6 presents the ratio between the sizes of the holes and images in the synthesized image when the number of reference views varies. In this experiment, the hole size rapidly decreases as the number of reference views increases when the number is small. However, the ratios were saturated over a threshold number of reference views, implying that too many reference views are not needed to reduce the area of holes. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  Fig. 7 depicts the relationship between the synthesized virtual view quality and number of reference views. When the number starts small, as it increases, the peak signal-to-noise ratio (PSNR) of the synthesized image increases because using more reference views provides more data to construct the virtual view. However, the quality is saturated when the number exceeds a specific value, and if the number becomes too large, the PSNR decreases because many patches warped from multiple reference views are misaligned, resulting in blurred regions in the virtual view.

D. SUMMARY OF PRELIMINARY ANALYSIS
As we observe from the preliminary analysis, incrementing the reference views affects the algorithm complexity, hole size, and virtual view quality with different tendencies, respectively. Thus, the optimal set of reference views should be selected by jointly maximizing the quality and minimizing the complexity.
To solve this optimization problem, we design a cost function including various components related to the virtual view quality.  Initially, all points of , , , , are projected onto the world coordinate system, respectively, as follows:

A. ESTIMATION OF OVERLAP REGION
. The focal lengths ) * and ) . and the principal point ( * , . ) are intrinsic camera parameters given as input data, and / is a depth value of , , , , .
After applying (2), , ' , ( is rotated and translated to the corresponding coordinate 0 , ' 0 , ( 0 in the virtual camera coordinate system as follows: where 1 →0 and →0 are the rotation matrix and translation vector from the coordinate system of the th camera to that of the virtual camera, respectively. In addition, 1 →0 and →0 are calculated using where 1 and , and 1 0 and 0 are the rotation matrices and translation vectors of the th and virtual views, respectively. The superscript 3 indicates the transpose of the matrix. Then, 0 , ' 0 , ( 0 from (3) is reprojected onto the coordinate system in the virtual view using the following equation: In Fig. 8, if any warped corners are out of the virtual view range, they are mapped to the closest corner points %% , , , , within the virtual view. When the warped corner is in the virtual view, %% equals % .
The area 7 81 of the region 81 overlapped by the warped th reference picture is calculated using Heron's formula as follows: is flipped and twisted. Thus, if the following condition is satisfied, the corresponding th reference picture is not considered in the synthesis:

B. ESTIMATION FOR DISOCCLUSION
In this section, we analyze a scenario generating a disocclusion and design a scheme to measure the disocclusion quantity. Fig.  9 displays the scenario to generate disocclusion, where a blue column is in front of a big gray box. Some points in the gray box cannot be captured by the th camera because the blue column blocks them. In this scenario, the reference picture captured by the th camera may produce a kind of disocclusion in the synthesized image. In Fig. 9, we assume that the virtual and th cameras examine the same point 9 , where > 0 and > are the centers of the virtual and th cameras. We draw two rays from the centers of the two cameras to the back object through the point 9 . The points ? and are intersections between the rays and rear object. In addition, @ i BBB⃗ is a vector from to ? , whose absolute value D@ i BBB⃗ D is highly correlated with the hole size in the synthesized picture. Additionally, / ? is the farthest depth of the camera and is given as the input metadata. Finally, / 9 is the depth of 9 in the coordinate system for the virtual camera. In Fig. 9, > can be represented with a relative position from the center > 0 of the virtual camera (i.e., > and > 0 are represented as * , . , E F and 0, 0, 0 F , respectively). The position of > is calculated using extrinsic parameters for the input and virtual cameras: From the geometric relation in Fig. 9, p 9 i ? BBBBBBBBB⃗ , p 9 i BBBBBBBB⃗ , and g i BB⃗ are derived as follows: To analyze the relation between @ i BBB⃗ and the hole size in the virtual image, we present Fig. 10, the top view of Fig. 9, where @ i BBB⃗ is mapped to the vector i B B⃗ in the virtual image plane as follows: The This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  In Fig. 11 (a), if 81 N is included completely in 81 (i.e., 81 N ⊂ 81 ), the O th reference picture P N may be redundant for P . This redundancy can be checked using the following four equations:
In Fig. 11, if 81 includes some holes, the 81 N has wider holes because the difference between the shooting angles of the virtual and O th cameras is larger than that between the virtual and the th cameras. This difference can be checked using the following equations: In the algorithm, if all conditions for (17) to (22) are satisfied, the O th reference view is excluded in the set of reference pictures.

D. PROPOSED ALGORITHM FOR REFERENCE VIEW SELECTION
When R reference views are given to synthesize a virtual view as in Fig. 2, each reference picture is checked regarding whether it is excluded from the set of reference images based on the criteria of (9), (15)- (16), and (17)- (22). After screening them, the remaining reference pictures are included in the universal set S of the candidate reference images. In this section, we propose an algorithm to construct the optimal set T * from the universal set S , where T * ⊂ S . The set T * is optimized based on the cost function considering the estimated hole size, predicted redundancy, and mutual supplementation between reference views. We summarize the notations consisting of the cost function in Table I.  The cost function in the proposed algorithm is represented as follows: where the lower subscripts T, H, D, and V of F T` , e T` , f T` , and g T` correspond to "Total," "Hole," "Direction," and "Variance," respectively. T` denotes a specific subset of the reference pictures. The number of pictures in T` is a . Table I defines the notations used to calculate the cost terms in this section.
In (23), e T a is an estimator for the average area of the total holes generated using the reference pictures in T`. To calculate e T a , the averaged areas of the noncovered and disoccluded holes are estimated using hZ T` and i j T` , respectively, as follows: In (25) and (26), the geometric meanings of k BB⃗ ! and ' k BB⃗ ! are shown in both of Fig. 9 and Fig. 10.
In the scenario in Fig. 2, if the reference pictures l? , _b on the left and right sides of the virtual view are used, they can complementarily remove the disocclusion holes in the synthesized picture. To verify this case, we calculate the absolute value of the sum of V BBB⃗ for all views in T`. A smaller absolute value results in a smaller averaged area of holes. Nevertheless, the average area of the holes decreases as the absolute value of the sum of the center position 9 V for all reference views in T` decreases. The geometric meaning of 9 V is shown in Fig. 9 and Fig. 10. The terms related to the absolute values of the sum of V BBB⃗ and that of the center position 9 V are calculated in (27) and (28), respectively, as follows: where (27) and (28) comprise the direction-related term f T` given in (23): The direction term f T` is related to the relative directions and positions of the reference pictures used to construct a virtual view. Minimization of f T` reduces hole size in the virtual view. In (23), g T a represents the uniformity of the positions of the reference pictures. If most reference views are located in a specific area, some would be redundant for reducing the hole area. We check the direction vector V BBB⃗ for each view V in T` to consider the uniformity, as follows: Based on (30), we calculate the absolute difference between m V and m N of N , which has the smallest absolute difference angle with V in T`: After that, we calculate the variance of ∆m V for all views in T` as follows: In (32), 2p/a represents the average value of ∆m V in an ideal scenario, where the reference pictures are located uniformly with the angle intervals of 2p/a.

End While
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182401 In (23), we only evaluate e T a when only one reference view was included in a subset T` (i.e., Z = 1), because the statistics, f T` and g T a , of multiple reference views cannot be calculated. Based on the cost function of (23), we derive the optimal reference view set T * that minimizes the total function F T` . Calculating the cost functions for all constructible sets T`, 1 ≤ a ≤ y , is highly complex. Therefore, we optimize the set based on the greedy algorithm [34], which determines the optimal solution by considering the best available choice at every iteration.
We determine a set T`, minimizing the cost function for a specific value of a; after that, we increase a at the end of each iteration. Initially, T v , T v * , and T * are set to empty, and F T v is set as the positive numerical limit. The initial value of Z is set to 1. The proposed algorithm is represented with pseudocode, as shown in Table II.

E. OPTIMIZATION OF COEFFICIENTS
When the total cost function F T a in (23) is calculated, coefficients • | and • ‚ are used to assign differential weight factors for e T` , f T` , and g T` . To optimize the factors, we analyzed the effects of e T` , f T` , and g T` for the synthesized picture quality through empirical tests. Fig. 12 illustrates the relationship between the PSNR of the synthesized picture and the values for each term of (23), where 16 test image sets were used. Fig. 12 indicates linear correlations in all tests. Moreover, the PSNR decreases as the values of e T` , f T` , and g T` increase, implying that the terms in (23) efficiently represent the cost to create the synthesized picture.
Considering the linear dependency in Fig. 12, the coefficients • | and • ‚ are set to the following:

F. CONTRIBUTION
This paper proposes a systematic algorithm to compose an optimal subset of the reference pictures to reduce the hole area and increase the accuracy of the overlapped pixel data in the virtual picture. The process consists of (step 1) screening the redundant reference pictures and (step 2) composing the optimal subset from the remained reference pictures. These steps are conducted based on logical and geometric measurements.
In the screening step, all reference pictures are tested according to (A) the area of the overlap region, (B) disocclusion, and (C) redundancy; these terms are measured with geometric metrics, as explained in Section IV. A, B, and C, respectively.
After screening, it is challenging to compose the optimal subset of the reference pictures out of the remained M reference pictures, where the number of reference pictures in the optimal subset can be in the range of 1-M. In Table II,   Using the selected reference pictures only, instead of all reference pictures, significantly reduces the computational complexity, because decreasing the number of the used reference pictures reduces the number of very complex modules, such as the depth information generation, warping, blending, and post-processing for the used reference pictures. It means that the proposed algorithm is effective in reducing computational complexity.
On the other hand, as for the quality of the synthesized picture, it is obvious that the degradation of the virtual view dominantly depends on the hole size. The terms in (15), (16), and (24), which are related to the hole size, are used to check whether the reference picture is appropriate to construct the virtual view minimizing the hole size. We can expect that the hole size is reduced in the synthesized picture because the screening and composing of the optimal subset are based on those equations.
Additionally, the terms in (17)- (22), (29), and (32) are related to the mutual compensation of the reference pictures. The compensation increases the quality of the synthesized picture. Therefore, we can expect that the proposed algorithm is effective in increasing the quality of the virtual view.

A. EXPERIMENTAL SETUP
We compared the proposed method with various latest conventional methods, such as VSVS [31], PreSim [32], and no-selector to demonstrate the performance of the proposed algorithm. When no-selector is used in the synthesis, all reference views are employed to create a virtual view. VSVS and PreSim were constructed using the C++ language while implementing these algorithms. The proposed algorithm and various conventional methods are used as one module in the entire system for virtual view synthesis to select the reference views. In this simulation, RVS was used as the view synthesis system because it is a state-ofthe-art virtual view synthesizer. The RVS is a part of the MPEG-I reference software [22]. The software for RVS was implemented using C++ and the OpenGL Shading Language. Therefore, the programs were executed on a graphics processing unit. Furthermore, all experiments were executed on a computer equipped with an AMD Ryzen processor at 3.40 GHz with 32 GB RAM and NVIDIA GeForce RTX 2070 SUPER.
The algorithms are tested with various picture sets shown in Figs. 3, 13, and 14. The authors took the pictures in Figs. 3 and 13. Figs. 3 and 13 show each umbrella, chair, checkerboard, C908, checkerboard2, sink, animal, cabinet, wallpaper, wallpaper2 image set. These picture sets were taken using the built-in smartphone camera in Xiaomi Redmi Note 8T. The camera resolution is 3000 4000. The baseline between adjacent cameras is approximately 15 cm to 1 m. All datasets were captured casually with arbitrary movement. The information on the testing sets is summarized in Table III. The pictures in Fig. 14 are MPEG-I test sequences used under common test conditions during the standardization of the immersive video codec. Fig. 14 depicts each picture: a barn, breakfast, breaktime, frog, Magritte, and mirror. The information on the testing sets is summarized in Table IV. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182401

B. SUBJECTIVE QUALITY OF SYNTHESIZED IMAGES
In Fig. 15, we compare the visual qualities of the virtual view pictures synthesized by RVS incorporating various reference view selection methods, such as no selector, VSVS, PreSim, and the proposed algorithm, from the left-most column to the right-most column. This figure presents the results for umbrella, checkerboard, frog, sink, and animal from the top to bottom rows. The important regions to compare are indicated with red rectangles.
In the top row, the left rectangle regions of the umbrellas synthesized by no-selector and PreSim are stained because these methods make a virtual view using inappropriate reference pictures. The right rectangle region of the umbrella constructed by the proposed algorithm has sharper edges than in other methods.
In the second row, VSVS makes the stretched parts, resulting from filling the holes. It degrades the visual quality severely. When we compare the images synthesized from the frog data set in the third row, the face of a man in the pictures from no-selector, VSVS, and PreSim are severely blurred, whereas the proposed algorithm provides clear pixel values. In the fourth and fifth rows, VSVS reveals blurred and stretched regions because the method selects only two reference pictures and data are lacking to construct the virtual view results in the stretched regions. Fig. 16 shows the enlarged figures of the red rectangle regions of Fig. 15, which highlights the visual differences between the pictures in Fig. 15. Based on these results in Figs. 15 and 16, the proposed algorithm outperforms other conventional methods concerning the visual quality of the synthesized image.

C. OBJECTIVE QUALITY OF SYNTHESIZED IMAGES
We calculate the PSNR and structural similarity index measure (SSIM) of the synthesized pictures to evaluate the objective performances of the algorithms. The PSNR and SSIM were measured between synthesized and ground truth images. The averaged values for all views are summarized in Tables V to VIII, where the best and second-best values are red and blue, respectively.
In Table V, the experiments are conducted with test image sets of Figs. 3 and 13, which were taken along arbitrary directions at irregular positions by authors. In this table, the proposed algorithm has the best performance for all test image sets except for the umbrella image set. For the umbrella set, because test pictures were taken within a narrow view angle with a wide baseline width, the best performance was obtained when many reference pictures were used. Therefore, PSNR increases as the number of reference views increases.
The PSNRs for the MPEG-I test data sets are summarized in Table VI, where the proposed algorithm has the best and second-best performances for all data sets. Among the test data sets, the pictures in the frog set were taken by multiple cameras set elaborately. Thus, the depth information for the frog data is exact and provides high-quality data to construct the virtual view. In this case, using two (left and right) reference pictures efficiently synthesizes the virtual view. Thus, VSVS has the best PSNR for the frog data.
The SSIM in Tables VII and VIII implies the structural synthesized picture quality. As we observe from the data in these tables, the proposed algorithm results in the highest and second-highest SSIMs for all test image sets. Based on the results in Tables V to VIII, the proposed algorithm objectively outperforms other methods.

D. COMPUTATIONAL COMPLEXITIES
To compare the computational complexities of the algorithms, we measured the CPU time consumed by those as summarized in Tables IX and X. Table IX lists the results for the test image  sets in Figs. 3 and 13. In Table X, the times for the MPEG-I test sets are summarized. The time is represented by sec/frame, the average value for the entire picture in each test data set. As observed in these tables, VSVS is the fastest algorithm for all test image sets except in frog and mirror because VSVS selects only the two nearest reference views among all pictures in each set. PreSim selects many reference pictures up to 10, excluding the inappropriate reference views. Thus, the time for PreSim is longer than for VSVS. In Tables IX and X, the no-selector algorithm requires the longest time because it uses all reference pictures to synthesize a virtual view. Although the proposed algorithm requires the second-shortest time for most data sets, they are slightly longer than for VSVS. These results imply that the proposed algorithm is very simple compared to conventional methods.
We checked the number of the reference views used to synthesize a virtual view for no-selector, VSVS, PreSim, and the proposed algorithm in Tables XI and XII. As observed in these tables, no-selector uses all reference pictures in each set. The VSVS selects the two nearest reference pictures. In the PreSim algorithm, multiple reference pictures were selected up to 10. Compared with other methods, the proposed algorithm uses fewer reference views, between 3 and 4.
The results in Tables IX, X, XI, and XII indicate that the proposed algorithm is one of the simplest techniques. It selects the most required reference pictures based on minimizing the cost function of (23). In addition, using the selected reference views decreases the algorithm complexity.

VI. CONCLUSION
The performance of a virtual view synthesizer predominantly depends on the set of reference views. This study analyzed the relationship between the number of reference images and the synthesized virtual view quality. Based on the analytical information, we proposed a systematic algorithm to compose the optimal set of reference views. The proposed algorithm consisted of screening and composing steps for the reference pictures. In the screening step, the area of the overlap region, disocclusion, and redundancy were defined with forms of geometric metrics, which were used to exclude some reference pictures that were ineffective in constructing the virtual view. After the screening step, the optimal subset of the remained reference pictures was composed by the iterative method, where the number of reference pictures in the considered subset increased as the iteration was repeated. While composing the subset, hole area, the complementary relationship of the reference pictures, and uniformity of positions of reference pictures were considered based on the geometric measures. One of the challenges is to choose some pictures among the given reference pictures. This work resolved this problem using geometric measures and a systematic procedure. The simulation results demonstrate that the proposed algorithm effectively reduces the hole size and blurring artifacts compared with conventional algorithms.