A High-Quality VR Calibration and Real-Time Stitching Framework Using Preprocessed Features

Virtual Reality (VR) contents include $360^{\circ }\times 180^{\circ }$ seamless panoramic videos, stitched from multiple overlapping video streams. Many commercial VR devices use a two-camera rig to capture VR content. These devices have increased radial distortion effects along the seams of the stitching boundary. Moreover, a fixed number of cameras in the camera rig makes the VR system non-scalable. Since the VR experience is directly related to the quality of VR content, it is desirable to create a VR framework that is scalable in terms of the number of cameras attached to the camera-rig and has better geometric and photometric quality. In this paper, we propose an end-to-end VR system for stitching full spherical content. The VR system is composed of camera rig calibration and stitching modules. The calibration module performs a geometric alignment of the camera rig. The stitching module transforms texture from the camera or video stream into the VR stream using lookup tables (LUTs) and blend masks (BMs). In this work, our main contribution is the improvement of stitching quality. First, we propose a feature preprocessing method that filters out inconsistent, error-prone features. Secondly, we propose a geometric alignment method that outperforms state-of-the-art VR stitching solutions. We tested our system on diverse image sets and obtained state-of-the-art geometric alignment. Moreover, we achieved real-time stitching of camera and video streams up to 120 fps at 4K resolution. After stitching, we encode VR content for IP multicasting.


I. INTRODUCTION
The problem of stitching and rendering a panoramic video, from multiple overlapping camera frames, is a well-known and widely sought-after research topic. Panoramic videos offer an immersive experience for users across a wide range of commercially available devices. The availability of these devices helps in real-world applications of human-computer interaction. However, these devices lack quality in terms of stitching alignment, content resolution, and rendering speed. Moreover, commercially available devices for VR content creation and consumption either have a fixed number of cameras [1] usually connected to a pre-aligned camera rig or limited display options. We refer to such devices as The associate editor coordinating the review of this manuscript and approving it for publication was Chin-Feng Lai .
non-scalable or limited-scalability devices. It is commercially desirable to design a scalable VR content creation and consumption system that can be reconfigured with a variable number of cameras and different types of local or network [2], [3] based display devices [4].

A. IMMERSIVE CONTENT CREATION
The process of stitching a panoramic canvas from multiple overlapping images is highly dependent on scene texture, camera quality, and type of lenses. Typically, panorama stitching requires a camera to be rotated around its axis in a panning movement, such that there is an overlap between consecutive frames. The variations in scene texture, lighting conditions, overlap area, and rotation axis create a challenging environment for stitching frameworks. These variations become more challenging for multi-camera image acquisition which exhibits a panning movement of 360 • . Stitching immersive panorama demands higher system complexity and economical costs. System complexity can be reduced by using wide-angle lenses with additional constraints for handling non-linearity. There exist several commercially available platforms that can stitch panoramas under dynamic conditions and unconstrainted environments.

B. IMMERSIVE CONTENT CONSUMPTION
A recent surge in demand for immersive content has resulted in the commercial availability of various kinds of immersive content consumption devices. These devices vary from single-user head-mounted VR systems [5] to wall-mounted group immersion display systems. It requires higher computational complexity and economic costs to render high-resolution content on these devices. Furthermore, different categories of immersive display devices demand a content creation system that can work across multiple categories of display devices [6], [7]. Therefore, the design of a consumer-grade and economically viable real-time 360 • stitching system involves an increase in system performance along with reduced cost and complexities.

C. CONTRIBUTIONS
In this paper, we present a cost-effective and scalable VR system for stitching immersive content in real-time. The stitching system has a calibration phase followed by a real-time stitching phase. First, we present an algorithm for system calibration that uses a feature preprocessing stage to clean and trim feature matches. After feature preprocessing, an iterative bundle adjustment strategy is used for the alignment of multiple overlapping cameras. Secondly, we provide a real-time stitching module that can provide immersive content on multiple devices at 4K resolution. We achieve real-time stitching with 60 fps from the live camera feed and 120 fps from videos stored on a disk. Our framework can be used locally or over the network that makes it ideal for remote panorama stitching and parallel processing.

II. RELATED WORK
Primitive panorama stitching methods used semi-automatic algorithms that required user input for stitching mosaics. Commercially available software was able to stitch panorama with the help of the user's input. The semi-automatic stitching process was replaced by a fully automated algorithm [8]- [10] with the use of scale-invariant features [8], [11]. Improvements in reliability of a fully automated stitching system paved way for very large-scale multi-camera, multi-display panoramic stitching [12] and rendering systems [13]- [18]. Further advancements in software and hardware resulted in the production of portable devices such as smart-phones and handheld cameras which can create very high-quality panoramic images in real-time [19]. A typical fully automated stitching algorithm [10], [13], [20]- [23] uses rectilinear lenses to capture multiple overlapping images and estimates homographic relationship between them. The homography between these images is used to warp each image into its relative position on a planar, cylindrical or spherical canvas, with seams blended in such a way that the resultant image appears to be a seamless mosaic having field-of-view (FoV) larger than each camera's FoV.

A. SPHERICAL PANORAMA STITCHING
Rectilinear lenses have limited (FoV), therefore for very wide area coverage, multiple cameras with overlapping FoV are required [24]. Recently, the inclusion of virtual and augmented reality in most of the commercially available multimedia devices has increased the demand for content created with wide and ultra-wide-angle lenses such as fisheye. Usage of such lenses to stitch images and videos in modern technologies is particularly helpful in various fields. For example, people use navigation services for commuting from one place to another. The use of maps, generated by a wide-angle lens mounted on a car or a drone and traversed throughout a city, is particularly helpful. In another case, the satellite imagery of different planets and constellations can be stitched together and viewed in an immersive and interactive environment. The use of panoramic videos of live sports matches [25], stitched from wide-angle cameras is also becoming mainstream. Wide-angle lenses capture curvilinear images that can be used for stitching a full spherical panorama with a FoV of 360 • ×180 • [26]- [29]. Although, the wide FoV of these lenses require a smaller number of cameras, the radial distortion results in degraded quality of distortion-corrected images [30]. Less number of required cameras is desirable since it is economical from a commercial perspective. Therefore, the challenge of mitigating lens distortion along with stitching and transmission of a high-quality panoramic stream for VR applications is actively explored by researchers [31]- [33]. There are several commercially available, hardware-based, full spherical stitching solutions [34]- [38]. Most of these solutions employ two cameras with a fisheye lens that works in either 180 • or 360 • mode. These solutions are not scalable in terms of camera and display configurations. Furthermore, the quality of the seam at the overlap area is not good as they use only two cameras which creates an increased distortion. Moreover, these solutions are very expensive. Software-based state-of-the-art solutions such as [39] and [40] for stitching a full spherical panorama are also commercially available. However, a majority of these solutions are not real-time, and the stitching video exhibits poor frame rates. Lee et al. [41] proposed a real-time high-resolution 360 • video foveated stitching method, however the average framerate for 4K stitching is very low.

B. PANORAMA STITCHING WITH BUNDLE ADJUSTMENT
A simple way of stitching multiple images, given geometrically matched features, is by finding a homographic relationship.
A homographic relationship between a pair of matching images represents a geometric transformation from one VOLUME 8, 2020 image to another such that one of the images is warped onto another to create a single unified view. However, the homographic relationship for a sequence of matching images tends to accumulate geometric errors. In the case of a full spherical panorama, the homography estimation method creates a gap between the first and last camera images [10]. The gap can be removed by applying constraints and using a camera parameter refinement method. Moreover, homography is a linear relationship between two matching images which cannot be used to estimate the relationship between two images with nonlinear distortions such as fisheye images. Bundle adjustment [9], [42], [31] is a method of choice for such problems. It uses Levenberg Marquardt (LM) optimization algorithm [43], to jointly estimate camera intrinsic and extrinsic parameters of all images in the form of a bundle. The parameter estimation is performed by using a cost function. A cost function is a function that minimizes the 2D or 3D distance between matched features while simultaneously estimating camera parameters. The cost function is usually sensitive to noise, therefore a data filtering method [44] is essential for an accurate solution. Bundle adjustment can be performed on all images at once, called a global bundle adjustment, or it can be performed iteratively on a pair of images, called iterative or local bundle adjustment.
In this paper, we stitch multiple video streams, captured from nonlinear sensors, into a 360 • ×180 • panoramic video using LM optimization. It starts with the extraction of SIFT [8] features followed by matching these features in other cameras frames using K-Nearest Neighbors [45], [46]. Geometrically consistent feature matches are obtained using the Random Sample Consensus (RANSAC) [47]. The matched features are further processed by our proposed feature preprocessing method. The processed feature matches are mapped in a sphere and the distance between them is minimized along with the joint estimation of camera parameters using LM optimization. After estimating camera parameters, we create LUTs for real-time stitching.
The organization of this paper is as follows. The proposed method is discussed in section III. Section III (A) includes a detailed discussion on the calibration module whereas section III (B) includes a detailed discussion on real-time stitching with LUTs and BMs. Section IV contains experimental results. We present the conclusion of our work in section V.

III. PROPOSED METHOD
We propose a VR system by modeling a framework in two phases i.e. calibration phase and stitching phase. In the first phase, the task of system calibration is achieved by estimating the geometric alignment of multiple cameras. We provide six video streams as input to our proposed framework. The video streams are either directly transmitted from cameras at 60fps or loaded from storage disk at over 120fps. The storage disk is an SSD of NVME type which can provide read/write speeds for handling 4K videos at over 120fps. Each video stream is captured using a fisheye lens and represented by where N is the number of cameras and k is the frame number. We use the geometric alignment parameters to create LUTs [13], [48] In the second phase, we use these LUTs for mapping texture quads from the camera domain directly onto the panorama domain. The mapping process creates a seamless presentation of the scene on the panoramic canvas by applying linear blending with BMs. We denote k th panorama frame by P (k) . The stitched panoramic stream is encoded and transmitted to multiple display devices. The flow of the proposed framework is shown in Fig. 1.

A. CALIBRATION PHASE
In the calibration phase, the camera frames C FEi(k) are transformed from fisheye to equirectangular images C i(k) . The camera correspondences are evaluated by extracting scale-invariant features [8] from equirectangular images.
The features from one image are matched across all images [45]. Features extracted from equirectangular images are described by where h is the number of extracted features, x h ∈ R 2 is the feature location and d h is SIFT descriptor of h th feature in k th frame of i th camera C i(k) , respectively. Features from each equirectangular image are matched with other cameras to get overlapping camera pairs CP using k-nearest-neighbors matching [45]. The matched pairs of camera frames are geometrically verified with RANSAC [47]. We define camera pairs by CP = is the ratio of distances from l th feature in image a to first and second best matched features in image b of a pair using k-nearest neighbor matching.
For each overlapping camera pair, matched features could possibly be dispersed in the form of non-uniform clusters. Since we are using LM Optimization [43] for geometric alignment, which is very sensitive to noisy data, an incorrectly matched feature or a cluster of features may tilt final solution into an incorrect local-minima. Moreover, large number of matched features makes bundle adjustment slower. We propose a feature preprocessing method for handling inconsistent, error prone features before estimating geometric alignment. The feature preprocessing module consists of a grid-based trimming method for removing redundant clusters of matched features and an outlier filtering method for removal of error prone features.

1) GRID TRIMMING
The process of trimming redundant feature matches in an overlap area is described in this section. Consider the 1st frame from two overlapping cameras, C a (1) and C b (1) , shown in Fig. 2 (a, b). Assume that there are L feature matches in the overlapping area of the two images, shown in Fig. 2(a). Create a region-of-interest (RoI ) in the overlap area, shown in Fig. 2 (b, c). The dimensions of the region will be RoI (x , y , width, height), such that all features should reside inside this region. Create a grid of size S × S, in the RoI , where S = 3, 6 & 12, as shown in Fig. 3(b, c, d). For all matched features F l(1)a where l = 1, 2, . . . L, locate grid box and assign a matched feature to it by scaling x&y coordinates to grid dimensions. If multiple matched features are assigned to a box, the trimming process is accomplished by a selection criterion (γ ) such that only one matched feature is selected per grid box.
We propose two criteria for matched feature selection in a grid box. In first selection criterion (γ 1 ), a matched feature pair (F l(1)a ,F l(1)b ,δ l(k) ) with the lowest distance ratio between first and second-best matched features (δ l(k) ) is selected, while rest of the matched features in the grid box are removed. In the second selection criteria (γ 2 ), a matched feature pair (F l(1)a ,F l(1)b ,δ l(k) ) is randomly chosen out of all matched features in the grid box. The trimming process is repeated for all camera pairs and the trimmed features are represented by where l is the number of trimmed feature matches between pair of camera a and camera b. Furthermore, we use a distribution score θ score , as in (1), for estimating the quality of distribution of trimmed feature matches along the grid.
where L min is a limiting threshold for the minimum number of trimmed feature matches, S is grid size and L T is the number of trimmed features. A higher value of θ score indicates that trimmed feature matches are dispersed more uniformly in each grid box whereas a negative value indicates that the size of the grid or the number of trimmed feature matches is too small. The minimum number of trimmed feature matches must be greater than a limiting threshold L min = 6 to get a positive distribution score.

2) OUTLIER FILTERING
The process of trimming redundant features is followed by outlier filtering. Outlier filtering removes loosely matched VOLUME 8, 2020 features which creates a drift in the iterative bundle adjustment process, causing a degrade in stitching quality. The process of outlier filtering is described in this section. Consider 1 st frame from a pair of overlapping cameras, C a (1) and C b (1) . Assume that there are L T trimmed feature matches between the camera a and camera b of a pair which are represented by We align feature matches M T in a sphere and represents these points by X f = {X l,i ∈ R 3 |l = 1, ..L T , i ∈ a, b}. The spherical distance between X l,a and X l,b is δ l , as in (2).
By assuming that the spherical distances between trimmed feature matches are distributed normally, we use an outlier filtering threshold τ f in (3), for outlier removal.
where n is the number of standard deviations. The process of outlier removal is repeated for each overlapping camera pair followed by the collective outlier filtering of all camera pairs to undertake the group geometry into account. From (3), the first, second and third standard deviations contain 68%, 95% and 99.7% of total number of datum points, in accordance with 3-sigma rule [49]. Therefore, it is highly likely to have an outlier for n = 1, 2&3 without much loss of inliers. After filtering outliers, matched inliers between cameras a and b are represented by where l is the number of filtered inliers.

3) GEOMETRIC ALIGNMENT
Geometric alignment is a process of estimating camera intrinsic and extrinsic parameters. The intrinsic parameters are the focal length and principal point. The extrinsic parameters are camera orientations in the form of rotation matrices. If the camera lenses induce radial distortion, the radial correction parameters are also estimated in the geometric alignment process. In a typical geometric alignment process, a viewgraph tree of overlapping cameras is created with the help of scale-invariant feature matches. The view-graph tree contains a camera image at each vertex. If two or more cameras have an overlapping scene and matching features, we connect these camera vertices with edges. We assign a weight to each edge, equal to the number of matching features between cameras of connected vertices. We use LM optimization for the estimation of camera parameters. In LM optimization, a residual function that relates the matching features to camera parameters is minimized. We follow an approach similar to [50], where we use a local optimization between individual vertex pairs connected with edges followed by global optimization of all vertices in the view-graph tree. In order to minimize the residual function, first, we map matched scale invariant features between connected vertices into a sphere. We use vertical lines along with scale-invariant features for keeping the geometric alignment upright in the direction of z-axis. Second, we use the objective function for residual minimization, as in (4), which is used to minimize the distance of matched features between camera a and camera b.
where δ l , as in (2), is the spherical distance between a pair of matching features x l,a(k) and x l,b(k) , L F is the number of preprocessed features, X l,a(k) and X l,b(k) are feature vectors mapped in a sphere, corresponding to matching pairs of features x l,a(k) and x l,b(k) , from camera a and b. In order to minimize the distance between pairs of matching features, we map pairs in a sphere with spherical transformation function F(x l(a,b)(k) , β), where β represents camera parameters. We denote the transformed feature points with X l,(a,b)(k) . Assume that the l th feature from camera b is mapped on the sphere at point X l,b(k) with transformation F(x l,b(k) , β b ). The transformation function F(x l,a(k) , β a ) transforms l th feature from camera a to a point X l,a(k) , in spherical domain using camera parameters β a . We use (4) as a cost function in LM optimization and minimize the residual error e. The residual error is jointly minimized for all pairs of matching features in the form of a bundle by varying the camera parameters. The camera parameters β ={β i |i = 1, ..N &a, b ∈ i} are classified into two categories i.e. intrinsic and extrinsic parameters. We assume that the intrinsic parameters, which include focal length, principal point and radial distortion [51], are shared by all cameras. The extrinsic parameters are assumed to be unique to each camera, and are represented by  (4) to jointly minimize the residual error with the estimation of camera parameters. c. Sort vertices connected to the vertices of the initial camera pair from step b. Select a vertex connected to the initial pair that has maximum weight. Map feature matches of the camera of the selected vertex in a sphere. Use (4) to jointly minimize the residual error with the estimation of the camera parameters of a newly selected camera. Repeat this process for all vertices one by one. d. Use local geometric alignments from steps b and c as a starting point and perform a global geometric alignment of cameras from all vertices using (4) to jointly minimize the residual error with the refinement of camera parameters of all cameras in the view-graph tree. The pseudo-code for calibration phase is added in Algorithm 1.

B. STITCHING PHASE
After the geometric alignment of cameras, camera contents are rendered on a two-dimensional equirectangular canvas. We create an equirectangular canvas of aspect ratio 360:180, which can contain 4K ×2K pixels. As in (5), pixels can be mapped directly.
where a pixel p c i (r, c), located at r th row and c th column in i th camera, is mapped to the equirectangular canvas. The function T C→P i is a transformation of a pixel from the camera frame to the equirectangular panoramic canvas. It is a combination of transformation from the camera to sphere followed by sphere to an equirectangular domain. However, for simplicity, we will use T C→P i = T sp→eq T C→sp (p c i (r, c)) to denote a single transformation from the camera to panorama canvas. Direct mapping of camera content is not desired because of the forward mapping holes and computational complexities.

1) LOOKUP TABLES AND BLEND MASKS
After the estimation of calibration parameters β, we use a LUT approach for mapping quadrilaterals of camera contents Algorithm 1 Calibration Phase in VR Framework Input: Equirectangular frames C i(k) obtained from camera rig in an unknown order where i is camera number and k is frame number such that i = 1to6 and k = 1. (The camera frames are transformed from fisheye to equirectangular domain). Output: Camera calibration parameters β i , LUTs T i and BMs B i . Variables: F i(k) are SIFT features in k th frame of i th camera, M ab(k) are matched features obtained from k-d tree based nearest neighbor matching, δ l(k) is the distance ratio of first and second best matched l th features in cameras a and b, S is the size of grid used for trimming, γ is the selection criterion in trimming process, n is the outlier filtering parameter, X (k) are matched features mapped in a sphere and V is the viewgraph. 1 function Preprocess(M ab(k) , δ l(k) ): on a panoramic canvas [13], [48]. For this purpose, we use OpenGL routines and achieve real-time rendering speeds of up to 120fps for 4K ×2K pixels. LUT contains four coordinate correspondences per quadrilateral between camera and panorama canvas. The number of quadrilaterals per camera frame, along the horizontal or vertical axis, is equal to the greatest common divisor of width and height of camera frame, denoted by Q = GCD(W , H ) whereas the width and height of each quadrilateral will be q w = Q W and q h = Q H , where W and H are the width and height of camera frames. We use inverse mapping from the panoramic canvas to each camera frame to avoid forward mapping holes. Assume that the top left corner of each quadrilateral in panoramic canvas, is denoted by P p i,q where q ∈ Q 2 . Therefore, for a quadrilateral with the top left corner at P p i,1 , the top right, bottom left, and bottom right corners will be P p i,2 , P p i,1+q , and P p i,2+q . We use the inverse of calibration parameters β for each camera to transform these corner points to each camera frame, as in (6).
where P p i,q is the top left corner of q th quadrilateral in panoramic canvas. We save this mapping information in the form of LUTs. We assign one LUT per camera and denote it by T = {T i |i = 1, ..N }. Camera frame texture mapped on a panoramic canvas with LUTs, lacks spatial coherence. The spatial incoherency is caused by abrupt changes in overlap area texture. We create a smooth fading effect across overlap area texture by applying a BM on each transformed camera frame. The BMs are two-dimensional intensity layers, having the same size as the panoramic canvas. We denote the BMs by B = {B i |i = 1, ..N }. The seamless version of the stitched panorama is obtained by multiplying the BMs to geometrically transformed camera frames and summing up on a canvas, as in (7).
where k th camera frame from i th camera is denoted by C FEi(k) and transformed by LUT T i and BM B i .

2) CONTENT RENDERING
The LUTs and BMs are created only once using 1 st frames, at the initialization of the VR calibration and stitching system. The rest of the frames are rendered directly on the canvas. We get a frame rate of 60 or over 120 panorama frames per second depending on the input, camera feed, or storage disk. The frames are shared via the network to multiple display devices. For network sharing, we encode panoramic frames with H.264. The transmitted stream can be consumed across different media such as YouTube, VR headset, or custom designed displays. In the next section, we demonstrate the superiority of our calibration and stitching system.

IV. EXPERIMENTS
We test our system with six video streams. The video streams are either acquired directly from a camera rig, shown in Fig. 4(a, b), at 60fps or loaded from a storage disk at 120fps. Each video stream is captured with a camera having a fisheye lens with a vertical FoV of 180 • and horizontal FoV ∼75 • . The resolution of each camera frame is full HD and frames are captured in portrait mode. We capture frames with an overlap of approximately 10%-25% with neighboring cameras. These frames are stitched into  Where γ 1 is best distance ratio and γ 2 is random selection of matched features for a grid box. a full spherical panorama with horizontal and vertical FoV of 360 and 180 degrees respectively. The calibration system is equipped with a quad-core processor at 3.3GHz-3.9GHz and 64GB DDR4 RAM. The calibration code is written in C++. For our VR calibration and stitching system, we face these research questions; • How to improve stitching quality with similar or better calibration time?
• How to design an efficient end-to-end framework for stitching VR content? To look for the solution of these questions, we use 14 sets of camera frames captured at distinct places. These image sets have diverse illumination, texture, and scene variations to provide a challenging environment for the calibration and stitching process. We demonstrate improvements in the calibration phase by first trimming unwanted clusters of feature matches. Secondly, we filter the outlying error-prone feature matches. We create 1680 trials per camera pair out of which 672 trials are used to assess grid trimming whereas 1008 trials are used to evaluate the outlier filtering method.

A. IMPROVEMENTS WITH FEATURE PREPROCESSING
In an image set with 6 images, a single image has an average of 3004 SIFT features and 194 matched features per pair. The process of finding alignment parameters for a camera pair takes 174 seconds with an average error of 3.24 pixels. We improve the alignment quality of stitched panorama by using a grid-based trimming and outlier filtering method. For this purpose, we first perform grid trimming at 3 different sizes i.e. 3, 6 and 12, and assign matched features to corresponding grid boxes.
We select one pair of matched features per grid box with matched feature selection methods i.e. Distance Ratio Criteria (γ 1 ) and Random Selection Criteria (γ 2 ). In Fig. 5(a), a comparison of distance ratio γ 1 and random selection criteria γ 2 shows that distance ratio criteria γ 1 performs better than random selection criteria γ 2 in terms of average error. Therefore, we select γ 1 criteria in grid trimming.
In Fig. 5(b), we compare the distribution score θ score , as in (1), with grid size 3, 6, and 12. The θ score for a grid of size 3 occasionally drifts to negative value whereas, for the grid of size 6, we get the highest score in comparison to the score for a grid of size 12. The average distribution scores for grid 3, 6, and 12 are 0.00, 0.28, and 0.22. Therefore, we select grid size 6 for our framework. To find outliers among trimmed feature pairs, we geometrically align feature matches in a sphere. We assume that for all matched features in overlapping camera pairs, the distances between corresponding features in a spherical domain are distributed normally. The outliers are detected and removed by calculating the mean and standard deviation of the distance between matched features. We have used a maximum filtering threshold τ f , as in (3), for valid inlier selection. We have tested the calibration system with n = 1, 2&3, where n is the number of standard deviations. The filtering of outliers with n = 1 results in the removal of ∼32% of matched features which causes a significant decrease in projection error. In this case, the decrease of projection error might not signal an improvement in alignment quality since projection error does not guarantee that the matched features are dispersed uniformly. In order to select the value of n, we use a structural similarity measure, SSIM [52]. The value of the SSIM index ranges from 0 to 1 where a higher value indicates better stitching quality. In Table 1, we compare the projection error, SSIM [52], calibration time, and the number of inliers, by calibrating 14 image sets, for n = 1, 2&3 and grid sizes 3, 6, and 12 with distance ratio selection criteria γ 1 .
We get the highest SSIM score for n = 1 and grid size 6. The projection error corresponding to the highest score of SSIM is 2.26 pixels. Although 2.26 pixels is second-best average projection error, we will select this configuration for our framework. With the selected configuration, we obtain 30.25% improvement in stitching quality as compared to the panoramas obtained without any trimming and filtering. In Fig. 6, we compare the calibration performance of our framework with and without grid trimming and outlier filtering. The number of inliers after outlier filtering with n = 1, 2&3, and grid trimming with size 6 is shown in Fig. 6(a). In Fig. 6(b) and 6(c), the comparison of projection error and calibration time is shown.
In Table 2, we compare the performance of the proposed feature preprocessing method on our VR system. First, we use all the matched features to estimate the geometric alignment of 14 image sets. The average projection error for geometric alignment with all matched features is 3.24 pixels. Secondly, we use our proposed feature preprocessing method to process feature matches. The geometric alignment obtained from preprocessed feature matches is significantly improved with an average projection error of 2.26 pixels.

B. COMPARISON WITH COMMERCIAL SOLUTIONS
In Table 3, we compare our framework with state-of-the-art VR stitching solutions i.e. Pro 11.28 version of [39] (Trail Version) and [40] version 2019.2.0. We use the average projection error as a measure of quality for this comparison, where a lower value indicates better alignment. We use our framework with n = 1 and grid size 6, which is the configuration corresponding to the highest SSIM score, to obtain average projection error for 14 image sets. We stitch the same image sets with [39] and [40] and reported the projection error in Table 3. The highest average projection error is 5.94 pixels obtained from [40]. On the contrary, [39] produced relatively good alignment for all image sets with an average projection error of 2.34 pixels. We obtained a 3.42% improvement with our framework as compared to [39] with an average projection error of 2.26 pixels. A visual comparison of the stitched panoramas using our framework, [39] and [40] is shown in Fig. 7. The individual camera frames of the image set 8 are shown in Fig. 7(a). The panoramas stitched with our framework, [39] and [40] are shown in Fig. 7(b), 7(c), and 7(d). A visual comparison of 14 image sets for our framework, [39] and [40] is added in the supplementary material.

V. CONCLUSION
In this paper, we have proposed a framework for stitching VR content. The proposed system performs system calibration and stitching of 4K contents followed by rendering in real-time across a multitude of devices. We have tested the proposed system on a comprehensive set of images with challenging texture, illumination, and background variations. The proposed system achieves state-of-the-art stitching quality, as compared to commercial solutions, with an average projection error of 2.26 pixels. The stitched panoramic stream is rendered via multicasting, at up to 120 fps. The calibration phase of the proposed framework is relatively slower to obtain initial camera rig alignment. This limitation can be solved by porting it to GPU in the future.