Cascaded and Generalizable Neural Radiance Fields for Fast View Synthesis

We present CG-NeRF, a cascade and generalizable neural radiance fields method for view synthesis. Recent generalizing view synthesis methods can render high-quality novel views using a set of nearby input views. However, the rendering speed is still slow due to the nature of uniformly-point sampling of neural radiance fields. Existing scene-specific methods can train and render novel views efficiently but can not generalize to unseen data. Our approach addresses the problems of fast and generalizing view synthesis by proposing two novel modules: a coarse radiance fields predictor and a convolutional-based neural renderer. This architecture infers consistent scene geometry based on the implicit neural fields and renders new views efficiently using a single GPU. We first train CG-NeRF on multiple 3D scenes of the DTU dataset, and the network can produce high-quality and accurate novel views on unseen real and synthetic data using only photometric losses. Moreover, our method can leverage a denser set of reference images of a single scene to produce accurate novel views without relying on additional explicit representations and still maintains the high-speed rendering of the pre-trained model. Experimental results show that CG-NeRF outperforms state-of-the-art generalizable neural rendering methods on various synthetic and real datasets.


INTRODUCTION
Novel view synthesis (NVS) is a long-standing task in computer vision and computer graphics that has applications in free-viewpoint video, telepresence, and mixed reality [1].Novel view synthesis is a problem where visual content is captured from a set of sparse reference views and synthesized for an unseen target view.The problem is challenging since mapping between views depends on the 3D geometry of the scene, and the camera poses between the views.Moreover, NVS requires not only the propagation of information between the views but also the hallucination of details in the target view that is not visible in the reference image due to occlusions or limited field of view.
Early NVS methods produced target views by interpolating in ray [2] or pixel space [3].They were followed by works that leveraged geometric constraints such as epipolar consistency [4] for depth-aware warping of the input views.These interpolation-based methods suffered from artifacts arising from occlusions and inaccurate geometry.Later works tried to patch the artifacts by propagating depth values to similar pixels [5] or by soft 3D reconstruction [6].However, these approaches cannot leverage depth information to refine the synthesized images or deal with the unavoidable issues of temporal inconsistency.Recently, Neural Radiance Fields (NeRF) significantly impacted NVS research by implicitly representing the 3D structure of the scene and rendering novel photorealistic images.There are • Phong Nguyen-Ha, Lam Huynh and Janne Heikkilä are with the Center of Machine Vision and Signal Analysis, University of Oulu, Finland (e-mail: phong.nguyen@oulu.fi;lam.huynh@oulu.fi;janne.heikkila@oulu.fi).• Esa Rahtu is with the Faculty of Information Technology and Communication Sciences, Tampere University, Finland (e-mail: esa.rahtu@tuni.fi).• Jiri Matas is with the The Center for Machine Perception Department of Cybernetics, Czech Technical University, Prague (e-mail: matas@fel.cvut.cz).
two main drawbacks of NeRF [7]: i) the requirement to train from scratch for every new scene separately and ii) slow rendering speed.Moreover, the per-scene optimization of NeRF is lengthy and requires densely captured images for each scene.
Recent approaches [8], [9], [10], [11], [12], [13] address the former issue by training a generalized NeRF model to unseen scenes.The standard strategy is to condition the NeRF renderer with features extracted from source images from nearby views.Despite the generalization ability of these models to new scenes, the rendering speed is a bottleneck, and they cannot render novel views at an interactive rate.Chen et al. [10] decodes multi-view input features by using time-consuming 3D convolution and Multi-Layer Perceptron (MLP) networks into volume densities and radiance colors of the high-resolution target images.Rendering such images requires querying millions of input 3D points to the model, so it is non-trivial to render the entire novel views in a single forward pass.There are recent scene-specific NeRFbased methods that can render photorealistic novel images in real-time and require less than an hour for training.Despite such impressive results, these methods often rely on either differentiable explicit voxel representation [14], [15], [16] or a multi-resolution hash table [17] to store the neural scene representation.Therefore, those methods require a completely new per-scene optimization step to render novel views of an unseen data.
This work addresses the above issues by proposing a novel and efficient view synthesis pipeline that renders the entire view in a single forward pass during training and testing.Inspired by the recently proposed works [18], we adopt the coarse-to-fine RGB and depth rendering scheme to speed up the rendering process.Similar to MVSNeRF [10], we also infer a low-resolution 3D volume from a few unstructured multi-view input images using a shallow, yet Fig. 1.CG-NeRF is an efficient sparse view synthesis network that estimates coarse neural radiance fields of the target viewpoints at a low resolution and then renders the entire novel images efficiently via a convolutional-based neural renderer.Previous works [10], [21] do not possess such a renderer but rely on a deep fully connected network to estimate the high-resolution novel images pixel-by-pixel.Therefore, multiple forward passes are required to render pixels of novel views and this process often takes minutes to finish.In contrast, CG-NeRF renders coarseally the entire novel view more efficiently in a single forward pass and also requires less fine-tuning time than previous works to achieve state-of-the-art results.
efficient attention-based network.We found that synthesizing a low-resolution novel view using NeRF is fast and efficient due to the reduced amount of sampled 3D points.Moreover, the volume rendering of NeRF provides lowresolution radiance features and depth maps at the novel viewpoints.Instead of using the time-consuming coarse-tofine rendering approach like [7], [8], we use the inferred depth maps to produce near-depth features of the target viewpoint and then fuse them with the radiance features as inputs to a convolutional-based neural renderer.We then train both networks to esitmate high-resolution target images from low-resolution radiance features.Rendering the entire novel view also allows us to use perceptual loss [19] or adversarial training [20], enhancing the generated images' overall quality.We also include a regularization loss to ensure the predicted final images are consistent with the coarse estimated novel views from the coarse radiance field predictor.
Our trained CG-NeRF model renders plausible results on target poses close to the input viewpoints.However, the performance degrades when we extrapolate the targets further from the nearby source views.Previous works [10], [14], [15], [16] learn a hybrid implicit-explicit representation of radiance fields using a denser set of input images that cover more views of a single scene.Using a similar approach, finetuning the pre-trained CG-NeRF model in 10-15 minutes produces state-of-the-art results compared to those produced by scene-specific approaches [13], [16].Both pre-trained and finetuned CG-NeRF models do not require explicit data structure but rely on a few selected reference views closest to the targets.As can be seen in Fig. 1, we also observe clear improvements in terms of visual quality between the novel views generated by the CG-NeRF models and the other view synthesis methods [8], [10], [21].Note that our method does not rely on depth supervision [18] to improve the quality of the synthesized images.
CG-NeRF shows strong generalizability to render realistic images at novel viewpoints via a lightweight view synthesis network.If shortly further optimized with additional images, CG-NeRF outperforms both recently proposed gen-eralizable view synthesis methods [10], [13], [21] and perscene optimized models [7], [16].The main contributions of the work are: • An efficient sparse view synthesis network that employs a coarse radiance field predictor and a neural renderer to effectively predict novel images approximately two orders of magnitude faster than NeRF [7] and its variants [10], [13], [21].

•
The proposed scene-specific model requires only 10-15 minutes of fine-tuning the pre-trained model using more images.In addition, CG-NeRF does not require additional depth supervision.
We will publicly publish the source code and neural network models upon paper publication.

RELATED WORKS
In the following, we discuss different generalizable view synthesis methods using a set of sparse input views.We would like to refer to [25], [26] for a more extensive review.Novel view synthesis.Early works based on deep learning often use a Plane Sweep Volume (PSV) [27].Each input image is projected onto successive virtual planes of the target camera to form a PSV.Kalantari et al. [28] calculate the mean and standard deviation per plane of the PSV to estimate the disparity map and render the target view.Extreme View Synthesis (EVS) [29] builds upon DeepMVS [30] to estimate a depth probability volume for each input view that is then warped and fused into the target view.A similar coarse-tofine scheme has been proposed by Nguyen et al. [18], but the method relies on depth supervision for view synthesis.Rather than estimating the depth maps of the source images, we train CG-NeRF to predict the depth map at the target view via volumetric rendering.The inferred depth is then used to produce high-resolution appearance features, which are later rendered as novel views.Multi-layered representation.A significant number of works [31], [32], [33], [34] on view synthesis represent the 3D scene by Multiple Plane Images (MPIs).Each MPI includes multiple RGB-α planes, where each plane is related to a certain depth.The target view is generated by using alpha composition [35] in the back-to-front order.Zhou et al. [31] introduce a deep convolutional neural network to predict MPIs that reconstruct the target views for the stereo magnification task.Local Light Field Fusion (LLFF) [23] introduces a practical high-fidelity view synthesis model that blends neighboring MPIs to the target view.The input to the MPI-based methods is also PSVs.However, those PSVs are constructed for a fixed range of depth values.The proposed CG-NeRF leverages coarse geometry and gathers near-surface features to enrich fine PSVs.Voxel grid.Grid-based representations are similar to the MPI representation but are based on a dense uniform grid of voxels.This representation has been used as the basis for neural rendering techniques to model object appearance.Neural Volumes [36] is an approach for learning dynamic volumetric representations of multi-view data.The main limitation of grid-based methods is the required cubic memory footprint.The sparser the scene, the more voxels are empty, which wastes model capacity and limits output resolution.A recent work by Sun et al. [16] represents a 3D scene using low-resolution density and feature voxel grids for scene geometry and appearance.This method is fast to fit and produces high-quality novel views comparable with our fine-tuned CG-NeRF model on a single scene.Instead of optimizing such voxel-grid representation [15], [16], [36], we propose using a memory-efficient architecture to encode multi-view input features into a single volume and infer the coarse geometry and appearance features of the entire target view in a single forward pass.Pointclouds.Recent works [37], [38], [39], [40] on view synthesis have also employed the point-based representation to model 3D scene appearance.A drawback of the point-based representation is that there might be holes between points after projection to the screen space.Aliev et al. [38] train a neural network to learn feature vectors that describe 3D points in a scene.These learned features are then projected onto the target view and fed to a rendering network to produce the final novel image.A recent work by Xu et al. [13] proposes a point-based radiance field representation that efficiently renders novel views within 15 minutes of training for each new scene.However, this method requires ground-truth depths to train a multi-view depth estimator network.In contrast, we only leverage photo-metric losses between the generated and ground-truth novel views to train our model.Experimental results show that CG-NeRF can produce temporally consistent depths and novel views between multiple target viewpoints without relying on 3D supervision.Neural radiance fields.The current state-of-the-art method Neural Radiance Fields (NeRF) by Mildenhall et al. [7] represents the plenoptic function by a multi-layer perceptron that can be queried using classical volume rendering to produce novel images.NeRF has to be evaluated at many sample points along each camera ray.This makes rendering a full image with NeRF extremely slow.Despite the high quality of the synthesized novel images, NeRF also requires per-scene training.Recent volumetric approaches [8], [9], [10], [11], [12], [13] address the generalization issue of NeRF by incorporating a latent vector extracted from reference views.These methods show generalizability on selected testing scenes, but they share the slow rendering property of NeRF [7].There are recent approaches that address the slow rendering of NeRF by sampling a chunk of rays in a local patch and applying ConvNets for post-processing such as enhancements or super-resolution.Despite having impressive results, those methods [41], [42] focus on rendering high resolution images via per-scene optimization.In this work, we propose a generalizable view synthesis network which speeds up volume rendering with convolutional layers to estimate the target views efficiently on a single GPU.The view synthesis results can also be enhanced by fine-tuning the obtained model on a single scene without using any additional components.

PROPOSED METHOD
This section describes in detail the architecture of CG-NeRF, which consists of two modules: a coarse radiance field predictor (Section 3.1) that produces geometry and appearance of the scene at the lower resolution, and a convolutionalbased neural renderer (Section 3.2) that combines both coarse and refined features to produce the final target image at the original size.In addition, we discuss the loss functions to train the generalizable CG-NeRF model and then finetune it on a single scene (Section 3.3).

Coarse radiance field predictor
Our approach to inferring coarse radiance fields is orthogonal to many recent works [8], [10], [11], [21] on generalized view synthesis.Despite impressive results, these methods cannot achieve fast view synthesis.Since each pixel is rendered independently, millions of query 3D points must pass through the deep networks.This scheme is expensive because the number of queries is much larger than the total number of pixels rendered.
The main difference between CG-NeRF and the above methods is that we infer radiance fields at a lower resolution to reduce queried inputs and speed up the rendering process.By doing so, we can also obtain the geometry and appearance features of the entire target view in a single forward pass.Our method circumvents the slow rendering of NeRF [7] by avoiding splitting all queried 3D points into multiple chunks and rendering each chunk as a small image patch.Therefore, NeRF requires multiple forward passes to render patches of the novel views.This patch-based rendering strategy also prevents NeRF and its variants [8], [10], [11], [21] from training their model using GAN or perceptual losses [20], [44] if stochastic pixels are generated during training.In contrast, we can train CG-NeRF on these losses between the ground-truth images and the estimated novel views at low and high resolution (see Section 3.3).Feature extraction.We first describe our pipeline (see Fig. .their poses.Each input image I n is first fed to the Feature Pyramid Network [45] to extract F c n ∈ R H/4×W/4×C and F f n ∈ R H×W ×C/4 which are coarse and fine 2D image features, respectively.Note that F f n has the same height H and width W as the original input, so we can later use them for the coarse-to-fine synthesis.We do not know the scene geometry at the coarse level, so we uniformly sample several K virtual depth planes.Therefore, we leverage the coarse features of each input view to build a cost volume at the target viewpoint.Those features F c n are warped into multiple hypothesis depth planes via bilinear sampling [10], [18].The warped features are then concatenated to construct a per-view coarse volume V n ∈ R H/4×W/4×K×C .Multi-view attention learning.Each volume V n contains multiple-plane features of the target view so it requires a spatial reasoning architecture to aggregate those N volumes before the neural rendering step.Previous works [21], [46] use the vanilla Transformers, which results in the slow inference of the novel views due to the heavy computation of multi-head attention [47].MVS-based methods [10], [18] opt on the mean and variance-based volumes that a 3D UNet can later process to infer a unified scene encoding volume.However, a 3D Unet is limited due to the small receptive fields compared to attention-based architectures. In this work, we combine the best of both approaches by using a single MobileViT block [43] which is a more memory-efficient variant of Transformers.We also compute the mean and variance between N volumes V n and concatenate them as a statistic volume which is then passed to a single MobileViT block.This block learns the long-range dependencies via the multi-head attention [47] between the non-overlapping patches of those N volumes.We configure the input and output channels of the MobileViT block to produce a unified volume V , which has the same spatial dimension as V n .By learning to attend to extracted multi-view features, the inferred volume encodes both scene geometry and appearance, which can later be processed into volume densities and view-dependent features for view synthesis.Coarse radiance fields.Using the unified coarse volume V , our method learns an MLP network M (see Fig. 3 (left)) to regress volume density σ k ∈ R 1 and appearance feature f k ∈ R C of a 3D point x k ∈ R 3 and its viewing direction d k ∈ R 3 .Specifically, each 3D point x k is the intersection between a ray shooting from the target camera and a virtual depth plane.We obtain the feature V (x k ) of sampled point x k via trilinear interpolation [48].The coarse radiance fields are then computed as follows: where γ is the positional encoding function [7].In this work, we estimate the per-point features f k , which can be used later for the neural renderer (described in Section 3.2).We design the model M as a shallow MLP network so that the training/inference speed can be faster than its NeRF counterparts.Instead of directly concatenating positional embedded γ(x k ) and γ(d k ) to image features, we use two different fully connected layers and project both embeddings into the latent space before combining them with the interpolated feature V (x k ).Adding those two extra layers does not increase the inference time but further improves the learning capacity of the model.In addition, our proposed architecture runs more efficiently than its NeRF counterparts since it inherits the massive speedup of the fully-fused connected layers [17] by treating the entire network as a single GPU kernel.
We also adopt the volume rendering approach of [7] to render novel depths and features via differentiable ray marching.Specifically, we can compute the per-ray feature F r by accumulating f k and σ k of K sampled 3D points along a ray r as follows: where τ i is the accumulated volume transmittance from ray origin to the point x k and ∆ k is the distance between adjacent sampling points.Gathering all ray-features F r of the target camera; we obtain the coarse feature maps F ∈ R H/4×W/4×C of the novel view.We can also obtain the temporally consistent depth maps D ∈ R H/4×W/4 as a side-product of the volume rendering by calculating the weighted sum between estimated densities and depth values of x k .Without learning to predict the novel depth maps at the original resolution, we up-sample the coarse depth maps D and use them to combine high-resolution input features F f n via a feature fusion step.

Neural renderer
Feature fusion.Once we get a coarse scene geometry and appearance estimation, we use an auto-encoder network to render the final target images.It is challenging to generate such high-resolution novel views using coarse features.Therefore, we leverage the depth plane resampling technique of [18] to obtain near-depth features F ′ , which have the exact spatial resolution as the target views.For each ray shot from the target camera, we sample J points around the predicted depth value of the given ray.We back-project those points into each input camera and obtain their corresponding features by bi-linearly interpolating the extracted high-resolution input features F f n .Each pixel of F ′ ∈ R H×W ×JC/4 is the weighted sum of the multi-view warped features, and weights are defined as the inverse depths of the x k at each input viewpoint.As the coarse radiance fields predictor improves at predicting depth maps, so does the feature fusion method.Please refer to [18] for more details.UNet renderer.To render the novel views at the high resolution, we up-sample the coarse feature F to match the spatial resolution of the original targets and then concatenate with F ′ before feeding to a fully convolutional UNet neural renderer which contains three up and down-sampling convolutional layers.Instead of using skip connections between the encoder and decoder, we employ several residual blocks with Fast Fourier Convolutions [49], [50], [51].These blocks have the image-wide receptive field to effectively combine multi-scale features F and F ′ to render the high-resolution novel images I + RGB .Without the skip connection between the encoder and decoder, the Unet model is also smaller and more efficient due to the reduced number of parameters.

Loss functions
We train both the coarse radiance field predictor and the auto-encoder network using a fine reconstruction loss L f ine which is a combination of the L1 loss and perceptual [19] loss between I + RGB and the ground-truth images.A 1 × 1 convolutional layer can also be added to transform coarse features F to low-resolution RGB color images I RGB (see Fig. 3 (right)).Outputing an intermediate coarse novel view allows us to train CG-NeRF using a reconstruction L1 loss term L coarse between I RGB and the down-scale groundtruths.This loss term also helps to regularize the coarse radiance field learning without relying on the depth supervision [18].To make sure that both I RGB and I + RGB are visually consistent with each other, we follow the dual discriminator set up of [20], add a hinge GAN [44] loss L GAN .Instead of discriminating over three-channel images, we perform the adversarial training using six-channel real Fig. 4. Qualitative comparisons of view synthesis on the two testing sets of the DTU dataset [22].Our CG-NeRF can recover texture details and geometrical structures more accurately than other methods.Within 15 minutes of finetuning, CG-NeRF outperforms state-of-the-art methods such as MVSNeRF [10] and IBRNet [21] which are required to more time to optimize using the same amount of data.and fake images.The fake image I RGB is up-sampled before concatenating with I + RGB .The ground-truth image is also down-sampled before concatenating with the original image.The total photometric loss to train CG-NeRF is computed as

EXPERIMENTS
In this section, we evaluate CG-NeRF and compare the generated novel views to those produced by the state-of- the-art methods.

Implementations
Model details.The models were trained with the Adam optimizer using a 0.004 learning rate for the discriminator, 0.001 for both the coarse radiance field predictor and the neural renderer with the momentum parameters (0, 0.9).λ = 0.5, C = 64, K = 64, J = 4, N = 3, W = 640, H = 512 and the MobileViT block had 4 attention heads for fast inference.We used the Pytorch extension of the tiny-cudann [52] library for the fully-fused connected layers.The other modules of CG-NeRF are implemented using native PyTorch.We first trained the model from scratch on multiple scenes in 16 hours using four V100 GPUs with a batch size of 16.We then perform a 10-15 min fine-tuning on a single V100 GPU and achieve state-of-the-art results for perscene optimization.We have tried to fine-tune the model on a consumer-grade RTX 2080TI GPU for 15 minutes and observe no significant differences in terms of view synthesis quality.
View selection.We follow the view selection method of [53] to choose the top 10 closest source images to each target image.We first run the standard structure-from-motion method [54] on the each training scene to estimate the coarse depth map of each image in the dataset.We then project pixels of the novel views with valid depth values into each input image using the known camera intrinsics and extrinsics.For each novel view, we select the top 10 closest input images that have the most valid projections.Since our method is bounded by GPU memory, we then randomly sample N out of 10 closest views.At each training step, the number of input views N to the is further uniformly sampled between 3-5.
However, it is non-trivial to perform view extrapolation when the set of sparse N near-by views can only cover a small area of the scene.We address this issue by increasing N during the fine-tuning stage.We apply the same view selection strategy on other generalizing view synthesis baselines [10], [21] and we report the comparison results with them in the following sections.We also discuss more on how increase N affects the training performance in the Table 5.

Experiments
Datasets.We train CG-NeRF on the DTU [22] dataset to learn a generalizable network.DTU is an MVS dataset consisting of more than 100 scenes scanned in 7 different lighting conditions at 49 positions.From 49 camera poses, we selected 10 as targets for view synthesis and used the rest for source image selection.We evaluate the performance of our pretrained model using the testing set of the DTU dataset.To further compare CG-NeRF with state-of-the-art methods, we test it on the Synthetic-NeRF [7], Forward-Facing [23], and Tanks & Temples [24] datasets, which have different scenes and view distributions from our training set.Each scene includes 12 to 62 images and 1/8 of these images are held out for testing.
Baselines.In this evaluation, we compare CG-NeRF to both generalizable view synthesis and pure per-scene optimized methods.The former approaches [8], [10], [21] predict new views with and without per-scene optimization.We use their provided code and train them on DTU dataset [22] and then finetune for each testing scene for fair comparisons.Furthermore, we compare our method with the recent scene-specific synthesis methods such as Instant-NGP [17], PointNeRF [13] and DVGO [16].We use their public code to train scene-specific models and compare the generated novel views with those produced by our method both qualitatively and quantitatively.We use a system of four V100 GPU to train and test all baselines and compare with our method for a fair comparison.Metrics.We report the PSNR, SSIM, and perceptual similarity (LPIPS) [55] for CG-NeRF and other state-of-the-art methods.We summarize the quantitative and qualitative results in Table 1, Fig. 4, and Fig. 5 using samples from four different datasets [7], [22], [23], [24].Please see the supplementary video for more qualitative results.Testing on the seen dataset.We first evaluate CG-NeRF on the testing set of the DTU dataset.Since our methods are trained on the training set of the same dataset, we observe both pretrained CG-NeRF and per-scene optimized CG-NeRF † can reconstruct accurate novel views on the unseen testing scenes as can be seen in the second and third column of Fig. 4.Moreover, they outperform other state-ofthe-art methods both quantitatively and qualitatively.The direct inference network of IBRNet [21] and MVSNeRF [10] are not able to produce faithful textures of the windows and the reflection at the tip of the nose of the toy character.Both baselines tend to predict blurry results and fail to retrieve fine details as can be seen in the zoomed insets of Fig. 4. Since these methods predict each pixel of the high-resolution novel views from a low-resolution feature volume, their MLP network have to solve two difficult tasks which are image synthesis and super-resolution.Instead, we tackle these two tasks by two different networks: one to estimate coarse radiance features and another to refine them using the adversarial training and a convolution-based neural renderer.As can be seen in Fig. 4, the rendering results of IBRNet * and MVSNeRF † are sharper if both models are optimized for each testing scene.Note that, they both required approximately 24 minutes up to 1 hour to achieve sharp results but still achieve less accurate novel views than ours which are fine-tuned in 15 minutes for all testing scenes.Testing on the unseen dataset.To further test the generalizability of our approach on unseen data, we conduct experiments on three synthetic and real datasets.As can be seen in the second column of Fig. 5, CG-NeRF can produce plausible results on all testing novel views which are very different from the training DTU images.Despite not seeing those testing images, the CG-NeRF model shows competitive results with a variant of MVSNeRF † which is trained for the same amount of time as ours.Given almost one hour of finetuning, IBRNet * improves its results significantly but is still not able to estimate as accurate novel views as ours.Optimizing for a quarter of an hour, CG-NeRF † produces cleaner and more photo-realistic novel views than those produced by MVSNeRF † and IBRNet * on these unseen synthetic and real datasets.Despite being trained on the testing scenes, both baseline methods are not able to render such fine details because they only use a L2 color loss between stochastic rendered and ground-truth pixels.On the ship scene of the Synthetic-NeRF dataset, our optimized CG-NeRF † model can render the thin structure of the sky-sail which is not visible on the generated novel views of other methods.High-frequency details can be seen in the generated images of our method on the real flower and trex scenes of the Forward-Facing dataset.Moreover, we can also observe that CG-NeRF † estimates clearer text on the door of the truck scene in the second example of the Tank&Temples dataset.Although our pretrained CG-NeRF model is not able to retrieve such fine details, the rendering results can be vastly improved thanks to the hinge GAN loss L GAN that we applied during training.Note that, it is not straightforward for other baselines to perform adversarial training due to the limited resolution of the generated images.This is not a problem with our methods because we can efficiently render the entire novel views without worrying about the out-of-memory issue which is unavoidable for other baselines.
We also compare our methods with state-of-the-art scene-specific view synthesis PointNeRF [13] models.Since this method is not designed for generalizable view synthesis, it performs worse than CG-NeRF.However, we observe a significant gain in performance of the fine-tuned PointNeRF † model across testing data.As can be seen in the last column of Fig. 5, PointNeRF † can render fine details but the generated novel views are still not as accurate as ours in all testing scenes.In the challenging ficus scene of the Synthetic-NeRF dataset, our method can render the leaves sharper than PointNeRF † , which renders high-resolution novel views from a point-cloud where the feature of each point is interpolated from low-resolution image features.Despite using a memory-expensive point-cloud representation, PointNeRF † is still not able to render high-quality details of the novel views, especially when we zoom closely into the content of the generated images.Rendering speed.In this section, we compare the rendering speed between the full CG-NeRF model and other view synthesis methods.In general, our method not only produces better novel views but also renders them faster than previous works.Both pixelNeRF [8] and IBRNet [21] takes more than half a minute to render a single novel image because the method uses the time-consuming MLPbased architecture for multi-view aggregation and it also inherits the slow rendering of NeRF.Moreover, it also takes several hours for training and still perform worse than our approach.
The point-based approach PointNeRF [13] improves the speed by rendering the novel views directly from their hybrid implicit-explicit volume representations.However, the method is still slow and not able to render novel views at the interactive rate.As can be seen in the Table 2, our 5-15 minutes-finetuning CG-NeRF model not only outperforms the state-of-the-art scene-specific fast view synthesis methods [16], [17] but also renders novel at least 3 times faster than them.We found that rendering the entire novel views using the proposed fully-fused MLP and convolutionbased neural renderer is faster than sequentially rendering  individual pixels using a deep MLP model of NeRF and its variants.

Architecture design
Table 4 and Fig. 6 summarize the quantitative and qualitative results of CG-NeRF on different architectural choices using the test set of the Forward-Facing dataset [24].We first define a "Random rays" variant of CG-NeRF that estimates stochastic sampled pixels [10], [21] of the high-resolution novel views during training.Independently rendering each pixel leads to visible artifacts and blurriness of the predicted novel views.The rendering results are better if we estimate the entire the novel views using a convolution-based neural renderer.However, this model does not produce plausible target views as they still contain incorrect geometry and poorly rendered specular areas.By regularizing our model with a coarse reconstruction loss L coarse , we address the above issues and observe vastly improved novel views.Finally, we found that adding a hinge GAN loss L GAN and the dual discriminator of [20] helps us to achieve state-ofthe-art results as can be seen in the last column of Fig. 6.We provide more comparison results in the supplementary videos.

Rendering large 4K novel views
Since we use a convolution-based neural renderer to obtain the high resolution images, CG-NeRF can accept very large 4K input images of the Forward-Facing dataset [23] and generate new views.In the Table 3, we compare our approach with other variants of NeRF at both 800x800 and 4K resolution.We first conduct an experiment to test whether if our neural renderer is able to improve existing view synthesis methods such as PointNeRF [13] and Instant-NGP [17] at the testing standard 800x800 resolution.Methods with the ‡ symbol indicate that they were trained to produce a smaller novel views and then later up-sampled to the original size using our neural renderer.Given the same training time, both baseline methods produce almost similar novel views but the rendering time is significantly improved.This further highlights the usefulness of our novel neural renderer to the recently proposed methods that it can be easily plug into the existing systems.We also found that directly optimizing both Point-NeRF [13] and Instant-NGP [17] on 4K images requires more than 20GB of GPU memory and a longer training time to obtain good synthesis results.Therefore, we apply the same up-scaling strategy above to reduce the memory footprint.In contrast, our method only requires approximately 5GB of GPU memory to synthesis high quality 4K novel views.Instead of training CG-NeRF ‡ model from scratch, we finetune our generalized CG-NeRF model and output 4K images.Experimental results in the Table 3 show that optimizing a convolution-based neural renderer improves the synthesis quality of the novel views at both testing standard 800x800 and very high 4K resolution.Moreover, it takes 2.5 seconds to render a single 4K novel image but our method is still much faster compared to other baselines.

Spatial-temporal consistency
From the supplementary material, both of our generalized and finetuned CG-NeRF model renders photo-realistic high quality novel views with multi-view consistency using the learned encoder-decoder structure.Similar results have been observed in recent 3D generative NeRF-based methods [20], [56] that we can produce high resolution 3D consistent novel views from 2D low resolution feature maps.In this work, we follow the design of the recently proposed EG3D [20] that uses a dual-discriminator to enforce consistent results between high and low resolution outputs.The up-sampling neural renderer of [20] is similar to our proposed encoder-decoder network but we condition the synthesis process using a set of sparse input views.
As can be seen in the Fig. 2, we also use an aggregate warped features as input to the neural renderer.This warped features are consistent with the coarse radiance fields features since we leverage predicted coarse depth maps to perform feature fusion as described in the section 3.2.Therefore, we add a an regularization L coarse loss to train the coarse radiance fields predictor.We found that using a simple and yet effective loss function not only boosts the quality of the low and high resolution novel views but also improves the temporal consistency between consecutive novel views.Without L coarse , the predicted novel views include significant artifacts near the boundary and also not very temporally consistent due to independent renderings at each novel viewpoint using 2D Unet renderer (see Fig. 7).By forcing the network to estimate accurate down-sampled novel views, our method can learn to produce consistent features in the higher resolution.We also try to the improved designs of [56] on our pipeline and observe similar results.

Number of input views
In Table 5, we evaluate the performance of our method with an increasing number of source images using the Tanks and Temples [24] dataset.We report both SSIM and LPIPS metrics with the number of source images up to 10.We observe that CG-NeRF performs the best with 7 input views and then the results get worse.When reference and target poses are far from each other, inaccurate regressed depth maps will lead to less accurate novel views.Therefore, having views close to the target views and having less self-occlusion is essential to synthesize novel views.If it is hard to gather views around the target view, adding more views that have overlapping viewing frustums with the target view is also necessary.

CONCLUSION
We presented CG-NeRF, a new method to address the challenging problem of novel view synthesis from a sparse and unstructured set of input images.Due to its coarse neural radiance field predictor and a convolution-based neural renderer, CG-NeRF can produce all pixels of the target view without relying any additional explicit data structure.Moreover, it enables highly efficient per-scene optimization that takes only 10-15 minutes, leading to rendering quality comparable to and even surpassing recent state-of-the-art methods which require several hours of training.

Fig. 2 .
Fig.2.Our proposed CG-NeRF comprises several parts: (i) a memory-efficient MobileViT architecture[43] that fused multiple low-resolution planesweep volumes of the target viewpoint into a single unified volume V , (ii) a coarse radiance fields predictor that estimates target depth and features in low resolution, and (iii) an auto-encoder network to render novel views at the original resolution.Our method is lightweight and can infer fast novel views

Fig. 3 .
Fig.3.Our proposed MLP model (left) uses several fully connected (FC) layers with 64 neurons each to estimate the coarse radiance fields of each sampled 3D point x k .We also remove skip connections of the neural renderer (right) and add nine residual blocks which utilize Fast Fourier Convolution (FFC)[49] layers for generating the fine prediction I + RGB of the novel views.A 1x1 convolution layer can also be added to estimate a coarse prediction I RGB and regularize the training process.Finally, we use the dual discriminator of[20] to make sure that I RGB and I + RGB are visually consistent with each other.

Fig. 5 .
Fig.5.Qualitative comparisons between CG-NeRF and state-of-the-art methods on unseen data.Each row show estimated novel views of three different datasets: Synthetic-NeRF[7], Forward-Facing[23], and Tanks & Temples[24].Without any finetuning, our pre-trained model can produce plausible results on unseen data and the results are vastly improved after optimizing on a single scene.In general, our method produces more accurate textures and thin structure objects than those produced by other baselines.

Fig. 6 .
Fig. 6.Qualitative ablation study.Comparison of ground-truth with predicted novel views by CG-NeRF with random and coarse rays training, with coarse reconstruction loss Lcoarse and the full model with enabled adversarial loss L GAN .The full model not only predicts novel views more accurate than other baselines but also render them efficiently.Without using L GAN , we observe less realistic novel images compared to the ground-truths.

Fig. 7 .
Fig. 7.A generated sequence of consecutive novel views produced by CG-NeRF with and without the coarse reconstruction loss Lcoarse from the Room and Fern scenes of the LLFF [23] dataset.

TABLE 1
Quantitative comparison on large-scale dataset of synthetic and real images.Methods with † and * symbols are optimized per scene for 15 and 60 minutes respectively.

TABLE 3
[23]titative comparisons of rendering on 800x800 and 4K novel images between CG-NeRF and other view synthesis methods using the Forward-Facing[23]dataset. Methods with ‡ are optimized with the proposed neural renderer.

TABLE 4 CG
[22]F architecture ablation study.Reconstruction accuracy of view synthesis on the Forward-Facing scenes[22].

TABLE 5
[24]impact of the number of reference images, measured in terms of reconstruction accuracy on the Tank&Temples[24].