Multi-NeuS: 3D Head Portraits from Single Image with Neural Implicit Functions

We present an approach for the reconstruction of textured 3D meshes of human heads from one or few views. Since such few-shot reconstruction is underconstrained, it requires prior knowledge which is hard to impose on traditional 3D reconstruction algorithms. In this work, we rely on the recently introduced 3D representation $\unicode{x2013}$ neural implicit functions $\unicode{x2013}$ which, being based on neural networks, allows to naturally learn priors about human heads from data, and is directly convertible to textured mesh. Namely, we extend NeuS, a state-of-the-art neural implicit function formulation, to represent multiple objects of a class (human heads in our case) simultaneously. The underlying neural net architecture is designed to learn the commonalities among these objects and to generalize to unseen ones. Our model is trained on just a hundred smartphone videos and does not require any scanned 3D data. Afterwards, the model can fit novel heads in the few-shot or one-shot modes with good results.


I. INTRODUCTION
We consider the task of 3D portraiture, i.e. automatic acquisition of 3D models of human heads that capture both the geometry and the texture. This automation avoids costly and time-consuming processes of manual creation of such models. While there is a number of approaches to modeling 2D head appearance [1]- [3], here we consider 3D head modeling as an important task that finds applications in filmmaking, AR, VR, XR, gaming industries. While a number of learningbased methods for this task have been suggested [4], [5], most of these methods require 3D scans or synthetic data for learning. Here, we propose an alternative approach that learns to model human head shape and appearance directly from a collection of RGB videos.
Our approach is based on a recent class of methods that use implicit representations for shape and appearance such as the recently introduced NeuS method [6] and very related approaches introduced in parallel with NeuS [7]- [9]. We introduce new and simple way to fit such models to individual videos, while sharing a subset parameters, resulting in the approach that we call Multi-NeuS.
We show that sharing the parameters across training videos facilitates knowledge transfer to new individuals unseen during training. As a result, Multi-NeuS achieves noteworthy data-efficiency (capable of learning a generic human head model from the videos of as little as 103 individuals) and is fast to train (takes only 24 hours on a single V100 GPU). After training, Multi-NeuS can create convincing textured 3D head meshes from as little as a single photograph ( Figure 1). Within this work, we investigate two parameter sharing patterns and sharing-related regularizations that can be used within Multi-NeuS. These are (1) sharing the parameters of layer subsets, as well as more sophisticated (2) low-rank regularization on non-shared parameters. We assess the effect of the sharing setting on the quality of the results.
Overall, in the experiments we show that our system is capable of creating high-quality 3D portraits from few photographs, and reasonably good portraits from single in-thewild photographs. More generally, our approach proposes a new way to do transfer learning within implicit shape and appearance modeling frameworks, and we hope that our findings will boost future meta-learning research involving implicit functions. To sum up, our contributions are: • We introduce a new type of 3D neural implicit architecture VOLUME   that can efficiently fit to many objects of the same class simultaneously and recover their surfaces, given sets of multi-view photos. • We devise a meta-learning pipeline for the above model that enables it to reconstruct the textured 3D surface of an unseen object from one or few images. • We demonstrate that our system can be applied to singleview reconstruction of 3D full head portraits, producing convincing 3D meshes from in-the-wild images after being trained on just a hundred short smartphone videos.

II. RELATED WORK A. NEURAL IMPLICIT 3D RECONSTRUCTION
Neural implicit functions have attracted a lot of attention recently, notably as a flexible approach to represent 3D scenes with neural networks. Contrary to traditional explicit 3D representations such as meshes, they are not limited to a fixed resolution or topology, and, most importantly to us, can naturally employ the power of modern neural network methods. Neural radiance fields (NeRF) [10] and its extensions (e.g. [11]) model density and emitted radiance with neural nets which are trained via backpropagating through volumetric ray casting. Although NeRF achieves impressive results in novel view synthesis, it is not designed for reconstructing geometry: meshes directly obtained from NeRF density functions are often full of artifacts [7].
A more ''geometry-friendly'' implicit approach is to model the object surface as a zero-level set of implicitly defined occupancy [12] or signed distance function (SDF) [13], [14], which goes back to the classical works on level set reconstruction [15]. The isosurface can then be easily converted to a mesh via marching cubes [16]. To train such models without 3D supervision, several authors have done inverse rendering by modeling color or radiance similar to NeRF and then applying some kind of ray marching [6]- [9], [17], [18]. From these single-scene multi-view methods, we pick NeuS [6] as a base of our multi-scene few-view method due to simplicity and code availability. We revisit NeuS in more detail in Section III-A.

B. META-LEARNING NEURAL IMPLICITS
The meta-learning paradigm addresses (among other things) the few-shot problem when given several training examples the network aims to achieve better performance. The most common line of approaches is the optimization-based approaches [19], [20] that learn the best weight initialization. For a deeper meta-learning review we refer the reader to [21]. Regarding the application of meta-learning to neural implicits, MetaSDF [22] exploits this idea to learn the initialization of the SDF network, while the work [23] applies meta-learning to a wider variety of signal types. Our work concentrates on human body representation and uses shared network layers across different tasks (with different people identities).

C. FEW-OR SINGLE-VIEW HEAD RECONSTRUCTION
Historically, directly fitting statistical 3D Morphable Models (3DMMs) to an image has been a popular method to recover the 3D head shape [24]- [27], but 3DMMs are limited to coarse shape estimation, requiring separate steps of reconstructing e.g. wrinkles [25] or hair [28]. In addition, 3DMMs are constructed from 3D scans which might be hard to obtain for many classes. Other more descriptive and flexible 3D representations include depth maps [29]- [31], regular meshes [32], [33], and volumetric grids [34], although many of these approaches still rely on 3DMM in their intermediate steps.
Two rare examples of completely model-free methods that also reconstruct hair [31], [32] are self-supervised GANs [35] that learn from unlabeled collections of images. However, actual fitting to unseen images (e.g. GAN inversion) was not demonstrated. More information on face/head reconstruction before the advent of 3D neural implicit methods is available in recent comprehensive surveys [36], [37].
Recently, several works have successfully applied neural implicit representations to the head reconstruction task, but most of them either do not reconstruct geometry directly (e.g. because of ill-suited NeRF representation) or require complex datasets. Portrait-NeRF [38] is an early attempt of meta-learning a single-view NeRF. The support of only slight viewpoint changes has been demonstrated for this method. The i3DMM method [39] introduced the first 3DMM to 2 VOLUME 11, 2023 include hair. This method is based on SDFs and is constructed from about 2000 3D scans of 64 people. H3D-Net [4] metalearns high-quality SDF representation of full static heads and supports reconstruction from as few as three posed images (though feeding just one image is also possible). The method is trained on a private dataset of 10,000 structured-light 3D scans. HeadNeRF [40] yields controllable NeRF portraits conditioned on latent 3DMM vectors (identity, expression, albedo, and illumination). It is a fully supervised approach, and authors were able to train it in reasonable time thanks to a strategy that improves rendering performance [41]. The authors of EG3D [42] went even further and trained a Style-GAN2 [35] to yield volume-renderable 3D heads with very little supervision (a similar idea was proposed in VolumeGAN [43] simultaneously). Like HeadNeRF, EG3D can fit an arbitrary head photo by optimizing the latent vector(s). Moreover, their paper demonstrates extracting meshes of convincing quality. Still, this method is computationally 80× more expensive to train than Multi-NeuS, and it does not reconstruct parts of the head that are further from the face due to the lack of dedicated background modeling.

III. METHOD
A. RECAP: NEUS RECONSTRUCTION As our method builds upon NeuS [6], we start with the review of this method. NeuS is a modification of NeRF [10] for nontransparent objects. It models the object surface directly, thus allowing 3D surface reconstruction from images using differentiable neural rendering. Specifically, the object surface in NeuS is represented as the zero-level set of a signed distance function x ∈ R 3 | SDF(x) = 0 , where SDF is defined as signed distance to object surface and is modeled by a neural network. In addition, RGB radiance at any 3D point is modeled by another neural net, and density is modeled as a bell-shaped function of SDF that attains its maximum at zero, i.e. at the object surface. More specifically (see Figure 2), the SDF network is a simple multi-layer perceptron (MLP) with 8 hidden layers of 256 neurons and softplus activations (β = 100), and the radiance network is an MLP with 4 hidden layers of 256 neurons and ReLU activations. The former network takes a 3D coordinate and outputs an SDF value and a latent vector. Meanwhile, the latter network takes this latent vector, the 3D coordinate, the camera view direction, and the gradient of the SDF, and outputs the RGB radiance value. Positional encodings [10] are applied to 3D coordinates (6 dimensions) and view directions (4 dimensions).
The radiance and density of points sampled along the rays corresponding to pixels of input images are used to run differentiable volume rendering [10] that integrates the samples along the ray and outputs its RGB color. The optimization algorithm forces the RGB results of ray integration to be similar to the corresponding known pixel intensities by progressively tuning the weights of neural networks. The loss function to optimize is a simple pixelwise mean squared error combined with an eikonal regularization term that ensures ∥∇SDF(x)∥ = 1. After convergence, it is possible to obtain object mesh via marching cubes [16] over SDF(x), as well as to synthesize novel views by volume rendering or any ray marching algorithm, such as sphere tracing.
The multi-view captures may include distant background which is difficult to represent by the above neural nets. Therefore, the object of interest is considered to be within a unit sphere, and everything outside of that sphere is modeled by a separate dedicated NeRF with the special parametrization of coordinates [44]. To optimize this NeRF along with NeuS, extra ray points are sampled outside of the unit sphere. A sufficiently large dataset lets such tandem to disentangle background from the central object automatically, without mask supervision.
NeuS achieves excellent results when applied to sets containing dozens of images. Our goal is to create a NeuS-based system that can perform reconstruction given a single image or very few images. This scenario is too under-constrained for the original NeuS and will result in poor convergence. To alleviate this, we narrow down the class of potential scenes to human heads and pre-train our model on a dataset of multiple people, while facilitating knowledge transfer to unseen people as discussed below.

B. MULTI-NEUS
Our solution called Multi-NeuS is depicted in Figure 2. We upgrade NeuS so that it can fit to N scenes simultaneously. Our high-level idea is simple. We create N copies of scenespecific NeuS instances that share some of the layers, while keeping other layers unshared (scene-specific). We then fit these N instances to the scenes simultaneously, while optionally imposing additional structural regularization on scenespecific layers.
Naturally, we expect that during such fitting shared layers will tend to model features useful to represent any object, while scene-specific layers combine, refine and augment the output of shared layers to model a specific object. For instance, a shared layer might model rough basic human head shapes, while the following (scene-specific) layer may learn the weights with which to combine those shapes, like in linear blend skinning models.
We experiment with two architectures for scene-specific layers that are described below in Section III-C. As shown in Figure 2, we use scene-specific layers in the first halves of the SDF network and the radiance network, while sharing all other layers. This choice is evaluated in Section IV-D.
Differently from NeuS, Multi-NeuS learns N independent scene-specific instances of background NeRFs. Also, we do not model view-dependent effects in our architecture, effectively assuming that human heads do not produce specular reflections. We find that on our dataset (which is captured in scattered light), this does not hurt validation performance but significantly reduces overfitting in few-shot mode especially when generalizing to new lighting.
Architecture of Multi-NeuS, a 3D neural implicit function that can represent multiple objects of a class simultaneously (boxes depict fully connected layers and their output dimensionalities; γ is the positional encoding function). Since some layers (blue) are shared between all scenes, they can learn class priors to then transfer knowledge to novel scenes of the same class, enabling few-shot reconstruction. The model is trained via volumetric rendering and simple pixelwise loss, just like NeuS [6], but on a dataset of multiple scenes. Afterwards, when fitting to an unseen object, scene-specific layers (yellow, Section III-C) are fitted first, and finally all layers are fine-tuned together.
The two architectures of scene-specific layers explored in our paper, independent (a) and low-rank (b). They are fully connected layers whose weights and biases w (i ) depend on scene index i . An independent layer learns individual weights and biases for each of N scenes, while a low-rank layer learns r copies and then linearly combines them with each scene's own learnable coefficients.

C. SCENE-SPECIFIC LAYERS
We use the scene (person) index i ∈ 1, N to enumerate scene-specific layer instances. Thus, by considering different instances within scene-specific layers, the same network architecture models every object in the dataset. In this work, we experiment with two architecture choices for scene-specific layers (Figure 3), which we term independent and low-rank. They are described below. (Figure 3, a). This is a straightforward implementation where the scene-specific layer has a dedicated set of weights and biases w 1 , . . . , w N for each scene. This architecture has large representational power but has significant drawbacks. First, during meta-learning, each w i receives infrequent weight updates during learning. Thus, if a training minibatch includes pixels from few (m ≪ N ) scenes, then sub-layers corresponding to all other scenes do not receive any weight updates. Alternatively, a minibatch can be composed of random pixels from the entire dataset (m ≈ N ). In this case, however, w i 's gradients become too noisy, coming from just few (≈ N m ) pixels, again leading to slow/poor convergence. In practice, batching together pixels from many scenes is inefficient as it requires to run m ≈ N layers in each forward pass, so in our experiments we set m = 1, i.e. we sample all pixels of a minibatch from just one scene. We therefore use the Adam [45] optimizer but update moment statistics for a scenespecific layer only when the corresponding scene participates in the forward pass (known as ''sparse/lazy Adam").

Independent layers
Another related problem with independent layers is overfitting due to the excessive number of parameters. This often leads to poor generalization to new subjects. Our second architecture below is designed to alleviate this by a built-in regularization. (Figure 3, b). In this scheme, scene-specific layer's weights and biases w (i) ∈ R p are not learnt directly. Instead, they are computed as a linear combination of r basis vectors b 1 , . . . , b r :

Low-rank layers
where r is the layer's rank. We learn both the basis vectors b j ∈ R p and the linear combination coefficients c ij ∈ R, where i ∈ 1, N and j ∈ 1, r. Thus, each scene-specific layer learns a single set of r basis vectors for the entire training dataset containing multiple scenes, and these vectors are recombined with different weights to model different scenes, therefore a separate set of r coefficients is learned for each of the N scenes. Such low-rank factorization decreases the number of parameters significantly (by a factor of several hundreds in our experiments), reducing overfitting.

D. TRAINING
Multi-NeuS is applied in two stages: meta-learning and fitting (see Figure 4).
In the initial meta-learning stage, we pre-train the whole architecture using the same volumetric rendering procedure as in NeuS (Section III-A) but on a dataset of multi-view RGB images of N scenes rather than a single scene. At every optimization step, a minibatch of camera rays (or, equivalently, image pixels) is sampled uniformly from eight random images of one random scene. Eventually, Multi-NeuS estimates the 3D shape and texture of every scene (subject) in the dataset.
After meta-learning, we can fit to new scenes starting from the pre-trained initialization. This fitting stage is thus conducted to estimate the 3D shape and the texture of a novel unseen object. To represent that object, we add the new (N + 1)-st scene to the model, that is, the (N + 1)-st set of scene-specific layers, initialized as described below. This time, we are given images of the new subject (can be as few as one or two), their estimated camera parameters, and their background segmentation masks.
The fitting process is performed in two steps: we first retrain the scene-specific weights and then we fine-tune all weights to the new scene. The first step begins with initializing scene-specific weights. We do it by simply averaging these weights over N scenes so that the (N + 1)-st representation in Multi-NeuS essentially represents ''the average object in the dataset'' as learned by the scene-specific layers. That is, for independent layer we set w N +1 = 1 N N i=1 w i , and for the low-rank layer c N +1 = 1 N N i=1 c i . Note that in the low-rank layer the basis weights b 1 , . . . , b r are not scenespecific but are in fact shared by all scenes. Therefore, we do not optimize or reset them in the first step of the fitting process. After optimizing the newly initialized scene-specific weights in the first step, in the second step we ''unfreeze'' the shared weights and optimize all weights while using a smaller learning rate.
The optimization during the fitting stage is performed in the same way as in the meta-learning stage with two notable differences. Firstly, instead of using a dedicated background NeRF [6], we explicitly estimate background masks and optimize an additional loss that forces the SDF isosurface to match these masks. The loss used in this case is the binary cross-entropy between the accumulated density over a ray and the foreground mask value (1 if object, 0 if background). This is needed since we found that background separation in the original NeuS works unsatisfactory in the few-shot regime.
The second modification is fine-tuning the camera parameters. This is needed because camera estimates can be inaccurate, especially for in-the-wild images from the Internet. To compensate for that, we backpropagate the losses into the camera parameters and optimize them alongside the neural networks with a 10× smaller learning rate.
Please refer to Section IV-E for additional details, including the implementation details and the hyperparameters.

A. DATASETS
Our training (meta-learning) dataset is a subset of SmartPortraits [46]. It consists of 107 short (≈ 25 seconds) smartphone videos of still people with neutral pose and facial expression. Four of these (two female and two male subjects) serve as the validation set. In each video, the distance to the head (≈ 1.5 m) and the elevation are roughly constant, while the azimuth travels within ±45 • . From each video, we remove frames with flash and randomly pick about 77 frames from the rest, shrinking the entire dataset to 8256 images. We obtain camera parameters by running the COLMAP structure-from-motion software [47] on these images. Finally, these images are loosely cropped to head and shoulders using a face detector. Note that we do not use any motion or depth information in our system.
Because Multi-NeuS takes in the absolute 3D coordinates, all scenes are aligned against each other to minimize the relative difference between objects. This helps our network not to spend capacity on modeling translations and scaling, and thus to fit the training set easier. We accomplish approximate alignment as follows. For each scene, we detect six prominent facial landmarks in images [48]. We then triangulate the 2D landmarks to get their 3D coordinates. We choose the the first scene of SmartPortraits as a reference one. For all other scenes we compute an optimal similarity transform T [49] that aligns two set of points: the triangulated 3D landmarks with the reference ones. It is achieved by finding the optimal translation, rotation and scaling by minimizing the root-mean-square deviation of the point pairs. Finally, the transform T is applied to all camera poses of the current scene. We estimate and apply such similarity transform not only for SmartPortraits but for every scene of every dataset used in this work.
We also validate on the H3DS dataset [4], which consists of ten individuals. For each individual, the dataset offers a full head 3D scan (mesh) alongside with 60 to 70 360 • photos taken with varying lighting, and camera parameters for these photos.
Besides, we provide qualitative results on several paintings and in-the-wild photos found on the Web. To that end, we estimate camera parameters for a single photo as follows. We detect the same six landmarks as above, but this time obtain their approximate 3D coordinates in orthographic camera coordinate system ( [48] provides them directly). We assume that these coordinates are 3D world coordinates, and that the image was taken with a telephoto lens with the vertical field of view of ≈ 10 • . These asssumptions allow us to roughly recover the camera pose in world coordinates, namely via an algorithm for the Perspective-n-Point (PnP) problem [50].
When fitting to any unseen pictures, we estimate background masks using an off-the-shelf model [51] and manually refine them.

B. SINGLE-VIEW GEOMETRY RECONSTRUCTION
By providing ground truth 3D scans, H3DS permits a quantitative comparison of geometry reconstruction, so we use it to compare against H3D-Net [4], which was tailored for this VOLUME

Input
H3D-Net Ours GT Input H3D-Net Ours GT

FIGURE 5.
Single-view mesh reconstruction on the first four scenes of the H3DS dataset. H3D-Net [4], a method related to ours, was designed for three-view reconstruction but can also be evaluated in the one-shot mode. The H3D-Net system was trained on 10,000 3D scans from the same distribution as these test examples. Our method is trained on a hundred smartphone videos and still matches the quality of H3D-Net, while demonstrating somewhat smaller identity gap and less pronounced regression-to-mean effect.
dataset. Although H3D-Net was demonstrated to reconstruct from three or more views, it can fit to a single view as well, and it has an advantage on the H3DS dataset since this model was trained on a large dataset (10,000 scenes) from the same distribution.
The target metrics, as in [4], are unidirectional Chamfer distances in millimeters from the predicted mesh to the ground truth, computed after rigid alignment via ICP [52]. One metric is the distance computed over facial area only, and the other one is computed over the entire ground truth mesh of a head.
We compute 1-view metrics by reconstructing from left, right (azimuth ≈ 45 • ), and frontal views. We do not apply our method in few-shot setting on H3DS because images in this dataset are taken with varying lighting and exposure, lacking multi-view consistency required for Multi-NeuS.
We compare our best model (low-rank architecture, r = 1000; evaluated in Section IV-C) with H3D-Net in Table 1 and Figure 5. Our method practically matches H3D-Net in reconstruction accuracy while learning from a different dataset that Lower is better, "F/L/R" are for "frontal/left/right". See Section IV-B for details.

FIGURE 6.
Quality of novel view reconstruction depending on scene-specific layer type (Section III-C), measured on our validation subset of SmartPortraits. Lower rank metamodels have fewer degrees of freedom and underfit during the first step of fitting; higher rank models fit better and provide a more convenient initialization for the second step of fitting (fine-tuning of all weights).
has 100× fewer identities and does not require 3D scanning. Furthermore, rendered samples suggest that H3D samples look very similar to each other, especially outside of the face region (the so-called regression-to-mean effect) while our model predicts more ''personalized'' shapes.
To demonstrate additional single-view geometry reconstruction, we show several reconstructions of in-the-wild photographs and paintings in Figure 7 and in the supplementary video.

C. EFFECT OF NUMBER OF VIEWS AND LAYER TYPE
Although our primary aim is to reconstruct heads given just one image, our method naturally benefits from additional views. We demonstrate this on the validation subset of Smart-Portraits. Similarly to H3DS, we restrict the scenes to the views: left, right (with azimuths around ±45 • ) and frontal. Since 3D ground truth is not available in this case, we render two additional control views (±20 • ) and compute masked PSNR against the ground truth images corresponding to these two views (control images). These are in turn averaged over four validation scenes.
We observe that during fitting to a novel person, optimizing camera parameters provides additional degrees of freedom. This often leads to the person's shape in Multi-NeuS drifting away from its ''canonical'' position (Section IV-A) and inflat-ing the validation error, even when the reconstruction is good. Moreover, the two control cameras might be estimated inaccurately during data pre-processing. To address this, before reporting PSNR against control images, we refine the control cameras' poses and focal distances by optimizing for PSNR. Figure 6 compares how faithfully the novel views are reconstructed depending on the number of input views (one, two, or three) and depending on the layer type (independent or low-rank). In the case of a single input view, the metric is averaged over reconstructions from left, right, and frontal views. In the two views case, we take the left+right views as input.
Consider low-rank models. Clearly, at the first step of fitting ( Figure 6, left), when only linear combination coefficients c N +1 and camera parameters are optimized, the models underfit in the case of low ranks. When the rank is very high (2000), the models start to overfit since the number of parameters becomes excessive. This is additionally illustrated in Figure 8.
In all cases, the second step of fitting ( Figure 6, right) where all parameters are fine-tuned is necessary because low-rank coefficients have few degrees of freedom. Multi-NeuS without the second step thus underfits the input views. However, in the few-shot setting, optimizing the full network can lead to severe overfitting. So the primary goal of the first fitting step is to provide a good initialization for this second step. According to the diagram, models with a reasonably high rank (e.g. 1000) provide best initializations, but this comes at a cost of overfitting during fine-tuning which may even decrease the overall PSNR by distorting the unseen areas of the resulting shape. This is probably because high rank scene-specific layers gain too much representational power and the shared layers are not forced as hard to learn universal features, thus hampering generalization. An alternative interpretation is that since we spend the same number of fine-tuning iterations regardless of rank, early stopping might alleviate some overfitting issues.
Another obvious observation is that with more views the effect of overfitting decreases, and the advantage of higherrank models becomes less pronounced (e.g. a 50-rank model already does well for the three-view reconstruction). In addition, the change in the number of views allows to assess the capacity of scene-specific layers.
The model with independent scene-specific layers does not generalize well because of the excessive capacity. Although it demonstrates larger PSNR than 50-and 150-rank models, it does so because its scene-specific layers are usual linear layers which can fit the training view really well, while lowrank models underfit. At the same time, the unseen parts in the validation views already get distorted in the first step and this is why the second step (fine-tuning) does not improve the score in this case.
Finally, to prove the necessity of shared architectures and meta-learning, we compare to a simple baseline ( Figure 6, extreme right) where a vanilla NeuS (without view directions) is trained on a scene from SmartPortraits and is then fine- VOLUME 11, 2023 Input Input Input The independent architecture with overparametrized scene-specific layers overfits already at this fitting stage. The low-rank variants become better at fitting these hidden parts with increasing ranks and model the texture better, but at some point (r = 2000 in this case) get too many degrees of freedom and start to overfit. r = 1000 provides optimal reconstruction in this case (and on average).
tuned (transfer-learned) in a few-shot scenario to the target scene. This is essentially equivalent to Multi-NeuS with N = 1, i.e. with a pre-training dataset of 1 scene. The score for this baseline was computed by transferring from 4 different SmartPortraits training scenes (2 male, 2 female) and averaging the metric. Although NeuS typically fits better to a single scene than Multi-NeuS to any of its meta-learning scenes, its few-shot generalization ability is clearly lower compared to any version of Multi-NeuS.

D. WHICH LAYERS TO MAKE SCENE-SPECIFIC
In this subsection, we evaluate the exact choice to put scenespecific layers into the first halves of the SDF network and the radiance network of vanilla NeuS. Some possible choices are listed in Table 2 and are evaluated for low-rank shared layers with r = 1000. Putting the scene-specific layers to the radiance network only results in a constant SDF in the first stage of fitting. This results in very low metrics. Results for other sharing patterns are harder to interpret, but arguably good performance requires sufficient number of scene-specific layers (having too few Novel view reconstruction quality depending on the choice of layers to replace with their scene-specific variants. We test the performance on the validations scenes from SmartPortraits and report PSNR values (in dB) averaged across holdout views. Boxes depict sequential fully-connected layers -like in NeuS, there are 9 layers that predict SDF, followed by 5 layers that predict radiance. ■ means scene-specific layer, □ means shared, i.e. vanilla linear layer. Layer type is low-rank, r = 1000.
of them is detrimental). Furthermore, at least some of these layers should be among the early processing layers.

E. IMPLEMENTATION DETAILS AND HYPERPARAMETERS
In both meta-learning and fitting, we use minibatches of 512 rays. In our experiments, there are 610,000 optimization iterations during meta-learning, 12,000 iterations in the first stage of fine-tuning and 13,000 in the second stage of finetuning. Pre-training (meta-learning) Multi-NeuS takes around 24 hours and fitting it to a novel subject takes about an hour on a single NVIDIA V100 GPU. The learning rate is 1.8·10 −4 in meta-learning, 4 · 10 −4 in the first step of fitting and 6 · 10 −5 in the second step. We multiply the learning rate by 0.316 every time the loss stops decreasing (known as ''reduce-onplateau schedule"). All other hyperparameters, including the number of ray sampling steps, eikonal loss weight, weight initialization (including that in the fitting stage) are kept the same as in NeuS [6].
We optimize camera parameters similar to [53]. Specifically, we (1) multiply initial camera rotation matrix by optimizable update matrix parametrized using so(3) Lie algebra, (2) add a optimizable residual to the translation parameters, and (3) multiply focal length by an optimizable scalar.

V. DISCUSSION
We have presented Multi-NeuS -an approach for one-and few-shot 3D head portrait reconstruction. The approach can reconstruct head portraits in the form of surface mesh and texture. To enable the few-shot capability, we propose and validate a very simple idea of taking a scene-specific deep architecture (NeuS) and fitting it to multiple scenes, while sharing some parameters across scenes. We show that despite simplicity, this idea is sufficient to accomplish knowledge transfer from the training scenes to previously unseen test scenes. We believe that this general idea might be applicable beyond head portrait reconstruction to other classes (e.g. fullbody reconstruction) and architectures (e.g. different NeRF types).
Our approach has certain limitations. Many of them are due to rather constrained training dataset. First, there are only 103 training sequences. Although Multi-NeuS' generalization ability seems very good for such a small dataset, there is still low diversity of hair styles, adornments, and skin types. In addition, SmartPortraits only exhibits neutral facial expressions, though in practice Multi-NeuS still seems to reconstruct smiles reasonably well. Moreover, the camera in the dataset only travels at most ±45 • around the head and therefore does not capture the back. As a result, our model always fails to reconstruct the back because it has never "seen" it in training (Figure 7, bottom right; Figure 9). Thus, an obvious remedy to improve the quality is to expand our training set.
Our models might benefit greatly from further improvements and simplifications of the underlying architecture. While the first step of fitting often provides a good initialization for occluded regions, the second step sometimes worsens these regions (Figure 8; Figure 10). This could be addressed with ad-hoc inpainting procedures that exploit class-specific symmetries, or more principled extensions of our method such as learned gradient descent [54], [55]. However, the fundamental problem might be hidden deeper in the network architecture. This is additionally highlighted by the fact that Multi-NeuS struggles to fit training samples with the same accuracy as NeuS. A promicing direction for future investigation is therefore how to reduce the model complexity even further (e.g. by using small learnable latent dictionaries) and to allow for very large datasets and better generalization. Finally, the models produced with our approach come without rigging capability, and in the future it would be interesting to extend our framework to address this.

VI. ACKNOWLEDGMENT
The Authors acknowledge the use of computational resources of the Skoltech supercomputer Zhores [56] for obtaining the results presented in this paper.