Unsupervised Learning of Probably Symmetric Deformable 3D Objects From Images in the Wild (Invited Paper)

We propose a method to learn 3D deformable object categories from raw single-view images, without external supervision. The method is based on an autoencoder that factors each input image into depth, albedo, viewpoint and illumination. In order to disentangle these components without supervision, we use the fact that many object categories have, at least approximately, a symmetric structure. We show that reasoning about illumination allows us to exploit the underlying object symmetry even if the appearance is not symmetric due to shading. Furthermore, we model objects that are probably, but not certainly, symmetric by predicting a symmetry probability map, learned end-to-end with the other components of the model. Our experiments show that this method can recover very accurately the 3D shape of human faces, cat faces and cars from single-view images, without any supervision or a prior shape model. On benchmarks, we demonstrate superior accuracy compared to another method that uses supervision at the level of 2D image correspondences.


INTRODUCTION
T HE ability to understand and reconstruct the content of images in 3D is of great importance in many computer vision applications. Yet, when it comes to learning categories of visual objects, for instance to detect and segment them, most approaches model them as 2D patterns [1], with no obvious understanding of their 3D structure. Thus, in this paper we consider the problem of learning categories of 3D deformable objects. Furthermore, we do so under two challenging conditions. The first condition is that no 2D or 3D ground truth information (such as keypoints, segmentation, depth maps, or prior knowledge of a 3D model) is available. Learning without external supervisions removes the need for collecting image annotations, which is often a major obstacle to deploying deep learning to new applications. The second condition is that learning can only use an unconstrained collection of single-view images -in particular, it does not use multiple views of the same object instance. Learning from single-view images is useful because in many applications we only have a source of independent still images to work with (for example obtained form an Internet search engine).
In more detail, we introduce a new learning algorithm that takes as input a collection of single-view images of a deformable object category and produces as output a deep network that can estimate the 3D shape of any object instance given a single image of it (Fig. 1). The algorithm is based on an autoencoder that internally decomposes the image into albedo, depth, illumination and viewpoint, without direct supervision for any of these factors. In general, decomposing images into these four factors is ill-posed. We thus seek for a minimal set of assumptions that makes the problem solvable. To this end, we note that many object categories are symmetric (e.g. almost all animals and many handcrafted objects). If an object is perfectly symmetric, mirroring any image of it results in a second virtual view of the object. Furthermore, if point correspondences between the image and its mirrored version can be established, then the 3D shape of the object can be recovered using any of a number of standard multi-view 3D reconstruction approaches [2], [3], [4], [5], [6]. Motivated by this, we seek to leverage symmetry as a cue to constrain this decomposition task.
While symmetry is a powerful cue, using it in practice is far from trivial. First, even if symmetry allows to obtain a pair of virtual views of an object, reconstruction still require to establish point correspondences between them, which can be difficult to do in an unsupervised manner. For instance, the appearance of symmetric points may still differ substantially due to asymmetric illumination. Second, specific object instances are in practice never fully symmetric, neither in shape nor appearance. Shape is non-symmetric due to variations in pose or other details (e.g. the hair style or expressions in a human face), and albedo can also be non-symmetric (e.g. asymmetries in the texture of cat's fur).
We address these issues in two ways. First, we explicitly account for the effect of illumination in the reconstruction pipeline by decomposing the appearance into albedo and shading. In this manner, the model learns to explain asymmetries in the object appearance resulting from illumination, allowing it to better understand how pairs of symmetric views of the object correspond. Moreover, since shading provides information on the surface normals and thus the 3D shape, decomposing it allows the model to explicitly use this information to constrain 3D shapes. Second, we augment the model to reason about potential lack of symmetry in the objects. To do this, the model predicts, along with the factors listed above, a dense map explaining the probability that a given pixel has a symmetric counterpart in the image.
We combine these elements in an end-to-end learning formulation, where all components, including the symmetry probability map, are learned from raw RGB images only. As a further contribution, we show that, rather than enforcing the symmetry by adding further terms to the learning objective, we can instead do so indirectly. The latter is obtained by randomly mirroring the internal representation of the object, thus encouraging the autoencoder to generate a symmetric view of the object. The advantage of this approach is that it avoids the need to introduce and thus tune additional terms in the learning objective.
We test our method on several datasets, including human faces, cat faces and synthetic cars. We provide a thorough ablation study and extensive analyses using a synthetic face dataset with the necessary 3D ground truth. On real images, we achieve higher fidelity reconstruction results compared to other methods [7], [8] that do not rely on 2D or 3D ground truth information, nor prior knowledge of a 3D model of the instance or class. In addition, our method outperforms a recent state-of-the-art method [9] that uses keypoint supervision for 3D reconstruction on real faces, while our method uses no external supervision at all. As a by-product, our method also learns intrinsic image decomposition without any external supervision. Finally, we demonstrate that our trained model generalizes to nonnatural images, such as paintings and cartoon drawings, as well as video frames without any fine-tuning.
This article is an extension and archival version of our previous work [10]. In this article, we expand the literature review, provide additional technical details, and include additional experiments and discussions that reveal the important insights of the proposed algorithm, including how it works, how it may fail, and how it compares to prominent model-based methods on 3D reconstruction benchmarks. The code and pretrained models are available at https://github.com/elliottwu/unsup3d.

RELATED WORK
In order to assess our contribution in relation to the vast literature on image-based 3D reconstruction, it is important to consider three aspects of each approach: which information is used, which assumptions are made, and what the output is. Below and in Table 1 we compare our contribution to prior works based on these factors. Fig. 1. Unsupervised learning of 3D deformable objects from in-the-wild images. Left: Training uses only single views of the object category with no additional supervision at all (i.e. no ground-truth 3D information, multiple views, or any prior model of the object). Right: Once trained, our model reconstructs the 3D pose, shape, albedo and illumination of a deformable object instance from a single image with excellent fidelity.  [34]. y in the form of part segmentation maps. z can also recover A and L in post-processing. Ã appear after our original paper was published.
Our method uses single-view images of an object category as training data, assumes that the objects belong to a specific class (e.g. human faces) which is weakly symmetric, and outputs a monocular predictor capable of decomposing any image of the category into shape, albedo, illumination, viewpoint and symmetry probability.

Structure From Motion
Traditional methods such as Structure from Motion (SfM) [35] can reconstruct the 3D structure of individual rigid scenes given as input multiple views of each scene and 2D keypoint matches between the views. This can be extended in two ways. First, monocular reconstruction methods can perform dense 3D reconstruction from a single image without 2D keypoints [36], [37], [38]. However, they require multiple views [38] or videos of rigid scenes for training [36]. Second, Non-Rigid SfM (NRSfM) approaches [39], [40] can learn to reconstruct deformable objects by allowing 3D points to deform in a limited manner between views, but require supervision in terms of annotated 2D keypoints for both training and testing. Hence, neither family of SfM approaches can learn to reconstruct deformable objects from raw pixels of a single view.

Shape From X
Many other monocular cues have been used as alternatives or supplements to SfM for recovering shape from images, such as shading [41], [42], silhouettes [43], texture [44], symmetry [2], [3] etc. In particular, our work is inspired from shape from symmetry and shape from shading. Shape from symmetry [2], [3], [4], [5] reconstructs symmetric objects from a single image by using the mirrored image as a virtual second view, provided that symmetric correspondences are available. [5] also shows that it is possible to detect symmetries and correspondences using descriptors. Shape from shading [41], [42] assumes a shading model such as Lambertian reflectance, and reconstructs the surface by exploiting the non-uniform illumination.
Only recently have authors attempted to learn the geometry of object categories from raw, monocular views only. Thewlis et al. [63], [64] uses equivariance to learn dense landmarks, which recovers the 2D geometry of the objects. DAE [65] learns to predict a deformation field through heavily constraining an autoencoder with a small bottleneck embedding and lift that to 3D in [7] -in post processing, they further decompose the reconstruction in albedo and shading, obtaining an output similar to ours.
Adversarial learning has been proposed as a way of hallucinating new views of an object. Some of these methods start from 3D representations [12], [13], [14], [32], [66]. Kato et al. [24] trains a discriminator on raw images but uses viewpoint as addition supervision. HoloGAN [31] only uses raw images but does not obtain an explicit 3D reconstruction. Szabo et al. [8] uses adversarial training to reconstruct 3D meshes of the object, but does not assess their results quantitatively. Henzler et al. [27] also learns from raw images, but only experiments with images that contain the object on a white background, which is akin to supervision with 2D silhouettes. In Section 4.4, we compare to [7], [8] and demonstrate superior reconstruction results with much higher fidelity.
Since our model generates images from an internal 3D representation, one essential component is a differentiable renderer. However, with a traditional rendering pipeline, gradients across occlusions and boundaries are not defined. Several soft relaxations have thus been proposed [67], [68], [69]. Here, we use a PyTorch implementation 1 of [68].

METHOD
Our learning algorithm, illustrated in Fig. 2, takes as input a collection of independent images of objects of a certain category, such as human or cat faces. It then produces as output a model F that, given any new image, recovers the object's 3D shape, albedo, illumination and viewpoint.
As the algorithm has only raw images to learn from, the learning objective is reconstructive: namely, the model is trained so that the combination of the four factors gives back the input image. This results in an auto-encoding pipeline where the factors have, due to the way they are combined to generate an image, an specific photo-geometric interpretation.
Due to the lack of 2D or 3D supervision and of a 3D prior on the possible shapes of the objects, this reconstruction problem is ill-posed. In order to address this issue, we use the fact that many object categories are bilaterally symmetric, which provides a strong geometric cue to remove the most severe reconstruction ambiguities. In practice, the appearance of specific object instances is never exactly symmetric due to deformations of the 3D shape and asymmetric details in the shape itself as well as in the illumination and albedo. We take two measures to account for these asymmetries. First, we explicitly model asymmetric illumination. Second, our model also estimates, for each pixel in the input image, a confidence score that explains the probability of the pixel having a symmetric counterpart in the image (denoted as conf. s and s 0 in Fig. 2).
The following sections describe how this is done, looking first at the photo-geometric autoencoder (Section 3.1), then at how symmetries are modelled (Section 3.2), followed by details of the image formation (Section 3.3) and the optional use of a perceptual loss (Section 3.4).

Photo-Geometric Autoencoding
An image I is a function V ! R 3 defined on a grid V ¼ f0; . . . ; W À 1g Â f0; . . . ; H À 1g, or, equivalently, a tensor in R 3ÂW ÂH . We assume that the image is roughly centered on an instance of the object of interest. The goal is to learn a function F, implemented as a neural network, that maps the image I to four factors ðd; a; w; lÞ comprising a depth map d : V ! R þ , an albedo image a : V ! R 3 , a global light direction l 2 S 2 , and a viewpoint w 2 R 6 so that the image can be reconstructed from them.
The image I is reconstructed from the four factors in two steps, lighting L and reprojection P, as follows: The lighting function L generates a version of the object based on the depth map d, the light direction l and the albedo a as seen from a canonical viewpoint w ¼ 0. The viewpoint w represents the transformation between the canonical view and the viewpoint of the actual input image I. Then, the reprojection function P simulates the effect of a viewpoint change and generates the imageÎ given the canonical depth d and the shaded canonical image Lða; d; lÞ.
Learning uses a reconstruction loss which encourages I %Î (Section 3.2).

Discussion
The effect of lighting could be incorporated in the albedo a by interpreting the latter as a texture rather than as the object's albedo. However, there are two good reasons to avoid this. First, the albedo a is often symmetric even if the illumination causes the corresponding appearance to look asymmetric. Separating them allows us to more effectively incorporate the symmetry constraint described below. Second, shading provides an additional cue on the underlying 3D shape [70], [71]. In particular, unlike the recent work of [65] where a shading map is predicted independently from shape, our model computes the shading based on the predicted depth, mutually constraining each other.

Probably Symmetric Objects
Leveraging symmetry for 3D reconstruction requires identifying symmetric object points in an image. Here we do so implicitly, assuming that depth and albedo, which are reconstructed in a canonical frame, are symmetric about a fixed vertical plane. An important beneficial side effect of this choice is that it helps the model discover a 'canonical view' for the object, which is important for reconstruction [40].
To do this, we consider the operator that flips a map a 2 R CÂW ÂH along the horizontal axis: 2 ½flipa c;u;v ¼ a c;W À1Àu;v : We then require d % flipd 0 and a % flipa 0 . While these constraints could be enforced by adding corresponding loss terms to the learning objective, they would be difficult to balance. Instead, we achieve the same effect indirectly, by obtaining a second reconstructionÎ 0 from the flipped depth and albedô Then, we consider two reconstruction losses encouraging I %Î and I %Î 0 . Since the two losses are commensurate, they are easy to balance and train jointly. Most importantly, this approach allows us to easily reason about symmetry probabilistically, as explained next.
The source image I and the reconstructionÎ are compared via the loss where ' 1;uv ¼ jÎ uv À I uv j is the L 1 distance between the intensity of pixels at location uv, and s 2 R W ÂH þ is a confidence map, also estimated by the network F from the image I, which expresses the aleatoric uncertainty of the model. The loss can be interpreted as the negative log-likelihood of a factorized Laplacian distribution on the reconstruction residuals. Optimizing likelihood causes the model to selfcalibrate, learning a meaningful confidence map [72].
Modelling uncertainty is generally useful, but in our case is particularly important when we consider the "symmetric" reconstructionÎ 0 , for which we use the same loss LðÎ 0 ; I; s 0 Þ. Crucially, we use the network to estimate, also from the same input image I, a second confidence map s 0 . This confidence map allows the model to learn which portions of the input image might not be symmetric. For instance, in some cases hair on a human face is not symmetric, as shown in Fig. 2, and s 0 can assign a higher reconstruction uncertainty to the hair region where the symmetry assumption is not satisfied. Note that this depends on the specific instance under consideration, and is learned by the model itself.
Overall, the learning objective is given by the combination of the two reconstruction errors 2. The choice of axis is arbitrary as long as it is fixed.

Image Formation Model
We now describe the functions P and L in Eq. (1) in more detail. The image is formed by a camera looking at a 3D object. If we denote with P ¼ ðP x ; P y ; P z Þ 2 R 3 a 3D point expressed in the reference frame of the camera, this is mapped to pixel p ¼ ðu; v; 1Þ by the following projection: This model assumes a perspective camera with field of view (FOV) u FOV . We assume a nominal distance of the object from the camera at about 1m. Given that the images are cropped around a particular object, we assume a relatively narrow FOV of u FOV % 10 . The depth map d : V ! R þ associates a depth value d uv to each pixel ðu; vÞ 2 V in the canonical view. By inverting the camera model (5), we find that this corresponds to the 3D point P ¼ d uv Á K À1 p: The viewpoint w 2 R 6 represents an euclidean transformation ðR; T Þ 2 SEð3Þ, where w 1:3 and w 4:6 are rotation angles and translations along x, y and z axes respectively.
The map ðR; T Þ transforms 3D points from the canonical view to the actual view. Thus a pixel ðu; vÞ in the canonical view is mapped to the pixel ðu 0 ; v 0 Þ in the actual view by the warping function h d;w : ðu; vÞ 7 ! ðu 0 ; v 0 Þ given by where p 0 ¼ ðu 0 ; v 0 ; 1Þ: Finally, the reprojection function P takes as input the depth d and the viewpoint change w and applies the resulting warp to the canonical image J to obtain the actual imagê d;w ðu 0 ; v 0 Þ: Note that this requires to compute the inverse of the warp h d;w , which is detailed in Section 3.5.
The canonical image J ¼ Lða; d; lÞ is in turn generated as a combination of albedo, normal map and light direction. To do so, given the depth map d, we derive the normal map n : V ! S 2 by associating to each pixel ðu; vÞ a vector normal to the underlying 3D surface. In order to find this vector, we compute the vectors t u uv and t v uv tangent to the surface along the u and v directions. For example, the first one is where p is defined above and e x ¼ ð1; 0; 0Þ. Then, the normal is obtained by taking the vector product n uv / t u uv Â t v uv . The normal n uv is multiplied by the light direction l to obtain a value for the directional illumination and the latter is added to the ambient light. Finally, the result is multiplied by the albedo to obtain the illuminated texture, as follows: Here k s and k d are the scalar coefficients weighting the ambient and diffuse terms, and are predicted by the model with range between 0 and 1 via rescaling a tanh output. The light direction l ¼ ðl x ; l y ; 1Þ T =ðl 2 x þ l 2 y þ 1Þ 0:5 is modeled as a spherical sector by predicting l x and l y with tanh.

Perceptual Loss
The L 1 loss function Eq. (3) is sensitive to small geometric imperfections and tends to result in blurry reconstructions. We add a perceptual loss term to mitigate this problem. The kth layer of an off-the-shelf image encoder e (VGG16 in our case [73]) predicts a representation e ðkÞ ðIÞ 2 R C k ÂW k ÂH k where V k ¼ f0; . . .; W k À 1g Â f0; . . .; H k À 1g is the corresponding spatial domain. Note that this feature encoder does not have to be trained with supervised tasks. Selfsupervised encoders can be equally effective as shown in Table 3.
Similar to Eq. (3), assuming a Gaussian distribution, the perceptual loss is given by where ' ðkÞ uv ¼ je ðkÞ uv ðÎÞ À e ðkÞ uv ðIÞj for each pixel index uv in the kth layer. We also compute the loss forÎ 0 using s ðkÞ 0 . s ðkÞ and s ðkÞ 0 are additional confidence maps predicted by our model. In practice, we found it is good enough for our purpose to use the features from only one layer (relu3_3) of VGG16. We therefore shorten the notation of perceptual loss to L p . With this, the loss function L in Eq. (4) is replaced by L þ p L p with p ¼ 1.

Differentiable Rendering Layer
As noted in Section 3.3, the reprojection function P warps the canonical image J to generate the actual image I. In CNNs, image warping is usually regarded as a simple operation that can be implemented efficiently using a bilinear resampling layer [74]. However, this is true only if we can easily send pixels ðu 0 ; v 0 Þ in the warped image I back to pixels ðu; vÞ in the source image J, a process also known as backward warping. Unfortunately, in our case the function h d;w obtained by Eq. (6) sends pixels the opposite way.
Implementing a forward warping layer is surprisingly delicate. One way of approaching the problem is to regard this task as a special case of rendering a textured mesh. The Neural Mesh Renderer (NMR) of [68] is a differentiable renderer of this type. In our case, the mesh has one vertex per pixel and each group of 2 Â 2 adjacent pixels is tessellated by two triangles. Empirically, we found the quality of the texture gradients of NMR to be poor in this case, likely caused by noisy depth map d and high frequency content in the texture image J.
We solve the problem as follows. First, we use NMR to warp only the depth map d, obtaining a version d of the depth map as seen from the input viewpoint. This has two advantages: backpropagation through NMR is faster and second, the depth gradients are more stable than color gradients, probably also due to the comparatively smooth nature of the depth map d compared to the texture image J.
Given the depth map d, we then use the inverse of Eq. (6) to find the warp field from the observed viewpoint to the canonical viewpoint, and bilinearly resample the canonical image J to obtain the reconstruction (i.e. using backward warping).

EXPERIMENTS
In this section, we first describe the experimental setup and implementation details, and then present the qualitative results on three object categories, human faces, cat faces and synthetic cars, followed by extensive ablation studies and analyses. We also report comparisons with several state-ofthe-art methods both qualitatively and quantitatively. In the end, we provide a discussion on the limitations of our method.

Datasets
We test our method on three human face datasets: Cel-ebA [75], 3DFAW [76], [77], [78], [79] and BFM [11]. CelebA is a large scale human face dataset, consisting of over 200k images of real human faces in the wild annotated with bounding boxes. 3DFAW contains 23k images with 66 3D keypoint annotations, which we use to evaluate our 3D predictions in Section 4.4. We roughly crop the images around the head region using MTCNN [80] and use the official train/val/test splits. BFM (Basel Face Model) is a synthetic face model, which we use to assess the quality of the 3D reconstructions (since the in-the-wild datasets lack groundtruth). We follow the protocol of [19] to generate a dataset, sampling shapes, poses, textures, and illumination randomly. We use images from SUN Database [81] as background and save ground truth depth maps for evaluation.
We also test our method on cat faces and synthetic cars. We use two cat datasets [82], [83]. The first one has 10k cat images with nine keypoint annotations, and the second one is a collection of dog and cat images, containing 1.2k cat images with bounding box annotations. We combine the two datasets and crop the images around the cat heads. For cars, we render 35k images of synthetic cars from Shape-Net [84] with random viewpoints and illumination. We randomly split the images by 8:1:1 into train, validation and test sets.

Metrics
Since the scale of 3D reconstruction from projective cameras is inherently ambiguous [35], we discount it in the evaluation. Specifically, given the depth map d predicted by our model in the canonical view, we warp it to a depth map d in the actual view using the predicted viewpoint and compare the latter to the ground-truth depth map d Ã using the scaleinvariant depth error (SIDE) [85] where D uv ¼ log d uv À log d Ã uv . We compare only valid depth pixels and erode the foreground mask by one pixel to discount rendering artefacts at object boundaries. Additionally, we report the mean angle deviation (MAD) between normals computed from ground truth depth and from the predicted depth, measuring how well the surface is captured.

Implementation Details
The function ðd; a; w; l; sÞ ¼ FðIÞ that preditcs depth, albedo, viewpoint, lighting, and confidence maps from the image I is implemented using individual neural networks. The depth and albedo are generated by encoder-decoder networks, while viewpoint and lighting are regressed using simple encoder networks. The encoder-decoders do not use skip connections because input and output images are not spatially aligned (since the output is in the canonical viewpoint). All four confidence maps are predicted using the same network, at different decoding layers for the photometric and perceptual losses since these are computed at different resolutions. The final activation function is tanh for depth, albedo, viewpoint and lighting and softplus for the confidence maps. The depth prediction is centered on the mean before tanh, as the global distance is estimated as part of the viewpoint. We do not use any special initialization for all predictions, except that two border pixels of the depth maps on both the left and the right are clamped at a maximal depth to avoid boundary issues.
We train using Adam over batches of 64 input images, resized to 64 Â 64 pixels. The size of the output depth and albedo is also 64 Â 64. We train for approximately 50k iterations. For visualization, depth maps are upsampled to 256. We include more details in the supplementary material, which can be found on the Computer Society Digital Library at http:// doi.ieeecomputersociety.org/10.1109/TPAMI.2021.3076536.

Reconstruction Results
In Fig. 3 we show reconstruction results of human faces from CelebA and 3DFAW, cat faces from [82], [83] and synthetic cars from ShapeNet. The 3D shapes are recovered with high fidelity. The reconstructed 3D face, for instance, contain fine details of the nose, eyes and mouth even in the presence of extreme facial expression.

Generalization to Paintings
To further test generalization, we applied our model trained on the CelebA dataset to a number of paintings and cartoon drawings of faces collected from [86] and the Internet. As shown in Fig. 4, our method still works well even though it has never seen such images during training. It is worth noting that since the model is trained using real face images, the reconstructions seem to also be more "realistic" faces reflecting the prior learned during training.

Relighting
A by-product of our reconstruction framework is that it learns to disentangle albedo and shading from a single image, without any external supervision at all. This is possible by leveraging the symmetry assumption on the albedo as well as the categorical prior imposed by training set.
Decomposing the albedo map enables realistic graphics editing applications, such as re-rendering the object under different lighting conditions, as illustrated in Fig. 5.

Inference on Video Frames
We can also apply our trained model on video sequences frame-by-frame. To demonstrate this, we take speech clips from VoxCeleb [87], and crop the faces using MTCNN 3 [80]. We then feed the crops to our model to produce a 3D reconstruction of the faces and render them from novel viewpoints, shown in Fig. 6. Note that our model does not use videos for training, yet it produces temporally consistent and accurate reconstruction results by simply processing the frames independently.

Symmetry and Asymmetry Detection
Since our model predicts a canonical view of the objects that is symmetric about the vertical center-line of the image, we can easily visualize the symmetry plane, which is otherwise nontrivial to detect from in-the-wild images. In Fig. 7, we warp the center-line of the canonical image to the predicted input viewpoint. Our method can detect symmetry planes accurately despite the presence of asymmetric texture and lighting    3. We use the implementation from https://github.com/timesler/ facenet-pytorch. effects. We also overlay the predicted confidence map s 0 onto the image, confirming that the model assigns low confidence to asymmetric regions in a sample-specific way. Table 2 uses the BFM dataset to compare the depth reconstruction quality obtained by our method, a fully-supervised baseline and two other baselines. The supervised baseline is a version of our model trained to regress the ground-truth depth maps using an L 1 loss. The trivial baseline predicts a constant uniform depth map, which provides a performance lowerbound. The third baseline is a constant depth map obtained by averaging all ground-truth depth maps in the test set. Our method largely outperforms the two constant baselines and approaches the results of supervised training.

Ablation
To understand the influence of the individual parts of the model, we remove them one at a time and evaluate the performance of the ablated model in Table 3 and Fig. 8.
In the table, row (1) shows the performance of the full model (the same as in Table 2). Row (2) does not flip the albedo. Thus, the albedo is not encouraged to be symmetric in the canonical space, which fails to canonicalize the viewpoint of the object and to use cues from symmetry to recover shape. The performance is as low as the trivial baseline in Table 2. Row (3) does not flip the depth, with a similar effect to row (2). In addition, we had to add an L2 smoothness loss on the depth maps during training. Otherwise, the model tends to produce noisy depth maps without the symmetry constraint, which lead to heavy occlusion and break the training.
Row (4) predicts a shading map instead of computing it from depth and light direction. This also harms performance significantly because shading cannot be used as a cue to recover shape. Moreover, the training often collapses after a few epochs as the model produces spikes in the depth maps that also result in large occlusion. We therefore report the results of the latest epoch prior to collapse.
Row (5) switches off the perceptual loss, which leads to degraded image quality and hence degraded reconstruction results. Row (6) replaces the ImageNet pretrained image encoder used in the perceptual loss with one 4 trained through a self-supervised task [88], which shows no difference in performance.
Finally, row (7) switches off the confidence maps, using a fixed and uniform value for the confidence -this reduces losses (3) and (9) to the basic L 1 and L 2 losses, respectively. The accuracy does not drop significantly, as faces in BFM are highly symmetric (e.g. do not have hair), but its variance increases. To better understand the effect of the confidence maps, we specifically evaluate on partially asymmetric faces using perturbations.

Asymmetric Perturbation
In order to demonstrate that our uncertainty modelling allows the model to handle asymmetry, we add asymmetric perturbations to BFM. Specifically, we generate random rectangular color patches with 20 to 50 percent of the image size and blend them onto the images with a-values ranging from 0.5 to 1, as shown in Fig. 9. We then train our model  with and without confidence on these perturbed images, and report the results in Table 4. Without the confidence maps, the model always predicts a symmetric albedo and geometry reconstruction often fails. With our confidence estimates, the model is able to reconstruct the asymmetric faces correctly, with very little loss in accuracy compared to the unperturbed case.

Training Only on Frontal Faces
Our full training data consists of single-view images of many instances, each captured from a different viewpoint, which essentially compose a large "multi-view" image set, although these are "multi-views" of different instances with different texture and shape. Nonetheless, it would be interesting to know how much this "multi-view" signal contributes to the learning, compared to other cues, such as symmetry and shading.
In order to understand this, we generate another synthetic face dataset consisting of only frontal faces with random texture and shape variations, and train a model on only frontal faces. We compare the performance of this model to our full model trained on the original dataset of images with various viewpoints in Table 5 and Fig. 10. In fact, the model trained on only frontal faces is indeed able to learn 3D shape of frontal faces, despite producing artifacts and a lower reconstruction accuracy compared to the full model. This suggests the symmetry and shading constraints can still provide powerful signals for learning shapes, even without the view variation in the training set. However, this model fails to generalize to input faces from other viewpoints.

Training With Fewer Images
As the symmetry assumption and shading seem to provide strong signals for learning the shape, another interesting question to ask is: does it still need to be trained on a large image collection? To answer this question, we train the model on different numbers of training images, ranging from only one single image to the entire training set of 155k images, and compare the results in Fig. 11. When training with 1 image and 100 images, we added a L2 smoothness loss on the depth maps, as the training otherwise collapses due to noisy depth maps.
As shown in Fig. 11, when trained on only 1 image, the model seems still able to pick up some shading and symmetry cues to recover the 3D shape. However, these cues alone cannot provide enough constraints on this heavily ill-posed 2D-to-3D task. Therefore, although the image reconstruction loss is low, the underlying 3D shape is poorly reconstructed. The model only starts to learn reasonable 3D faces when trained on 1000 or more images, which suggests that a sufficiently large image collection is critical for the model to learn a 3D shape prior of the object category.

Mixing Categories
In order to understand whether the model learns different priors for different categories, we further conduct experiments on cross-category inference as well as multi-category training. Fig. 12 shows some examples. We first feed images of human faces to a model trained on images of cat images and also the other way around. Unsurprisingly, the models trained on one single category learn shape priors specific to that particular category, and tend to reconstruct shapes of the training category, even if the input images depict a different category.
We further consider training the model on a mixture of images from two object categories, which turns out still capable of reconstructing both categories with similar quality compared to the models trained individually on each category. This observation shows promise of learning a general modal independent of object categories in the future.

Comparison With the State of the Art
As shown in Table 1, most reconstruction methods in the literature require either image annotations, prior 3D models or both. When these assumptions are dropped, the task becomes considerably harder, and there is little prior work that is directly comparable. Of these, [33] only uses synthetic, texture-less objects from ShapeNet, [8] reconstructs in-the-wild faces but does not report any quantitative results, and [7] reports quantitative results only on keypoint regression, but not on the 3D reconstruction quality. We were not able to obtain code or trained models from [7], [8] for a direct quantitative comparison and thus compare qualitatively.

Qualitative Comparison
In order to establish a side-by-side comparison, we cropped the figures reported in the papers [7], [8] and compare our results with theirs (Fig. 13). Our method produces higher quality reconstructions than both methods, with fine details of the facial expression. The difference is especially noticeable in the recovery of 3D shape for [7], and the shape generation in [8]. Note that [8] uses an unconditional GAN that generates high resolution 3D faces from random noise, and cannot recover 3D shapes from images. The input images for [8] in Fig. 13 were generated by their GAN.

3D Keypoint Depth Evaluation
Next, we compare to the DepthNet model of [9]. This method predicts depth for selected facial keypoints, but uses 2D keypoint annotations as input -a much easier setup than the one we consider here. Still, we compare the quality of the reconstruction of these sparse point obtained by DepthNet and our method. We also compare to the baselines MOFA [90] and AIGN [89] reported in [9]. For a fair comparison, we use their public code which computes the Fig. 11. Training with fewer images. We show a qualitative comparison of the models trained with different numbers of images, which confirms the necessity of training on a sufficiently large image collection. Fig. 12. Mixing categories. When trained on one single category, the model learns a prior specific to that particular category, whereas when trained on two categories, it is able to reconstruct both categories well. Fig. 13. Qualitative comparison to SOTA. Comparing to [7], [8], our method recovers higher quality shapes.
depth correlation score (between 0 and 66) on the frontal faces. We use the 2D keypoint locations to sample our predicted depth and then evaluate the same metric. The set of test images from 3DFAW and the preprocessing are identical to [9]. Since 3DFAW is a small dataset with limited variation, we also report results with CelebA pre-training. In Table 6 we report the results from their paper and the slightly improved results we obtained from their publiclyavailable implementation. The paper also evaluates a supervised model using a GAN discriminator trained with ground-truth depth information. While our method does not use any supervision, it still outperforms DepthNet and reaches close-to-supervised performance.
The first benchmark by Feng et al. [91] provides a test set, which consists of 133 ground-truth 3D scans and 2,000 test images, including 656 high-quality (HQ) images captured in a controlled environment and 1,344 low-quality (LQ) images extracted from videos. The second one, NoW benchmark [21], provides a test set of 1,702 images of 80 subjects and a ground-truth 3D scan per subject. These images are captured with a higher variety in facial expression, occlusion, and lighting, compared to the Feng et al. benchmark.
However, it is important to highlight that these benchmarks are designed specifically for evaluating 3DMM-based face reconstruction methods, and inherently put model-free approaches at a disadvantage. In both of these benchmark sets, only 3D scans of neutral faces are available, which are used as ground-truth for various input images that describe different viewpoints and facial expressions and may contain occlusion. This gives the 3DMM-based methods an advantage over our method, since the output of these methods is always constrained to a face model regardless of input variety, whereas our method produces instance-specific reconstructions with different expressions, which are not captured in the ground-truth scans. Our main intention with this evaluation is the establishment of a fair, quantitative evaluation of future model-free methods, since qualitative comparisons are often subjective and synthetic benchmarks are limited in terms of generalization to real data.
For both datasets, we detect faces and crop the images using MTCNN [80] and obtain 3D mesh reconstructions from the depthmaps predicted by our model trained on Cel-ebA. We then use the same evaluation protocol in both benchmarks [21], [91], which align the predicted meshes with the ground-truth meshes with a rigid transformation based on 7 pre-defined keypoints and compute the scan-tomesh distances. We obtain these keypoints on our predicted meshes by applying a facial keypoint detector [92] on the reconstructed canonical images. The average keypoints are used when the keypoint detector fails.
We report the statistics of the distances and compare them with other methods in Tables 7 and 8. Although our modelfree unsupervised method does not perform as well as the model-based methods on these benchmarks, it is significantly better than a flat shape baseline as shown in Table 7.
Since the NoW dataset provides attributes for the images, we select a subset of the test set that contains 91 frontal neutral faces, which better match with the ground-truth scans, and include the results in Table 8. The results in this subset further reduce the gap towards model-based methods.

Limitations
While our unsupervised method is robust in many challenging scenarios (e.g., extreme facial expression, drawings), we do observe limitations as shown in Fig. 14.   First and foremost, our model relies on the assumption that object category has weakly symmetric 3D shape as well as weakly symmetric albedo. Extending the key insights in this work, including leveraging category priors, other forms of symmetry and shape from shading in a learning framework, to general objects will require future work.
In this work, we represent shape using a depth map in the canonical (symmetric) viewpoint, which cannot describe the full 3D shape in 360 degrees. Thus, the reconstructed shapes often lack details on the sides. This is particularly evident for the cars, as illustrated in Fig. 14a. One would need to consider using other 3D representations to capture full 3D objects from 360 degrees.
Our model also tends to ignore occluders (Fig. 14b), since the training set does not contain many examples with occlusion. Disentangling dark textures and shading is often difficult. Therefore, the model fails to accurately reconstruct sunglasses (Fig. 14c) and may produce bumpy surfaces when the texture is noisy (Fig. 14d). During training, we assume a simple Lambertian shading model, ignoring shadows and specularity, which leads to inaccurate reconstructions under extreme lighting conditions (Fig. 14e) or highly non-Lambertian surfaces. The reconstruction quality is also lower for extreme poses (Fig. 14f), partly due to poor supervisory signal from the reconstruction loss of side images. This may be improved by imposing constraints from accurate reconstructions of frontal poses.

CONCLUSION
We have presented a method that can learn a 3D model of a deformable object category from an unconstrained collection of single-view images of the object category. The model is able to obtain high-fidelity monocular 3D reconstructions of individual object instances. This is trained based on a reconstruction loss without any supervision, resembling an autoencoder. We have shown that symmetry and illumination are strong cues for shape and help the model to converge to a meaningful reconstruction. Our model outperforms a current state-of-the-art 3D reconstruction method that uses 2D keypoint supervision. As for future work, the model currently represents 3D shape from a canonical viewpoint using a depth map, which is sufficient for objects such as faces that have a roughly convex shape and a natural canonical viewpoint. For more complex objects, it may be possible to extend the model to use either multiple canonical views or a different 3D representation, such as a mesh or a voxel map. Shangzhe Wu received the bachelor's degree in computer science from the Hong Kong University of Science and Technology, where he worked with Chi-Keung Tang and Yu-Wing Tai on image translation. He is currently working toward the DPhil degree with the Visual Geometry Group, University of Oxford, supervised by Andrea Vedaldi. His research focuses on unsupervised 3D understanding. He was the recipient of the Best Paper Award at CVPR 2020.
Christian Rupprecht received the PhD degree from the Technical University of Munich, Germany, advised by Nassir Navab and Gregory D. Hager (JHU). He is currently a postdoctoral researcher with the Visual Geometry Group, University of Oxford. For six months, he was with Chris Pal, Mila Institute, Montreal, working on AI safety. His research interests include self-supervised and minimally supervised learning for computer vision.
Andrea Vedaldi is currently a professor of computer vision and machine learning with the University of Oxford, where he has been co-leading Visual Geometry Group since 2012. He is also a research scientist with Facebook AI Research, London. He has authored or coauthored more than 130 peer-reviewed publications in the top machine vision and artificial intelligence conferences and journals. His research interests include unsupervised learning of representations and geometry in computer vision. He was the recipient of the Mark Everingham Prize for selfless contributions to the computer vision community, the Open Source Software Award by the ACM, and the Best Paper Award from the Conference on Computer Vision and Pattern Recognition.