Face Hallucination Using Cascaded Super-Resolution and Identity Priors

In this paper we address the problem of hallucinating high-resolution facial images from low-resolution inputs at high magnification factors. We approach this task with convolutional neural networks (CNNs) and propose a novel (deep) face hallucination model that incorporates identity priors into the learning procedure. The model consists of two main parts: i) a cascaded super-resolution network that upscales the low-resolution facial images, and ii) an ensemble of face recognition models that act as identity priors for the super-resolution network during training. Different from most competing super-resolution techniques that rely on a single model for upscaling (even with large magnification factors), our network uses a cascade of multiple SR models that progressively upscale the low-resolution images using steps of $2\times $ . This characteristic allows us to apply supervision signals (target appearances) at different resolutions and incorporate identity constraints at multiple-scales. The proposed C-SRIP model (Cascaded Super Resolution with Identity Priors) is able to upscale (tiny) low-resolution images captured in unconstrained conditions and produce visually convincing results for diverse low-resolution inputs. We rigorously evaluate the proposed model on the Labeled Faces in the Wild (LFW), Helen and CelebA datasets and report superior performance compared to the existing state-of-the-art.


Introduction
Face hallucination represents a domain-specific super-resolution (SR) problem where the goal is to recover high-resolution (HR) face images from low-resolution (LR) inputs [1].It has important applications in image enhancement, compression and face recognition [2], but also surveillance and security [3,4].
Similar to other single-image super-resolution tasks, face hallucination is inherently ill-posed.Given a fixed image-degradation model, every LR facial image can be shown to have many possible HR counterparts.Thus, the solution space for SR problems is extremely large and existing solutions commonly arXiv:1805.10938v2[cs.CV] 11 Feb 2019 try to produce plausible reconstructions by "hallucinating" high-frequency information based on the provided LR evidence.While significant progress has been made in recent years in the area of super-resolution and face hallucination [5,6,7,8,9,10,11,12,13,14,15,16,17,18,19], super-resolving arbitrary facial images, especially at high magnification factors, is still an open and challenging problem, mainly due to: -The ill-posed nature of the face hallucination problem, where the solution space is known to grow exponentially with an increase in the desired magnification factor [20].Even with strong reconstruction constraints it is exceptionally difficult to find good solutions and devise methods that work well under a broad range of conditions.Even for domain-specific SR problems, such as face hallucination, where the solution space is constrained by facial appearances, there are still an overwhelming number of possible solutions.-The difficulty of learning and integrating strong priors into the face hallucination models that sufficiently constrain the solution space beyond solely the visual quality of the reconstructions.Most of the existing priors utilized for super-resolution relate to specific image characteristics, such as gradient distribution [21], total variation [22], smoothness [23] and the like, and hence focus on the perceptual quality of the super-resolved results.If discernibility of the semantic content is the goal of the SR procedure, such priors may not be the most optimal choice, as they are not sufficiently task-oriented.
The outlined limitation are most evident for challenging face hallucination problems where tiny low-resolution images (e.g., 24×24 pixels) of arbitrary characteristics need to be super-resolved at high magnification factors (e.g., 8×).In this paper, we try to address some of these limitations with a new hallucination model build around deep convolutional neural networks (CNNs).Our model, called C-SRIP, uses a Cascade of simple Super-Resolution models (referred to as SR modules hereafter) for image upscaling and Identity Priors in the form of pretrained recognition networks as constraints for the training procedure.The SR models super-resolve the LR input images in magnification increments of 2× and, consequently, allow for intermediate supervision at every scale.This intermediate supervision confines the explosion of the solution-space size and contributes towards more accurate hallucination results.To preserve identityrelated features in the SR images, we incorporate pretrained recognition models into the training procedure, which act as identity constraints for the face hallucination problem.The recognition models are trained to respond only to the hallucinated high-frequency parts of the SR images and ensure that the added facial details are not only plausible, but as close to the true details as possible.Due to availability of intermediate SR results, we incorporate the identity constraints at multiple scales in the C-SRIP model.Additionally, we introduce a novel loss function derived from the structural similarity index (SSIM, [24]) that provides a stronger error signal for model training than the loss functions commonly used in this area.
Overall, we make three main contributions in this paper: 1. We propose a new CNN-based face hallucination model, C-SRIP, that integrates identity priors at multiple scales into the training procedure of a super-resolution network.To the best of our knowledge, this is the first attempt to exploit multi-scale identity information to constrain the solution space of deep-learning based SR models.2. We introduce a cascaded SR network architecture that super-resolves images in magnification steps of 2× and offers a convenient and transparent way of incorporating supervision signals an multiple scale.Once trained, the SR network is able to hallucinate tiny unaligned 24 × 24 pixel LR images at magnification factors of 8× and produce realistic and visually convincing hallucination results as illustrated in Fig. 1. 3. We formulate a novel differentiable loss function for SR models based on the concept of structural similarity (SSIM).The novel loss drives our SR model towards solutions of higher perceived quality, as it relates to a measure designed explicitly with the goal of modeling human image-quality perception.

Related work
In this section we discuss recent research on super-resolution and face hallucination with the goal of providing the necessary context for our work.For a more comprehensive coverage the reader is referred to the existing surveys on super-resolution and face hallucination, e.g., [25,26,27,28].Super-resolution: Recent solutions to the problem of single-image superresolution (SR) are dominated by learning-based methods that use pairs of corresponding HR and LR images to train machine learning models capable of predicting HR outputs given LR evidence [5,6,7,8,9,10].The learning procedures used with these models typically aim to minimize an objective function that quantifies the error between the ground truth HR images and the SR predictions.Common objectives in this area include the mean-squared-error (MSE), the mean-absolute-error (MAE) and other related error metrics.Our SR model follows the outlined learning paradigm, but different from existing SR methods, exploits a novel objective related to structural similarity (SSIM, [29]), which better models human image perception than simple pixel-based metrics, such as MSE or MAE.
Our C-SRIP model is based on convolutional neural networks (CNNs) and in this sense is related to recent SR models that exploit CNNs for image upscaling, e.g., [9,6,11,12,14,15,16,17,18,19].A common aspect of these models is that they super-resolve images in a single step and, while capable of producing impressive SR results, rely only on LR-HR image pairs for training.Our model, on the other hand, upscales the LR inputs in a cascaded manner and allows for supervision signals and constraints to be incorporated at multiple scales during training.
Recent CNN-based SR models, e.g., [6,12] exploit contemporary network architectures such as ResNets [30] and Generative Adversarial Networks (GANs, [31]).These models are closely related to our work, as we also make heavy use of residual connections and incorporate a generative and a discriminative network in our model.While we do not rely on GANs per se, our model does include a discriminative (classification) model that constrains the solution space of the generative SR network.However, our discriminative model is pre-trained and then frozen and not optimized alternatively with the generator, which greatly improves training stability and still results in realistic SR outputs.
Our work can also be seen as an extreme case of the perceptual-loss ( p ) image transformation model from [11], which relies on comparisons of high-level features extracted from a pretrained secondary network as the learning objective for SR, instead of comparisons at the pixel level.Our model follows a similar idea, but uses identity (information a highest possible semantic level) to constrain the solution space of the generative SR network.Thus, instead of network features, our model considers the outputs of a pretrained network during training.
Face hallucination and identity constraints: Because the solution space of face hallucination models is typically constrained to a set of plausible facial appearances, remarkable performance has been achieved with hallucination models at much higher magnification factors than for general single-image SR tasks [32].Similarly to other vision problems, the research is moving increasingly towards deep learning and considerable improvements have been achieved recently with CNN-based models, such as [32,33,34,35,36,37,38,39,40].We contribute to this body of work in this paper with a novel deep face hallucination model.While the SR network of our model is general and applicable to arbitrary input images, we infuse domain-specific knowledge into the model through face recognition models.
It needs to be noted that using identity information as a prior (or constraint) for SR models has been examined before [41,42].Henning-Yeomans et al. [43], for example, formulated a joint optimization approach that maximized for superresolution and face recognition performance simultaneously.This approach is conceptually similar to our work, but our approach is more general in the sense that it can be applied with any differentiable classification model.The approach from [43] is focused only on linear feature extraction techniques, e.g., PCA [44].
Recent CNN-based face hallucination methods [32] have included secondary networks as constraints, which are trained jointly with the SR network.We found this to decreases training stability, so we instead use separately trained recognition and SR networks, where the former acts as a constraint for the latter.

Proposed method
Our C-SRIP face hallucination model consists of two main components: i) a generative SR network for image upscaling, build around a powerful cascaded residual architecture, and ii) an ensemble of face recognition models that serve as identity priors for the C-SRIP model (see Fig. 2).In the following sections we describe all components of C-SRIP in detail and elaborate on the training procedure used to learn the model parameters.

The cascaded SR network
The generative part of our C-SRIP model is a 53-layer deep convolutional neural network (CNN) that takes a LR facial image as input and super-resolves it at a magnification factor of 8×.The network progressively upscales the images using a cascaded series of so-called SR modules.Each module upscales the image by a factor of 2×, which makes it possible to apply a loss function on the intermediate SR results and ensures better control of the training procedure in comparison to competing solutions that exploit supervision only at the final scale.The cascaded architecture allows us to solve a series of easier and better conditioned problems using repeated bottom-up inference with top-down supervision instead of one complex problem with an overwhelming amount of possible solutions.We design our SR network around a fully-convolutional architecture that relies heavily on residual blocks [30] for all processing within one SR module and sub-pixel convolutions [45] for image upscaling.Our design choices are motivated by the success of fully-convolutional CNN models in various vision problems [30,46,47] and the state-of-the-art performance ensured by the sub-pixel convolutions in prior SR work [45,12].Similarly to [12], the residual blocks of the SR modules consist of two convolution-batch-norm-activation sub-blocks, followed by a post-activation element-wise sum.We ensure a constant memory footprint of all SR modules by decreasing the number of filters in the convolutional layers by a factor of 2 with every upscaling step.This maximizes the capacity of the network and balances the computational complexity across the SR modules.To upscale the feature maps at the output of each SR module, we rely on the sub-pixel convolution layers proposed in [45].These layers increase the spatial dimensions of the feature maps by reshuffling and aggregating pixels from multiple LR feature maps and, thus, for every upscaling step of 2× reduce the number of available feature maps by a factor of 4×.We counteract this effect by doubling the number of filters in the convolutional layer preceding the subpixel convolutions and, consequently, ensure that the capacity of the SR modules is not compromised due to the upscaling.After reaching the target resolution, the feature maps are passed through one last residual block and a convolutional layer with 3 output channels that produce the final 8× super-resolved RGB image.
The network branches off after each SR module to allow for intermediate top-down supervision during training.Each branch applies a series of large-filter convolutions to produce intermediate SR resolution results at different scales (i.e., 2× and 4× the initial scale) that are incorporated into the loss functions discussed in Section 3.3.However, these branches are not used at test time.The entire architecture of our network is illustrated in detail in Fig. 2.

The identity prior
Using prior information to constrain the solution space of SR models during training is a key mechanism in the area of super-resolution [48,22,23,49,50,51,21].The main motivation for incorporating priors into SR models is to provide a source of additional information for the learning procedure that complements the commonly used reconstruction-oriented objectives and contributes towards sharper and more accurate SR results.
An exceptionally strong prior in this context (also used in our model) is identity.Because identity information relates to the semantic content (i.e., who is in the image) and not the perceptual quality (i.e., how visually convincing is the image) of the SR images, it represents a natural choice for constraining the solution space of SR models.In fact, it seem intuitive to think about SR from both i) an image-enhancement as well as a ii) content-preservation perspective and to incorporate both views into the SR model for optimal results.While the image enhancement perspective is covered in our model by a reconstructionbased loss (discussed in Section 3.3), the content-preservation aspect is addressed through an ensemble of CNN-based face recognition models that ensure that identity information is not altered during upscaling.
For C-SRIP we associate each recognition model with one of the SR modules and use it as an identity prior for the corresponding SR output, as illustrated in Fig. 2. Since each SR module can be shown to add only high-frequency details to the input images (see Fig. 3 left), we pretrain all recognition models to respond only to the hallucinated details and ignore the low-resolution content that is shared by the input and SR images (see Fig. 3 right).By focusing exclusively on the added details, we are able to directly link the recognition models to the desired SR outputs and penalize the results in case they alter the facial identity.This mechanism allows us to learn the parameters of the SR network by considering an identity-dependent loss in the overall learning objective.While in principle any differentiable recognition model could be used as the identity prior for our face hallucination model, we select SqueezeNet models for this work [52].The main reason for our choice is the lightweight architecture of SqueezeNet, which does not impose significant runtime slowdowns due to its relatively small memory and FLOPS footprint.

Training details and SSIM loss
We train the C-SRIP model in two stages.In the first stage, we learn the parameters of the SqueezeNet models for all three SR outputs.In the second stage, we freeze the the weights of the recognition models and train the SR network with a combined loss.The details of both stages are presented next.
Recognition-model training.Next to LR and HR image pairs, we also require two intermediate reference images between the lowest and the highest resolution to learn the parameters of the recognition models and SR modules.To this end, we apply a simple degradation model on the available HR images x hr i and generate N image quadruplets for training, i.e., {x lr i , x 2× i , x 4× i , x hr i } N i=1 , where x lr i represents the LR input image, x 2× i and x 4× i stand for the intermediate SR results at 2× and 4× magnification factors, respectively, and the HR image x hr i corresponds to the ground truth for the magnification factor of 8×.Our degradation model uses Gaussian blurring followed by image decimation for down-sampling and produces training data as shown on the left side of Fig. 4.
To train the recognition models, we construct residual images that reflect the facial details that need to be learned by the SR modules.The residual images, shown on the right side of Fig. 4, are computed by smoothing the ground truth images by a Gaussian kernel and subtracting the smoothed image from the original, i.e., ∆x j i = x j i − g * x j i , for j ∈ {2×, 4×, hr}, where σ values of σ 2× = 1/3, σ 2× = 1 and σ 8× = 7/3 are used with images at 2×, 4×, and 8× the LR image size, respectively.We train the SqueezeNet models based on the generated residual images using categorical cross-entropy L CE : where p ∆x denotes the ground truth class probability distribution of the residual image ∆x (i.e., p ∆x ∈ {0, 1} K is a class-encoded one-hot vector), p∆x ∈ R K stands for the output probability distribution produced by SqueezeNet's softmax layer based on ∆x, i.e., K stands for the number of classes in the training data and θ SN represents the parameters of the network.We learn the parameters of all three recognition models through backpropagation by minimizaing the L CE loss over the training dataset, i.e.: θj SN = arg min θ j SN E ∆x j L CE (θ j SN , ∆x j ) .The result of this first training stage are three SqueezeNet face recognition models θ2× SN , θ4× SN , θhr SN , one for each image resolution that respond only to the hallucinated facial details and serve as identity constraints for the SR network.
SR network training.Standard reconstruction-oriented loss functions used for learning SR models, such as MSE or MAE, are known to produce overly smooth and often blurry SR results [12].We therefore design a new loss function for our SR network around the structural similarity index (SSIM, [29]), and integrate it directly into our learning algorithm.Specifically, we use our SSIM approximation as a loss function for the C-SRIP hallucination model.
Given a ground truth image x and the corresponding SR network prediction x = f θ SR (x), we compute the SSIM-based loss as follows: where the SR network f is parametrized by θ SR , E x [•] stands for the expectation operator over the spatial coordinates and ŜSIM (x, x) is a spatial similarity map between x and x defined as: , where In the above equations, * denotes the convolution operator, denotes the Hadamard product, and the open parameters, g, C 1 and C 2 , are defined as per the SSIM reference implementation provided by the authors of [24], i.e., g is a 11 × 11 Gaussian kernel with σ = 1.5 and C 1 ≈ 6.55, C 2 ≈ 58.98.
In Fig 5, we present error maps generated when comparing images of different resolutions with the ground truth based on squared-differences (center) and our ŜSIM approximation (right).The examples show that the SSIM approximation results in error maps that are less sparse compared to the squared-differences used with MSE-based losses, which, as we discuss in the experimental section, results in better training characteristics.
Based on the pretrained SqueezeNet models and the loss introduced above, we defined the overall loss of our C-SRIP face hallucination model as follows:

Implementational details
Recognition models.All three SqueezeNet models are implemented in accordance with the so-called complex SqueezeNet architecture from [52].The models consist of 9 fire modules with intermediate shortcut connections, followed by a global average pooling layer and a softmax classifier on top.We train the first recognition model to classify residual images at 2× the initial LR scale, i.e., 48 × 48 pixels, the second to classify images at 4× the initial scale, i.e., 96 × 96 pixels, and the last for recognition of residual images of 192 × 192 pixels in size.To learn the model parameters we use backpropagation and the Adam [53] minibatch gradient descent algorithm, with a batch size of 128 and an initial learning rate of 10 −4 .The learning rate is multiplied by a factor of 1  3 every 20 epochs.To avoid over-fitting, we resort to data augmentation in the form of random horizontal flipping and random crops.We employ an early stopping criterion based on accuracy improvements on the validation set.If no improvements are observed over 10 consecutive training epochs we stop the learning procedure and assume the recognition model has converged.
The SR network.The SR network consist of three SR modules that are preceded by a convolutional layer with 512 large-scale filters of size 9 × 9 pixels.The SR modules are implemented with p = 7 residual blocks that contain 512 filters in the first SR module, 256 filters in the second SR module, and 128 filters in the last SR module, as shown in Fig. 2. We set the number of filters for the final convolutional layer of the SR modules, to 1024 for the first, 512 for the second and 256 for the third module.All filters are of size 3 × 3 pixels.For the activations, we use Leaky Rectified Linear Units (LReLU).The last residual block of the SR network has 128 filters 3 × 3 pixels in size.Before generating SR results at the output of the network and in the off-branches, a convolutional layer with three 9 × 9 filters is used followed by a clipping layer to ensure that the SR RGB images are within the valid intensity range of [0, 255].
We train the SR network based on the objective in Eq. ( 4) that considers the novel SSIM-based loss as well as the recognition performance of the SqueezeNet models.We keep the parameters of the recognition models fixed and learn only the parameters of the SR network of C-SRIP with a value of α = 0.001.We again backbropagation and the Adam [53] minibatch gradient descent algorithm for training.Due to the large memory footprint of the SR network and the face recognition models, we use a relatively small batch size of 8. We set the initial learning rate to 10  3 × 10 −3 and multiply it by 1  3 at the end of epochs 10, 25, 50 and 80.We use a combined early stopping criterion that assumes the model has converged if both SSIM and MSE show no improvements over 10 epochs.

Datasets and model training
We select two datasets for our experiments.To train the C-SRIP model we use the CASIA WebFace dataset [54] which features 494, 414 images of 10, 575 identities, (i.e., N = 494, 414; K = 10, 575).The CASIA WebFace images are blurred and sub-sampled to produce the necessary image quadruplets for training and employed for learning the parameters of the recognition models and the SR network (see Fig. 4 for an illustration of the training-data generation process).For testing, we use the Labeled Faces in the Wild (LFW) [55] dataset with 13, 233 facial images and 5, 749 subjects.The two datasets are selected for the experiments because they feature images of variable quality captured in unconstrained conditions and thus represent a significant challenge for SR models.More importantly, they are designed to contains zero overlap in terms of identity, which is paramount to ensure a fair and unbiased evaluation of the C-SRIP model.
For SqueezeNet training we randomly sample identities from CASIA Web-Face and utilize 90% of the images for training and 10% for validation.The recognition models converge to the rank one recognition rate of 0.5138 (0.2974 † ) with 48 × 48px images, 0.7215 (0.4266 † ) with 96 × 96px images and 0.8569 (0.5713 † ) with 192×192px residual images on the training ( † validation) data.As expected, the performance decreases with a decreasing size of the residual images and is adversely affected by the lack of low-frequency information during training (see, e.g., [56] for the expected performance of SqueezeNet for face recognition).Nevertheless, the models contribute towards accurate and visually convincing SR results, as evidenced by the results in the next sections.Since we also need identity information when learning the parameters of the SR network of C-SRIP, we again use the 90%/10% data split per identity for training and validation.With this setup we train the SR network on 494, 414 CASIA WebFace images.
We train all models on a workstation with two Nvidia GTX Titan Xp GPUs.On this hardware, the SqueezeNet training takes 1, 2, and 5 days, respectively, for the 2×, 4× and 8× scale models.The training of the SR network with the identity constraints included takes around 8 days.Once trained, the SR network is capable of processing images at an average speed of 15 ms/image on GPU in batch mode, or 30 ms/image in real-time (i.e., single-sample batch) mode.

Comparison to the state-of-the-art
We compare our C-SRIP model with 6 state-of-the-art SR and face hallucination models, i.e.: the Naive Bayes Super-Resolution Forest (NBSRF) from [10], the Super-Resolution Convolutional Neural Network (SRCNN) from [9], the Very Deep Super Resolution Network (VDSR) from [6], the perceptual-loss based SR model ( p ) from [11], the Super-Resolution Generative Adversarial Network from [12], and the Ultra Resolving Discriminative Generative Network (URDGN) from [32].We train all models with the same data as C-SRIP and use open-source implementations of the authors (where available) for a fair comparison.For p we use features from the fire2, fire3 and fire4 layers of SqueezeNet for the learning criterion.We include results for bicubic interpolation as a baseline.Qualitative comparison.A few sample SR images are presented in Fig. 6.We see that with magnification factors of 8×, interpolation methods are insufficient and result in the loss of facial details.Furthermore, general SR models, such as NBSRF, SRCNN and VDSR, fail to provide substantial improvements and are seen to amplify noise present in the LR images.These models fail to make use of the available facial context due to their relatively low receptive fields.The SR-GAN, URDGN and p models improve on this by including secondary networks as constraints during SR training.p is consistently the best-performing model included in our comparison, only slightly behind C-SRIP.However, we notice it often adds high-frequency noise when trying to minimize the perceptual loss of the convolutional maps of the secondary network.We speculate the reason our model is not susceptible to these errors is the global cross-entropy loss of the secondary networks as opposed to the local conv features exploited by p .Quantitative comparison.We report average peak-signal-to-noise-ratio (PSNR) and structural similarity (SSIM) scores computed over the LFW images for all tested models in Table 1.C-SRIP results in the best overall performance in terms of PSNR and SSIM, followed by p and URDGN.While providing reasonably convincing visual results, SRGAN produces only an average PSNR score and the lowest SSIM score among all tested models.This result is expected and is observed regularly in the literature [12] with GAN-based SR methods.NBSRF, SRCNN and VDSR improve upon the Bicubic baseline in terms of performance metrics, but are less competitive in comparison to the three top performers of our experiments.
The summary statistics in Table 1 show a partial picture of the performance of the tested models.To get better insight into the performance we present Cumulative Score (PSNR and SSIM) Distribution (CSD) curves of the experiments in Fig. 8. Since SR models are increasingly focusing on learning-based techniques, which are expected to perform inconsistently across images of different characteristics, CSD curves provide a reasonable way of visualizing this performance variability.From the presented curves we see that all tested methods vary significantly in PSNR and SSIM scores across the LFW dataset, with a large fraction of images producing sub-average performance scores.The p and the proposed C-SRIP models are superior to other models and very close in terms of the PSNR-based CSD curve.However, the difference becomes significantly larger with the SSIM-based CSD curve, where C-SRIP is the top performer.

Ablation study
We perform an ablation study with the goal of assessing the contribution of the individual components of our proposed C-SRIP model.Towards this end, we train the following models and evaluate their performance on the LFW dataset: The results of the ablation study in Table 4 and the corresponding sample images in Fig. 9 show that each added component improves performance.The only decrease we see is when we switch from the MSE loss to the SSIM-based loss, which slightly lowers the average PSNR score, but results in a higher SSIM score.This result is expected, as PSNR is directly proportional to MSE and, thus, SR models optimizing for MSE typically achieve lower PSNR values than models using other loss functions.Nevertheless, we observe much better training characteristics with the SSIM loss, since the models converged faster and achieved significantly better SSIM and MSE scores on the training and validation data than the MSE-based models.Among the evaluated components, we see the biggest increase in the PSNR and SSIM scores with the multi-scale identity supervision.This addition also results in the biggest visual improvement of the SR images as seen in Fig. 9.

Limitations of C-SRIP
To evaluate the weaknesses of the proposed C-SRIP model, we examine a few example images that result in the worst SR results according to the SSIM score in Fig. 10.We identify a few potential reasons for the poor SR performance, i.e.:   -High-frequency details.Images 10a, 10b and 10d contain a great amount of high-frequency details (background, hair).Our SR network is guided by the face-recognition models that focus on the face and ignore other regions.
-Significant occlusion.In images 10a and 10f, the face is partially occluded by a foreground object.The occlusion changes the global facial appearance, which adversely affects the reconstruction capabilities of C-SRIP.
-Significant pose variations.In images 10e, the subject's face is partially obscured due to the profile pose.Few samples in our training dataset feature profile poses, which deteriorates performance on this type of facial images.
-Low-quality HR image.Image 10c has a significant amount of noise, which is reduced during down-sampling and cannot be reconstructed.

Conclusion
We have presented a novel CNN-based model for identity-preserving face hallucination from very low-resolution images (i.e., 24×24 pixels) at high magnification factors.We have shown that the proposed model improves SR results on face images, compared to both existing general super-resolution and face hallucination models.In terms of future work, we see the possibility of adapting our model to other modalities, e.g. to video sequences via recurrent attention models.

A Appendix
In this section we present some additional results to further highlight the merits of our C-SRIP model.Similarly to the main paper, we use images from the LFW dataset [55] (down-sampled by smoothing the original HR images followed by sub-sampling) as our test data.All inputs to the C-SRIP model are of size 24×24 pixels.
A We observe that our model achieves realistic SR results even for small magnification factors.That is, even when the images are upscaled to a (still modest) size of 48 × 48 or 96 × 96 pixels, the hallucinated images preserve the identity of the subjects reasonably well, despite the limited performance of the SqueezNet models at these scales and, consequently, the relatively weak identity constraint applied during training.It needs to be noted that none of the presented subjects has been included in our training data.generated images.The result of applying such post-processing steps are significantly sharper in crisper SR images.However, in terms of summary statistics (i.e., average SSIM and PSNR scores) these are not competitive to the results reported in the main part of the paper -the sharpening operation deteriorates (quantitatively measured) performance.In Fig. 13 and Fig. 14 we also include results for some examples that were already presented in the main part of the paper to facilitate implicit comparisons with competing methods.Interestingly, after the post-processing some of the SR images appear sharper than the original HR targets.This can be partially explained by the presence of noise in the target images that is not present in the SR reconstructions and the higher image contrast after enhancement that contributes towards the perception of higher-quality images.
A.3 Quantitative results on the impact of the SSIM loss Next, we present some (additional) quantitative results related to the proposed SSIM loss.Our SSIM formulation uses convolutions with a discrete Gaussian   3), to approximate the local averages used with the original SSIM and is, therefore, easily implementable using standard deep learning frameworks.As emphasized in the main part of the paper, the result of using the proposed SSIM-based loss are significantly better training characteristics in terms of faster convergence and lower PSNR and SSIM scores on the training data as shown in Table 3.Here, the results are presented for the simplest architecture from the ablation study (Section 4.3), where i) the images are processed through a series of 21 residual blocks, ii) all three upscaling layers are placed at the end of the SR network, and iii) supervision is applied only at the output of the model.The proposed SSIM-based loss ensures significantly better performance scores during training.Even though the MSE-based loss is directly proportional to the PSNR score, our SSIM-based loss results in a lower average PSNR score on the training data, which suggests that a better optimum was found by the backpropagation-based learning procedure.On the test data the proposed loss still improves on the SSIM score, but offers no improvements in terms of PSNR value as shown in Table 4 -this is already highlighted in the ablation study of the main part of the paper.

Fig. 2 .
Fig. 2. Illustration of the proposed C-SRIP model.The model consists of a generative SR network and an ensemble of face recognition models that serve as identity priors during training.The figure shows all architectural details (best viewed electronically).

Fig. 3 .
Fig.3.Each SR module adds fine facial details during upscaling (left).The recognition models are pre-trained to respond to these details only (right) and can therefore be used as identity constraints when learning the parameters of the SR network.

Fig. 4 .
Fig. 4. We generate training images for the SR network at four different spatial resolutions(left).For recognition-model training we compute residual images that correspond to the facial details that are hallucinated by the SR modules (right).

Fig. 5 .
Fig. 5. Error maps generated by the squared-error used by the MSE loss and the proposed ŜSIM function (the error map between x and the ground truth xg is defined as 1 − ŜSIM (x, xg)).The figure represents degraded images (left), the corresponding squared-error maps (center) and the error maps generated by ŜSIM (right).

Fig. 6 .
Fig. 6.Qualitative comparison of state-of-the-art SR models on sample images from the LFW dataset.The first column shows the input 24 × 24 pixel LR image (upscaled with nearest neighbor interpolation).Best viewed zoomed in.

Fig. 7 .
Fig. 7. Qualitative comparison of the evaluated SR models on sample images from the LFW dataset with highlighted image details.Best viewed electronically.

Fig. 8 .
Fig. 8. Cumulative Score Distribution curves (CSD) for the PSNR (left) and SSIM (right) scores over the LFW dataset.Curves further to the right are better.

1 .
Baseline: A baseline SR model without the cascaded SR modules.The model consist of 21 residual blocks similarly to our C-SRIP model, but the three sub-pixel convolution layers for upscaling are all at the end of the model.The model is trained using standard MSE loss.2. B+SSIM: Same as above, but trained with the proposed SSIM-based loss.3. C+SSIM: Our cascaded SR model, trained with the proposed SSIM-based loss, but without the identity prior networks and without multi-scale supervision i.e., the loss function is only applied at the output of the model.4. C+SSIM+M: Our cascaded SR model, trained with multi-scale supervision and the proposed SSIM-based loss function, but without the identity priors.5. C-SRIP: The C-SRIP model with multi-scale SSIM and identity supervision.

Fig. 10 .
Fig. 10.Examples of poor SR results obtained with the C-SRIP model according to the SSIM value.The four columns of each image correspond to (from left to right): the input LR image, bicubic interpolation, C-SRIP and the target HR image.

. 1 Fig. 11 .
Fig. 11.Illustration of the intermediate results generated with the SR modules of the C-SRIP model.The top row shows the output at a magnification factor of 2× and the bottom row shows the output at 4×.We again use the kXnXsX notation introduced in [12] to denote convolutional layers with n filters of size k × k, applied with stride s.

Fig. 12 .
Fig. 12. Qualitative results for the intermediate scales generated by our C-SRIP model.The columns correspond to (from left to right): the 24 × 24 input image, bicubic interpolation, our results and the ground truth (GT) at either 48 × 48 or 96 × 96px.

Fig. 13 .
Fig. 13.Qualitative results for SR outputs post-processed with a standard image enhancement technique (i.e., with a sharpening filter).For each 24 × 24 LR input image the far left of each quadruplet) the following columns correspond to (from left to right): C-SRIP, C-SRIP with image enhancement, and the target HR image.Best viewed in high resolution.

Fig. 14 .
Fig. 14.Qualitative results for SR outputs post-processed with a standard image enhancement technique (i.e., with a sharpening filter) with highlighted image details.For each 24 × 24 LR input image (on the far left of each quadruplet) the following columns correspond to (from left to right): C-SRIP, C-SRIP with image enhancement and the target HR image.Best viewed in high resolution.

Table 1 .
Averaged PSNR and SSIM scores for the tested SR models computed over the LFW dataset.The highest PSNR and SSIM values are achieved by C-SRIP.

Table 2 .
Ablation study on the LFW dataset.The table shows the impact of different model components on the average PSNR and SSIM scores.

Table 3 .
PSNR and SSIM scores obtained on the training data with the MSE-and SSIM-based losses.

Table 4 .
Comparison of the PSNR and SSIM scores on the test data obtained with the MSE-and SSIM-based losses.