Continuous Facial Motion Deblurring

We introduce a novel framework for continuous facial motion deblurring that restores the continuous sharp moment latent in a single motion-blurred face image via a moment control factor. Although a motion-blurred image is the accumulated signal of continuous sharp moments during the exposure time, most existing single image deblurring approaches aim to restore a fixed number of frames using multiple networks and training stages. To address this problem, we propose a continuous facial motion deblurring network based on GAN (CFMD-GAN), which is a novel framework for restoring the continuous moment latent in a single motion-blurred face image with a single network and a single training stage. To stabilize the network training, we train the generator to restore continuous moments in the order determined by our facial motion-based reordering process (FMR) utilizing domain-specific knowledge of the face. Moreover, we propose an auxiliary regressor that helps our generator produce more accurate images by estimating continuous sharp moments. Furthermore, we introduce a control-adaptive (ContAda) block that performs spatially deformable convolution and channel-wise attention as a function of the control factor. Extensive experiments on the 300VW datasets demonstrate that the proposed framework generates a various number of continuous output frames by varying the moment control factor. Compared with the recent single-to-single image deblurring networks trained with the same 300VW training set, the proposed method show the superior performance in restoring the central sharp frame in terms of perceptual metrics, including LPIPS, FID and Arcface identity distance. The proposed method outperforms the existing single-to-video deblurring method for both qualitative and quantitative comparisons.


Introduction
Facial motion deblurring for a single image is a specific but critical branches of image deblurring, aimed at restoring a sharp image latent in a motion-blurred face image. Besides being visually unpleasant, blurry face images also degrade the performance of many facial-related computer vision tasks such as face detection [62,73,87], face recognition [14,75], facial emotion recognition [80,91], and face medical image segmentation [63]. Therefore, face deblurring studies in computer vision and image processing have received much attention.
Recently, deep neural networks (DNNs) have become widespread in image restoration fields [15,18,41,88]. Among them, it has been achieved remarkable success in single image face deblurring [11,12,36,46,67,69,81]. Most of these methods recover only a single sharp image from a motion-blurred facial image. However, motion-blurred images are the integration of continuous sharp moments during the exposure time [18,30]. Thus, recovering such aggregated sharp moments from the blurred image can be considered the ideal goal of single image deblurring.
Several methods [1,34,59,86] have been proposed to restore sharp sequences from a blurry image. However, most of these methods have several drawbacks. First, the temporal ordering problem is extremely challenging, because it is difficult to uniquely define the temporal order of the motion of an object in a blurry image [1,34,59]. For this reason, most existing methods fail to extract the accurate temporal order. This temporal ambiguity of the underlying motion in blurry images remains unsolved issue [1]. Second, as shown in Fig. 1, most existing models aim to only restore fixed frames, owing to architectural design or training strategies. Jin et al. [34] proposed a cascaded architecture consisting of four deblurring networks. As depicted in Fig. 1a, each network is assigned to restore neighboring frames using the outputs from the previous networks. Thus, this method requires a large number of networks according to the number of output frames to be extracted. Purohit et al. [59] proposed the using a recurrent neural network (RNN) so that they can handle various numbers of frames without architectural changes GT" denotes the ground-truth sharp frames in 300VW dataset [65]. "# Fr" in parentheses denotes the number of frames. The results in (e) and (f) denote the outputs of the same network. By adjusting the control factor value, our single network can restore any number of sharp movements from a given blurry face image. This figure contains videos that are best viewed using Adobe Reader.
( Fig. 1b). They first extracted the middle frame using a pretrained deblurring network and extracted nine frames using an RNN. However, their model is fixed to restore the entire sequence with nine frames, which is the predefined number of iterations of the RNN in the training phase. Argaw et al.
[1] proposed a single encoder-multiple decoder architecture trained in a single training step. However, as shown in Fig. 1c, this architecture requires as many decoders as output frames. Recently, Zhang et al. [86] have shown promising results by restoring 42 frames from a blurry image. They trained three generative adversarial networks (GANs) by repeating the reblurring and deblurring processes (Fig. 1d). However, they restore a fixed number of frames and require multiple training steps. To address the problems described above, as shown in Fig. 1e, we propose a facial motion-based reordering (FMR) process and a continuous facial motion deblurring network based on GAN (CFMD-GAN), a novel framework for restoring continuous moment latent in a single motion-blurred face image with a single training stage.
To alleviate the difficulty of resolving temporal ambiguity, we estimate the reordered frames instead of estimating the frames in the original temporal order. To this end, we apply a facial motion-based reordering (FMR) process, which reorders frames in the dataset based on the position of the left eye in the face (e.g. from top-left to right-bottom position) [72]. This reordering process helps the network stabilize training.
On the other hand, we introduce CFMD-GAN that restores sharp moments by varying the continuous moment control factor to estimate frames under continuous scenario. This approach is primarily inspired by conditional GANs (cGANs) [4,49,51,54,85], which are effective for training generators to synthesize diverse and realistic data conditioned on interpretable information, such as class labels. In our case, a single image deblurring network serves as the generator, and the conditional information for sharp image generation is the moment control factor. However, we have found that there are two main challenges in effectively incorporating cGANs into a single image deblurring framework. First, most existing cGANs are primarily developed for image synthesis conditioned on discrete labels (e.g. class labels) [16]. In contrast, we aim to restore the output images conditioned on the continuous control factor. Unlike most cGANs [20,37,38,54] that use an auxiliary classifier for discrete class labels, we propose an auxiliary regressor to estimate the continuous control factor. It allows the proposed deblurring network to learn the image deblurring as a function of the continuous control factor. Second, an effective network module is required to apply the control factor into the deblurring network. Most existing single image deblurring approaches directly learn image-to-image mapping functions without the use of control factor. Recently, DNNsbased controllable image restoration models [5,27,40] have been extensively studied. Generally, these methods use a channel-wise attention module as a function of the control factor to resolve the Gaussian blurs and noise in static scenes. However, spatially-variant blurs with dynamic scenes must be considered. To this end, we present a control-adaptive (ContAda) block to effectively incorporate a control factor into recent deblurring architectures. The proposed block learns the modulation weights using a spatially deformable convolution and channel-wise attention as functions of the control factor.
Extensive experiments show that the proposed CFMD-GAN restores continuous sharp moments latent in a blurry face image using a single network and a single training process. Fig. 2 exemplifies our results, and compares our method with previous method [34].
The main contributions of this study are summarized as follows.
• We introduce the FMR process to stabilize the network training. It allows the network to utilize rich and accurate information of the ground-truth frames corresponding to the control factor during training.
• We propose a CFMD-GAN for continuous facial motion deblurring that restores continuous sharp frames latent in a single motion-blurred face image via a moment control factor.
• We present a ContAda block to learn the feature modulation weights of the deblurring network using spatially deformable convolution and channel-wise attention as functions of the control factor.

Related Works
In this section, we briefly review recent single image deblurring methods and conditional GANs, which are closely related to the present work.

Single Image Deblurring
Traditionally, the motion-blur process is formulated as the accumulation of continuous sharp moments that occur during exposure [18,32]. By mimicking this, large-scale deblurring datasets [18,53,68,70] have been proposed by synthesizing a blurry image by averaging consecutive sharp frames. By leveraging such datasets, DNNs-based methods have become widespread for single image deblurring. In the following, we introduce existing DNNs-based single image deblurring methods into three categories.

Singe-to-Single, General Deblurring
Single-to-single image deblurring aims to restore a single sharp image when a blurry image in a general scene is given. Earlier studies [6,19,71] estimated the blur kernel using DNNs and obtained the resulting image using deconvolution methods. Chakrabarti et al. [6] proposed a network that predicts the complex Fourier coefficients of a deconvolution filter and applies the predicted deconvolution filter to the input patch. Sun et al. [71] proposed a deep learning approach that estimates motion blur kernels from local patches using a Markov random field model. Gong et al. [19] developed a DNN to predict the motion flow from blurred images, which was used to recover deblurred images. Without estimating the deconvolution kernel, Nah et al. [18] utilized a coarse-to-fine network to directly restore a sharp image using their synthesized large-scale dynamic scene blur dataset. Following the success of [18], variants of coarse-tofine networks have been proposed, such as multi-recurrent networks [56,74], multi-patch networks [84] and efficient multi-scale networks [10]. Concretely, Tao et al. [74] designed a scale-recurrent network that shares network parameters across scales. Zhang et al. [84] cascaded a multi-patch network to restore sharp images based on different patches. In addition, Cho et al. [10] reduced computational costs by utilizing a U-Net [61]-based architecture that exploits multiscale features extracted from an input image and outputs.

Singe-to-Single, Face Deblurring
Face deblurring is a domain-specific task of single image deblurring that aims to obtain a sharp face from a blurry face image. Most existing methods have been studied in a manner that utilizes strong prior knowledge of the face, such as reference faces [23,55], face landmark [11,12], face sketches [47], multi-task embedding [69], 3D face models [60], facial parsing maps [46,67,81] and deep feature priors [36]. Specifically, Shen et al. [67] proposed to estimate the facial parsing map from the blurry face and then utilize it for restoring the sharp image. To avoid side effects caused by incorrect parsing maps, Yasarla et al. [81] utilized an uncertainty-based multi-stream architecture. Lee et al. [46] proposed restoring the face progressively from large components, such as skin, to small components, such as the eyes and nose. More recently, Jung et al. [36] utilized the rich information of feature maps extracted from a pre-trained deep neural network on the face.
However, all single-to-single deblurring methods, including the general and facial image domains, focus on restoring only one of the many moments accumulated in the blurred image. Unlike these methods, the proposed method restores various numbers of moments from a blurred image.

Singe-to-Video, General Deblurring
Instead of restoring a single output image, single-to-video deblurring is to predict multiple sharp frames from a single blurred image. In the pioneering work of Jin et al. [34], a sequentially cascaded architecture consisting of multiple networks trained with the corresponding number of training steps was utilized. In their method, each network is assigned to predict pre-specified frames among all sharp frames. Thus, this method requires changing the number of networks based on the desired number of output frames and training them from scratch. Purohit et al. [59] proposed a recurrent neural networks (RNNs)-based method trained with two stages. In the first stage, they trained a video autoencoder to learn the motion and frame generation from sharp frames. It addresses the problem of the number of network scales with respect to the number of output frames. However, they still have to be trained anew each time the number of output frames changes. The method proposed by Zhang et al. [86] was one of the first attempts to restore continuous frames. Their method extracts a total of 42 sharp frames from a blurry image by cascading three GANs trained in three stages. However, this approach is limited to restoring a fixed number of frames. Instead of training the entire model in multiple stages, Argaw et al. [1] proposed a single framework that can be trained in an end-toend manner. They proposed a feature transformer network consisting of a single encoder and multiple decoders, where each decoder was specified to output a specific frame. Thus, this method still requires changing the number of decoders when the number of output frames changes.
In short, existing studies are inherently limited in restoring only a fixed number of frames, owing to their rigorous architectural design or training strategies. In contrast, the proposed method differs in that 1) it restores continuous sharp frames beyond a fixed number, 2) a single deblurring network with a single training step is utilized, and 3) the proposed method can be trained in an end-to-end manner.

Conditional Generative Adversarial Networks
Generative Adversarial Networks (GANs) [22] are among the most widely used frameworks in image generation and have been extensively studied over the past few years. Conditional GANs (cGANs) [49] are variants of GANs that synthesize realistic and diverse images using conditional information, such as class labels. Depending on how the framework incorporates the data and class labels, most cGANs can be categorized into classifier-based cGANs [20,37,38,54] and projection-based cGANs [4,25,50,51]. Classifier-based cGANs utilize conditional information (class labels) by training an additional classifier as well as a standard GAN discriminator. Meanwhile, projection-based cGANs propose a projection discriminator that takes an inner product between the embedded class labels and the feature vector extracted from the data.
The proposed method draws inspiration from all existing cGANs. To the best of our knowledge, this is the first attempt to apply continuous conditional information to deblurring task.

Preliminaries
Generative Adversarial Networks (GANs) [22] are wellestablished method for mimicking the probability distribution of the real data by playing a min-max game between the generator G and discriminator D. Whereas G learns to fool D by generating realistic samples, D learns to classify whether the given samples are true data (real) or generated data (fake). Their objective, V (G, D) is formulated as follows. min where p(x) denotes the real data distribution, and p(z) denotes a pre-defined distribution, e.g., Gaussian distribution.
A key property of GANs is that a well-trained G successfully captures the data manifold even if there are missing data in the training set [17,21,43]. Conditional GANs (cGANs) [49,51,54] are an extended GAN framework developed for conditional image synthesis. Given a pair of images x and class labels c sampled from the joint distribution of the real dataset (x, c) ∼ p(x, c), the goal of G is to learn the class-conditional image synthesis by utilizing c as an additional input with z. Let p G (x|c) denote the generative distribution specified by G(x, c) and p G (x, c) := p G (x|c)p(c). The objective of generic cGANs [49], V cGAN (G, D), minimizes the Jensen-Shannon Divergence (JSD) between p(x, c) and p G (x, c) as As one of the most representative classifier-based cGANs, AC-GAN [54] introduces an auxiliary classifier Q to provide feedback on the class-conditional image synthesis of G. In AC-GAN, D and Q share all weights of the feature extractor, except for the final output layer. Let p Q (c|x) denote the conditional distribution induced by classifier Q. Then, their loss, V AC-GAN (G, Q, D) can be expressed as follows where λ c is the balancing weight between the GAN and the auxiliary classification losses. In Eq. (3), the first two lines are loss functions similar to the original GANs (Eq. (1)), where D serves as a binary classifier that distinguishes between real and fake samples. Terms (a) and (b) represent the auxiliary classification losses that enable Q to determine the class labels of the input samples. Through this auxiliary classifier, AC-GAN can generate class-conditional image synthesis.

Proposed Method
In this section, we first introduce the facial motion-based reordering (FMR) process, which is proposed to mitigate the temporal ambiguity problem by utilizing human face information (Sec. 4.1). Next, detailed explanation of the key components of the proposed CFMD-GAN is provided, which recovers the continuous moment latent in a blurry face image via a moment control factor (Sec. 4.2). Lastly, we introduce the training objectives of the proposed model (Sec. 4.3).

Facial Motion-based Reordering
One of the main challenges in restoring multiple images from a single blurred image is to resolve the temporal (sequence) ambiguity of sharp moments. A motion-blurred image is the averaged result of a continuous sharp sequence during the exposure time [18,32]. As averaging destroys the information of the temporal order [1,34,86], reconstructing the original sequence of sharp moments is non trivial. For example, suppose a blurry facial image and its corresponding original sharp sequence are given, as shown in Fig. 4. The problem is that the same blurry image can be obtained even if the face moves in a reverse or shuffled order during the exposure time. Owing to this ill-posed nature of the temporal ambiguity, finding the underlying sequence of the blurry image is one of the unsolved issues [1]. In this regard, previous studies [1,34,59] have found that temporal ambiguity causes unstable training of the network because it is difficult to uniquely define the temporal sequence of object movements.
To alleviate this, we leverage the information of the human face to apply effective yet strong constraints. In a recent study on face landmark detection, Sun et al. [72] proposed defining the intensity of facial motion as the movement of the left eye during the time unit. Inspired by this, we devised a facial motion-based reordering (FMR) that enables the network to restore sharp face images in a generalized order based on the position of the left eye.
Specifically, as depicted in Fig. 4, FMR is a motionbased reordering process of the ground-truth (GT) sequence in a training dataset consisting of a single facial motion per single video clip. Let S t be a time-ordered set of GT frames sampled from a high-frame-rate facial video, which is denoted by where i denotes the frame index within the total number of frames N . Then, a blurry image b ∈ R H×W ×3 can be approximated by averaging these GT frames as follows: where g(·) denotes the camera response function [18]. We rearrange s[i] according to the position of the left eye (x, y)  by applying u = i−1 N . Then, we can denote this reordered set S r as follows: Note that the real number u becomes a moment control factor in the proposed framework. In this study, the network learns to restore the facial motion-based order in S r . It should be noted that this reordered sequence does not match the temporal sequence. Instead, the proposed framework restores all possible sharp moments latent in a blurry facial image. The FMR process allows the frames in the sequence S r to have regularity of face motion, which helps the network stabilize the training. The effects of the FMR are analyzed in Sec. 5.

Continuous Facial Motion Deblurring GAN
Inspired by the success of AC-GAN [54], the proposed continuous facial motion deblurring framework CFMD-GAN consists of a generator G and a discriminator D with an auxiliary regressor Q. An overview of the CFMD-GAN is depicted in Fig. 3. Given a blurry face image and a control factor, G performs the role of a deblurring network to perform conditional image restoration. Unlike most single image deblurring methods that only recover a single deblurred image from a single blurry image, the proposed G is a function that restores a deblurred image conditioned on a control factor. That is, G predicts continuous sharp moments latent in a blurry image by changing the value of the control factor. To achieve this, D learns to predict 1) decisions of images belonging to real or fake [64] and 2) regression for control factor at the additional output layer Q.

Overall Pipeline of Generator
Given a blurry face image b ∈ R H×W ×3 and moment control factor u ∈ [0, 1] as the condition, G generates a restored face imageŝ(u) ∈ R H×W ×3 , which is defined aŝ Specifically, the proposed G comprises two parts, a mapping network G M and a deblurring network G R . First, G M translates the moment control factor u ∈ [0, 1] into the feature control factor u f ∈ R H×W ×64 . Second, G R incorporates u f with features extracted from b and then outputs the final deblurred face imageŝ(u). In the proposed deblurring network, we deign a ContAda block so that G can focus on important spatial locations and channels of features extracted from b according to c f . Mapping Network. In recent GANs studies [26,39,66,92], the additional mapping network has proven to provide more disentangled semantics for the generator than directly using input codes. Inspired by this, we set the mapping network G M that outputs the feature map control factor u f ∈ R H×W ×64 from the given moment control factor u ∈ [0, 1] as u f = G M (u).
As shown in Fig. 5, G M first expands u into a 2-dimensional matrix u 2D ∈ R H×W where each position is filled with u. Then, G M outputs u f from u 2D through several convolutional layers. Similar to [39], we design G M consisting of eight layers, each of which includes 1×1 convolutions and a leaky ReLU [48]. Deblurring Network. As mentioned earlier, the deblurring network G R generates a restored imageŝ(u) ∈ R H×W ×3 from the blurry face image b ∈ R H×W ×3 and the feature map control factor u f ∈ R H×W ×64 , aŝ In this work, we employ the high-level structure of MIMO-UNet [10], which has exhibited impressive performance in a single image deblurring field. Specifically, as shown in Fig. 5, MIMO-Unet is based on the encoder-decoder architecture and comprises three encoder blocks (EB 1 , EB 2 and EB 3 ) and three decoder blocks (DB 1 , DB 2 and DB 3 ). Each of these encoder and decoder blocks contain eight modified residual blocks [74]. Unlike the original MIMO-UNet, the network developed in this study can focus on important spatial positions and channels of the feature map depending on the control factor by replacing the residual blocks with the proposed ContAda blocks. Note that SCM, FAM and AFF are modules used in the original MIMO-UNet that represent the shallow convolutional module, feature attention module and asymmetric feature fusion module, respectively. The details of each module, including the high-level architecture, can be found in [10]. In the following section, we discuss the proposed control-adaptive (ContAda) block.

Control-Adaptive Block
There is a major challenge in applying existing building blocks (e.g. variants of residual blocks [28] ) that are widely used in single image deblurring networks in the proposed continuous facial motion deblurring. First, standard convolution-based layers have an inherent drawback in modelling geometric transformations. This drawback stems from the fact that a convolutional unit samples the input feature map at fixed spatial locations [13,93,94]. To alleviate this, deformable convolution [13,93] has exhibited promising results in object detection by learning the offsets of the convolution grid to adjust the receptive field dynamically. Inspired by this, several motion deblurring studies [58,76,82] applied a deformable convolution module to handle the complex and various latent movements in a given blurred image [58,76]. However, these methods are still inadequate for our task because of the inability to focus on the adaptive positions of the feature maps depending on the control factor. To this end, as shown in Fig. 6, we propose a Control-Adpative (ContAda) block that comprises a control-adaptive deformable convolution (CADC) module and a controladaptive channel-attention (CACA) module. Let F Im ∈ R Hn×Wn×Cn denote an input feature map of the ContAda block extracted from the input blurred image b ∈ R H×W ×3 . Here, H n , W n and C n represent the height, width, and num-ber of channels in the n th encoder/decoder block, respectively. The ContAda block starts with a 3 × 3 convolutional layer and LeakyReLU to extract the initial feature map F o ∈ R Hn×Wn×Cn . Meanwhile, the feature control factor u f ∈ R H×W ×C , which is the output of the mapping network G M , is reshaped to u (n) f ∈ R Hn×Wn×Cn using bilinear interpolation and 1×1 convolutional layer. Then, u (n) f is concatenated with F o along the channel dimension and then reshaped into F u ∈ R Hn×Wn×Cn by applying 1 × 1 convolution layer. F u is utilized as an input feature for the CADC and CACA modules. In the following section, we introduce CADC and CACA distinctly.
Control-Adaptive Deformable Convolution (CADC) module is based on deformable convolution [13,93] that enhances the ability of network in modeling spatial variations. Unlike [13,93], where deformable offsets and attention weights are solely determined by internal information regarding the features of the input image, the proposed CADC learns the offsets and attention weights from the combined features of the control factor and image features. Let K denote the sampling locations of a convolutional kernel. We denote the weight and pre-specified offset for the k th location as w k and p k , respectively. For example, 3 × 3 convolutional kernel of dilation 1 has 9 sampling locations (K = 9) and p k ∈ {(−1, −1), (−1, 0), . . . , (1, 1)}. Let F u (p) and F dc (p) denote the features at location p of the input feature map F u and output feature map F dc , respectively. Accordingly, the proposed CADC can be formulated as where ∆p k and ∆m k denote the learned offset and attention weight scalar for the k th location, respectively. As shown in Fig. 6, ∆p k and ∆m k are determined by separate convolutional layers. The output of the sampling offsets branch has 2K channels, corresponding to {∆p k } K k=1 . The output of the attention weights branch is of K channels, as {∆m k } K k=1 , and each ∆m k is in the range of [0, 1] by the sigmoid function. Following [93], the initial values of ∆p k and ∆m k are set to 0 and 0.5, respectively.
Control-Adaptive Channel Attention (CACA) module is mainly motivated by [8,31,90], which benefits from applying the channel-wise attention mechanism for convolutional layers. In short, both CADC and CACA can be considered as attention functions of two variables: features extracted from blurry images and those extracted from the control factor. They are complementary in that CADC performs spatial attention to select important geometric properties of features, whereas CACA focuses on significant semantic and contextual attributes [8,90]. Given F u , as can be seen in Fig. 6, global average pooling is applied to transform channel-wise information into channel descriptors, following [90]. Subsequently, we obtain the channel-wise attention weights from two 1 × 1 convolutional layers and a sigmoid function. The learned attention weights are multiplied by F dc , the output of the CADC, in an element-wise manner.

Discriminator
As shown in Fig. 3, the proposed discriminator D is based on the U-net structure discriminator [64] with an auxiliary regressor. In our framework, G receives as inputs a blurred face image b and a control factor u, and outputs an imagê s(u) = G(b, u). Following [33], the discriminator D takes as inputs as a blurred face image and the corresponding sharp face image. Here, a face image is either a real sharp image s(u) drawn from the training dataset or a restored imagê s(u) from G. Then, D provides three types of outputs from the encoder output layer D enc , decoder output layer D dec , and auxiliary regression layer Q.
Following [64], D enc determines whether the global input context is real or fake. Similarly, the final outputs of D dec are used to classify whether the local context of the input is sampled from the real or fake. On the other hand, the proposed Q provides a regression value for the estimated control factor. Instead of predicting a single scalar value of c, our Q outputsû 2D ∈ R H×W and is trained to estimate the ground-truth control factor u 2D ∈ R H×W .

Model Objectives
Following [22], D and G are optimized alternately using loss functions, which are described as follows.

Discriminator Loss
To estimate the global and per-pixel probability distributions, the encoder loss L Denc and decoder loss L D dec are formulated as follows: G(b, u)), Here, [D dec (·)] (i,j) represents the decision of the discriminator decoder at pixel coordinate (i, j).
To ensure that the restored image is an accurate moment of the blurry image, the auxiliary regression loss L Q is defined by The total loss of D is formulated as the sum of the above objectives: where λ Q denotes a weight parameter, which is empirically set to 0.05.

Generator Loss
Auxiliary Regression Loss. To accurately restore the output image conditioned by the control factor, an auxiliary regression loss L ar is optimized as follows: Adversarial Loss. We use the Unet-discriminator to ensure that the generated image is indistinguishable from the real data for both global and local contexts. The adversarial loss L adv is formulated as follows: G(b, u)) G(b, u))] (i,j) .
(15) Pixel-wise Loss. To restore accurate pixel intensities, following [83], we employ the Charbonnier loss [7] to minimize the pixel-wise distance between a ground-truth moment and a restored image as follows: where n denotes the number of multi-scale levels. H n and W n represent the height and width at the corresponding n th level of output image, respectively. Following [83], ε is set to 10 −3 . Perceptual loss. Furthermore, we use perceptual loss to obtain perceptually satisfactory images. Similar to [35], LPIPS [89] is employed for perceptual loss.
Here, φ(·) is a feature extractor, ω denotes a learned vector to measure the LPIPS score, and the total score is averaged over M layers.
In overall, the total loss of G combines the aforementioned loss functions, where λ ar , λ adv , λ pix and λ per denote the balancing weights empirically set to 0.05, 0.1, 1 and 0.01, respectively.

Dataset
We use the 300VW dataset [65] which consists of a large number of high-quality facial videos recorded in the wild. Each video has a duration of about one minute at 25-30 fps. Following the face deblurring study by Ren et al. [60], the training and test datasets are extracted from 83 videos and 9 videos, respectively. Each blurry image is synthesized by averaging various numbers (5-13) of consecutive sharp frames, as in recent motion deblurring studies [18,60]. Thus, the testset consists of total 13,058 blurred images and 116,188 sharp frames. The details of the number of test images are listed in Table 1.

Implementation Details
The proposed framework is implemented with Pytorch [57] and trained with NVIDIA TITAN-RTX GPUs. We train our networks using the Adam optimizer [42] with β 1 = 0.9, and β 2 = 0.999. The initial learning rate is set as 1 × 10 −4 and it decayed exponentially by a factor of 0.99 for every epoch. For data augmentation, we randomly scale the image from 1.0 to 1.5 and then randomly crop the image with a spatial size of 256 × 256 × 3. During training, we set the batch size as 8 and train our model for 200 epochs.

Evaluation Metrics
For a quantitative evaluation, we measure the PSNR and SSIM [79], which are traditionally used for image quality assessment. We also report two metrics of learning-based perceptual quality, FID [29] and LPIPS [89]. Moreover, we employ the ArcFace [14] model to measure the distance of facial identity between the ground truth (GT) and the resulting image, as [77].

Comparisons with the state-of-the-arts
To the best of our knowledge, the proposed method is the first attempt for single-to-video face deblurring. Hence, we conduct extensive and faithful comparisons with stateof-the-art methods in single image deblurring. Specifically, the proposed CFMD-GAN is compared with single-to-single (s2s) general deblurring ( i.e. Nah et al. [18], SRN [74], DMPHN [84], MIMO [10]), s2s face deblurring (i.e. Shen et al. [67], UMSN [81], MSPL [46] ), and single-to-video (s2v) general deblurring ( i.e. Jin et al. [34]). To facilitate fair comparisons, we retrain the existing methods using the same training dataset used in this study. The retrained models are marked with asterisks (*). All experiments are performed using the official codes provided by the authors.

Single-to-Single General Deblurring
In this comparison, we evaluate the performance of the center frame prediction, as most s2s general methods are proposed Input DMPHN* [84] MIMO* [10] CFMD-GAN (ours) Ground truth Figure 7: Qualitative comparisons of single-to-single general deblurring methods. Zoom in for the best view. to restore the center frame. For the proposed method, the control factor is set to c = 0.5 to obtain the center frame results. Table 2 reports the comparisons of s2s general deblurring methods. Despite the significant improvements in the performance of retrained DMPHN* and MIMO* compared to the original DMPHN and MIMO, our CFMD-GAN shows the best results in LPIPS, FID and ArcFace distance, and the second best in PSNR and SSIM.
As investigated in recent GAN-based restoration studies [2,9,24,35,45,77,78], PSNR and SSIM may be lower because the GAN-based model tends to generate fake yet realistic details and textures [24]. This effect of GANs can be clearly observed in the visual comparisons in Fig. 7. Compared with other methods, the proposed CFMD-GAN restores more realistic textures and finer details of facial components, such as the eyes, nose, and eyelids. Based on these results, we can confirm that the proposed model can predict a more accurate center frame than the other methods.

Single-to-Single Face Deblurring
Most existing s2s face deblurring methods [46,67,81] are developed to remove spatially-uniform blurs. However, our training and test datasets contain spatially-variant blurs. Besides, their models only handle input images of 128×128×3. For these reasons, we downsample our dataset to 128×128×3 and use it to retrain UMSN [81], MSPL [46] and our model (termed as CFMD-GAN 128 ). The retrained models, UMSN* and MSPL*, are trained to predict the center frame, similar to the s2s general deblurring approaches. Note that we do not retrain Shen et al. [67] because they do not release the training code. Table 3 and Fig. 8 provide the quantitative and qualitative comparisons of the s2s face deblurring methods, respectively. In this experiment, the proposed method achieved significantly better performance on SSIM, LPIPS, FID and Arc-Face than the existing face deblurring methods. For PSNR, our method achieved the second best. Shen et al. [67] fails to restore plausible results because they are not trained to remove spatially-variant blurs, as shown in Fig. 8. Although the retrained models (UMSN* and MSPL*) show improved performance, they are still inferior to the CFMD-GAN.

Single-to-Video General Deblurring
For s2v general deblurring methods, we compare our method with Jin et al. [34] which officially released their test model. Since this method is strictly fixed to extract seven sequential frames from a single blurry image, we compare the results only for blurry images averaged by seven sharp frames. None of the s2v deblurring methods [1,34,59,86] have released their training codes. [34] is the only work that provides the Input Shen et al. [67] UMSN* [81] MSPL-GAN* [46] CFMD-GAN 128 (ours) Ground truth test code. Table 4 reports quantitative comparisons with Jin et al. [34] and detailed results of our model according to the number of GT frames. The model of Jin et al. [34] is limited to predicting only a fixed number of frames when the model is trained once. However, it is worth to note that the proposed single model can predict various numbers of output frames without additional network changes or training processes. Visual comparisons in Fig. 9 show this difference.

Evaluation on Other Test Datasets
Since our model is trained and evaluated with synthetically blurred images using the 300VW dataset [65], we verify how our model performs on other motion-blur benchmark datasets such as REDS [52] and Lai et al. [44]. The REDS dataset is generated using 120 fps videos, synthesizing blurry frames by merging consecutive frames. The Lai dataset contains real-blur images where the GT images do not exist. We manually crop the facial regions of images in the REDS validation set and the Lai dataset. Input CFMD-GAN 11 frames Figure 10: Qualitative results of the proposed CFMD-GAN on REDS dataset [52] (1 st row) and Lai dataset [44] (2 nd and 3 rd rows). This figure contains videos that are best viewed using Adobe Reader. Fig. 10 shows that our method restores satisfactory images for recent benchmark deblurring datasets. In 1 st row of Fig. 10, we can see that our method produces not only a sharp face, but also the background that was occluded by the face in the previous frame. For the real-blurred images in 2 nd row of Fig. 10, our model restores plausible results containing consecutive frames. Our framework can provide all sharp moments that user wants from a single motion-blurred face image.

Ablation Study
In Table 5, we evaluate the impact of the proposed Con-tAda block consisting of ContAda deformable convolution (CADC) and ContAda channel attention (CACA). With the CADC module, the proposed method can focus on the spatially important sampling points of the feature maps by the feature map control factor. Notably, using only CACA module improves the average PSNR by about 0.5dB compared to using only CADC module. This demonstrates that the channel attention plays a more important role in the proposed model. More importantly, using both CADC and CACA achieves the best results. This indicates that both spatial and channel-wise modulations are required for the continuous facial motion deblurring. Furthermore, we conduct an ablation study to investigate the contribution of FMR to the network training. The 3 rd row in Table 5 indicates that without FMR, the performance of the model drops drastically when it learns the original temporal order.

Conclusion
In this study, we introduce CFMD-GAN, a novel framework for continuous facial motion deblurring with a single network and a single training process. Subsequently, we apply facial motion-based reordering (FMR) to mitigate the difficulty of temporal ordering by utilizing domain-specific facial information. This ensures a stable learning process for the framework. We devise an auxiliary regressor to learn continuous motion deblurring by integrating the concept of conditional GANs into a single image deblurring framework. In addition, we propose a control-adaptive (ContAda) block that focuses on deformable locations and important channels according to the control factor. In our extensive experiments, we demonstrate that the proposed method outperforms stateof-the-art methods in facial image deblurring. The proposed framework can provide continuous sharp moments that users want to obtain from a single motion-blurred facial image. Since the proposed method restores facial motion in the order of FMR, there may be a limitation in predicting the accurate temporal order of the facial motion. However, we believe that the proposed method will be the basis for future studies on continuous facial motion deblurring. In addition, incorporating various facial priors can be a fundamental issue for future research to improve the quality of this study.