Concurrent Video Denoising and Deblurring for Dynamic Scenes

Dynamic scene video deblurring is a challenging task due to the spatially variant blur inflicted by independently moving objects and camera shakes. Recent deep learning works bypass the ill-posedness of explicitly deriving the blur kernel by learning pixel-to-pixel mappings, which is commonly enhanced by larger region awareness. This is a difficult yet simplified scenario because noise is neglected when it is omnipresent in a wide spectrum of video processing applications. Despite its relevance, the problem of concurrent noise and dynamic blur has not yet been addressed in the deep learning literature. To this end, we analyze existing state-of-the-art deblurring methods and encounter their limitations in handling non-uniform blur under strong noise conditions. Thereafter, we propose a first-to-date work that addresses blur- and noise-free frame recovery by casting the restoration problem into a multi-task learning framework. Our contribution is threefold: a) We propose R2-D4, a multi-scale encoder architecture attached to two cascaded decoders performing the restoration task in two steps. b) We design multi-scale residual dense modules, bolstered by our modulated efficient channel attention, to enhance the encoder representations via augmenting deformable convolutions to capture longer-range and object-specific context that assists blur kernel estimation under strong noise. c) We perform extensive experiments and evaluate state-of-the-art approaches on a publicly available dataset under different noise levels. The proposed method performs favorably under all noise levels while retaining a reasonably low computational and memory footprint.


I. INTRODUCTION
V IDEOS aim at faithfully reflecting the motion in dynamic scenes but concurrent motion blur and noise can severely obscure scene perception. Vision sensors are reaching new and complex environments ranging from medicine, marine and robotics to night vision. However, hardware typically bears application-specific limitations and poses challenges for video enhancement. Improving visual outputs finds applications in visualization environments where the user can assess the scene more accurately and react. Moreover, enhanced video processing facilitates downstream computer vision tasks and improves performance in general video understanding. Although algorithms should address hardware limitations and account for adversarial physical phenomena by enhancing the video output, satisfying the objectives of a real-world application is a demanding task in practice.
Proper calibration of the sensor requires adjustment of the exposure time. While a longer time of exposure increases the number of photons and thus allows the sensor to capture scenes with less noise, it increases the risk of motion blur when the camera shakes and objects move. However, a small exposure time causes noise. Numerous methods have been proposed to address the deblurring task, ranging from spatially invariant [1]- [6] to spatially variant blur [7]- [12]. Meanwhile, many approaches have been proposed for denoising with remarkable results [13]- [16]. However, deeply learnt, dynamic scene video denoising and deblurring have been addressed only as independent tasks. Severe noise has been recently addressed in video enhancement, but only for static scenes, that is without motion blur, for example in the context of low-light imaging [17]. The problem of spatially variant motion blur, related to independently moving objects in the presence of noise, has not yet been addressed in the deep learning literature. Not only is it an intrinsically challenging problem, but relevant research is also limited by the difficulty in constructing such labeled datasets [17], [18]. For instance, [12] used a beam splitter to construct real blurry-sharp frames, whereas [18] emulated motion by manually moving objects and used a fixed tripod to capture multiple frames of the same scene before generating real noisy-clean pairs of frames by averaging the noisy instantiations of the same scene. In this study, we rely on the realistic blurry dataset of [12] and the realistic Poisson-Gaussian noise model [18], [19].
The problem at hand raises questions. Should different models be tailored to individually address denoising and deblurring tasks? How robust are deep video deblurring methods with increasing noise levels? To answer our questions, we first developed a deep learning system for video deblurring under strong noise. We demonstrate that the sequential utilization of off-the-shelf state-of-the-art video denoising and deblurring algorithms is ineffective. The former oversmooths the output since it is not constrained to retain the blur kernel. Moreover, such a configuration would be suboptimal because individual methods require individual feature extraction modules, while motion estimation and local frame features between the two tasks are essentially shareable.
Contributions: To address the aforementioned limitations, we propose R2-D4, the first-to-date deeply learned network that leverages the feature-sharing potential of MTL to increase model efficiency and jointly address dynamic video denoising and deblurring. Our main contributions are summarized as follows: R2-D4: We propose R2-D4, a novel, MTL-inspired, cascaded convolutional architecture utilizing two decoders to denoise and deblur input frames in stages. R2-D4 employs a tailored feature alignment module that leverages deformable convolutions at the feature level. MS-RDM: We propose multiscale residual dense modules to learn coarse-to-fine, dense representations, enhanced by MECA, a novel extension of the efficient channel attention module [20] to further modulate deformable convolutions and increase restoration performance while retaining the number of FLOPs. Experiments: We extensively benchmark existing deblurring approaches under different levels of noise on a real, publicly available dataset and show that state-ofthe-art deblurring networks bear noise-removing capacity, yet R2-D4 performs consistently better.

A. DEBLURRING IN THE PRESENCE OF NOISE
Image deblurring in the presence of noise is a fundamental, widely studied subject. Traditionally, a convolution model has been employed for spatially invariant blur and addi-tive Gaussian noise [21] or more realistic Poisson-Gaussian noise [19]. To solve the corresponding ill-posed inverse problem, some studies have resorted to variational methods that incorporate prior information on the unknown clean image, such as promoting sparsity [22], or some prior learned from data [1]- [4]. The other strategy combines the advantages of deep neural networks and variational approaches by linking each layer of a deep network to one iteration of the baseline iterative algorithm, and learning the algorithm hyperparameters from data by using deep unfolding methods [5], [6], [23]. Although the aforementioned methods produce very good results, they are typically limited to simplistic scenarios, which are spatially invariant blur kernels. Moreover they hardly scale to videos because of their relatively high computational complexity. More realistic scenarios with both noise and unknown spatially variant blur have not been extensively addressed in the literature. Existing optimizationbased methods require knowledge of the noise level and rely on relatively simple priors assuming a piece-wise constant change of the blur kernel in space [24], [25] or deal with a simplified blur model [26], [27]. The more general spacevariant blur model is well studied in the context of deep learning.

B. DEEP VIDEO DEBLURRING
Deep video deblurring methods rely on the expressiveness of stacked convolution filters to deal with dynamic scenes, where blur is due to both camera shakes and independently moving objects. Su et al. [28] proposed an encoder-decoder architecture to align the input frames via its intrinsic multiscale property and showed that warping the input frames with optical flow introduces negligible performance gains but significant warping artifacts. Similarly, Zhou et al. [10] performed implicit frame alignment on the feature level by learning alignment kernels to overcome inaccurate optical flow estimation via a recurrent design. Wang et al. [29] proposed EDVR, a general-purpose reconstruction network for video restoration tasks, including deblurring, denoising, and super-resolution. The authors performed feature-level alignment with a multi-scale cascaded module, comprising deformable convolutions, before spatio-temporal attentive fusion followed a reconstruction module to restore the corrupted input. Zhong et al. [12] introduced a recurrent network that extracts the features frame-wise and pre-processes them with a spatio-temporal attention module that emphasizes the important ones to be passed to the reconstruction decoder that generates the output image. Pan et al. [30] proposed a cascaded algorithm that relies on optical flow estimation to restore the latent frame. However, despite the number of successful studies on dynamic video deblurring, no study has yet addressed this task in the presence of noise.

C. CHANNEL ATTENTION
Channel attention mechanisms have become a prevalent building block in vision since they enable enhanced channelwise feature learning by highlighting informative features and suppressing irrelevant ones at low cost. Hu et al. [31] performed channel-wise global average pooling and employed linear projections with fully connected layers to first reduce and then redeem the channel dimensionality. The features were rescaled using learned weights. Similarly, Woo et al. [32] augmented linear projections with both the global average and max pooling. More recently, Wang et al. [20] argued that dimensionality reduction impairs channel relationships and performs projection directly on the input channel dimension with efficient 1D convolutions. Motivated by the efficiency of channel attention modules, we incorporate them into our work by extending the ECA to further enrich the attentive representations while retaining its efficacy.

D. MULTI-TASK LEARNING
Multi-task Learning constitutes the paradigm where different tasks are learnt simultaneously [33], typically through hardparameter sharing. MTL is an especially desirable setup under a synergic task configuration. With minor gradient conflicts, it increases the model efficiency by restraining the computational budget and leverages the underlying data structure more effectively by utilizing joint signals from different labels [34]. With regard to the problem at hand, both denoising and deblurring benefit from end-to-end, accurate alignment. Inspired by the success of cascaded restoration in stages [29], [30], [35], we cast the restoration problem on an MTL framework that shares features between the cascaded decoders.

E. DATASETS
Many video deblurring datasets [8], [28], [36] have been introduced to facilitate research in the field. Earlier works [8], [28] utilized high-fps cameras to approximate spatially variant blur via frame averaging over a temporal window. Despite their wide adoption, their use comes at the expense of consistent artifact generation incurred by excessive frame averaging to increase blur. As a result, the frames from [8] exhibit ghosting artifacts. In contrast, [28] used a smaller temporal window at the expense of adequate blur generation. Most recently, the beam splitter dataset [12] (BSD) has been constructed using cameras with different exposure times that record the same scene through a beam splitter. The authors introduced three different exposure configurations, yielding datasets of three different blur levels. However, the datasets captured the outdoor scenes. Obtaining sharp and blurry pairs of frames for low-light scenes remains unaddressed. For instance, the recently published ARID [37] is a low-light dataset that motivates the proposed problem, exhibiting both noise and blur, but lacks the respective paired clear frames.

III. PROBLEM FORMULATION
Let x ∈ R Q be a vector of observations related to an original signal y ∈ [0, +∞) N through the model where α ∈ (0, +∞) is a scaling parameter, z(y) = z i (y) 1≤i≤Q and w = (w i ) 1≤i≤Q are the realizations of mutually independent random vectors Z(y) = Z i (y) 1≤i≤Q and W = (W i ) 1≤i≤Q with independent components. It is further assumed that, for every i ∈ {1, . . . , Q}, Z i (y) ∼ P([Hy] i ) and W i ∼ N (0, σ 2 ), where P, N denote the Poisson and Gaussian distributions respectively, σ ∈ (0, +∞) is the standard deviation of the Gaussian noise component, and H ∈ [0, +∞) Q×N is a matrix modeling the degradation process, i.e. a heterogeneous motion blur kernel map with different blur kernels for each pixel in y. Let h i represent the kernel from H that operates on a region of the image centered at location i such that Thus, for each i, we have P(k i ) = P j h i j y i+j . In the context of deep learning, the original video signal can be recovered by some network F with parameters Θ. Hence, given VOLUME 4, 2016 T consecutive, corrupted frames x t ) 1≤t≤T , the optimal set of Θ is derived by minimizing the criterion: where L denotes some quality measure function e.g. 2 squared norm or 1 norm. More recently, perceptually motivated strategies [8], [10], [38] have been considered to restore realistic image structures by augmenting the optimization criterion via either GAN-based [39] adversarial training [8], [38] or perceptual loss terms [40].

IV. PROPOSED METHOD
Given N consecutive corrupted frames x [t−N :t] and N − 1 previously restored framesŷ [t−N :t−1] , our method obtainsŷ t via a cascaded, two-stage restoration. The proposed R2-D4 network consists of a shared, dense, deformable (D2) feature alignment module, followed by a convolutional feature fusion and two decoders performing denoising and deblurring sequentially (D2) to restore the frames via a two-stage (R2) cascaded process, as illustrated in Fig. 1. The shared D2 module processes the current x t and previous {x t−N ,ŷ t−N } frames to extract features at each time step. Subsequently, the asymmetric offsets are estimated to align the neighboring frame features with the reference frame features. Thereafter, the aligned features are fused before the two decoders leverage the shared features to denoise and deblur the current frame sequentially in a cascaded manner.
The D2 alignment module, described in Sec. IV-B1, employs modulated deformable convolutions [41] to align frames at the feature level and does not estimate the optical flow that is harder under strong noise. Common issues arising from optical flow include computational inefficiency and generation of motion artifacts. Feature alignment is further improved via our multiscale residual dense modules (MS-RDMs), described in Sec. IV-A2, which leverage dilated convolutions to capture a longer-range context. MS-RDBs essentially serve as a pre-processing step before deformable convolutions, aggregating features with increased effective receptive fields, thus facilitating the known deformable offset estimation issue [29], [42]. MS-RDBs are further enhanced with our modulated efficient channel attention blocks, as explained in Sec. IV-A1.
The R2 two-stage cascaded restoration process utilizes decoders to denoise and deblur corrupted frames sequentially under an efficient MTL framework. Accurate feature alignment benefits both denoising and deblurring. Moreover, the features are expanded channel-wise upon fusion, increasing the model capacity at the lowest resolution to accommodate both tasks sufficiently. Finally, the two-stage cascaded process has been shown to yield increased performance on many restoration tasks and is therefore integrated into R2-D4 through the two decoders under the proposed featuresharing scheme. As illustrated in Fig. 1, additional residual connections from x t to the first-stage outputk t , and from the latter to the second-stageŷ t are used to facilitate the training.

A. PROPOSED BLOCKS 1) Modulated Efficient Channel Attention
Self-supervised channel attention blocks have become ubiquitous since they highlight informative and suppress nonrelevant features. Wang et al. [20] proposed an efficient 1D convolution (ECA) on globally averaged input channels to determine the attentive weights, as illustrated in the top half of Fig. 3. Formally, ECA is denoted as follows: where GAP is the global average pooling operation, σ is the sigmoid function, and C c,1×k is a 1D convolutional operation with c output channels and a kernel of size k, where the latter is typically determined adaptively as a function of the input channels. Formally, assuming a feature cube f c ∈ R H×W ×C , the channel-wise attention weights are then derived as follows:f Then,f c is multiplied by the input features f to obtain the attendedf .
Despite the success of channel attention modules, they are often difficult to optimize and converge to uniform distributions of the channel weights. To alleviate such issues and facilitate the gradient flow during the backward pass, we propose to complement globally averaged features with max-pooled features as in CBAM [32], under the efficient 1D convolution configuration of [20]. The modulated efficient channel attention module, termed MECA, is illustrated in the bottom half of Fig. 3. In contrast to ECA, we perform both global average and max pooling (MP) on the features f channel-wise to obtain f c , and we denote the concatenation of GAP and MP channels as MGAP. By adopting the notation in Eq. 4, MECA is defined as: The attended weights are derived similarly to Eq. 5 and multiplied by the input features to obtain the attented features. Notably, MECA retains the efficiency of 1D convolution in capturing the local cross-channel interactions but learns an essentially more effective projection, utilizing two channels of informative cues instead of solely the globally averaged ones. MECA is an easy-to-plug module that can be integrated into all standard architectures for any vision task.

2) Multiscale Residual Dense Module
Residual blocks [43] have been a popular choice [10], [16], [38], [44] in image and video restoration. More recently, residual dense blocks (RDBs) [45] exploited dense connections between layers to extract richer hierarchical features while instantiating a contiguous memory (CM) mechanism to further enhance the learned representations. Residual dense blocks (RDBs) typically consist of l convolutional kernels and a 'growth factor' hyperparameter g. As shown in Fig. 2b, each layer receives the feature maps from the previous stage, convolves them with a 3 × 3 kernel  that yields g additional channels and concatenates them with the previous ones before passing them to the next layer. Each block is then followed by a 1 × 1 convolution to aggregate the signal and stabilize the training before the residual summation. Formally, a single RDB with 3 layers can be denoted as follows: where C c,k×k , R and CAT c are the k × k convolution operation, the activation function and the concatenation function respectively. The subscript c denotes the number of output channels after each convolution and concatenation. Stacking b such residual dense blocks gives rise to RDB cells [12], where the output of each block is sequentially processed by the next block. For clarity, we term them residual dense modules (RDMs). In RDMs, all subsequent RDB outputs are concatenated and fed into another 1 × 1 convolution before the residual summation at the module level.
In this work, we introduce multiscale residual dense modules (MS-RDMs) to efficiently increase the effective receptive field by spatially augmenting the hierarchical features in a coarse-to-fine manner. As illustrated in Fig. 2, MS-RDMs are designed via an MS-RDB that captures a hierarchically coarser context via kernel dilation followed by a simple nondilated RDB to complement hierarchical features with fine details. Regarding the MS-RDB, layers are progressively enhanced with larger dilation rates to hierarchically capture a longer-range context. As depicted in Fig. 2a, the MS-RDB block is defined as: where C c,k×k,d denotes, again, the convolution, but dilated with a rate of d. Upon concatenation of the coarse and fine block features and before the 1 × 1 convolutional aggregation, we perform channel-wise attention via the proposed M ECA k . Similarly, the resultant RDM M S is defined as: The proposed MS-RDM reformulation enlarges the effective receptive field, which in turn renders the CM mechanism spatially more aware. The coarse-to-fine hierarchical features mine spatially aware representations and serve as a preprocessing step for deformable offset estimation.

B. RESTORATION EN CASCADE 1) Dense Deformable Alignment
At each time step t, the network receives the current frame x t and previous corrupted and restored {x t−N ,ŷ t−N } ones. Leveraging previously restored frames encourages temporal coherence by reducing flickering and has been shown to yield improved performance [10]. At each time step, the respective features are computed using the following block: where C c,k,s denotes a k × k convolution with a stride of s, and c output channels and RDM MS g is the multiscale residual dense block with a growth factor g. As illustrated in Fig. 1, R2-D4 contains two sets of weights: one for the current x t and one for each past {x t−N ,ŷ t−N }. Likewise, where N is set to 2. Weight sharing for past frame features increases the training efficiency and accelerates inference by reusing f t−2 at each time step. VOLUME 4, 2016 The current and previous frames are then aligned using deformable convolutional layers. A deformable module enables the modeling of geometric transformations through asymmetric kernels so that output features can capture objectspecific contexts that assist blur kernel estimation. The leveraging of deformable convolutions under the proposed scheme has three advantages. First, it discards the necessity for erroneous and computationally expensive optical flow estimations. Second, it performs alignment on the deeper feature levels instead of the image level. This has been shown to improve performance [10], [29] because the layers prior to the deformable modules encode features that are tailored to the alignment. Third, estimating deformation offsets on the coarse-to-fine features extracted from MS-RDMs assists in modulated offset estimation and improves performance.
Each modulated deformable layer consists of two convolutions. The first layer learns the offset displacements and the modulating scalars that determine the amplitude of the output features. The second layer employs the modulated offsets and learns the filter weights, as in ordinary convolution. The deformable convolution is denoted as where C 3k 2 ,k×k,s is the k × k convolutional kernel with a stride of s, estimating the 2k 2 offsets and respective k 2 modulation scalars from the concatenated c frame features and C D c,k×k,s denotes the actual deformable convolution with c output channels. Correspondingly, the aligned features are defined as follows: The fusion of the aligned features is then performed via: Note that simple RDBs without dilation rates are employed for fusion because a spatially wider context does not strengthen the feature representations at smaller scales. Because the number of past frames is N = 2, the output and shared features are defined as

2) Cascaded Decoders
The decoders share identical architectures. They are optimized to upsample the shared features and yield denoised and deblurred outputs (D2) sequentially. As shown in [46], transposed convolutions often generate checkerboard artifacts. To overcome these problems during feature upsampling, many studies have resorted to bilinear upsampling followed by convolution [35], [47]. Although we confirm that bilinear upsampling eliminates artifacts, it leads to a loss of spatial information. Therefore, we resort to convolutional channelwise expansion followed by pixel shuffling [48] to reduce gridding artifacts and preserve spatial details. Denoting the upsampling layers as P S, each decoder can be expressed as follows: Assuming two such instantiations for denoising and deblurring as D den and D deb , the intermediate denoised and restored output frames are defined as: We utilize skip connections from the encoder to the decoders to preserve spatial information and facilitate training, as is common in UNet-based [49] methods. Instead of concatenating the encoding channels with both decoders, we restructure the gradient flow by dissecting the former, say f ∈ R H×W ×C , in two groups f den , f deb ∈ R H×W ×C/2 , each specialized for the decoder's task, as illustrated in Fig. 1. Likewise, f den and f deb receive task-specific gradients in addition to the shared gradients. As a result, f den focuses on the global noise distribution, whereas f deb is specialized in recovering the blur-free frame.

V. EXPERIMENTS
In this section, we present the experiments that (i) compare the performance of R2-D4 with state-of-the art video deblurring methods and investigate their robustness at different noise levels, (ii) assess the impact of the proposed blocks on R2-D4 architecture and (iii) assess the impact of the MTL configuration. All experiments use the "3ms24ms" version of BSD that has the strongest level of blur. The evaluation protocol contains 60 training (30K pairs), 20 validation (10K pairs) and 20 test (15K pairs) sequences with a resolution  Fig. 4). In the model considered in Eq. (1), the corrupted data z(y/α) are further normalized back to the common range by multiplying with α. The generated shot noise distribution is typical for bright, dark, and low-light images for α equal to 0.5, 1.9, and 7.1, respectively.

A. LOSS FUNCTIONS
The R2-D4 parameters are derived by optimizing Eq.(3), where L is a weighted sum of 2 squared norms, i.e. where and The definition of L perceptual is adopted from [40], where φ V GG denotes the VGG-19 features [50] extracted from the 3th layer and C φ , H φ , W φ denote the corresponding feature dimensions. The scalar values C, H, and W refer to the image channel, height, and width, respectively, and the weights are set to λ 1 = 0.6 and λ 2 = 0.01.

B. METHODS
First, we examine the performance of a naive system in which two methods operate sequentially. We trained Fast-DVDNet [13] for denoising, followed by STFAN [10] for deblurring. Second, we compare R2-D4 with state-of-theart models: STFAN [10], ESTRNN [12] with 15 blocks and only past frames, and CDVD-TSP [30]. In our ablation study, we investigated the effectiveness of our MTL setup by comparing it to R2-D3, which uses only a single decoder. Subsequently, we verify the impact of the proposed blocks on our feature alignment module by R2-D4 − , defined as: (i) MECA substituted with ECA; (ii) MS-RDB modules substituted with simple RDB modules, thus retaining GFLOPs. Next, we reduce the number of channels in the decoders and the fusion module, thereby obtaining the reduced but more computationally efficient "small" and "medium" R2-D4 variants.

C. SETUP
Our experiments are performed with PyTorch on an Nvidia Tesla V100 for 250 epochs. Adam [51] was used as the optimizer with a learning rate of 1.5 × 10 −4 decayed to 10 −6 via the cosine annealing strategy [52]. The networks are trained with sequences of 30 frames and a batch size of 1. The frames are randomly augmented with horizontal and vertical flips. Experiments for state-of-the-art methods follow the official, publicly available implementations.

D. RESULTS
The naive approach is not trained end-to-end and thus oversmooths the input frames achieving a PSNR of 28.40 and an SSIM of 0.850 for the severe noise setting. The results of the end-to-end methods are listed in Table 1. Interestingly, our experiments show that deblurring methods bear some noise-removal capacity, although R2-D4 performs better than STFAN and ESTRNN in both PSNR and SSIM. Moreover, it performs higher in PSNR and on par in SSIM with the computationally expensive, cascaded version of CDVD-TSP (2), which performs two passes over the corrupted frames and uses five input frames. As shown in Table 1, the performance increased over the compared methods across all levels of noise. The second decoder and the proposed blocks clearly . Qualitative Results. The frames were normalized to the same range. In zoomed areas, red and green rectangles highlight artifacts and more accurate reconstructions, respectively. The first, second, third and fourth rows were generated with severe, severe, moderate and low noise respectively. Column (b) demonstrates varying illumination conditions under which noise was generated whereas column (c) shows patches that are normalized to the common image scale.
(a) Corrupted frame x t−2 (b) Corrupted frame x t (c) Restored frameŷ t (d) Offsets at t − 2 (e) Offsets at t contribute to performance gains, increasing mean PSNR by 0.19 dB and 0.15 dB compared to R2-D3 and R2-D4 − , respectively. Last, Fig. 5 shows that while the small R2-D4 variant has 30% fewer GFLOPs in comparison to ESTRNN, it performs better than both STFAN and ESTRNN. R2-D4 benefits from accurate feature alignment under strong noise and recovers fine-grained frame details (see Fig. 6). One can observe that STFAN often fails to align features producing hallucinations, as seen in the gas tube (top row) and in the fence (middle row). For the same examples, the ESTRNN tends to oversmooth the output. CDVD-TSP performs better but tends to yield piecewise constant artifacts despite its larger complexity, which is visible in the fence example. R2-D4 performs implicit feature alignment and dynamically adapts offsets over time, as illustrated in Fig. 7. The top row illustrates the scenario of independently moving objects, whereas the bottom row depicts the uniform motion caused by camera movement. The offset variance is higher for the former; R2-D4 mines the spatio-temporal boundaries and aggregates the object-specific context. The spatial responses for the second case show a smaller variance as the learned offsets exhibit similar directions. R2-D4 dynamically adapts offsets in the case at hand.

VI. CONCLUSION
In this paper we study dynamic scene video deblurring under strong noise. Although such acquisition settings arise frequently in practice, the problem is challenging and new in the deep learning literature. We demonstrate that state-ofthe-art deblurring methods have some denoising capacity, but the proposed R2-D4 method outperforms them owing to an MTL-inspired, cascaded yet efficient architecture, enhanced with MS-RDM modules. Future research aims to bridge the gap between synthetically generated and real datasets with raw video sequences of dynamic scenes with natural noise.
EFKLIDIS KATSAROS received the B.Sc. degree in Mathematics, from the Aristotle University of Thessaloniki, Greece in 2016 and the M.Sc. degree in Data Science: Statistical Science, cum laude, from Leiden University, the Netherlands, in 2019. His main research interests lie in computer vision and machine learning. Since 2020, he has been a researcher at the Gdańsk University of Technology, Poland, where he simultaneously pursues a Ph.D. degree in multi-task learning for computer vision in medical applications at the Department of Biomedical Engineering.
PIOTR K. OSTROWSKI received the B.Eng. degree in control engineering and robotics in 2017, and the M.Sc. degree in Computer Science in 2018 from the Gdańsk University of Technology. He is a Ph.D. candidate at the Gdańsk University of Technology. His research interests include machine learning and computer vision with an emphasis on efficient video processing.
DANIEL WĘSIERSKI is an assistant professor at the Gdańsk University of Technology. He received the M.Sc. degree from the Gdańsk University of Technology, Poland, in 2007 for his thesis on stereo-vision robot gripping systems that he developed at ThyssenKrupp in Bremen, Germany. He then developed vision and robot systems for testing car control units at Volkswagen R&D in Wolfsburg, Germany. He received his Ph.D. degree from Télécom SudParis, France, in 2013 for developing visual tracking algorithms of flexible and articulated objects. He is a recipient of the best paper awards at MICCAI CARE workshop in 2015 and 2017 for vision-based tracking algorithms of surgical instruments. He has led application-oriented national R&D projects and has participated in two European projects. He has extensive experience in computer vision, image processing, and teaching machines on uncertain data with a focus on medical applications, including minimally invasive surgery, dentistry, and electroencephalography. His basic research interests include machine learning under uncertainty, denoising, optimization, and model-based vision.