Parameter Reduction of Kernel-Based Video Frame Interpolation Methods Using Multiple Encoders

Video frame interpolation synthesises a new frame from existing frames. Several approaches have been devised to handle this core computer vision problem. Kernel-based approaches use an encoder-decoder architecture to extract features from the inputs and generate weights for a local separable convolution operation which is used to warp the input frames. The warped inputs are then combined to obtain the final interpolated frame. The ease of implementation of such an approach and favourable performance have enabled it to become a popular method in the field of interpolation. One downside, however, is that the encoder-decoder feature extractor is large and uses a lot of parameters. We propose a Multi-Encoder Method for Parameter Reduction (MEMPR) that can significantly reduce parameters by up to 85% whilst maintaining a similar level of performance. This is achieved by leveraging multiple encoders to focus on different aspects of the input. The approach can also be used to improve the performance of kernel-based models in a parameter-effective manner. To encourage the adoption of such an approach in potential future kernel-based methods, the approach is designed to be modular, intuitive and easy to implement. It is implemented on some of the most impactful kernel-based works such as SepConvNet, AdaCoFNet and EDSC. Extensive experiments on datasets with varying ranges of motion highlight the effectiveness of the MEMPR approach and its generalisability to different convolutional backbones and kernel-based operators.


I. INTRODUCTION
V IDEO frame interpolation involves the synthesis of a new intermediate frame from a set of given inputs.It is a popular computer vision task that has many applications in Manuscript received 29  Issa Khalifeh is with the Multimedia and Vision Group, School of Electrical and Electronic Engineering, Queen Mary University of London, E1 4NS London, U.K., and also with BBC Research and Development, W12 7TQ London, U.K. (e-mail: i.khalifeh@qmul.ac.uk).
Luka Murn is with BBC Research and Development, W12 7TQ London, U.K., and also with the Insight Centre for Data Analytics, School of Computing, Dublin City University, Dublin 9, D09 V209 Ireland (e-mail: luka.murn@bbc.co.uk).
Ebroul Izquierdo is with the Multimedia and Vision Group, School of Electrical and Electronic Engineering, Queen Mary University of London, E1 4NS London, U.K. (e-mail: ebroul.izquierdo@qmul.ac.uk).
This article has supplementary material provided by the authors and color versions of one or more figures available at https://doi.org/10.1109/JETCAS.2024.3395418.
Digital Object Identifier 10.1109/JETCAS.2024.3395418Fig. 1.A comparison of the different possible models that can be obtained using our proposed MEMPR multi-encoder approach on the Vimeo90K test set.With our proposed approach, parameters can be reduced significantly and a smaller kernel size can be used in many cases, resulting in memory reductions.Presented are the different performances that could be obtained in the case of SepConvNet when the MEMPR multi-encoder approach is applied.General structure XE-YLevel-KSZ corresponds to X Encoders -Y Level -Kernel Size Z.
areas such as video coding, slow-motion generation, broadcast and gaming.
Classical interpolation methods often rely on an optical flow estimate of the motion between the input frames and use this information to warp the input frames.Usually, the interpolated frame is obtained through combining the warped inputs and applying post-processing.Interpolation algorithms can fail when presented with occlusions, brightness changes and large motion.Research has generally focused on improving the robustness of interpolation methods to these failure cases.
Many deep learning-based methods also adopt a similar pipeline.Optical flow is first estimated using a Convolutional Neural Network (CNN) which is then used to warp the frames.Many optical flow-based methods utilise a synthesis block to combine the warped representations and obtain the final interpolated frame.Others estimate an occlusion mask and fuse the warped representations accordingly.
Another area of frame interpolation does not involve the explicit synthesis of motion information.CAIN [1] applies pixel shuffle [2] to mix the spatial and channel information of the inputs and then applies a series of Residual Blocks (ResBlocks) [3] with channel attention.A popular approach is the use of spatially adaptive convolutional kernels (AdaConv) first proposed by Niklaus [4].A convolutional backbone, typically following a UNet [5] architecture, extracts features from the input frames.These features are then input to a set of kernels which are used as weights for a convolution-based warping operation which convolves the image patches with the convolutional kernels.
Subsequent work builds upon the AdaConv approach by reducing memory usage in SepConvNet [6].Further works increase the range of motion covered by the spatially adaptive kernels by introducing deformable convolutions in Adaptive Collaboration of Flows (AdaCoF) [7], Deformable Separable Convolution (DSepConv) [8] and Enhanced Deformable Separable Convolution (EDSC) [9].
However, one key consideration is that the parameter usage of such methods remains high, which can cause issues relating to model storage.Acknowledging the large number of parameters typically present in such networks, EDSC significantly reduces parameter count through replacing many of the convolutional blocks in the backbone with Heterogeneous Convolutions [10].
In this work, we adopt a different parameter-reduction approach based on our workshop paper [11] that leaves the building blocks of the backbone intact to ensure ease of implementation.The proposed method is termed Multi-Encoder Method for Parameter Reduction (MEMPR).
The MEMPR approach is implemented on SepConvNet [6], AdaCoFNet [7] and EDSC [9].The implementation to different kernel-based approaches shows the transferability and generalisability of the proposed parameter reduction approach to methods with different convolutional backbones and different kernel-based operations.Our proposed work is orthogonal to other streams of research such as pruning and quantisation as well as different parameter reduction approaches which reduce the complexity of the convolutional operation.
The parameter count of SepConvNet, AdaCoFNet and EDSC methods is first reduced by removing the deeper parameter-heavy blocks in the encoder and decoder which account for the majority of the parameters in the network.As the removal of deeper features can result in a drop in performance, this is compensated by the introduction of multiple shallow encoders.The encoder features are combined at each level and passed as a skip connection to the decoder.The approach is inspired by the field of deep image compositing [12] which uses two encoders to fuse the foreground and background images.
Instead of using a deeper encoder and decoder which have significantly more parameters, MEMPR uses multiple shallow encoders to extract features from the inputs at each corresponding level.A shallow decoder is then used to decode the combined encoder features.This is the main novelty of the proposed modular approach − it leverages the ability of separate convolutional encoders to learn different characteristics of the input.This is discussed further in Section V-A4.Multiple encoders can, therefore, be used to compensate for the lack of deeper features.Another novelty of the proposed work is the combination of extracted features at the decoder level as opposed to the encoder level.This can enable more independent representations of the input to be learned by each encoder.We have also shown that the proposed methodology is not model-specific and is generalisable to other kernel-based architectures.
The proposed approach is designed to be intuitive and easy to implement to enable its adoption in other fields of work.The removal of redundant features using the proposed methodology could also aid model pruning techniques as the model size is smaller and there are likely to be less layers that could be pruned, potentially saving time.Models using our proposed MEMPR approach can exceed the performance of the baseline methods whilst significantly reducing parameter usage as depicted in Fig. 1.
Datasets with different ranges of motion are considered for evaluation.This is to have an understanding of the impact and effectiveness of MEMPR.The performance of MEMPR models can exceed the performance of reference baselines on these sets with less parameters.A MEMPR model with a smaller kernel size could also be used to obtain performance that is higher than an original deeper model implemented with a larger kernel size.This can reduce memory usage requirements.
The contributions of this work are as follows: • A novel, modular multi-encoder approach is used to leverage the ability of shallow encoders to learn different features from the input and compensate for the removal of deeper layers.
• We achieve significant parameter reduction of up to 85% for SepConvNet and AdaCoFNet methods and more than 50% for EDSC using our easy-to-implement approach.
• In addition to parameter reduction, a multi-encoder approach with a smaller kernel size can be used instead of a model with a larger kernel size.Thereby, resulting in reductions in memory usage.
• The approach is generalisable to three different kernelbased operations and two-different backbones with a scope to also be transferable to other backbone architectures.The paper is organised as follows.Section II presents the related work in the video frame interpolation field.In Section III, a background on SepConvNet, AdaCoFNet and EDSC methods is first presented in Section III-A and an overview of our proposed MEMPR method is described in Section III-B.The training and evaluation procedures are described in Sections IV-A and IV-B respectively.Ablations are conducted in Section V and the conclusion is presented in Section VII.

II. RELATED WORK
Before the introduction of deep learning, approaches based on flow estimation [13] and phase-based interpolation [14] were popular.Although such methods could work well, the proliferation of higher-resolution sequences meant that the handling of large motion and occlusions could be an issue.

A. Optical Flow-Based Interpolation
Many deep learning methods adopt the general framework of traditional optical flow-based interpolation methods.Generally, optical flow-based methods first estimate the motion between the input frames and then adopt the linear motion assumption to scale the flow to obtain the bilateral flows at Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.time t.The input frames are then warped according to the bilateral flows.Typically, further processing is then applied to the warped input frames by adopting a synthesis network.Some approaches adopt backwards warping [15], [16] and others adopt forward warping [17], [18], [19].Generally, the linear motion assumption is followed.Recent work looks at asymmetric motion prediction [20] and using a quadratic motion assumption [21] which could allow for the improved handling of complex motions.

B. Flow-Free Video Frame Interpolation
Flow-free methods do not explicitly have a linear motion constraint which could allow for the better synthesis of complex scenes.Reference [22] was the first to propose a direct synthesis UNet-based architecture.However, the approach struggled with occlusions and undesirable artefacts.Adaptive convolutions for frame interpolation (AdaConv) [4] instead propose the prediction of 2D spatially adaptive kernels for each input frame.A UNet architecture predicts the weights for the kernel-based operators which are used as weights for the adaptive convolutional operation which is applied to the inputs.Although this approach produced more favourable results than [22], the memory utilisation was high and the authors adopted a shift-and-stitch approach to interpolate higher resolution sequences.
SepConvNet [6] reduces this memory burden by predicting two one-dimensional kernels to convolve each input frame.However, the network utilises a kernel size of 51 × 51 which is inefficient if the range of motion within the sequence is small.Adaptive Collaboration of Flows (AdaCoF) [7] proposes introducing deformable offsets to the kernel-based interpolation operator and fusing the convolved frames using an occlusion mask.Similarly, Deformable Separable Convolution (DSepConv) [8] and Enhanced Deformable Separable Convolution (EDSC) [9] propose the adoption of a deformable convolution operation.EDSC uses the same deformable convolutional operation as DSepConv but predicts an image residual and reduces parameter usage through the adoption of a Heterogeneous convolutional operation (HetConv) [10] which replaces each 3 × 3 convolution which only uses 3×3 filters with 3×3 P filters and 1− P 1×1 filters.Shi et al. [23] propose Video Frame Interpolation Transformer (VFIT), which integrates a multi-scale encoder which uses a modified Swin transformer [24] that extracts features from the input in the 3D domain and a 3D decoder.VFIT uses the AdaCoF operation as the local spatially adaptive convolution.However, the runtime and memory usage of such an approach are detrimental to real-life applications.
Choi et al. [1] propose a flow-free method, CAIN, which applies pixel shuffle [2] to the inputs followed by residual groups with integrated channel attention.
TTVFI [25] introduces a trajectory aware transformer for frame interpolation.However, despite the introduction of a transformer mechanism, performance can still suffer from the common interpolation failure points.Recently, [26] extracts multi-scale features and aligns the features in a coarse-to-fine manner.The output is then fused using a predicted mask and passed to a reconstruction module.[26] proposes a texture consistency loss which relaxes the constraint that the input has to be synthesised at the intermediate time t.

C. Reducing Network Complexity in Interpolation
Many methods adopt different approaches to reduce the complexity of models.CDFI [27] first applies sparsity pruning to AdaCoFNet to reduce the number of parameters.A contextual pyramid is introduced and features are warped using the AdaCoF operation and fused using a GridNet architecture [28].Although, relative to AdaCoFNet, the number of parameters was reduced, the memory utilisation and runtime of CDFI is higher.
EDENVFI [29] proposes a transformer-based interpolation architecture that achieves improved runtime and memory usage relative to VFIT.The network uses a shallow 2-level (32, 64 channel) pyramid vision transformer (PVT) encoder, a 3-level (32, 64, 128 channel) and a shallow 2-level convolutional decoder (64, 128 channels) as the feature extractor.Transformer encoder and convolutional encoder features are combined at the relevant levels and passed to the decoder which generates the kernels for the EDSC operation.
IFRNet [15] and UPRNet [16] achieve a low-complexity network with favourable runtime through network design.
Some approaches highlight that simple sequences with small motion could be interpolated using a simpler network.CAIN-SD [30] proposes reducing the complexity of CAIN by adopting a model with variable scale and depth depending on the complexity of the input sequences.Reference [31] proposes an approach which selects RIFE [32] or VFIFormer [33] depending on the difficulty of the sequences.Others combine network design through bespoke methods such as in fLDR-VFI [34].It applies a tuned-principle component analysis approach to reduce the dimensionality of the inputs in the first block which reduces memory usage relative to XVFI [35].
Other lines of work such as ResNeXt [36] have explored the concept of a multi-branch architecture for enhancing classification model performance.To the best of our knowledge, our work is the first to adopt the concept of multiple encoder attention to achieve the goal of parameter reduction for kernel-based frame interpolation.Additionally, ResNeXt does not make use of a decoder architecture and thus does not tackle the impact of passing different combined learned features as a skip connection and using the concept of combined encoder features for parameter reduction.Additionally, the concept behind our proposed work is to leverage multiple shallow encoder features decoded by a shallow decoder to compensate for deeper more intrinsic features, a concept not explored by ResNeXt.
Our approach, Multi-Encoder Method for Parameter Reduction (MEMPR), is orthogonal to many of these lines of research with the potential to apply the findings of some of these approaches to MEMPR.MEMPR is an easy to implement approach which can achieve significant parameter reductions of up to 85% with minimal changes in performance on the test sets.

A. Background
Given two input frames I 0 , I 1 , the intermediate frame I 0.5 is required to be synthesised.To obtain the intermediate frame, generally, kernel-based interpolation methods apply a learned spatially adaptive convolutional operation τ to obtain the convolved input frame I ′ i at time instance i, where i ∈ 0, 1.The weights for this operation are obtained using an encoderdecoder architecture which extracts features φ 64 from the input frames.These extracted features φ 64 features are inputs used to generate kernels which act as weights for the separable convolution operation.
1) SepConvNet: One of the simplest approaches is that adopted by SepConvNet, which uses horizontal and vertical kernels K v and K h for each input frame and convolves the patches P of the input with these kernels using a separable convolutional operation.
In SepConvNet, each convolved frame is obtained through separable convolution τ operation represented in Equation 2.
The final interpolated frame can be obtained by combining the convolved frames.
2) AdaCoF: AdaCoF estimates horizontal and vertical weights W as well as deformable offsets (α, β) and an occlusion mask M. The deformable offsets can have a different initial starting point by setting dilation offsets dk and dl.
where i is the time step i ∈ 0, 1.The final output frame can be obtained using the generated occlusion mask M: 3) EDSC: EDSC also uses deformable convolutions for their kernel-based approach, however, the implementation of the approach is different.A 3-channel bias is also synthesised which acts as an image residual.For each patch P, the horizontal and vertical offsets are applied.The patch is then scaled using modulation mask m to obtain the resampled patch P ′ .
The convolved input at each time step i can be obtained by convolving the horizontal and vertical kernels (K i h and K i v ) with the resampled patch in Equation 8. * represents the convolution operation.
The final interpolated frame is obtained by combining the convolved frames with the generated image bias b as in Equation 9.The bias functions as an image residual and is obtained from the kernel sub-network.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Proposed Multi-Encoder Approach
An overview of our proposed approach is presented in Fig. 2. The model depth of the UNet feature extractor of the kernel-based interpolation methods is first reduced.This is achieved by removing the deeper layers of the encoder and decoder which contribute to the majority of the feature extractor parameters.The removal of deeper features, which can contribute to the extraction of more intricate features, can result in a drop in performance.To mitigate this drop in performance whilst still maintaining a low parameter count, multiple shallow encoders are introduced in our parameter reduction approach MEMPR.The inspiration behind the use of multiple encoders lies in the field of deep image composition [12].The foreground and background image are fused through a dual-encoder architecture.
Our approach adopts multiple encoders and utilises this approach for the purpose of parameter reduction.The rationale behind adopting a multi-encoder architecture is that each encoder could learn specific features from the input which could improve the synthesis of the interpolated frame.Many convolutional architectures contain many redundancies, evidenced by the popularity of the pruning research field [37], [38], [39].As such, the removal of deeper features and the introduction of shallow encoders could enable the better use of shallow less parameter-intensive layers to learn characteristics from the inputs.
For each encoder n with output channels f , the features θ n f are combined to obtain f .The feature combination operation at each feature level is represented in Equation 10.
f is then combined with the relevant decoder level features γ f with the same channel dimensions using a skip connection.The combined features ϕ f can be obtained in Equation 11.
For the final coarsest pooled features ρ at deepest encoder feature level, a similar approach to Equation 10 is adopted.The coarsest pooled features are combined to obtain direct decoder input D in Equation 12. D is input to the first decoder block.
The approach is designed for ease of implementation and is applied to AdaCoFNet, SepConvNet and EDSC which use different kernel-based operators.The approach is applied to different model depths to show the ability of the approach to be deployed for parameter reduction purposes as well as to potentially improve overall performance.

IV. EXPERIMENTAL SETUP A. Training
All models are trained for 100 epochs with an initial learning rate of 0.001.The learning rate decays every 20 epochs by a factor of 0.5.The AdaMax optimiser is used.The Vimeo90K triplets set [40] is used for training.Different from our workshop paper [11], the model is trained on the 'train' set of Vimeo90K and not the entire Vimeo90K set.This is to follow the commonly used training approaches used in many other works [15], [18], [26], [41].The 448×256 images from the train set are randomly cropped to 256 × 256 and then augmented through horizontal and vertical flipping and changing the order of frames.The L1 loss computed between the network output and the ground truth is used for training: where I 0.5 is the interpolated output from the network and I GT is the ground truth frame.
Each model is trained five times to ensure that any outliers in performance are detected.The model that achieves the closest performance to the median on Middlebury [42], Vimeo90K [26], DAVIS [43] and UCF101 [44] is selected.This model is selected by computing the Max-Normalised Mean Distance (MNMD) for each set and finding the model with the smallest MNMD: where λ i is the PSNR (dB) value of each test run, µ is the mean and MAX is where the maximum PSNR value across all runs on the set is found.
Following other works [18], [19], [35], [41], which test on sequences with large motion, we also evaluate on the Xiph-2K (2048 × 1080 resolution) and Xiph-1K (1024 × 540) sets which consist of sequences containing large motion.Following the same procedure used to obtain Xiph-2K and 1K in [41], we create two other Xiph sets, Xiph-720P (1280 × 720) and 360P (640×360).This is to ensure that the impact of different motion magnitudes and frame resolutions can be adequately evaluated.As the sequences are identical but the resolution and range of motion varies, identifying the main causes in the difference in performance are likely to be clearer.As such, the mean range of motion covered in our evaluation on Xiph is from 3.80 to 12.43 pixels.The different ranges of motion for each sequence examined in this work can be seen in Table I.The mean motion magnitude values are intended to provide a guidance in terms of the nature of the sequences tested.The mean motion magnitudes are computed using a similar procedure to [46], however, we use PWC-Net 1 [47] as our motion estimator.

V. ABLATION STUDIES
A. Application of MEMPR Multi-Encoder Approach to AdaCoFNet 1) Impact of Multi-Encoder Approach With Different Kernel Sizes: The AdaCoFNet paper proposes two models: 1 Implementation used is from https://github.com/sniklaus/pytorch-pwcAuthorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.In this section, we train multi-encoder 3-level models on kernel sizes (KS) 3, 5, 7, 9 and 11.The 3-level model is selected to be trained with different kernel sizes as it can achieve the largest reduction in parameters.In the KS5 configuration, it can achieve similar performance to the original as observed in Table II.For comparison, the original 5-level model is also trained at kernel sizes 3, 5, 7, 9 and 11.Training with different kernel sizes allows for the evaluation of the effectiveness and impact of the multi-encoder approach with receptive fields of different sizes.
Generally, increasing the number of encoders results in an improvement in performance for all models as evidenced in Table II.For a six-encoder 3-level kernel size 5 (KS5) model, performance is 0.448dB, 0.280dB, 0.180dB and 0.073dB higher on the Middlebury, Vimeo90K, UCF101 and DAVIS sets relative to the one-encoder 3-level model.
Compared to the original KS5 AdaCofNet model, the sixencoder 3-level KS5 model uses 80.12% less parameters and active memory usage is 6.80% lower whilst performance is 0.326dB, 0.178dB and 0.045dB higher on Middlebury, Vimeo90K and UCF101.Performance on DAVIS is equivalent.Despite the significant reduction in parameters, the runtime of such a model increases based on the number of encoders used.The six-encoder model has a runtime that is 81.80% higher.As such, a model with less encoders could be used such as the three-encoder model which achieves higher or equivalent performance relative to the original on Middlebury, Vimeo90K and UCF101.This comes at a cost of reduced performance on DAVIS where performance is 0.070dB lower than the original model.Parameters are 86.77%less, whilst runtime is 25.38% more.
Increasing the kernel size generally leads to an improvement in performance with all models with the original KS11 (AdaCoFNet+) model performance being 0.418dB, 0.281dB, 0.149dB higher on Middlebury, Vimeo90K and DAVIS sets relative to the corresponding AdaCoFNet (original KS5) model.Performance on UCF101 is not impacted significantly by the increase in kernel size due to the small motions observed in the set, which mean that a smaller kernel size can adequately reference the pixels necessary for synthesising the intermediate frame.
For the multi-encoder architecture, a similar trend is observed on Middlebury, Vimeo90K and DAVIS with an increase in the number of encoders leading to similar gains in performance in different kernel size configurations.For example, relative to the one-encoder KS11 model, performance is 0.442dB, 0.316dB and 0.164dB higher on Middlebury, Vimeo90K and DAVIS with the six-encoder model.Limited changes in performance are seen on UCF101.
On the Xiph sets, the removal of the deeper 256 and 512channel encoder and decoder blocks in the one-encoder 3-level configuration resulted in large drops in performance on Xiph-2K, 720P and 1K of 0.618dB, 0.185dB and 0.119dB in the KS5 configuration.Large drops in performance were also observed across different kernel sizes.
For Xiph-360P, limited changes in performance were observed with the Xiph-360P set.This suggests that the smaller motions present in this set can be handled well by a shallow UNet architecture.The introduction of a six-encoder MEMPR approach results in increases across all sets with 0.459dB, 0.263dB, 0.245dB and 0.214dB increases on Xiph-2K, 720P, 1K and 360P relative to the one-encoder 3-level model.Relative to the original KS5 model, performance is 0.0780dB, 0.127dB and 0.188dB higher with the six-encoder KS5 model on Xiph-720P, 1K and 360P.
Relative to the one-encoder 3-level model, increases in performance on the Xiph-2K, 720P, 1K and 360P sets were with the introduction of a multi-encoder architecture.Increases in kernel size also resulted in performance improvements.
In the KS3 configuration, equivalent performance relative to the original KS3 model on Xiph-2K was achieved.However, generally the original model achieved higher performance on the Xiph-2K.This indicates that the multi-encoder approach cannot handle the range of motions present in Xiph-2K due to the inherent limitation of a shallow architecture.
With other kernel sizes, the multi-encoder approach achieved higher performance on Xiph-720P, 1K and 360P relative to the original model of the same kernel size or even models with larger kernel sizes.On the sets with a mean motion magnitude value of 7.81 (refer to Xiph mean motion magnitudes in Table I) or less, a 3-level multi-encoder architecture is likely to be effective and a larger kernel size might benefit performance.The selection of a multiencoder approach is therefore dependent on the intended level of performance, number of parameters, memory usage and runtime.
With the increase in kernel size, the impact of the multi-encoder approach on the large motions in Xiph-2K becomes more limited and in some cases, such as the KS11 configuration there is negligible impact, despite increases being observed on Xiph-720P, 1K and 360P sets.
2) Using the MEMPR Approach for Memory Usage Reduction on AdaCoFNet: One aspect to consider is although performance improvements can be observed with a larger kernel size, memory usage increases significantly.As such, even if a shallow one-encoder 3-level model was adopted without the multi-encoder architecture, a multi-encoder model with a smaller kernel size can achieve similar performance with significantly less memory usage.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II THE PERFORMANCE OF THE MEMPR MULTI-ENCODER APPROACH APPLIED TO THE ADACOFNET METHOD IN THE 3-LEVEL CONFIGURATION WITH DIFFERENT KERNEL SIZES (KS). THE PROPOSED MEMPR METHOD CAN ACHIEVE PERFORMANCE THAT MATCHES THE ORIGINAL MODEL THAT USES A HIGHER KERNEL SIZE ON MANY SETS. THE MEMPR APPROACH CAN THEREFORE BE APPLIED FOR PARAMETER AND MEMORY REDUCTION USES IN MANY CASES DEPENDING ON THE APPLICATION AND THE TARGET RANGE OF MOTION WITHIN THE SEQUENCE. PERFORMANCE IS MEASURED USING PSNR (DB). P. (M) REPRESENTS THE NUMBER OF PARAMETERS, IN MILLIONS, OF THE MODELS
The four-encoder KS5 model can achieve performance that is similar to a one-encoder KS9 model on Middlebury, Vimeo90K, UCF101 and DAVIS whilst runtime is 17.82% less and memory usage is 52.89% less.Similarly, a six-encoder KS5 model can achieve performance similar to a two-encoder KS9 model with similar improvements in performance.
Relative to the original-KS9 model, a five-encoder KS7 model can achieve higher performance on Middlebury and Vimeo90K whilst the number of parameters is 81.93% lower and memory usage is 32.79% lower.The performance of this model relative to other comparable models is presented in Table IV.Runtime, however, is slightly higher.The experiments seek to demonstrate that the multi-encoder model can compensate for the use of a larger kernel size and the memory costs associated with it.Similarly, the same fiveencoder KS7 model can also achieve higher performance on these sets relative to the original KS11 model with 52.14% less memory.Alternatively, a four-encoder 3-level KS9 model could be also used.
One aspect to consider is that despite performance being higher or similar for many sets, performance on Xiph-2K is lower than the original model with larger kernel sizes.Therefore, the multi-encoder approach can be deployed for parameter reduction as well as memory reduction purposes if a certain level of performance is desired.If equivalent or higher performance relative to an original model with a larger kernel size is required, implementing the MEMPR approach to a 4level architecture with different kernels is likely to achieve this requirement of reduced memory usage and improved performance.
3) Impact of Model Depth on Performance: To investigate the impact of the proposed multi-encoder approach with different model depths, 3, 4 and 5-level multi-encoder MEMPR models are trained in the KS5 configuration and one-encoder 4-level model (32, 64, 128, 256 channel blocks) achieves performance that is 0.157dB, 0.080dB and 0.113dB higher on Middlebury, Vimeo90K and DAVIS sets relative to the one-encoder 3-level model.In contrast, relative to the original model, similar performance is achieved on Middlebury and Vimeo90K with the one-encoder 4-level model.However, there is a 0.060dB difference in performance on DAVIS, suggesting that a deeper UNet can improve the handling of more complex motions.In all model depths, the performance on UCF101 was similar, which further highlights that a shallow architecture is sufficient for the motions present in this set.With the introduction of the multi-encoder approach, performance differences become less apparent.A threeencoder 3-level model can achieve 0.181dB and 0.105dB higher performance on Middlebury and Vimeo90K relative to a one-encoder 4-level model.This is despite 51.10% less parameters and slightly reduced memory usage.Equivalent performance on DAVIS and UCF101 is achieved.
Adopting a multi-encoder approach in the five-encoder 4-level model results in 0.408dB, 0.246dB and 0.203dB increases in performance on Middlebury, Vimeo90K and DAVIS relative to the one-encoder 4-level model.The relative performance increases are similar to those observed with the introduction of a multi-encoder 3-level MEMPR Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
model.Relative to the best-performing six-encoder MEMPR 3-level model, performance is 0.113dB, 0.048dB and 0.136dB higher on Middlebury, Vimeo90K and DAVIS.The higher performance on DAVIS further suggests that a deeper base network can handle the motions in this set better.
As highlighted before, the original 5-level model is overparameterised.The application of a multi-encoder approach in an already over-parameterised model can result in improvements in performance as observed in Table III.However, the peak performance on Middlebury, Vimeo90K and DAVIS is similar to that of the best-performing multiencoder 4-level model.This is despite the increased number of parameters and memory usage of such an approach.There was a similar relative increase in performance from the original to the best performance on each set.
On the Xiph-2K set, limited changes in performance are observed with the 4 and 5-level configurations.The introduction of a multi-encoder approach does not lead to significant changes on this set.This is similar to what was observed with the larger kernel sizes, which could indicate that the effectiveness of a multi-encoder approach is influenced by the size of the receptive field and the ability of the base model to model long range correlations.As the 4 and 5-level models can handle the motion in Xiph-2K better than a one-encoder 3-level model, the impact of the multiencoder approach is more limited.For the Xiph-720P, 1K and 360P sets, performance increases are observed.However, peak performance on these sets is generally very close to those observed with the best-performing multi-encoder 3-level model.This indicates that model depth does not significantly impact multi-encoder performance on these sets as enough complementary shallow features can be extracted with a 3level model.It also shows a characteristic of the MEMPR approach, which is that performance trends are generally consistent when more local motions with a mean magnitude of 7.81 pixels or less are observed.

4) Visualising the Features of MEMPR-AdaCoFNet:
The improvements in performance reflected in Table III are validated by considering the feature map visualisations in Fig. 3. Using the two-encoder 4-level MEMPR-AdaCoFNet model as an example, the encoder features at each convolutional block (Conv Block) are visualised.It can be observed that regardless of channel depth, Encoder 1 has a different focus on features compared to Encoder 2. In the case of the sample sequence, Encoder 1 focuses on the vertical components within the sequence, whereas Encoder 2 focuses on the The feature map can therefore take into account features extracted from both encoders.The different learned representations by each encoder highlight that the features learned by encoders can be complementary.

B. Application of MEMPR Multi-Encoder Approach to SepConvNet 1) Impact of Multi-Encoder Approach With Different Kernel
Sizes and Model Depths: In SepConvNet, a kernel size of 51 is the default size.The removal of the 256 and 512-channel blocks led to a drop of 0.125dB, 0.0790dB, 0.038dB and 0.077dB on Middlebury, Vimeo90K, DAVIS and UCF101 sets.Similar to what was observed with AdaCoFNet, a multi-encoder MEMPR approach can result in performance improvements.However, the rate of improvement is dependent upon the selected model and kernel size.The four-encoder 3-level model achieved the best performance with performance that is 0.269dB, 0.153dB, 0.088dB and 0.078dB higher on Middlebury, Vimeo90K, DAVIS and UCF101.Relative to the original model, performance is 0.144dB, 0.0740dB, 0.0500dB higher on Middlebury, Vimeo90K and UCF101 whilst 85.19% less parameters are used.Performance on DAVIS is equivalent.Active memory usage is also 6.59% less.However, runtime is impacted with 42.46% more time needed to synthesise a 1280 × 720 frame.
Reducing the kernel size to 41, 31 and 21 impacts the performance of the one-encoder 3-level model as can be observed in Fig. 4.However, upon the adoption of a multiencoder MEMPR approach, performance improvements are observed.A six-encoder MEMPR-SepConvNet KS21 model can achieve the best performance on Vimeo90K that is 0.177dB higher than the original model.Performance on Middlebury and UCF101 is 0.194dB and 0.105dB higher.DAVIS performance is 0.025dB lower.This indicates that the reduction of kernel size could impact the modelling of dynamic motions.Relative to the original KS51 model, memory usage is 38.90% less and parameters are 81.46%less.However, runtime is 47.63% higher.
With the reduction in kernel size, compared to the KS51 configuration, the relative increases in performance on Middlebury, Vimeo90K and DAVIS are higher, suggesting that a multi-encoder MEMPR approach works best when the receptive field is not much larger than necessary.As the test sequences contain mean motions that are less than 30 pixels (refer to Table I), it is likely that a larger kernel size would result in the referencing of irrelevant pixels which could be detrimental to the synthesis of the final frame.
Relative to models with a larger kernel size, reducing the kernel size to 11 results in significant drops in performance.Compared to the original SepConvNet KS11 model, a fiveencoder-3level KS11 model can achieve performance that is 0.202dB, 0.145dB, 0.100dB higher on Middlebury, Vimeo90K and UCF101.Performance on DAVIS is 0.076dB lower.This indicates that despite the improvements relative to a oneencoder 3-level KS11 model, it ultimately does not fully compensate for a much smaller receptive field.UCF101 performance is actually higher than the original KS51 model.This is likely to due to the motion magnitudes in this set being within the receptive field of the network, so the multi-encoder approach results in further improvements in performance.
Similar to AdaCoFNet, a 4-level one-encoder SepConvNet model achieves equivalent or even slightly higher performance than the original model.
The impact of implementing a multi-encoder approach in the 4-level configuration with different kernel sizes is similar to the 3-level configuration.Generally, the KS21 and KS31 models achieve the best balance between runtime, performance and memory usage.
2) Using the MEMPR Approach for Memory Usage Reduction on SepConvNet: Out of the different multi-encoder Applying a multi-encoder MEMPR approach for a 5-level model results in performance improvements.However, similar to the 5-level AdaCoFNet model, the peak performance of a multi-encoder MEMPR model can be matched with a shallower MEMPR model.In this case, the performance of a three-encoder 5-level MEMPR SepConvNet model is similar to a five-encoder 4-level MEMPR SepConvNet KS31 model which uses less parameters and memory.
For Xiph-2K, in the 3-level configuration, a kernel size of 31 achieves the best performance.A kernel size of 51 leads to a drop in performance on this set, suggesting that a larger kernel size can be detrimental to performance if the kernel size is much larger than the motions present in the set.The adoption of a multi-encoder model results in increases in performance for all kernel sizes.However, it is only in the KS31 and KS41 configurations that the performance of the shallow model matches that of the original deeper 5-level model.This is similar to what was observed with the multi-encoder MEMPR approach using AdaCoFNet, where performance was close to the original in the KS3 and KS5 configurations.This suggests that a deeper model could benefit more from a larger receptive field arising from a larger kernel size.
In all kernel size configurations, performance on Xiph-720P, 1K and 360P exceeds the original model.Different to performance on Xiph-2K, no drops in performance are observed with larger kernel sizes.However, there are no further increases in performance beyond KS21 for Xiph-720P, 1K and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The only exception to this is the Xiph-360P set which contains a mean motion magnitude of 3.80 pixels.This is likely because the multi-encoder approach works well with local motions and the size of the receptive field of the model is sufficient to adequately model such a sequence.In the multi-encoder SepConvNet 4-level configurations, performance on Xiph-2K is equivalent to or exceeds the original in the KS21, KS31, KS41 and KS51 configurations.Performance on Xiph-2K in the with the four-encoder 4-level KS31 model is 0.225dB higher than the original KS51 model.The multi-encoder model in the KS31 configuration achieves the highest performance on Xiph-2K, suggesting that the careful selection of kernel size, model depth and the application of our proposed multi-encoder approach can result maximise performance improvements.
Relative to the multi-encoder 3-level model, the peak performance on Xiph-720P and 1K is higher by a large margin.
In comparison, peak performance on Xiph-360P is equivalent.This suggests local motion is enhanced by the multi-encoder approach.Whereas the synthesis of frames with larger motions can further be enhanced by increased model depth.
In the 5-level configuration, which was only trained in the KS51 configuration, a similar trend to the application of the multi-encoder architecture in a 5-level AdaCoFNet network was observed.The results of this configuration are presented in Table VII.The peak performance of a multiencoder network on Middlebury, Vimeo90K and UCF101 was equivalent to a multi-encoder 4-level model.On DAVIS, however, the performance of a three-encoder 5-level model was 0.0590dB higher than the best-performing five-encoder 4-level KS51 model.
Improved performance was also seen on Xiph-2K with the three-encoder 5-level model.This could be that a multiencoder approach combined with a deeper network can better synthesise motions with a mean of more than 12 pixels.The best performance on Xiph-720P, 1K and 360P with a 5-level MEMPR SepConvNet model was equivalent to the best performing MEMPR 4-level model.The best Xiph-1K and 360P performance was almost the same for the 3, 4 and 5-level multi-encoder models.This further highlights that the choice of how to adopt a multi-encoder architecture is dependent upon model depth.The similar performance is further highlighted in Table VIII.
A 3-level MEMPR multi-encoder approach can achieve equivalent or higher performance than the original KS51 model on almost all sets.This, however, is dependent on the range of motions considered and the desired level of performance.For improved performance on sequences with large motion, a 4-level MEMPR multi-encoder model is preferred.
Whilst the MEMPR approach is proposed for parameter reduction, it can also be used to provide performance increases in a parameter effective manner.This can be observed with the AdaCoFNet and SepConvNet 4-level models when the MEMPR approach is applied.Such networks are likely to   be more effective if they are evaluated on data with a larger motion range.

C. Application of MEMPR Multi-Encoder Approach to EDSC
EDSC uses a HetConv-based [10] backbone which is applied for the purpose parameter reduction.The MEMPR multi-encoder approach is applied to show that further parameter reduction is indeed possible for such a network.In the same manner as the previous approaches adopted, the multi-encoder approach is applied in the 3, 4 and 5-level configurations.A kernel size of 11 is used for the experiments.Performance gains are observed with the multi-encoder approach when applied to the 3-level model with performance being higher or matching the original model on Vimeo90K and UCF101.However, on DAVIS, which contains large motion, and on Middlebury, performance does not match the original model.This supports the previous experiments on AdaCoFNet and SepConvNet that the multi-encoder approach is constrained by the ability of the shallow base architecture to handle different motion ranges.Similar to AdaCoFNet and SepConvNet, a 4-level model achieves equivalent or better performance than the original 5-level model.In the case of EDSC, a 4-level model achieves performance that is equivalent on all sets.In such a scenario, the multi-encoder approach can be used for maximising the performance of the model with less parameter cost.The results of adopting such an approach can be observed in Table IX.As a deeper 5-level model performs similarly to a 4-level model on these sets, a multi-encoder approach can be deployed for further performance gains.
Adding one 4-level encoder and combining the features at each relevant feature level results in 0.193dB and 0.171dB increases on Middlebury and Vimeo90K relative to the original model.Performance on DAVIS is slightly higher (0.0460dB).Compared to the original EDSC, a 50.39% reduction in parameters is achieved.This model also provides the best balance between runtime and performance with minimal cost to runtime.
If higher performance is desired, a three-encoder 4-level EDSC model achieves performance that is 0.306dB, 0.329dB, 0.0630dB and 0.128dB on Middlebury, Vimeo90K, DAVIS and UCF101 relative to the original EDSC model.This is despite a 41.79% reduction in parameters.
Further increases in performance can be observed with a six-encoder 4-level model on Middlebury (0.393dB relative to the original) and DAVIS (0.164dB), however, parameter usage is only reduced by 16.20% relative to the original.
Implementing a model with rotations, where a four-encoder model is used and the input to each encoder is rotated by 90 degrees, generally does not benefit performance.This was observed with most AdaCoFNet, SepConvNet and EDSC configurations.Although the rationale behind the use of such an approach was to leverage the spatial invariance of convolutions, it is likely the improved performance seen in the workshop paper was due to random seed initialisation.As all experiments in this paper are conducted five times with a fixed Torch, NumPy, CuPy and cuDNN seeds, the relationship between the multi-encoder approach and rotations could be adequately investigated and outliers in performance detected.
Similar to AdaCoFNet and SepConvNet, the adoption of a multi-encoder approach in the 5-level configuration can lead to performance improvements.However, the peak performance on the test sets is similar to those observed in the 4-level configuration despite the increased number of parameters in a 5-level model.
Improvements to Xiph performance are observed in the 3, 4 and 5-level multi-encoder configurations.In the 3-level configuration, on Xiph-2K, improvements in performance can be observed with the MEMPR multi-encoder approach.
However, it is still lower than the original model.The reason behind the lower performance on Xiph-2K is likely similar to the performance difference observed with DAVIS, which is that the shallow encoder cannot adequately model very large motions.On Xiph-720P, 1K and 360P, equivalent or higher performance can be observed.
In the one-encoder 4-level configuration, higher performance on Xiph-720P was observed relative to the original model, whereas equivalent performance was seen on the Xiph-2K, 1K and 360P sets.As observed in Table IX, the application of the MEMPR multi-encoder approach results in improvements on Xiph-2K, 1K, 720P and 360P sets.As such, the multi-encoder approach can be used to improve performance on these sequences with different ranges of motions without increases in parameters relative to the original EDSC model.
In the 5-level model, although a multi-encoder approach can lead to performance improvements relative to the original model, there are models where a drop on Xiph-2K is observed.This is likely due to the network overfitting on the training data and being able to generalise less well.Another contributing factor is that the 5-level network the MEMPR approach is applied to an already over-parameterised model.Despite the performance improvements, the 4-level multi-encoder model can achieve higher or equivalent performance on all Xiph sets with significantly less parameters.
A possible reason as to why the multi-encoder approach is effective can be seen with the weights plot in Fig. 5.Although the distribution of weights is similar for each encoder layer, the higher prevalence of specific weights in some encoders could indicate that the encoder is more specialised.This was also the case in AdaCoFNet in

VI. COMPARISON TO STATE-OF-THE-ART MODELS
For comparisons to state-of-the-art models, the MEMPR models that best balance performance with runtime relative to the original AdaCoFNet, SepConvNet and EDSC models are selected.In [7], two models are defined: AdaCoFNet with a kernel size of 5 and AdaCoFNet+ which has a kernel size of 11.The two-encoder 4-level MEMPR AdaCoFNet model is selected to be a comparable baseline to AdaCoFNet and the five-encoder 3-level KS7 AdaCoFNet model is selected to be a comparable baseline to AdaCoFNet+.For SepConvNet, the four-encoder 3-level KS31 model is chosen to be a comparable baseline to the original SepConvNet (KS51) model defined in [6].For EDSC-KS11, the threeencoder 4-level KS11 MEMPR-EDSC-KS11 model is chosen.The selected models are referred to as MEMPR-AdaCoFNet, MEMPR-AdaCoFNet+, MEMPR-SepConvNet and MEMPR-EDSC-KS11 in Table X.
Relative to seminal kernel-based methods, the proposed MEMPR methods can achieve similar or better performance to the reference baselines.For other kernel-based methods such as CDFI [27], a shallow two-encoder multi-encoder MEMPR architecture could be used instead of the compressed AdaCoFNet as the building block to apply other components.The proposed MEMPR approach can also be applied to other better-performing models such as SepConv++ [41], as the same separable convolution operator is used and a UNetbased convolutional backbone is used.Additionally, MEMPR could be applied to newer kernel-based methods such as MVFI-Net [48] which uses a backbone architecture for kernel-based interpolation to SepConvNet and AdaCoFNet.When compared to direct synthesis methods and recent diffusion-based method LDMVFI [49], kernel-based approaches such as MEMPR-SepConvNet and MEMPR-EDSC perform favourably.However, relative to more recent flow-based networks, further developments in kernel-based methods could help with ensuring such approaches can achieve state-of-the-art performance in the two-frame frame interpolation field.Our proposed MEMPR approach can influence design choices and ensure that there are more parameter-efficient kernel-based networks.
Flow and kernel-based methods have inherent advantages and disadvantages.As such, both have been integrated in DAIN [50].More recently, flow and kernel-based approaches have been integrated into the four-frame ST-MFNet model [45] which can handle non-linear motions and sequences with texture more effectively than other comparable four-frame interpolation models which only generally follow one paradigm.Additionally, the deformable separable convolution operation used in EDSC has also been used in recent research involving latent diffusion models in frame interpolation in LDMVFI [49].This highlights the continued importance of kernel-based approaches.

VII. CONCLUSION
In conclusion, the adoption of the MEMPR approach can result in significant reductions in parameters without costs in performance.MEMPR is easy to implement and is generalisable to two different convolutional backbones and to three networks with different kernel-based operators.Experiments have shown that MEMPR approach can be used for parameter reduction, but also can be used in certain configurations to achieve a level of performance similar to a model with a larger kernel size.The approach can also be deployed for memory reduction in some cases by applying the multi-encoder model to a model with a smaller kernel size to obtain the performance of a larger kernel size model.The proposed MEMPR approach has been applied to seminal works in kernel-based video frame interpolation.As such, this approach is intended to provide a framework for parameter reduction for existing methods as well as any future convolutional-based kernel-based network.One consideration with the MEMPR approach is runtime which has to be balanced with the desired level of performance, the level of intended parameter reduction and the range of motions the target model would be applied to.Beyond careful selection of a model in accordance with these aforementioned parameters, applying dimensionality reduction in the first stage of the network, similar to EDENVFI [29] or fLDR [34], could enable faster runtime.Investigating the applicability of this approach to a transformer backbone could potentially enable this method to be integrated into kernel-based approaches such as VFIT [23] and EDENVFI which have a transformerbackbone as part of the model architecture.

Fig. 2 .
Fig.2.The multi-encoder approach removes the parameter-heavy encoder and decoder blocks and integrates multiple shallow encoders to compensate for the loss of deeper heavy features.Complementary features can be learned that can improve the synthesis of the final interpolated frame.Depicted is the generalised multi-encoder MEMPR method applied in a 3-level (32, 64, 128 channel) configuration.For SepConvNet and AdaCoFNet, the convolutional block consists of 3x(3 × 3 Conv + ReLU) and the Upsample block consists of Bilinear upsampling + Conv + ReLU.For EDSC, the first convolutional block consists of (3 × 3 Conv + ReLU + 2x(HetConv + ReLU)).The subsequent convolutional blocks consist of 3x(HetConv + ReLU) and the Upsampling block consists of Bilinear upsampling + HetConv + ReLU.The general architecture of the original SepConvNet, AdaCoFNet and EDSC models is presented in the top right corner.The kernel-based synthesis operations for SepConvNet, AdaCoFNet and EDSC differ.For full information on their differences please refer to the original works of the authors.

Fig. 3 .
Fig. 3.Feature visualisations from the two-encoder four-level MEM-PR-AdaCoFNet model using a sequence from the DAVIS set.The output of the second, third and fourth convolutional blocks are presented for each constituent encoder (Encoder 1 and Encoder 2) and the combined feature map.Blue and green show areas of attention, while red are areas of no attention.

Fig. 4 .
Fig. 4. The performance of the multi-encoder MEMPR architecture applied to SepConvNet in the 3-Level configuration with different kernel sizes.The of the original SepConvNet 5-level model to which we aim to achieve equivalent or higher performance to, is presented in the graphs and referred to as Original-KS51.

Fig. 5 .
Fig.5.The weights of encoder in the MEMPR multi-encoder approach applied to (a) AdaCoFNet, (b) SepConvNet and (c) EDSC presented in a step histogram using a four-encoder 4-level model as an example.For AdaCoFNet and SepConvNet, the weights of the final convolution of the second block are plotted.For EDSC, which uses a HetConv module, the weights from the 3 × 3 Conv from the final HetConv block in the second block are taken.Although the range of weight values is similar, the frequency of certain weights is more pronounced with other encoders.This can potentially ensure that different features are learned.
December 2023; revised 8 March 2024; accepted 20 April 2024.Date of publication 30 April 2024; date of current version 27 June 2024.This work was supported by the Engineering and Physical Sciences Resarch Council (EPSRC) in collaboration with British Broadcasting Corporation (BBC) through the Industrial Cooperative Awards in Science and Engineering (iCASE) under Grant 2246814.This article was recommended by Guest Editor F. Zhang.(Corresponding author: Issa Khalifeh.)

TABLE I THE
MEAN MOTION MAGNITUDES AND THE INPUT RESOLUTION OF THE SEQUENCES OF ALL TEST SETS USED IN THIS WORK.DIFFERENT TEST SETS ARE USED FOR THE EVALUATION OF OUR WORK TO INVESTIGATE THE EFFECTIVENESS OF OUR PROPOSED MEMPR MULTI-ENCODER APPROACH ON DIFFERENT RANGES OF MOTION AND DIFFERENT INPUT RESOLUTIONS

TABLE III PSNR
(DB) PERFORMANCE OF THE MULTI-ENCODER MEMPR APPROACH APPLIED TO ADACOFNET WITH DIFFERENT MODEL DEPTHS USING A KERNEL SIZE 5 CONFIGURATION.RUNTIME AND MEMORY USAGE NEEDED TO SYNTHESISE A 1920 × 1080 FRAME IS PRESENTED.ACTIVE MEMORY USAGE IS PRESENTED AND CAN BE OBTAINED USING tor ch.cuda.memory_summar y().P. (M) REPRESENTS THE NUMBER OF PARAMETERS, IN MILLIONS, OF THE MODELS

TABLE V THE
PERFORMANCE OF THE MEMPR MULTI-ENCODER APPROACH APPLIED TO THE SEPCONVNET 4-LEVEL MODEL WITH DIFFERENT KERNEL SIZES.THE PERFORMANCE IS MEASURED IN PSNR (DB).THE MODELS HIGHLIGHTED IN BOLD CAN ACHIEVE SIMILAR OR HIGHER PERFORMANCE THAN THE ORIGINAL SEPCONVNET-KS51 MODEL WITH REDUCED PARAMETERS AND MEMORY USAGE.THE MODELS THAT ARE ALSO UNDERLINED ALSO ACHIEVE A RUNTIME CLOSE TO THE ORIGINAL MODEL.P. (M) REPRESENTS THE NUMBER OF PARAMETERS, IN MILLIONS, OF THE MODELS

TABLE VI SELECTION
OF DIFFERENT MULTI-ENCODER MODELS (APPLIED TO SEPCONVNET) THAT CAN ACHIEVE EQUIVALENT OR HIGHER PERFORMANCE THAN THE ORIGINAL MODEL ON MIDDLEBURY, VIMEO90K, UCF101 AND DAVIS WITH LESS PARAMETERS AND LESS ACTIVE MEMORY USAGE.GENERAL STRUCTURE XE-YLEVEL-KSZ CORRESPONDS TO X ENCODERS -Y LEVEL -KERNEL SIZE Z. PERFORMANCE IS MEA-SURED USING PSNR (DB).P. (M) REPRESENTS THE NUMBER OF PARAMETERS, IN MILLIONS,

TABLE VII PSNR
(DB) PERFORMANCE OF THE MULTI-ENCODER MEMPR APPROACH APPLIED TO SEPCONVNET WITH DIFFERENT MODEL DEPTHS IN THE KERNEL SIZE 51 (KS51) CONFIGURATION.RUNTIME AND MEMORY USAGE NEED TO SYNTHESISE A 1280 × 720 FRAME IS PRESENTED.NOTE THAT A 1 ENCODER -5 LEVEL MODEL REFERS TO THE ORIGINAL MODEL.P. (M) REPRESENTS THE NUMBER OF PARAMETERS, IN MILLIONS, OF THE MODELS

TABLE VIII THE
LACK OF A NEED FOR A DEEPER 5-LEVEL MULTI-ENCODER MODEL IS DISPLAYED WITH THE PERFORMANCE OF A 4-LEVEL MULTI-ENCODER USING LESS PARAMETERS AND MEMORY

TABLE IX THE
PERFORMANCE OF A MULTI-ENCODER APPROACH WHEN APPLIED TO AN EDSC 4-LEVEL KS11 ARCHITECTURE.PERFORMANCE IS MEASURED IN PSNR (DB).RUNTIME AND MEMORY USAGE ARE EVALUATED USING 1280 × 720 INPUTS.P. (M) REPRESENTS THE NUMBER OF PARAMETERS, MEASURED IN MILLIONS, OF THE MODELSThe removal of the 256 and 512-channel encoder decoder blocks leads to a drop in performance of 0.403dB, 0.202dB, 0.0700dB and 0.269dB on Middlebury, Vimeo90K, and DAVIS relative to the original EDSC model.
Authorized licensed use limited the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE X THE
PERFORMANCE OF DIFFERENT STATE-OF-THE-ART (SOTA) METHODS ON MIDDLEBURY, VIMEO90K, UCF101, DAVIS, XIPH-2K, 720P, 1K AND 360P.PERFORMANCE IS MEASURED USING PSNR (DB).DUE TO NO AVAILABLE PUBLIC IMPLEMENTATION OF THE FULL MVFI-NET MODEL, THE RESULTS FOR MVFI-NET ARE DIRECTLY TAKEN FROM [48].† IS USED TO HIGHLIGHT THIS.FOR ADACOFNET, SEPCONVNET AND EDSC, THE MODELS TRAINED ACCORDING TO OUR PROCEDURE ARE PRESENTED.ACTIVE MEMORY USAGE IS PRESENTED AND CAN BE OBTAINED USING tor ch.cuda.memory_summar y().RUNTIME AND MEMORY USAGE IS COMPUTED ON 1280 × 720 INPUTS