Efficient Decoder Reduction for a Variety of Encoder-Decoder Problems

Encoder-decoder networks have become the standard solution for a variety of segmentation tasks. Many of these approaches use a symmetrical design where both the encoder as well as the decoder are approximately of the same computational complexity. However, symmetrical properties of encoder-decoder networks are not necessarily optimal. This work proposes an elegant and generic method to reduce the decoder complexity in encoder-decoder networks by scaling the number of feature channels in the decoder. The popular network U-Net is used as an example for how to adapt existing models with symmetrical properties. The effect of the decoder size is investigated on three data sets with varying complexity, namely, the ISIC, Cityscapes, and SUN RDB-D data sets. We show that a reduction in decoder channels shows no statistically differing results while at the same time providing a decoder requiring up to 99% fewer FLOPs, (± 90% fewer FLOPs is attainable for all investigated problems). In addition, results show that the number of parameters in the decoder of two models which already have smaller decoders can be further optimised depending on the problem. The proposed solution is a simple method and can easily be implemented in other encoder-decoder models. Empirical results also show that a reduction in parameters may even lead to improved performance, which is likely due to fewer parameters, reducing overfitting effects.


I. INTRODUCTION
A. PROBLEM STATEMENT Encoder-decoder networks are a prevalent solution to a variety of segmentation problems in many computer vision application domains, like in road-scene segmentation [1] and aerial image segmentation [2]. Likewise, encoder-decoder solutions such as e.g. skin-lesion segmentation [3], organ segmentation [4], or region-of-interest selection [5] are the standard for segmentation problems in the medical imaging domain. In nearly all of these approaches, the encoderdecoder networks are designed to be symmetrical (U-Net [6], Seg-Net [7], FusionNet [8]). However, symmetrical properties of encoder-decoder networks are not necessarily optimal. Especially in problems where a limited amount of data is available for training, or in problems where data acquisition is inherently difficult, a lower number of parameters is oftentimes preferable. Furthermore, when dealing with time-sensitive problems, such as for example, road-scene segmentation [9] and endoscopic video [10], real-time execution The associate editor coordinating the review of this manuscript and approving it for publication was Ahmed Farouk . imposes additional constraints on the network architecture. Due to a reduced the risk of overfitting for problems with limited data, and inference speedup when time constraints apply, a lower number of parameters is highly preferable (when accuracy is comparable). Furthermore, mobile applications have an increasing demand for small networks [11]. To this end, we propose an adaptation to the popular U-Net architecture [6], where the decoder size is tunable with a reduction parameter during the initialization of the model. The network decoder can then be made arbitrarily small, depending on the complexity of the task and the amount of data available. The proposed adaptation is modular and not limited to U-Net only, any encoder-decoder network can benefit from the presented solution.

B. RELATED WORK
Convolutional Neural Networks (CNNs) are currently widely used for many Artificial Intelligence (AI) applications including computer vision [12], language modeling [13], and graphics rendering [14]. CNNs have been incredibly successful solution as they often deliver state-of-the-art accuracies on many AI tasks. However, oftentimes this comes at the cost of high computational complexity. For example, a recent study by Chen et al. [15], uses a model with close to 400 million parameters which is trained on 128 TPUs to achieve state-of-the-art results. In practice, this amount of compute is not attainable in most cases. Additionally, models of this size are unpractical for mobile applications. For this reason, techniques that enable efficient processing of CNNs without sacrificing performance are crucial to the wide deployment of these types of models. Several solutions have been proposed that help deal with this problem.
Courbariaux et al. [16] propose a method to train neural networks with binary weights and activations. This approach greatly reduces the memory consumption of the model, especially during inference. In a similar vein, other work by Ma et al. reduces the precision of operations by going from floating point to fixed point precision [17]. This approach significantly decreases the model complexity with very little loss in performance.
In addition to reducing the size of each operation or operand (weight/activation), there is also a significant amount of research on methods to reduce the number of operations and model size. Work by Chen et al. [18], looked into exploiting the sparsity of activations by using run-length coding. This approach works particularly well due to the nature of the ReLU activation function, which results in matrices with many zeros. An entirely different approach attempts to decrease the model size by pruning unimportant weights after training with overparametrized models [19] . In this approach, weights with a small absolute magnitude are pruned and the model is finetuned to maintain accuracy and approximately 50% of the network weights can be discarded without loss of performance.
Finally, the number of weights and activations can also be reduced by improving the existing network architectures. For example, one N ×N convolution can be decomposed into two 1-D convolutions, one 1 × N and one N × 1 convolution [20]. 1 × 1 convolutional layers can also be used to reduce the number of channels in the output feature map for a given layer. This layer, often called a 'bottleneck', reduces the number of filters in the next layers as demonstrated in work by Lin et al. [21].
The work presented in this article also deals with reducing the number of parameters by optimizing the network architecture. More specifically, the discrepancy between the input and output complexity of most encoder-decoder problems is exploited by training network architectures with substantially smaller decoders relative to the encoder. The remainder of this section covers other existing approaches that employ non-symmetric encoder-decoder models.
Methods exploiting asymmetrical encoder-decoder architectures have also been explored in the past. Of which, the simplest approach to an encoder-decoder network is to apply a bilinear upsampling operation to the output activations of the last convolutional layer of a regular encoder network.This approach efficiently maps these activations to a N -ary mask of the original input resolution [22]. Whereas this type of decoder is extremely efficient, this naive approach is generally not successful in recovering object segmentation details.
In work by Paszke et. al [23], an asymmetric design was proposed (ENet), motivated by the idea that the role of the decoder is to upsample the output of the encoder and afterwards only fine-tune the details. However, ENet does not employ the proven efficacy of the widely used skip-connections introduced in the U-Net paper, which have shown to increase segmentation performance. Additionally, the effect of different decoder sizes is left unexplored. In contrast to the work of Paszke et al., the architecture proposed in this article enables the use of skip connections between the encoder and the decoder. Additionally, a more extensive evaluation on the effect of the decoder size is performed in this work.
Chen et al. [24] propose a model where the design of the decoder is significantly smaller compared to the encoder. In their model, the encoder features are first bilinearly upsampled by a factor of 4 and then concatenated with the corresponding low-level features from the encoder backbone, which have the same spatial resolution. A 1 × 1 convolution is applied to the low-level feature activations to reduce the number of feature channels. However, the design choices for their decoder are determined on a single data set (PASCAL VOC 2012 semantic segmentation benchmark [25]), hence the effect of the decoder size on different tasks with a variable amount of output complexity remains unknown.
In prior work [26], our group has performed the first experiments on this topic using the ISIC data set [27]. In this work, we propose a simple method to reduce the decoder complexity for any convolutional encoder-decoder model. Additionally, we show that existing models with efficient decoders can be improved further depending on the complexity of the problem. Finally, additional experiments and analyses are performed in this work on two additional data sets from different domains and with varying complexity.
In this article we address the above-mentioned shortcomings of prior work and introduce a flexible methodology that facilitates a decoder that is scalable with the complexity of the problem. 3) The proposed method is evaluated on three publicly available benchmark data sets with varying degrees of task complexities to show that the approach is generic. We report similar results on a binary segmentation problem, a multi-class segmentation problem, and a monocular depth estimation problem. 4) In addition to showing improved efficiency in an adapted (symmetric) U-Net model, two other state-of-the-art models with efficient decoders are improved further by the proposed approach depending on the problem.

II. METHODS
This section is structured as follows. First, a detailed description of a standard symmetrical U-Net inspired encoder-decoder network is discussed in Section II-A together with a simple method to reduce the decoder complexity. Then Subsection II-B and II-C discuss changes made to ENet and Deeplab to incorporate the proposed approach. Third, the complexity in terms of Floating Point OPerations (FLOPs) and number of parameters of all models is presented in Section II-D. Fourth, subsection II-E lists the parameter setting employed for all experiments. Lastly, the employed statistical test is motivated in Section II-F.

Encoder:
The encoder used in all experiments is a fully convolutional ResNet-like architecture. Each green block (residual module) in Figure 1 consists of two convolutional layers with ReLU activation functions. A residual skip-connection is used as proposed by He et al. in the original ResNet paper [28]. Batch normalization [29] is used after each convolutional layer for normalization purposes. After each residual module, a max-pooling operation is performed with kernel size 2 and stride 2 to reduce the feature-map resolution by a factor 2 at each level. In the first residual module, 32 filters are employed. The amount of convolutional filters in each subsequent level is multiplied by a factor 2 after each pooling operation, which is standard for nearly all classification algorithms. The encoder contains no special or advanced features. All encoder components are standard and commonly used in literature.

Decoder:
The proposed decoder design for the proposed architecture is mostly symmetrical to the encoder with two crucial differences. First, the amount of filters per residual module is scalable with reduction parameter r. In this work, models with r ∈ {1, 2, 4, 8, 16, 32} are evaluated. For r = 1, the encoder and decoder are entirely symmetrical with respect to the amount of convolutional layers, activation maps, normalization layers, and activation functions. As a complement to max-pooling, transposed convolutions are used to upsample the feature maps to a higher resolution with a filter size of 2 × 2. For the other configurations, the amount of learnable convolution parameters is reduced by a factor r. It is important to note that output predictions are poor for problems where the amount of output channels is larger than the amount of channels in the penultimate layer when reducing the decoder size. For example, the Cityscapes data set requires 19 output channels. With r = 4, the final layer before the output has only 8 channels. As a result, the network is not able to properly make predictions for each class, only for a subset of them. To address this problem, the number of channels in each decoder level has to be at least equal to the number of output classes. The amount of channels per decoder residual module is then given by C dec = max( C enc /r , M ), where C enc and C dec are the amount of channels in the encoder and decoder, respectively, and M is the total number of classes. Second, a 1 × 1 convolution is added in the skip connections to reduce the number of encoder feature-map channels to the decoder prior to the addition of the upsampled decoder feature map. This is necessary as residual addition requires tensors with the same dimensions. In the remainder of this article, models that use the proposed changes are referred to as Reduced Decoder Networks (ReDeNet). Figure 1 depicts a schematic overview of the employed ReDeNet model.

B. ENet BASELINE
As discussed in Section I-B, the ENet architecture [23] features a small decoder compared to the encoder. The authors motivated this decision as they view that the role of the decoder is to merely upsample the feature space and fine tune the details. Instead, ENet uses max pooling indices from the encoder to improve the decoder performance. In contrast, our example model ReDeNet and the Deeplab architecture discussed in the following section, skip connections from low-level encoder features to corresponding layers in the decoder are used. The ENet architecture was adapted to allow an even smaller decoder. 1 After upsampling using the max-pooling indices. The number of channels is reduced by a factor of r with a 1 × 1 convolution. Consequently, a series of convolutions are performed and, finally, a second 1 × 1 convolution is applied to increase the number of channels to the original size. This is necessary in order to match the dimensions of max-pooling indices in the next level.

C. DEEPLAB BASELINE
Other models with U-Net-based skip connections, such as for example Deeplab v3 [24], have been a great success in a large variety of segmentation problems. The introduction of skip connections has led to improved performance on many segmentation tasks [30], [31]. More specifically, the skip connections allow for models that gradually recover the spatial information lost in the encoding branch, due to e.g. pooling operations or strided convolutions. In this work, an implementation of the Deeplab-V3+ model is adapted. 2 The Deeplab decoder differs from earlier decoders with skip connections as only a single skip connection is applied in the middle 'level' of the network instead of after every 'level'. As such, the decoder is very small and consists of only a  single (non residual) convolutional block with three convolutional layers. Sub-results from the Deeplab paper [24] concluded that 256 filters for each of the convolutional layers in the decoder led to the best results. In this work we evaluate even smaller variants of this model using 256/r filters with r ∈ {1, 2, 4, 8, 16, 32}. A Resnet-152 backbone was employed for all Deeplab experiments.

D. MODEL COMPLEXITY
The decoder complexity of the evaluated models is shown in Table 1. Since the encoder is exactly the same for each respective sub model, all values pertain to the decoder only.
In the column headers of this table, FLOPs denotes the required amount of floating point operations while Params indicate the amount of learnable parameters in the decoder. To generate the values in Table 1, a Titan-XP GPU and a single tensor with input resolution of 256×256×3 was used. Finally, Baseline Percentage (BP) is computed as the percentage of FLOPS and learnable decoder parameters compared to the baseline (r = 1).

E. TRAINING DETAILS
The algorithm is trained using Adam and AMS-grad with a weight decay of 10 −5 . A cyclic cosine learning-rate scheduler VOLUME 8, 2020 Four examples of images from the ISIC data set with corresponding ground truth. Among the considered problems, the skin-lesion segmentation problem has the lowest complexity as the output mask is binary and the lesion is almost always located at the center of the image. However, lesion boundaries are not always clearly defined.
[32] is used to control the learning rate. Additionally, batch normalization is used to regularize the network and the model is further regularized with data augmentation. Images are randomly rotated with θ ∈ {0, 90, 180, 270} degrees and randomly flipped along the x-and y-axis with probability 0.5 for the ISIC data set (Section III-A). For the other 2 data sets (Section III-B and III-C), no rotation is used and images are only flipped over the y-axis. Additionally, random permutations are made to the color, contrast and brightness of the images. Furthermore, images are randomly sheared by up to 8 degrees and randomly translated by up to 10% of the image width. All experiments are performed on a desktop PC with the following specifications: Xeon CPU E5-1650 v4 @3.60 GHz, 32-GB RAM, Titan-XP 12-GB GPU.

F. STATISTICAL TEST
In order to determine whether increasing reduction parameter r has any significant effect on the performance of the model, a statistical significance test is performed with respect to the baseline models with r = 1. Since the test set is the same for all experiments per data set, a paired comparison is required. Additionally, the outputs of the model cannot be assumed to be Gaussian. For these reasons, the Wilcoxon signed-rank test [33] -a non-parametric paired statistical hypothesis test -is used to compare the models.

III. EXPERIMENTAL SETUP
In this work, reduced decoders are evaluated in various settings on three data sets with varying complexity. The three different problems are employed to show that simpler decoders perform just as well as standard-sized decoders on simple problems. Contrarily, simple decoders do not yield equivalent results when the problem is too difficult. First, a simple binary segmentation problem is discussed (ISIC). Second, our method is also evaluated on a multi-class semantic segmentation problem (Cityscapes). Finally, results on monocular depth estimation (RGB-D SUN) are discussed, which is the most difficult of the three problems.

A. ISIC EXPERIMENTS
Data: For the first set of experiments, the data was extracted from the ISIC 2017: Skin Lesion Analysis Towards Melanoma Detection grand challenge data sets [27]. The data set consists of 2,750 RGB dermoscopic images with spatial resolutions ranging from 576 × 768 pixels to 6,748 × 4,499 pixels. All images have a corresponding binary annotation, indicating the loci of the lesions. Of these images, 2,000 are reserved for training, 150 for validation, and 600 for testing. All images have been resized to 256 × 256 pixels for computational efficiency. Four examples of images with corresponding ground truths from the ISIC data set are shown in Figure 2. This first problem is relatively simple: the ground truth is binary and the lesions are almost always centered in the image. However, it should be noted that shapes and textures can vary widely for different lesions and the borders of the lesions are not always clearly defined.
Experiment details: The amount of output channels of the last residual module was set to unity for all ISIC experiments. All models are trained using the soft-Dice loss (L Dice , given in Equation (1)), where real valueŷ i ∈ [0, 1] is the i th output of the last network layer passed through a sigmoid non-linearity and y i ∈ {0, 1} is the corresponding binary label. A smoothing parameter s = 1, proposed by Manning et al. [34], is added to the loss function for regularization purposes. The soft-Dice loss is defined by: In Table 2, the Dice metric is reported which is calculated by Dice = 1 − L Dice . A Dice score of unity indicates a perfect prediction while low values indicate poor performance. In addition to Dice, the Intersection over Union (IoU) is also reported and calculated, by applying: whereŷ t is the output of the networkŷ thresholded with t = 0.5 to generate a binary prediction mask. Finally, the result of the significance test (Wilcoxon) is also reported as described in Section II-F. The test is performed using the Dice metric.

B. CITYSCAPES EXPERIMENTS
Data: For the second set of experiments, the Cityscapes data set [35] is employed. The used part of the data set consists of 5,000 high-resolution images (2,048 × 1,024 pixels) of urban street scenes. Each image has a corresponding finely labeled segmentation with 30 classes, of which 19 are object classes that are assessed in the official Cityscapes evaluation script. The remaining 11 classes are labeled as 'void' classes as they are either too rare (e.g. caravan), or cannot be assessed (e.g. background clutter). Figure 3 shows four representative images of the Cityscapes data set including the corresponding ground-truth segmentations. The data was collected in 50 cities over several months (spring, summer, fall) Results on the ISIC data set. Dice and IOU are reported, the statistical test was performed on the IoU scores and compared to the models with no decoder reduction. An asterisk indicates that while a statistical difference was found, the decoder reduction led to an increase in performance.

FIGURE 3.
Four examples of images from the Cityscapes data set with corresponding ground truth. This problem is more complex than skin-lesion segmentation, since the output contains 19 object classes instead of 2. However, the object boundaries are more clearly defined for these images.
and images contain varying scene layout and background, in addition to a large number of dynamic objects. Of this set, 2,975 images are reserved for training, 500 for validation and 1,525 for testing. Labels for the test set are not publicly available so 1,064 images are separated from the training set to create a test set for our purpose. This results in 1,911 images for training, 500 for validation and 1,064 for testing. The images are split up based on the city where the image was taken, such that images from the same city appear in only in a single set. Images are resized to 256 × 128 pixels for computational efficiency, which deviates from the other experiments, in order to maintain the skewed aspect ratio of the original images. The Cityscapes semantic segmentation problem is more complex compared to the ISIC segmentation challenge described in Section III-A. Instead of a binary problem, this data set contains fine-grained annotations of 30 classes, of which 19 are object classes. However, the segmentation boundaries are clearly defined (98% agreement between assessor ground truths over 30 images [35]). Experiment details: The amount of channels of the last residual module was set to 19 (one for each evaluation class in the official evaluation script [35]) for all Cityscapes experiments. The cross-entropy is used as a loss function, where pixels of void classes are ignored during training. Additionally, since some of the classes occur much more frequently than others, a class weight is added to give more importance to minority classes. The class weight is calculated as follows: where p class is the prior probability that a pixel belongs to a particular class (calculated from the training set). In contrast to the more commonly used inverse class probability weighting, the weights are bounded as the probability approaches 0. Additional parameter γ is set to 1.02 (restricting the class weights to the interval [1; 50]), just as in the ENet paper [23]. The final loss function for the Cityscapes experiments is then defined as: where w c denotes the weight corresponding to class c, M is the total number of classes and N is the total number of pixels of a single channel in the output prediction. Pixels that have a 'void' label are not taken into account, neither during training nor testing. For the evaluation of the different models, 3 metrics are reported. The classes in the Cityscapes data set are subclasses of the predefined categories (e.g. car, truck, and bus are subclasses of the vehicle category). The class IoU (IoU c and category IoU (IoU cat ) are defined as the mean IoU per class and per category, respectively. In addition, the cross-entropy loss as defined in Equation (4), is reported as well. Finally, the statistical significance test was calculated using L ce .

Data:
The SUN RGB-D data set is employed for the third set of experiments [36]. In this work, all the images of indoor scenes with corresponding depth maps (in the D component) are used for monocular depth prediction. Of this subset, 3,784 images were captured using Kinect v2 and 1,159 images using RealSense. Additionally, 1,449 images were included from the NYU Depth V2 [37] data set. This data set also contains 554 manually selected realistic scene VOLUME 8, 2020 images from the Berkeley B3DO data set [38], both captured by Kinect v1. Finally, 3,389 manually selected distinguished frames without significant motion blur from the SUN3D videos [39] are included, captured by Xtion. In total, the data set contains 10,335 RGB-D images. From this set, 4,228 images are reserved for training, 1,057 for validation and 5,500 for testing. All images have been resized to 256 × 256 pixels for computational efficiency. Four example images from the SUN RGB-D data set are shown in Figure 4. This final problem can be considered the most difficult, since it is essentially a per-pixel regression problem and the ground truth is incomplete in some cases, where the depth sensor cannot obtain reliable measurements. Finally, monocular depth estimation is an ill-posed problem because a single 2D image may be produced from a large (and potentially infinite) set of distinct 3D scenes. Experiment details: The amount of output channels of the last residual module was set to unity and was followed by a sigmoid function for all models instead of a softmax function, since this problem deals with regression instead of classification. For the depth prediction, the predicted and ground-truth logarithmic depth-maps D andD are used to calculate the loss. Defining d = D i −D i as the difference depth map and n the number of pixels, the Scale Invariant Loss (SIL) [40] is then defined by: In contrast to the standard L2 distance, used as an error term here (first term of Equation (5)), the SIL helps to measure the relationship between points in the scene, irrespective of the absolute global scale. In Table 4, the SIL metric is reported in addition to the absolute relative error (E rel ). which is defined by: The absolute relative error is a commonly used metric to evaluate depth-estimation models. Table 1 shows that increasing r leads to drastically smaller decoders which require as little as 0.45% of the amount of floating point operations and occupy more than 100 times less memory, compared to the baseline. The decoders with larger values of reduction parameter r are extremely efficient and allow for substantially faster inference times.

A. COMPLEXITY COMPARISONS
The biggest parameter reduction is logically obtained when downsizing the symmetric U-Net type model (ReDeNet). The ENet model benefits less from the proposed technique, where only approximately 85% of the learnable parameters can be pruned compared to 99+% for the symmetric model. The ENet model requires expansion of the number of channels in order to use the max-pooling indices from the corresponding encoder layer. This limitation accounts for difference the attainable reduction.

B. ISIC
The results from the ISIC experiments are shown in Table 2.
Between the r = 1 baselines and the other models with reduced decoder sizes, no statistical significant difference is measured, except for the ReDeNet and ENet models with a reduction of 32 (p < 0.01). In addition to not showing a statistical difference compared to the baseline, the amount of trainable decoder parameters and inference time decreases drastically for higher values of r (the ReDeNet r=32 model has less than 1% of the trainable decoder parameters and less than 0.5% of the amount of decoder FLOPs compared to ReDeNet r=1 ). Finally, no statistical difference was found between the baseline models and the models with a reduction of r = 16. This shows that for this problem, it is evident that the decoder size can be drastically reduced without decreasing performance. In other similar problems where less data is available (which is common for many medical imaging data sets), using a smaller decoder could also potentially help to prevent overfitting. Only when reducing the decoder size with r = 32, a small statistical significant performance drop can be observed. Figure 5 shows four examples of skin lesions with corresponding ground truth and network predictions generated with the ReDeNet models. From these examples it is clear that there is no substantial visual difference between the different ReDeNet models and the baseline. In the first and last example some degradation is present for r = 32, although this is not unique to the smallest decoder.

C. CITYSCAPES
Results of the Cityscapes experiments are depicted in Table 3.
There is not a very large difference in performance in terms of IoU c , but especially IoU cat varies very little between models.
According to the Wilcoxon signed-rank test, there is only a observable statistical difference for models smaller than or equal to r = 8 for the ReDeNet model. This indicates that the decoder can likely be made at least 90% faster (Table 1) without statistical loss of performance. The same observation applies to the ENet results, however, it is notable  Results on the CityScapes data set. L ce , IoU c and IoU cat are reported, the statistical test was performed on the L ce scores and compared to model with no decoder reduction. An asterisk indicates that while a statistical difference was found, the decoder reduction led to an increase in performance.
that overall, the segmentation scores are significantly lower compared to the ReDeNet and Deeplab models. This may be attributed to the lack of skip connections which play an important role in segmenting in more difficult problems such as semantic scene segmentation. For the Deeplab models, reducing the decoder further results in a significant performance decrease. The number of decoder channels of the base model was optimized on another scene segmentation data set in the original paper [24]. This might explain why reducing the decoder further leads to a decrease in performance for higher values of r. Figure 6 shows 4 examples of CityScapes images with corresponding ground truth and network predictions generated by the ReDeNet models. The network predictions are relatively consistent and there does not seem to be a significant visual difference between the models. All models perform well on large objects such as roads, cars and the sky. Smaller objects and objects that are further away result in the most errors in all models. This is unsurprising, considering that the images were downscaled from 2048 × 1024 pixels to 256 × 128 pixels.

D. SUN RGB-D
The results from the SUN RGB-D experiments are presented in Table 4. A statistical difference was measured between ReDeNet r=1 and the other models, except for the ReDeNet r=2 model. For ENet and Deeplab, a statistical decrease was observed for r ≤ 16. This indicates that in most cases, models with larger decoders perform better for monocular depth estimation. This is in line with the expectation, since monocular depth estimation is much more complicated compared to e.g. binary segmentation (Section III-A). It is also interesting to note that Deeplab outperforms the other models in terms of the Scale Invariant Loss (SIL) and the error E rel . However, visually, there is no major difference. For all three model types, the visual results only appear significantly worse when the decoder is substantially smaller (Figure 7). The predicted depth maps start to degrade in quality for r = 8 and are completely meaningless for r = 32. These visual discrepancies are not reflected in the SIL and E rel metrics to such an extent, which indicates that better monocular depth estimation metrics are desirable for such measurements.

V. DISCUSSION
The size of the decoder in encoder-decoder networks is an arbitrary choice in many state-of-the-art machine learning solutions. However, the results from the experiments performed in this work clearly show that in many cases, the complexity of the decoder can be greatly reduced. For example, in the experiments with a binary classification problem (Section IV-B), the amount of parameters in the decoder can be reduced by a factor 100 until an observable statistical difference occurs. Similar results were obtained in the Cityscapes experiments for multi-class segmentation IV-C, where the decoder could be made 90% faster without loss in performance. On the other hand, in more complex problems such as monocular depth estimation (Section IV-D), a very large decrease in the size of the decoder severely affects the visual quality of the output predictions. In these cases, the complexity of the decoder can still be reduced considerably before visual quality declines apparently. From the results of the experiments it is concluded that the optimal size of the decoder depends on the problem. However, in all cases, the number of decoder parameters can be reduced significantly compared to full-sized baselines, without considerable loss of performance. It is also noteworthy that a network with a reduced decoder, e.g. ReDeNet r=2 , often outperforms full-sized baselines in the ISIC experiments, as well as in the Cityscapes experiments. This phenomenon may occur because the reduced decoder size helps prevent overfitting due to the reduction in learnable parameters. For future work, it would be very helpful to have an indication of the necessary decoder size without having TABLE 4. Results on the SUN RGB-D data set. SIL and E rel are reported, the statistical test was performed on the SIL scores and compared to the model with no decoder reduction. An asterisk indicates that while a statistical difference was found, the decoder reduction led to an increase in performance. to empirically determine this hyper parameter by trial and error. Methods to estimate problem complexity can be investigated to facilitate in solving this problem. Additionally, it would be interesting to evaluate if a comparable analysis on sequence-to-sequence models leads to similar results. In sequence-to-sequence models, the output representation is often of similar complexity compared to the input, e.g. text translation. Decreasing the decoder complexity in these situations may not lead to better results. In other problems such as action recognition in video, the output sequence is considerably less complex (list of integers) compared to the input sequence (video data). However, approaches to action recognition in video sequences already generally have significantly smaller decoders. Lastly, the hypothesis on reducing overfitting with smaller decoders and small data sets can be investigated in future work, to explain the notable performance increase while reducing decoder complexity in the ISIC and Cityscapes experiments.

VI. CONCLUSION
In this work, we have proposed a simple method to reduce the complexity of decoders in encoder-decoder networks. We show that symmetrical properties of state-of-the-art encoder-decoder models are sub-optimal and that equal performance can be attained with substantially smaller decoder designs. These changes allow for a significant reduction in training, as well as inference time, in a variety of tasks. The method is generic in nature and is applicable to any encoder-decoder architecture that uses convolutional layers. This is verified with an extensive evaluation on three vastly different data sets of variable complexity, where we have found that a reduction in decoder channels shows no statistical decrease in performance. At the same time the method enables a decoder requiring up to 99% fewer FLOPs, (± 90% fewer FLOPs is attainable for all investigated problems). The proposed solution is a simple method and can easily be implemented in other encoder-decoder models. In this work, the most commonly used encoder-decoder model (U-Net) is used as an example for how to incorporate this technique. In addition, results show that architectures that already claim to have an efficient decoder (ENet and Deeplap) can still benefit from additional reduction in decoder parameters as well, depending on the problem complexity. Empirical results also show that a reduction in parameters may even lead to slightly improved performance, which likely occurs as fewer parameters reduce overfitting effects. Since 2017, he has been with the Video Coding and Architectures (VCA) Research Group. He has published multiple publications on the subject. Together with colleagues at the Amsterdam University Medical Center, he developed a real-time Barrett's Esophagus dysplasia detection and localization algorithm that was tested to great success in a live pilot study inside the endoscopy suite. Besides research on medical imaging, he has a broad interest in more general computer vision techniques and is continually looking for ways to combine this with medical domain knowledge to improve algorithms that can be implemented in clinical practice. His research interest includes the detection and diagnosis of dysplasia in Barrett's Esophagus. Driven by a personal tragedy, a passion for science and a desire to contribute to this world, he applies his understanding of machine learning and computer vision in healthcare, where he aims to develop novel assistive technologies for medical doctors, helping them with early diagnosis and effective treatment of disease. Heading the healthcare cluster at the VCA Research Group and the Oncology track of the Center for Care and Cure Technology (C3Te), his research focuses on Computer-Aided Detection and Diagnosis (CADe/CADx) for oncology, striving to increase the detection rates and early diagnosis of developing cancer, using efficient image processing and computer vision architectures that meet the demanding requirements for clinical implementation. Towards this goal, he works closely with medical specialists in an interdisciplinary team, aiming to translate medical domain knowledge into technical algorithms and vice versa. Systems. He has coauthored more than 50 refereed international book chapters and journal articles and more than 300 international conference articles, and holding more than 40 international patents.
He is a Technical Committee Member of the IEEE CES, ICIP, and SPIE VCIP, and also the chairman or a board member of various international working groups and foundations. He was a co-recipient of multiple paper awards like the IEEE CES Transactions Paper Award (several), VCIP and ICCE Best Paper Awards, ITVA and Invention awards. VOLUME 8, 2020