Hybrid Skip: A Biologically Inspired Skip Connection for the UNet Architecture

In this work we introduce a biologically inspired long-range skip connection for the UNet architecture that relies on the perceptual illusion of hybrid images, being images that simultaneously encode two images. The fusion of early encoder features with deeper decoder ones allows UNet models to produce finer-grained dense predictions. While proven in segmentation tasks, the network's benefits are down-weighted for dense regression tasks as these long-range skip connections additionally result in texture transfer artifacts. Specifically for depth estimation, this hurts smoothness and introduces false positive edges which are detrimental to the task due to the depth maps' piece-wise smooth nature. The proposed HybridSkip connections show improved performance in balancing the trade-off between edge preservation, and the minimization of texture transfer artifacts that hurt smoothness. This is achieved by the proper and balanced exchange of information that Hybrid-Skip connections offer between the high and low frequency, encoder and decoder features, respectively.


Introduction
Skip connections, specifically, the bypassing of convolutional layer blocks within a convolutional neural network (CNN) architecture, are a core building block of modern data-driven models [59]. Residual blocks [15,16] use shortrange skip connections with identity mappings and residual functions to improve information propagation in both forward and backward passes. They are the basic building block of ResNets, one of the most popular and better performing CNN backends.
At the same time, UNet [47] is another autoencoder CNN architecture that relies on long-range skip connections, forwarding early encoder features to their corresponding resolution features on the decoder's side. Different from Figure 1. Long-range skip connections are instrumental to the popular UNet architecture but are also challenged by the semantic gap between the encoder E and decoder D features. While they allow UNets to capture high resolution details, this is not always beneficial to dense regression tasks that need to overcome texture transfer and also preserve smoothness. We introduce a biologically inspired skip connection that balances the effect of the high frequency encoder features and the dominant structural information carried by the decoder ones. From left to right, the bottom row visualizes animations of the encoder and decoder features maps before and after the hybrid skip connection from a trained dense depth regression model (should they not be playing automatically please consider viewing them with the specific versions of Adobe Acrobar Reader that support animated images -clicking and holding pauses playback). residual skip connections, UNet concatenates the encoder and decoder features, allowing the network to implicitly learn their fusion through the decoder's convolutional layers. However, it is a challenging problem as there exists a semantic gap between the encoder features and the corresponding decoder ones, which stems from the higher level concepts and semantic information that is progressively encoded into CNNs. Despite this challenge, UNet remains a dominant architecture, especially for semantic segmentation, surpassing fully convolution networks (FCN) [32], mainly because it offers higher boundary preservation performance.
Consequently, various works have focused on overcoming this encoder-decoder semantic gap in UNet's long-range skip connections. Straightforward approaches add learnable operations to lessen the gap with MultiResUNet [21] relying on residual blocks. More involved approaches utilize gating-based spatial attention [38] to attend to the encoder features in a localized manner, or semantic embedding branches and global convolutions [63]. The search for an appropriate encoder-decoder skip connection led to the use of neural architecture search (NAS) [56] to identify the squeeze-and-excite operation [18] as the more prominent candidate mapping function.
Notably, all these works have applied their proposed skip connection in semantic segmentation, the downstream task that UNet was initially applied at. Yet recently, UNet-like architectures are increasingly being used in depth estimation as well [6,[12][13][14]33,44,65], as the long-range skip connections offer higher boundary preservation performance. However, the latter is a dense regression task, in contrast to the former, which is a dense classification task. For depth estimation, the model's parameters encode a continuous function approximation, whereas for semantic segmentation the model focuses on learning a high-dimensional decision surface. The core difference lies in the nature of depth maps, which are piecewise smooth functions [20], meaning that compared to semantic segmentation, the smoothness property needs to also hold for the predicted output, whereas for segmentation, the preservation of the boundary is the only secondary trait of importance. Consequently, for a regression task like depth estimation, skip connections usually result in texture transfer artifacts which hurt smoothness, and introduce false positive boundaries.
In this work, we design a biologically-inspired skip connection based on the way humans process visual input [5,29], specifically the decomposition into different spatial frequencies that happens early on in the visual pathway. Higher spatial frequencies become imperceptible with farther viewpoints, with the reverse holding for closer viewpoints. The human visual system assimilates higher spatial frequencies into lower ones as viewing distance increases, a mechanism that has been exploited by prior work to generate illusions [39]. We exploit this mechanism as well, to facilitate the exchange of information between the encoder and decoder features, taking into account their higher and lower frequency nature respectively resulting from the autoencoder's inductive bias.
More specifically, we contribute the following: • We design a lightweight and plug-n-play hybrid feature skip connection for the UNet architecture. It performs a blending-based information exchange between the higher and lower level feature maps partaking in a long-range skip connection, prior to their fusion.
• We experimentally demonstrate the efficacy of various skip connections in a dense regression task, while taking into account their performance differentials on secondary traits as well; namely boundary preservation and smoothness.
• We demonstrate that our proposed skip connection strikes a better balance at boosting direct depth, boundary and smoothness performance, compared to other state-of-the-art skip connections.

Related Work
The UNet CNN architecture [47] was initially introduced for semantic segmentation and was the first architecture to include long-range skip connections, forwarding information from the encoder to the decoder via feature fusion. The skip connection improves detail preservation by propagating the early encoder features near the prediction features, boosting semantic segmentation accuracy by allowing for thinner structure segmentation, rendering UNet the standard architecture for this task. More information about the UNet architecture and the importance of the skip connection can be found in various surveys about U-shaped network architectures [31,42].
Nonetheless, its strength also presents as one of its main weaknesses. While skip connections propagate details near the prediction layers, facilitating more detailed dense signal reconstructions, they are not necessarily optimal in their pure identity mapping form. The reason for this is that the raw fusion of early encoder and late decoder information is hindered by their semantic gap. CNNs typically extract high spatial-frequency details (e.g. edges, texture, lines) in the early stages, while at the deeper layers the network produces category-specific features representations [3,15].  [39] human vision based illusion that encode a dual image. From left to right: i) first and ii) second image, iii) low pass filtered first image, iv) high pass filtered second image, and v) the hybrid image which changes with viewing distance (from second to first, by zooming in and out the document respectively). The rightmost animated image shows the blending of the low and high pass images using an interpolated blending factor from 0.1 to 0.9 that mathematically simulates the physical viewing distance change. (the figure contains animated images, should they not be playing automatically please consider viewing them with the specific versions of Adobe Acrobar Reader that support animated images) Among the techniques designed to address this semantic gap, ExFuse [63] used a complex skip connection, replacing the identity mapping with a cascade comprising a semantic embedding branch and a global convolution module. Results in both an FCN and a UNet demonstrated its efficacy in improving semantic segmentation performance. Approaching the same problem from another perspective, Attention UNet [38] introduced a novel attention gate as the skip connection. Each skip connection softly attends to the incoming encoder features using a gating signal. Initially, additive attention between the projections of the gating signal and encoder features is used to generate an attention grid after aggregating and projecting the result. This is then resampled and used to reduce or preserve the importance of the encoder features in a localised manner. Results in medical segmentation showcased an improvement over vanilla UNet. The concept was similarly applied to the UNet++ architecture, resulting in Attention UNet++ [28], which adapts the gating signal to the nesting levels and shorter skip connections.
More recently, in MultiResUnet [21] the identity mapping skip connection was replaced by a series of residual blocks that aim at alleviating the semantic gap between the encoder and decoder features. Taking into account that earlier encoder features suffer from a bigger semantic gap, more blocks were used in the earlier features than the ones closer to the bottleneck. Apart from an improvement in dense and boundary segmentation, the residual skip connections also exhibited robustness to noise. In a similar fashion, MAPUNet [58], inspired by UNet++ [66], and UNet 3+ [19], exploited multi-scale feature fusion and supervision for monocular depth estimation. Moreover, a UNet++ variant with residual blocks and dense gated convolution based attention [60] was used for monocular depth estimation using sparse depth measurements [64]. Finally, Na-sUNet [56] employed neural architecture search to look for an efficient and effective UNet architecture, a finding shared by [48] as well. Their search resulted in identifying the Squeeze-and-Excite operation [18] as the most dominant replacement for the standard (identity) skip connection. Apart from the identity mapping, the search performed included traditional and dilated [7] convolutions, as well as separable depthwise convolutions [50]. Evaluation in different medical segmentation datasets showed performance increases at a fraction of the parameters and reduced memory cost.
Evidently, all aforementioned works focused on segmentation tasks, while all works using UNet's skip connections in reconstruction or regression tasks rely on the vanilla UNet. Dense regression tasks impose more stringent requirements compared to segmentation tasks, as the predicted signals need to exhibit richer properties. A notable example are depth images that need to preserve edges and their magnitude, while also varying smoothly in areas where no significant discontinuities manifest [20]. Com-pared to previous works, we focus on UNet networks used for regression and holistically assess the efficacy of these advanced skip connections [21,38,56,63] to improve performance and preserve properties like boundaries and smoothness. Further, we propose a biological vision inspired skip connection based on scale space theory, that better preserves the output signal's secondary properties simultaneously.

Approach
Our work focuses solely on the long-range skip connections found in the UNet architecture, and specifically the fusion of features coming from different depths of the model. Encoder features are learned earlier (shallower) and on higher resolutions, while decoder features are learned later (deeper) and on lower resolutions than the correspondingly encoder ones that they will be fused with. Our inspiration stems from the Hybrid Images [39]. We briefly introduce them in Section 3.1, following with our proposed Hybrid Skip connection in Section 3.2.

Hybrid Images
Hybrid images H are dual images that jointly encode two different images, A and B, but only one is largely perceived. Their interpretation changes with viewing distance, creating a smooth optical illusion which has been used to study patients [27], face identification [35], create two-layer QR codes [61], or even used for recreational art. They are generated by the blending of two different spatial resolution images: where f l and f h are a low-pass and high-pass filter respectively. Essentially, image A is highly blurred, making it visible from farther distances, while image B is composed by edges, which are only visible from close up. Figure 2 shows the resulting illusion and intermediate representations.

Hybrid Skip Connection
The UNet architecture's success relies on the long-range skip connection [42,59] that fuses early encoder features E ∈ R F ×H×W with late decoder features D ∈ R F ×H×W . The typical UNet fusion scheme is a learnable fusion using a convolutional layer receiving as input the concatenation of the encoder and decoder features: where H(·) denotes the convolution function of the ith layer, and without loss of generality s(·) denotes the encoder features' skip function, which for the typical UNet is the identity mapping. It is this multi-scale propagation of earlier encoder features to the late decoder layers that allows UNet architectures to capture finer details. Yet, there are challenges associated with this fusion scheme, namely the semantic gap between E and D as well as the different spatial frequencies of these two feature maps. Earlier CNN blocks capture lower level features like lines and edges, while later CNN blocks capture higher level features and concepts, a fact that constitutes their -straightforward -fusion an inefficient approach. Further, earlier encoder features are captured in higher resolutions and contain higher spatial frequencies, while later decoder features contain lower spatial frequencies and are typically upsampled at the skip connection fusion step. Bilinear interpolation of a lower resolution feature map results in low spatial frequencies [9].
Hybrid images, as represented by Eq. (1), blend together two images of different spatial frequencies, toggling the perception of one or the other via how the human visual system's perception changes with viewing distance. The latter mechanism can be generalized to alpha blending: where f a and f b are two filters converting A and B into different frequency images. The blending coefficient α controls the viewing distance, and therefore, converts the dual image to a distinctly perceived representation. Figure 2 shows the transition from one image to the other as α is Considering the skip connection fusing the semantically and spectrally different feature maps E and D, we rely on the following hybrid feature functions: where δ, ∈ R F are two alpha blending vectors. These are combined to form the hybrid skip connection's fusion function: Compared to most other non-identity skip connections [21,38,56,63], the hybrid skip connection presented in Eq. (4), (5) and (6) facilitates a bidirectional information exchange between the encoder E and decoder D features, whereas the aforementioned skip connections only focus on bridging the semantic gap between E and D by increasing the semantic information carried by the encoder features E. Analyzing HybridSkip. There are multiple ways that F hybrid can be analyzed. From an attention perspective it can be considered as a mix of heterogeneous feature boosting [53] using a soft attention [24] on the respective features. The decoder features attend to the encoder ones, and vice versa, boosting specific features depending on the blending factors. While traditional channel attention simply scale entire feature maps (e.g. the squeeze-and-excite skip connection in [56]) and grid based attention only focuses on spatial feature selection (i.e. [38]), our hybrid approach is distinctly different, albeit it combines these two concepts. The channel attended encoder(decoder) features boosting the respective channel attended decoder(encoder) are directly related to the spatial information as already learned by the features.
From a spectral processing point of view, it can be considered as a selective alignment or focusing of the spatial frequencies of the blended feature maps. Considering that the early encoder features E contain higher frequencies than the upsampled late decoder features D, the second term in Eq. (4) and (5) is essentially a band-pass filtered feature map as low/high frequency inputs are passed through a high/low frequency filter. Therefore, both terms blend inputs from a frequency spectrum lying in the middle of the two opposite end, spatial frequency wise, original feature maps.
Considering the semantic gap, it is apparent that the hybrid skip connection closes the gap in a symmetric fashion by using both inputs to derive the features to be fused. In contrast to most approaches, it does not seek to close the gap by aligning the encoder features to the decoder ones (e.g. as in [21,63]), but by appropriately blending them. As decreases, the structural edges derived from the decoder features become more dominant in the fused encoder features, accentuating these edges compared to those encoded in E. Similarly, as δ decreases, the smoothed detailed edges encoded in the encoder features progressively add texture to the decoder features. With appropriate blending factors, both directions tend to reduce texture transfer and preserve the edges that matter, leading to a balancing effect between the smoothness and boundary preservation properties of the resulting fused feature maps, and eventually the predicted signal. Notably, the process is distinct for each feature map, meaning that with δ and being learnable parameters of the model, it encodes a dual representation of these features and learns which one is more appropriate during training.

Results
Experimental Setup. For our analysis we use a dense regression task, namely depth estimation, which requires the balancing of both boundary preservation and smoothness of the predicted depth maps, apart from its direct depth estimation performance. To fully exploit rich depth maps that include both smooth regions, as well as lots of foreground to background depth discontinuities, we use an omnidirectional image benchmark [1]. It includes spherical panoramas that capture entire indoor scenes, containing a lot of flat surfaces (ceiling, floors, tables, etc.), as well as a plurality of foreground objects given their omnidirectional field of view, resulting in a rich piece-wise smooth depth map. Similar to most works on spherical depth estima-tion [6,10,12,41,67], we evaluate depths up to 10m and use standard metrics for depth estimation, as well as boundary preservation [17,26] and surface orientation [55].
Implementation Details. Our implementation is based on moai [37] which uses PyTorch 1.8 [40], PyTorch Lightning 1.0.7 [11] and Kornia 0.4.1 [45]. For all experiments we use the same UNet architecture and supervision scheme used in Pano3D [1], fixing the learning rate (0.0002), optimizer (default parameterized Adam [25]), batch size (4) and random number generator seed. Thus, only the skip connection varies from experiment to experiment. We use the Pano3D low resolution (512 × 256) Matterport3D (M3D) train and test splits for all experiments and apply no data augmentation, training for 60 epochs. For the low pass and high pass filters f l and f h , we use a discrete isotropic Gaussian and a discrete isotropic Laplacian filter respectively.

Hybrid Skip Connection Analysis
In this section we seek to understand the proper design of the hybrid skip connection. Our analysis focuses on one hand on the kernel size K of the low and high pass filters f l and f h respectively, and on the other hand on the choice of the encoder and decoder blending factors and δ respectively. Regarding the latter, one approach would be to use constant blending factors, explicitly controlling the information exchange between the two feature maps E and D.
This way, the encoder and decoder blending factors would be = 1 * and δ = 1 * δ, with 1 denoting a vector of ones with length F corresponding to the feature maps of each skip connection. Another approach would be to consider the blending factors as parameters of the model, and jointly optimize them with the convolutional UNet parameters. This would allow the model to adapt the blending factors to each separate feature instead. In this case, the blending factors are given by = σ(ˆ ) and δ = σ(δ), with the hat symbols denoting the model's parameters, and σ being the sigmoid function constraining the blending factors to lie in the [0, 1] range. When using learnable blending factors, the parametersˆ andδ are initialized using a zero mean and unit variance normal distribution N (0, 1).
To perform an aggregated analysis among many metrics of different performance traits, namely direct depth, boundary preservation and smoothness, we use the following indicators derived from the metrics used in [1], which aggregate accuracy and error metrics:  in Figure 3 present the results across different kernel sizes and blending factors. For the former we experiment with K = {3, 5, 7, 9} and for the latter, apart from the learnable blending factors, we also use the following explicit blendings {0.25, (0.25, 0.75), 0.5, (0.75, 0.25), 0.75}, with the tuples referring to ( , δ) combinations. Two trends are observed, first, that an increasing kernel size provides consistent performance gains, and second, that the learnable blending factors are also a consistently good performer across different kernel sizes. Consequently, we use the K = 9 kernel size with learnable blending factors as our baseline hybrid skip connection UNet model. From an interpretation perspective, analysing the learnable blending factors offers an insight on how the hybrid skip connections behave. We illustrate the distribution of the blending factor coefficients of the K = 9 model across its 5 skip connections in Figure 4. We observe an interesting and reasonable trend where the deeper layers focus on the structure offered by the incoming encoder features and their low-pass outputs (the encoder features in this case are not early encoder features), while as we progress towards the layers closer to the output, the blending factors indicate that the focus shifts on the predicted signal and its dominant edges, suppressing encoder features resulting into texture transfer.

Comparison with other Skip Connections
We additionally compare the performance of the proposed hybrid skip connection to other approaches used for long range skip connections. More specifically, we present results for a straightforward convolutional (Conv) skip connection stacking k 3 × 3 convolution layers, and the stacked residual unit skip connection [21] (Residual), where k i units are stacked, with i ∈ {1, ..., 5} indicating the ith encoderdecoder layers. Apart from the stacked approaches, we also compare against the attention UNet [38] skip connection (Attention), and the NAS identified [56] Squeeze-and-Excite [18] (SqEx) skip connection. Finally, we adapt the ExFuse [63] skip connection for the UNet architecture, using the decoder features as the high-level feature map fed (a) Conv (b) Residual [21] (c) ExFuse [63] (d) Attention [38] (e) SqEx [18,56]    nection on the M3D test set for the direct depth estimation metrics, as well as the boundary and smoothness preservation ones respectively. Evidently, the hybrid skip connection (learnable blending factors, K = 9) outperforms the other skip connections approaches for dense regression in two aspects. First, it offers the largest gain in terms of improving direct depth estimation performance. Second, it additionally offers the more balanced performance increase compared to a vanilla UNet [47] across the secondary -competing -performance traits. Finally, it does so at a reduced extra parameter cost (last column in Table 2). While SqEx (Residual) offers an important performance boost for preserving the depth map's smoothness (boundaries), it does so at the expense of preserving boundaries (smoothness). The (second) better balanced approach is that of ExFuse which manages to offer reasonable performance gains across all performance traits. These comparisons can be more easily discerned in Figure 6 that illustrates radar plots across different normalized accuracy (a) and error (e) indicators for all performance axes:  presents qualitative results of our K = 9 hybrid skip model, compared to the SqEx and Residual models. The latter are the better performing models in terms of surface and boundary preservation respectively, but clearly showcase the difficulty in achieving a balance between these two traits, as their improved performance on one, translates to a reduced performance on the other. In contrast, the hybrid skip model strikes a better balance in pre-serving both traits.

Hybrid Skip Ablation
We additionally perform an ablation study of the three functional components that jointly formulate the Hybrid-Skip connection. First, we examine a scenario where only the learnable blending of the encoder and decoder features is introduced, denoted as . Then we also conduct two experiments where only a single filter is applied either only at the encoder features (low pass) or the decoder ones (high pass) respectively denoted as F low = H i ([f l (E); D]) and The results for the two larger kernels (i.e. K = 7, K = 9) are presented in Tables 3 and  4, where the former includes the metrics related to direct depth estimation performance and the latter includes the metrics related to the secondary traits, depth smoothness and boundary preservation.
While each functional component in isolation may improve performance along a single axis or metric, it is evident that their combination leads to the most balanced performance boost. Interestingly, we observe that the preservation of structural edges is easy to achieve, but at the expense of smoothness or direct performance, something that is better mitigated when all components co-exist as a Hy-bridSkip connection. However, the discrepancy with respect to boundary preservation between K = 7 and K = 9, with the smaller kernel showing improved accuracy, indicates the selection of the kernel parameters should be tuned on a per-dataset basis.

Other Architectures
Finally, we examine the behavior of other UNet architectures, and established models for 360 o depth estima- were found to be the most balanced alternative in the skip comparison experiments in Section 4.2. For the latter, we employ the state-of-the-art BiFuse [54] and HoHoNet [51] models. All experiments are done using the same training scheme and our rich supervision, as presented in the previous experiments, essentially only switching the architecture for each different experiment, even for the BiFuse and Ho-HoNet models, for a fairer comparison. Tables 5 and 6 present the direct and secondary metrics respectively, including the baseline UNet and our proposed vanilla UNet variant with HybridSkip connections. While HoHoNet, a model specialized for the 360 o domain, produces high quality depth estimation, followed by our model, its behaviour with respect to preserving discontinuity and smoothness is largely reduced, showcasing worse performance even compared to the vanilla UNet. On the other hand, BiFuse largely favours smoothness instead of boundary preservation, whereas both UNet++ variants naturally show improved boundary preservation. As also seen in the experiments comparison different skip connections, the SqEx UNet++ balances the two secondary traits better, offering good results for smoothness as well, compared to the pure UNet++ architecture, overcoming the deficits of skip connections. Nonetheless, its direct depth estimation performance is still at similar levels to UNet++, and inferior to the better performing 360 o depth estimation models. Overall, our hybrid skip connection vanilla UNet model, offers the more balanced performance across direct and secondary trait metrics, as illustrated in Figure 8, with only the SqEx UNet++ model coming close in terms of balanced performance.

Conclusion
In this work we have designed a hybrid skip connection for the UNet architecture which relies on long range skip connections fusing features with a large semantic and spectral gap. The simultaneous blending and spatial nature of the hybrid skip connection allows for a balanced performance boost across all performance traits for depth estimation, a dense regression task, with minimal parameter overhead. These results indicate that it may be worth exploring the hybrid image concept in the various existing UNet modifications [19,36,43,52,66], or even CNN architectures without long range skip connections, with a recent report [4] providing interesting evidence about their interplay with CNNs. Potential explorations may include short range skip connections (e.g. residual units), or integration within basic CNN buildings blocks (e.g. squeeze-and-excite operations or Octave Convolutions [8]). One limitation is the design of the filters themselves, which are currently performed on the spatial domain and whose parameters remain fixed during training. While larger kernel sizes may provide a more balanced performance improvement as illustrated in Figure 3, each output signal's distribution may be more tuned to specific kernel parameters (e.g. K = 7 showing better boundary preservation in Table 4). Spectral or learnable filtering may allow models to better adapt to the task and data at hand. Further, experimenting with adaptive blending will open up dynamic dual feature representations instead of the fixed blending factors that statically choose  [18,66] (0.719 vs 0.706) Figure 8. Same scheme as Figure 6, with larger numbers inside the parenthesis indicating the area covered by each different approach, with larger areas indicating more balanced performance across all traits.
representations at the end of the model's training. It also remains to be seen if these balanced skip connections also boost performance in other downstream tasks like segmentation.