Parallel Recurrent Module with Inter-layer Attention for Capturing Long-range Feature Relationships

Capturing long-range feature relationships is becoming a central issue with regard to convolutional neural networks (CNNs). In particular, several recent end-to-end trainable attention modules have attempted to model spatial-channel relationships within a given layer. In this work, we focus instead on modeling relationships among visual information captured in different layers and propose a novel module, referred to as a Parallel Recurrent Module with Inter-layer Attention (PI module). The PI module exhibits several unique characteristics, including the ability to memorize information from earlier layers and ameliorate gradient vanishing, both of which are issues not addressed by existing attention modules. Furthermore, due to its easy-to-adopt structure also incurring negligible computational overheads, the module successfully extends to not only CNNs on regular grids but also to graph convolution networks, and even other attention modules. We demonstrate by extensive experiments that the PI module is cost-efficient yet effectively provides additional performance gains on multiple benchmarks in classification, detection, and segmentation tasks in the image domain and a segmentation task in the point cloud domain.


I. INTRODUCTION
Convolutional neural networks (CNNs) have become the standard for solving vision tasks. Although recent works on transformer architecture [1] and its families [2], [3] triggered the transition of the CNNs into transformer-based networks, they are vulnerable to inductive bias and require a larger dataset to acquire the locality that CNNs have. Therefore, auxiliary attention-modules that support CNNs are still actively developed [4]- [6]. Several architectural design aspects of CNNs, such as depth, width, cardinality, and spatialchannel relationships, have been investigated for successful feature learning on vision. In particular, studies that use end-to-end trainable attention modules to model spatialchannel relationships inside a convolution block are on the rise [7]- [10]. However, existing attention modules referred to as intra-layer attention modules, only model the interdependencies of neurons inside a single layer. Therefore, they cannot structurally model relationships between neurons in different layers.
To encode inter-layer relationships into the model, we focus on an essential property of CNNs: the feed-forward structure. As CNNs model long-range interactions through the progressive propagation of visual signals by iterative local aggregations, we are motivated to offer a new interpretation of CNNs: concatenating spatially-pooled features in the depth direction forms a sequence of visual features arranged in increasing order of the receptive field size.
Inspired by the missing aspect in current attention modules and the structural characteristic of CNNs, we propose a novel attention module called a 'Parallel Recurrent Module with Inter-layer Attention (PI module).' The goal of this work is to achieve the explicit modeling of inter-layer relationships in CNNs. To do this, we design two concepts: a parallel recurrent module (PRM) and an inter-layer attention mechanism. Briefly, the PRM is a small parallel memory stream external to the main CNN that carries attention information along with the propagation of the CNN, and the inter-layer attention mechanism serves to calibrate the network based on the layer-wise importance distribution. Furthermore, when incorporating our module into recent advances in spatialchannel attention [7], [8], the PI module is recognized as a generalized attention module that considers the relationships among neurons both inside and outside of a layer.
We thoroughly investigate the impact of each module component and verify the high compatibility of the PI module to Graph Convolution Netowkrs (GCNs,) and even to other attention modules. For a fair evaluation, we evaluate our module on image classification, object detection, image semantic segmentation, and point cloud segmentation tasks on multiple benchmarks: CIFAR [11], ImageNet-1K [12], Microsoft COCO [13], Pascal VOC [14], and S3DIS [15].
In general, the PI module alone significantly boosts backbone networks, achieving comparable results to respective state-of-the-art while incurring a negligible increase in computational complexities (GFLOPS, parameters, and inference time). The PI module also improves on state-of-the-art when it is integrated with them.
The contribution of this work can be summarized as: • We propose a novel attention module that resolves a unique problem that other attention modules do not handle, the inter-layer relationship. • We demonstrate the performance-boosting effects and generality of the PI module through extensive experiments on multiple benchmarks. • We show the scalability of the PI module by introducing a structural derivative: 'PI-Intra,' showing the compatibility of proposing concepts not only to CNNs but also to existing attention modules. • We investigate various visualizations for aiding the understanding of the PI module.

II. RELATED WORKS
End-to-end Trainable Attention Module. Emergence of transformer architecture [1] boosted the computer vision literature to widely incorporate the attention mechanism in the work. Consequently, vision-oriented transformer-based approaches including ViT [2] and AANet [3] are proposed, leading to transition of traditional CNN to transformer-based architectures. While the complete transition of the architecture significantly increased the performances in the visual task, ViT and its family requires a larger dataset to train the model properly when compared to the CNNs, since they lack inductive bias. Few approaches [7], [8] introduce an attention-based module to support the main network, whereas the lightness of the module was out of consideration. We opt to design a lightweight simple attention-based module that fosters the power of the network. An attention mechanism can be interpreted as a methodology by which to bias the allocation of available neurons to the most informative components of input signals [7]. One recent application of attention mechanisms in the area of computer vision is the integration of end-to-end trainable attention modules for deep CNNs. Attention modules can be divided into two categories: a spatial attention module and a channelwise attention module. Spatial attention modules learn 2-D spatial masks on features for the purpose of regulating the activations of neurons [16], [17], and Channel-wise attention modules learn 1-D channel-wise attention vectors and then utilize it to refine features [7], [10]. Furthermore, several architectural designs that have inherited the virtues of both spatial and channel-wise attention have also appeared [8]- [10], [18], [19].
However, existing spatial and channel-wise attention modules, referred to as intra-layer attention modules, only handle inter-dependencies inside a single layer. In contrast, our module aims to model relationships between layers.
RNNs in Computer Vision. In the field of computer vision, Recurrent Neural Networks (RNNs) are used to (1) model tasks that require sequential decisions or to (2) model non-sequential tasks (e.g., single image classification) as sequential procedures. Numerous vision studies including the tasks of visual question answering [20], [21], image captioning [22], multi-object classification [23], [24], video description [25], and text recognition [26], [27] belong to the first type. They share common ground with regard to how they utilize RNNs; they extract visual features from images using a pre-trained CNN and then utilize a RNN to model the sequential procedures of each task. Among them, the text recognition applications [26], [27] utilize bidirectional RNN mechanism in order to capture the text dependencies in both direction. Studies involving the second type include those that model spatial relationships in features [28] and those that exploit the RNN itself as a methodology by which to understand images [29], [30].
The PI module is similar to previous works [29], [30] in that RNN units carry visual signals. However, our module does not make predictions for tasks directly but rather aids in the representation learning of CNNs by acting as an informative attention source. In terms of RNNs, the most different aspect of the PI module about concerning other studies is that the PI module recurs over the depth dimension of networks, not the spatial dimensions [28], [31]. Thereby, it models inter-layer relationships while leaving parallelized implementation of CNNs unaltered.

III. METHODOLOGY
The PI module is a powerful but lightweight unit that models inter-layer relationships, incurring a negligible increase in the model parameters and computational cost. Figure 1 illustrates the concept of the PI module. When applied to CNNs, it memorizes the feature hierarchy of earlier layers and regulates the activation of the currently processing layer in the form of self-attention.
We insist that the PI module can be a new practical choice for the network engineering in that it (1) exhibits high extendibility to be adapted to generic architectures, including CNNs, GCNs, and even with other attention modules, and (2) aids the main network both in forward and backward pass. Given its ease of use, as the PI module is cost-efficient FIGURE 1: A concept figure of the network with PI module. As the main network propagate, the recurrent network summarizes the visual statistics captured by convolution (f l ), update an internal memory unit (r l ), and regulate f l with an inferred layer importance value from r l , layer-by layer. and provides an orthogonal improvement to existing attention modules, it becomes a more favorable option in practice.
The PI module is an integration of two main concepts: the Parallel Recurrent Module and the inter-layer attention mechanism. Below, we explain each component and its benefits and then describe the workflow of the PI module.

A. PARALLEL RECURRENT MODULE
The Parallel recurrent module (PRM) is a small separate memory stream that carries information from earlier layers that are propagated along the depth direction of the main network.
Here, we compare attention inferring operation of recent self-attention modules in vision and the PRM as follows: and Eq. (1) and Eq. (2) corresponds to existing self-attention modules and PRM, respectively. Here, x and r denote convolution features and RNN memory units, respectively. f (·) is convolution, h(·) is arbitrary self-attention operation, and g(·, ·) is RNN update function (layer index i is omitted for brevity). As implied by Eq (2), the PRM is an extension of self-attention that employs internal memory units for explicitly using previous features x 1 , ..., x i−1 . According to this concept, by recurrently memorizing co-propagating CNN features, we can extend the source of self-attention to the inter-layer level. The PRM can be used with any attention function h, but in this paper, we focus on using the proposed inter-layer attention mechanism. 1 In the following paragraphs, we describe how PRM assists with the feature learning of CNNs.
Memorizing Feature Hierarchy across Layers. CNNs model long-range interactions through the progressive propagation of visual signals via iterative local feature aggregation.
Focusing on the feed-forward structure of CNNs, we hypothesize that spatially pooled features obtained through this iterative process are viewed as a sequence of visual features with small receptive fields to global features. To this end, we are motivated to encode that sequence in an RNN following the propagation of the CNN.
As RNNs carry inductive bias via their Markovian structure [32] for the locality in the sequence, the PRM encodes the causalities of layers. For example, the dependency of a zebra's torso captured in the third layer of the CNN on the zebra's stripes captured in the first layer can be embedded in the PRM. This is enabled by the fact that the PRM recurs over the depth direction of the main network. Moreover, it is important to note that the PRM does not degrade the parallelizable capability of CNNs; it differs from previous works [28], [31] that encode spatial information by recurring over the spatial dimensions of the feature volume using an RNN.
Providing Light-weight Information Highway. In practice, memorization in PRM is effectively strengthened by the exploitation of LSTM, as this enables the regulation of the information of visual features captured at earlier convolutional layers using the forget gate. In this sense, the concept of gating the information flow employing LSTM is reminiscent of a Highway Network (HN) [33], as Highway Network models the transformation between input and output by adopting the non-linear transformations, denoted as 'transform gate' and 'carry gate'. However, the PI module is a more practical choice deviating from the HN in two ways: layer-wise connections and a computational advantage.
First, the PRM operates as an extra path for information flows. During a forward pass, the PRM provides crucial inductive bias, whereby the attention values (transform gate in an HN) are inferred using the memory units. Interestingly, the extra path provided by the PRM is also effective during a backward pass, ameliorating the vanishing gradient problem. This is the characteristic that distinguishes the PRM from the HN and also from existing intra-layer attention modules.
Second, the proposed PRM is more efficient than highway networks, which suffer from the computational costs associated with what is termed "expressive gating." This refers to the use of identically sized convolution filters for the transform gate and the plain layer so as successfully to train the Highway Networks. While larger kernels for the transform gate result insufficiently expressive transformations, increasing the kernel size increases the computational costs. On the other hand, the PRM does not require large and expensive filters for gating, as it operates on spatially pooled 1-D features. In general, the use of 1-D global pooled features and weight sharing makes the PRM significantly light in terms of computational complexities. Please refer to Table 1 in Section IV and Appendix D, E, F. Furthermore, the weight sharing scheme can also be viewed as a type of regularization to support generalization.
In short, the PRM is a new type of learned information highway. Importantly, because the PRM is an architectural unit that only requires a multi-layer structure to be applied, it is highly extendable to arbitrary networks.

B. INTER-LAYER ATTENTION MECHANISM
The inter-layer attention mechanism serves to regulate features by multiplication between the output of convolutional transformation and the correspondingly inferred layer importance value (scalar). By calibrating the network using a layer-wise importance distribution, the inter-layer attention mechanism explicitly models the inter-dependencies between neurons of different layers. In the PI module, we adopt the inter-layer attention mechanism to the feature from the CNNs and the layer attention value inferred from the internal memory units. By introducing the inter-layer attention mechanism, we can empower the CNNs to regulate layer activations selectively by emphasizing layers that catch salient visual information or reducing layers that acquire redundant visual information. This differs from intra-layer attention mechanisms [7], [8] in that those mechanisms calibrate a particular layer using an inferred spatial-channel importance distribution. Figure 2 (Top) depicts the process of the interlater attention mechanism along with the feed-forward structure of the CNNs.

C. WORKFLOW
The PI module recurs by L processing steps, becoming connected immediately after the arbitrary convolutional transformations. Here, the decision of L depends on how the PRM connects to the CNNs. A graphical representation of the PI module is illustrated in Figure 2 (Top) and the following equations represent the workflow: At each time step l, the PI module obtains an RNN input s l ∈ R N by passing a conveyed feature from the CNNs f l ∈ R H l ×W l ×C l to a summarizing function φ l and then updates memory units r l ∈ R M using a previous r l−1 and s l . The internal memory r l infers the layer attention value α l ∈ [0, 1] using an inter-layer inferring function ϕ l , after which the layer attention value α l serves as a scalar coefficient for calibrating a feature f l with a Hadamard product * . Lastly, the scaled feature f * l is converted by a subsequent convolutional transformation Conv and fed back to the RNN's next time step.
The PI module maintains this recurrence until CNN propagation ends, updating the RNN internal memory by utilizing the summarized layer statistic as the RNN input for each time step (convolutional transformation) and inferring the interlayer attention value from the updated RNN internal memory. For model learning, the parameters of the PI module are jointly optimized with the CNN parameters. Detailed designs of Eq. (4) and (6) are discussed below.
Feature Summarization Function. φ l refers to the combination of the feature summarizing operation and the consequent down-sampling (or up-sampling) operation. A parametric study led to the decision to adopt the 'global average pooling (GAP)' operation. Please refer to Appendix C. After feature summarization, the channel size must be adaptively adjusted, as RNNs require a fixed input size. To acquire the fixed-size inputs, we opt for max-pooling or interpolation as these include no learnable parameters.
Inter-layer Attention Inferring Function. ϕ l consists of two fully-connected layers with W 1 l ∈ R M r ×M and W 2 l ∈ R 1× M r . δ and σ indicate ReLU [34] and sigmoid operations, respectively. Reduction ratio r denotes the compressing ratio of the first fully-connected layer. The inter-layer attention inferring function can be viewed as an arbitrary function of s l and r l−1 , and we use a multi-layer perceptron with two hidden layers to approximate this function. Note that we found that the sharing of an inter-layer attention inferring function over layers causes a slight degradation of the performance in our early experiment; consequently, we assign a separate function for layers. This only incurs a minor increase in the number of parameters. In this section, we extend the PI module to other architectural units beyond normal CNNs for demonstrating that it is a practical choice for general application.
PI Module with Residual Architectures. It is intuitive to make links for connecting the PI modules to non-residual architectures. However, when applying the PI module to residual architectures, the following facts are required to be clarified: the locations of connections between the PI module and the CNN, and the logic behind the selections of such locations. Among possible candidates, we opt for linking the PI module to residuals, which is the same strategy as intralayer attention modules [7], [8] adopt. Inter-layer attention mechanism with residual architectures solves a preceding problem in an elegant way: adjusting residuals by inferred layer importance.
PI Module with Graph Convolution Networks. We enhance our scope in demonstrating the generality of PI modules, beyond architectures defined on the regular 2-D grids to graph convolution networks in the irregular domain. Note, several intra-layer attention modules that handle complex spatial-wise statistics [31], [35] do not apply to irregular point cloud or graph domains because of their permutation variance.
This lightness of the PI module leads us to integrate it with intra-layer attention modules as well as the backbone CNNs, thereby we link the PI module after the calibrated features with the intra-layer attention mechanism. Figure 2 (Bottom) illustrates the graphical representation of PI-Intra.

IV. EXPERIMENTS
In this section, to demonstrate the effectiveness and high extendability of the PI module, we report extensive experimental results on various tasks and datasets including CIFAR-10/100 and ImageNet-1K for image classification, Microsoft COCO for object detection, Pascal VOC for image segmentation, and S3DIS for point cloud segmentation. For thorough evaluation of performance on PI module, we also reproduce experiments on backbone networks, those with SE [7] and CBAM [8] except a few asterisked instances borrowed from previous papers. To ensure a fair comparison, we apply the same optimization and augmentation scheme for model instances that are compared; our implementations mostly follow official GitHub repositories of previous studies [8], [36], [37] and all details are available in Supplementary Material.
We validate the PI module on image classification task using ImageNet-1K 2012 dataset [12]. To verify the validity of our designs, extensive experiments are performed on a variety of conditions: architectures, datasets, and depth.
Model Comparison. To demonstrate the extent of the additional performance gains and the computational cost provided by the PI and PI-Intra modules compared to recent attention modules, Table 1 provides a comparison with the outcomes of several recent attention studies. Here, for PI-Intra, a SE block is adopted, denotes as 'PI-SE' in the table.
Performance Gains. The PI module significantly improves the performance of backbone architectures, showing similar performance to the SE block. Furthermore, notably, PI-SE-ResNet50 demonstrates better performance than the PI module and SE block alone, showing an additional reduction of 0.41% of the top-1 error compared to SE-ResNet50. Considering that GE [10] and α 2 -Net [9] present marginal improvements compared to SE-block with more parameters (+5.6M and +7.2M, respectively; for, PI: +0.2M, PI-SE: +2.8M), the margin is quite significant. This successful integration of 'inter-intra attention' provides evidence that the PI module operates with other intra-layer attention modules in an orthogonal manner.
Computational Advantages. Importantly, the PI module only incurs 'negligible' computational overheads. Practically, except for SE, recent attention modules sacrifice their lightness to ensure better performance. In terms of the inference time, most attention networks, except for PI and SE, are ×2 ∼ 3 slower than the original ResNet. Additionally, the PI module only adds a ∼ = 1% increase in the model parameters and a negligible increase in the number of GFLOPs, distinctly surpassing the others with regard to each metric.
On the other hand, PI-SE surpasses GE and α 2 -Net in all metrics and provides performance gains very similar to CBAM, while being nearly ×2 faster.
In short, this comparison supports our contention that the PI module efficiently uses these parameters and exhibits high extendibility to a wide range of applications due to its lightness.
ResNet and MobileNet Backbone. Table 2, 5 summarize self-investigation results on PI and PI-Intra for ResNet50 and MobileNet [40]. We observe that PI networks record similar performance with smaller computational costs. Also, VOLUME 1, 2016 5 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.    in this case, the integration of both, PI-Intra, provides extra performance gains. Network Depth Analysis. We conduct experiments to verify that the PI module seamlessly improves performances robust to the network depth. Furthermore, we also demonstrate the effectiveness of CIFAR, for verifying the generality of PI modules on a small-scale dataset and architectures. Table 3, 4 summarizes results.
Notably, for CIFAR experiments, we find out that PI networks distinctly surpass ResNets in performance and most of them also outperform SE networks; PI-ResNet44 outperforms ResNet110 at approximately one-third of model parameters. For ImageNet-1K experiments, we verify that the PI module improves backbones to a similar level to SE networks while being lighter.

A. IMAGE SEMANTIC SEGMENTATION
To verify the generality of the PI module, we conduct image semantic segmentation experiments using Pascal   VOC 2012 [14] dataset. For backbone networks, we use (PI) ResNet50 with the output stride of 16. Table 6 shows that the PI module significantly boosts performance of ResNet50-16s instance (+7.7% mIoU).

B. POINT CLOUD SEGMENTATION
To demonstrate the extendibility of the PI module to GCNs and point cloud domain, we conduct segmentation experiments using the S3DIS [15] dataset. For backbone networks, we opt for DeepGCNs [36]. To extend the PI module in point cloud networks, we replace the global average pooling over spatial dimension to point dimension. Table 7 shows that the PI module effectively boosts performance of ResGCN instances; +1.7% mIoU for PI-ResGCN-7, +0.9% mIoU for PI-ResGCN-28. Figure 3  This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.

A. DISENTANGLING PI MODULE.
To investigate the PI module fully, we disentangle the impacts of the PRM and the inter-layer attention mechanism. Furthermore, we decouple how the co-propagating recurrent network affects the feature learning of CNNs.
Ablate PRM from PI module. For a better understanding of the PI module, we isolate the impact of the PRM by ablating it. We replace the RNN update function (Eq (5)) with linear projection followed by tanh non-linearity (i.e., no interlayer connection). Table 8 indicates that the performance is downgraded by a distinguishable margin after removing the PRM from the PI module. This finding suggests the validity of recurrent networks for PRM.
PI module in Backward Pass. We claim that LSTM adopted for the PI module eases the feature learning of CNNs by providing an extra information highway through which gradients can be directly backpropagated. To investigate the extent to which the PRM ameliorates vanishing gradients, we compare the respective gradients of ResNet50 and PI-ResNet50 during training. Figure 4 describes the behaviors of the mean absolute gradient value upon the input of each network layer. At the beginning of training (Epoch 0), we observe that PI50 does not preserve the gradient flow compared to R50. This is the intended phenomenon for relieving the PI module of the need for a careful parameter initialization scheme, a known issue with a Highway Network [42]. Notably, we do not exploit any gradient-preserving strategies such as a negative bias initialization for the LSTM forget gate or the initialization of the inter-layer attention value to 1.
As the training progresses (Epoch 1), the mean absolute gradients of PI50 become larger than those of R50 at most of the layers, thereby inducing faster learning [43]. More interestingly, as training continues to progree (Epochs 2 and 100), we find that the PI module distributes the gradients of the last layer, leaking them from the backbone CNN to the module and providing them to earlier layers. In this sense, the PI module shows that it aids the feature learning of CNN during a backward pass.  Figure 5 presents the mean layer attention curves across the layers of PI networks. We found that the investigations of the layer attention value curves enable a deeper understanding of the residual architecture, described as follows.
Alternative for Interpreting Residual Architectures. The layer attention values are learned such that they are high in the early layers and relatively low in the intermediate layers. We claim that this observation could be another means of gaining a deeper understanding of residual architectures. The intuition behind residual learning is to let the layers learn the perturbations regarding an identity function, with this intuition supported by displaying the standard deviation of the batch normalization layer outputs, with the contention that they insist the smaller standard deviation represents the lower 'response strength' of the residual function.
Compared to this method, we note that our plots provide similar but more straightforward information. Given the observations of the decreased layer attention values from the early layers to the intermediate layers, we could support an intuition of residual learning, as the deep layers learn less important perturbations. In short, according to the essence of the attention mechanism, by allocating available neurons to more informative components, the PI module supports recent investigations that interpret residual architectures.
Increasing variance of layer attention values as going deeper. The variance of the layer attention values gradually increases along the depth direction of the network until the last residual block is reached, and the layers in the last block have the most significant layer attention value close to 1. Following the generic understanding of the interpretation of CNNs, i.e., learning simple and complex features in earlier and deeper layers respectively [44], it is not surprising that PI networks allocate large layer attention values for the last layers of CNNs, which are semantically salient, and that the layer attention values drastically vary in deep layers that contain abstract visual information.

C. GRAD-CAM VISUALIZATION
We hypothesized that PI networks learn better representations by considering repeated appearances of visual information. Examining our hypothesis and semantically understanding how PI networks allocate neurons, we apply Grad-CAM [45]   tion layer output of each residual block). Figure 6 depicts the Grad-CAM visualization results on the intermediate layers and the corresponding normalized layer attention curves. Interpreting the attention curves, we could gain several insights into how the parameters of the PI module are jointly learned with the CNN parameters.
Raising the activation of class-specific layers. PI networks learn to enhance the layer attention values of layers that capture class-specific visual features for each target class. The layers marked with a red border indicate those that achieve the highest layer attention value. For example, the first layer, which captures simple patterns or small parts of a racer class, achieves the highest value, and the fifth 8 VOLUME 1, 2022 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  layer, focusing on catching the head of an ostrich class, also obtains the highest value. According to these observations, we find that PI networks emphasize the layers that catch class-specific visual information for each target class.
Learning class-agnostic features in deeper layers. Deep layers of PI networks tend to learn class-agnostic features. These layers, marked with a yellow border, indicate a typical example of focusing less on class-specific visual information in images. Using an example of an ostrich class, the second to last layer of PI networks allocate neurons to learn classagnostic visual information, such as a background. Compared to the ResNet, which allocates more class-specific neurons up to the deep layers, these observations are quite impressive. Also, our findings are convincing because they are in good agreement with a recent analysis of the class selectivity [46] of Gather-Excite networks [10]; intra-layer attention networks have lower-class selectivity in deep layers compared to the backbone network. Please refer to Appendix J for the same analysis on PI-ResNet.

VI. CONCLUSION
In this work, we propose a novel attention module that is lightweight and highly compatible with general CNNs, graph CNNs, and even with other existing attention modules. The PI module sheds light on self-attention studies in vision by focusing on inter-layer relationships, an architectural aspect unexamined by other studies. We demonstrate the effectiveness and generality of the PI module through extensive analysis, experiments, and visualization. We expect that our perspective to model the inter-layer relationships will be helpful to the wide range of computer vision studies.