1. Introduction
The local receptive fields (RFs) of neurons in the primary visual cortex (V1) of cats [14] have inspired the construction of Convolutional Neural Networks (CNNs) [26] in the last century, and it continues to inspire mordern CNN structure construction. For instance, it is well-known that in the visual cortex, the RF sizes of neurons in the same area (e.g., V1 region) are different, which enables the neurons to collect multi-scale spatial information in the same processing stage. This mechanism has been widely adopted in recent Convolutional Neural Networks (CNNs). A typical example is InceptionNets [42], [15], [43], [41], in which a simple concatenation is designed to aggregate multi-scale information from, e.g., 3×3, 5×5, 7×7 convolutional kernels inside the "inception" building block.