Super Neurons

Self-Organized Operational Neural Networks (Self-ONNs) have recently been proposed as new-generation neural network models with nonlinear learning units, i.e., the generative neurons that yield an elegant level of diversity; however, like its predecessor, conventional Convolutional Neural Networks (CNNs), they still have a common drawback: localized (fixed) kernel operations. This severely limits the receptive field and information flow between layers and thus brings the necessity for deep and complex models. It is highly desired to improve the receptive field size without increasing the kernel dimensions. This requires a significant upgrade over the generative neurons to achieve the “non-localized kernel operations” for each connection between consecutive layers. In this article, we present superior (generative) neuron models (or super neurons in short) that allow random or learnable kernel shifts and thus can increase the receptive field size of each connection. The kernel localization process varies among the two super-neuron models. The first model assumes randomly localized kernels within a range and the second one learns (optimizes) the kernel locations during training. An extensive set of comparative evaluations against conventional and deformable convolutional, along with the generative neurons demonstrates that super neurons can empower Self-ONNs to achieve a superior learning and generalization capability with a minimal computational complexity burden. PyTorch implementation of Self-ONNs with super-neurons is now publically shared.


I. INTRODUCTION
Generalized Operational Perceptrons (GOPs) [1]- [5] have been proposed as an advanced model of biological neurons with varying nonlinear synaptic connections.Thanks to such a diverse neuron model, GOPs have achieved a superior learning capability on many challenging problems surpassing conventional Multi-Layer Perceptrons (MLPs) and even Extreme Learning Machines (ELMs) with a significant performance gap [1]- [5].Following the GOPs' main philosophy, Operational Neural Networks (ONNs) [6]- [9] have outperformed CNNs significantly, and achieved a notable learning performance, even on those problems where CNNs entirely fail.Yet, ONNs have the following limitations: 1) strict dependability to the operators in the operator set library, 2) requiring a prior search for the best operator set for each layer/neuron which can be highly timeconsuming.Self-organized ONNs (Self-ONNs) [10]- [21] have recently been proposed to address these drawbacks with the generative neuron model, which can optimize the nodal operators of each kernel element.Such a capability indeed yields an ultimate neuron heterogeneity that is far superior to what conventional ONNs can offer.Generative neurons can, therefore, replace the traditional "weight optimization" of convolutional neurons with the "nodal function optimization" process.However, their kernels are still "localized" or static, and hence each neuron's receptive field size is determined by its kernel size, and this severely limits the amount of information acquired from the previous layer with such limited and localized kernels.Obviously, using a larger kernel size may be a solution for this; however, it will not only create an increasing complexity issue, but it is also not feasible to determine the optimal kernel size for each connection of the neuron.The aim, therefore, should be to improve the receptive field of each kernel connection by allowing each kernel location to vary while keeping the kernel size the same.Moreover, it would be more beneficial "to learn" or to optimize each kernel location for each connection to the feature maps in the previous layer.
The most prominent approach ever proposed to improve the receptive field size was deformable CNNs [22], [23].However, the improvements over the regular convolutions were limited or simply none because the kernels of each layer have to be deformed in the same way.Therefore, it is rather a "relocation" operation over each kernel element rather than improving the receptive field size.Furthermore, deformable convolutions further increase the network complexity (number of parameters) and especially the memory overhead significantly.This is why deformable neurons are usually used in only one or a few layers of a (deep) CNN.
To address the aforementioned limitations and drawbacks, the novel and significant contributions of this study can be summarized as follows: • To accomplish the aim of improving the receptive field size with varying kernel locations, and even optimizing each kernel location, this study proposes a superior generative neuron model (i.e., super neurons in short) with non-localized kernel operations for Self-ONNs.
• This study proposes two super neuron models, each of which has a different kernel localization process: i) random localization within a bias range set for each layer, ii) BP-optimized locations of each kernel.• Particularly in the latter model, the "what" operator should be used and "where" it should be located, are simultaneously optimized during the BP training.This can be more advantageous for some particular problems where certain optimal kernel locations may exist, or some kernel location topology (or distribution) may be more desirable.
• This study will reveal the pros and cons of both super neuron models when compared against the generative and convolutional neurons over several challenging problems.• This study presents a "Proof-of-Concept" experiment where a single (hidden) super neuron will suffice to learn and regress any shifted image from its original.Such a regression can otherwise be performed only by deep CNN models with a high number of neurons.• Finally, an extensive set of experiments reveals that Self-ONNs with super neurons can outperform equivalent or significantly deeper CNNs in many challenging problems.
The rest of the paper is organized as follows: Section II will briefly present Self-ONNs with generative neurons while the details of the BP training are presented in Appendix A. Section III presents the two super neuron models with non-localized kernel operations in detail and formulates the forwardpropagation (FP) and back-propagation (BP).Comparative evaluations among Self-ONNs with generative and super neurons and CNNs over challenging problems are presented in both Section IV and Appendix C. The computational complexity analysis of these networks for both FP and BP is also presented in Section IV.Finally, Section V concludes the paper and suggests topics for future research.

II. SELF-ORGANIZED OPERATIONAL NEURAL NETWORKS
In biological neurons, during the learning process, the neurochemical characteristics and connection strengths of the synaptic connections are altered, giving rise to new connections and modifying the existing ones.Inspired by this a generativeneuron model for Self-ONNs is formed where each kernel can have a distinct nonlinear nodal-operator generated (optimized) during training without any restrictions.As a result, each kernel element of each generative neuron can "customize" its nodal operator to maximize the learning performance.To exemplify this, the nodal operators of the 3x3 kernels the convolutional (CNN), operational (ONN), and generative (Self-ONN) neurons are illustrated in Figure 1.Both convolutional and operational neurons have static (fixed) nodal operators (linear and harmonic, respectively) while the generative neuron has any arbitrary nodal function, , (including possibly standard functions such as linear and harmonic functions) for each kernel element of each connection.As illustrated in Figure 1 (middle), for conventional ONNs the input map of the i th neuron at the layer l+1,   +1 , is composed as follows: +1 =   +1 + ∑ (   ,   +1 , ′  ′ )

𝑘=1
+1 (, )| (0,0) (−1,−1) =   +1 + ∑ (  +1 [ Ψ  +1 (   (, ),   +1 (0,0)) , … , Ψ  +1 (   ( + ,  + ),   +1 (r, t)), . . .]) where    are the final output maps of the previous layer neurons operated with the corresponding kernels,   +1 , with a particular nodal function, Ψ  +1 such as linear (multiplication), sinusoid, exponential, Gaussian, chirp, Hermitian, etc.A close look at Eq. (1) reveals the fact that when the pool operator is a summation, i.e.   +1 = Σ, and the nodal operator is a linear function, Ψ  +1 (   ( + ,  + ),   +1 (r, t)) =    ( + ,  + ) ×   +1 (r, t), for all neurons, then the resulting homogenous ONN will be identical to a CNN.Hence, ONNs are a superset of CNNs as the GOPs are a superset of Multi-layer Perceptrons (MLPs).Self-ONNs differ from ONNs by the following two points: where is the q th coefficient of the Qorder polynomial.During BP training, each   + (r, t, q) is optimized for the learning problem at hand.Thanks to this ability, there is no need for any operator search for Self-ONNs and arbitrary nodal operators can be customized by the training process as illustrated in Figure 1 (right).This results in enhanced flexibility and diversity over an ONN neuron where only a standard nodal operator function has to be used for all kernels, each connected to an output map of a neuron in the previous layer.
Table 1: Formula abbreviations and descriptions.

𝑄
The order of the Maclaurin polynomial Kx × Ky The size of a kernel, i.e.,   +   + (r, t) Q-dimensional array of the kernel element (r,t) from the i th neuron in layer l+1 to k h neuron in layer l.   +1 (r, t, q) The q th element of    (r, t).

[α 𝑘 𝑖 , β 𝑘 𝑖 ] ∈ ℤ[±𝚪]
The integer bias pair in x-and y-directions within the range limit,  for the i th neuron in the current layer connected to the k th neuron in the previous layer.
The real bias pair in x-and y-directions within the range limit,  for the i th neuron in the current layer connected to the k th neuron in the previous layer.

OPERATIONS
The starting point of this study is the generative neuron model of Self-ONNs [10].Like its predecessors, ONNs, and CNNs, each kernel connection of a generative neuron to the previous layer output maps is localized, i.e., for a pixel located at (, ) in a neuron at the current layer, all kernels are located (centered) at the same location over the previous layer output maps.Figure 2 (bottom-left) illustrates this with 3 × 3 kernels where a pixel of the i th neuron in layer l+1,   +1 (, ), is computed using the 9 pixels of the previous layer output maps,    ( + ,  + ) ∀,  ∈ [−1,1], for ∀ ∈ [1,   ] operated with the kernels centered at the same location, (, ), given that   is the number of neurons in the previous layer, l.This gives rise to an obvious limitation since such a static kernel is blinded to the neighboring pixels outside of the kernel boundaries, ∀,  ∈ [−1,1] which may have the potential to contribute to the input pixel, and hence should not be excluded.This study provides two feasible solutions by proposing two super neuron models with nonlocalized kernel operations as illustrated at the bottom of the figure.We define two additional parameters, (α   , β   ) as the spatial bias, that is the shift of the kernel from the pixel location, (, ), towards x-and y-direction.The spatial bias is, therefore, defined for each kernel for both super-neuron models, i.e., in Figure 2, for the first model, the bottom-left illustration shows the shifted kernel locations for the i th neuron input map at layer l+1 connected to the k th output neuron at layer l, and the bias values are [α   , β   ] ∈ ℤ[±Γ] where the maximum range is determined by the hyperparameter, Γ = 4 pixels.Therefore, all 3 × 3 kernels are randomly located within a bias range of [−4,4], and thus, all pixels within the region of 11x11 pixels can contribute.In the illustration, different colored kernels are for different connections and their corresponding bias values within the 11x11 region (the outer, red-dashed square) are randomly set in advance.For instance, the bias for the 1 st connection (black) is,  1  =4,  1  =3 pixels whereas for the 3 rd connection (red), it is  3  =0,  3  =0, respectively.Finally, for the second model, the illustration at the bottomright shows the shifted kernel locations by the real-valued bias, (α   , β   ) ∈ ℝ[±].For this super-neuron model, the bias is iteratively optimized during BP training along with other network parameters.At the end of the training, the bias will converge to a (local) optimum point.So, the bottom-right illustration only shows instantaneous localizations of the kernels at a particular BP iteration.To formulate a non-localized kernel for the i th neuron in layer l+1, connected to the k th neuron in layer l with a bias in x-and ydirections, α   and β   , respectively, we will instead shift the output feature map in the previous layer in the opposite direction.For this purpose, let  (  (3).For generative neurons of Self-ONNs recall that  is the composite nodal function which is the Q th order Mac-Laurin series.Over the 1D array of kernel elements,   + (r, t),  is expressed in Eq. ( 4) where the DC bias term,   +1 (r, t, 0), is omitted.Therefore, each generative neuron has a 3D kernel matrix where the q th coefficient of the kernel element (r, t) is represented by   +1 (r, t, q).In the next sub-sections, we formulate the forwardpropagation (FP) for the two super-neuron models each of which performs non-localized kernel operations, the former with random bias and the latter with learnable (real-valued) bias through BP.
where  = 0 for the input layer.Then using Eq. ( 3) each input map in the next hidden layer,   +1 , ∀ ∈ [1,  +1 ], can be computed.Passing the input map through the activation operator, (), first and then the pooling (if up-or down-sampling is performed in that layer), the output map,   +1 , is created.Once again, using Eq. ( 5), the shifted output map is created, and the FP proceeds to the next layers.To accommodate such shifts, the boundaries of each output map are zero-padded by -zeros.In order to speed-up both FP and BP, the q th power of the shifted outputs, ( ⃗   )  , can also be computed only once (during FP) and stored to be used repeatedly during BP.On the other hand, except for the output maps of the super neurons in the output layer, there is no need to store the original outputs,    , along with their powers since they are only temporarily needed for non-localized kernel operations.

B. FP for Non-localized Kernel Operations by the BPoptimized Bias
The BP optimization of each of the pair of bias shifts in x-and ydirections requires that (α For simplicity and speed, bilinear interpolation is used as expressed in Eq. ( 6).Hence Eq. ( 3) can now be modified for FP with the shifted map location,  ⃗   ( + ,  + ) by the fractional bias as in Eq. (7).
where    =    − ⌊   ⌋ and    =    − ⌊   ⌋.Note that Eq. ( 7) is identical to Eq. ( 1), the one for generative neurons except that the shifted (interpolated) map,  ⃗   , is now used, which is different from the original output map over which a bilinear interpolation is performed.This is in fact equivalent to a low-pass filtering operation over the actual output maps.The BP formulations of the two super neuron models are covered in Appendices B and C, respectively.

IV. RESULTS
In this section, we have tested Self-ONNs with super neurons against both deep and shallow CNN models over three challenging applications.In the next subsection, we performed real-world image denoising experiments over the SIDD Medium benchmark dataset [63] to perform comparative evaluations against the deep (17-layer) Residual CNN (DnCNN [2]) and (DnONN [1]).For the comparisons using the shallow models, the two super neuron models proposed in this study will then be evaluated against the generative neurons, conventional and deformable [22], [23] convolutional neurons over the following challenging problems: 1) Motion and Spatial Deblurring, and 2) Face Segmentation.Finally, to validate the super neurons' ability to learn the true shift, a "Proof-of-Concept" experimentation using a Self-ONN with only one hidden super-neuron will be presented in Appendix D.

A. Real-World Denoising
In real-world denoising experiments, we utilize the SIDD Medium training dataset [63] which consists of 320 high-resolution images.We use the same cropping strategy as adopted in [9] to extract 160k training patches.For testing, the SIDD validation dataset is used which consists of 1280 noisy clean image pairs.In all the experiments, the training-tovalidation ratio is set to 9:1.For BP training, we use the ADAM optimizer with the maximum learning rate set to 10 -3 .All the networks were trained for 100 epochs and the model state which maximized the validation set performance was chosen for evaluation.Model architectures were defined using FastONN [8] library and Pytorch library [64].All experiments were performed either on an NVIDIA Tesla V100 or an NVIDIA TITAN RTX GPU.
The quantitative results for the real-world denoising problem in terms of average PSNR levels are presented in Table 2 and the visual results on the SIDD Validation dataset are shown in Figure 4.In order to test the hyper-parameter variations in Super-ONN models over the performance, we have trained 5 Super-ONN models: 1) Super-ONN (Q=3): 3-layer Self-ONN with super neurons and tanh activation functions.
3) Super-ONN (Q=3) LR-IN: 3-layer Self-ONN with super neurons and ReLU activation functions followed by instance normalization.4) Super-ONN (Q=2) Residual LR-IN: 3-layer Self-ONN with super neurons, ReLU activation functions followed by instance normalization and a residual input-output connection.5) Super-ONN (Q=2) Reflection ReLU: 3-layer Self-ONN with super neurons, ReLU activation functions, and reflection padding in the borders of the images.The results in Table 2 clearly show that all Super-ONN models significantly outperform both DnCNN and Dn-SelfONN models regardless of their model variations.Especially the performance gap over DnCNN exceeds 1.7dB in PSNR despite the fact that it has more than 5 times more layers and neurons.This demonstrates the superior learning capability of the super neurons over both generative and convolutional neurons.
Qualitatively speaking, the superior denoising performance of Self-ONNs with super-neurons (Super-ONNs) is once again visible in all output images shown in Figure 4. Super-ONNs not only achieve a sharper edge and texture restoration but also recover the smooth regions better than the other networks.

B. Results with Shallow Models
To test the true learning capability of the super neurons, especially against generative neurons, we further apply the following severe restrictions and harsh conditions: i) Very Low Resolution: 60x60 pixels, ii) Compact/Shallow Models with only 2 hidden layers and less than 25 neurons: Inx12x12xOut, iii) Scarce Train Data: only 10% of the dataset is for training and the rest is for testing (10xfold cross-validation) iv) Limited kernel size (3x3 kernels except for the 2 nd problem) For all problems, restriction (ii) is relaxed for the conventional and deformable CNNs that have a configuration In×48×48xOut, hence labeled as "CNN×4" and "DefCNN×4".With 4 times more neurons than Self-ONNs (Inx12x12xOut), such an unfair comparative evaluation is intended to show the true learning capability of the super neurons.For Self-ONNs,  = 3, 5, and 7 at the 1 st , 2 nd hidden, and output layers, respectively.Moreover, the first hidden layer applies sub-sampling by  =  = 2, and the second one applies up-sampling by  =  = 2.For each regression problem, we used the Signal-to-Noise Ratio (SNR) evaluation metric, which is defined as the ratio of the signal power to the noise power, i.e.,  = 10log(  2 /  2 ).For the Image Transformation problem, we performed 10 experiments each with 4 images transformed to another 4.For Deblurring and Denoising, the benchmark datasets are partitioned into the train (10%) and test (90%) for 10-fold cross-validation.For each fold, all networks are trained using Stochastic Gradient Descent (SGD) with a fixed learning parameter as presented in Table 4. Finally, 5 BP runs are performed and the network model that achieved the minimum loss (MSE) during these runs is used for evaluation (tested over the rest of the dataset).

1) Deblurring
Image deblurring [24]- [36] can broadly be categorized as kernelbased estimation [24]- [30], or an end-to-end system [31]- [36].Deep CNNs have been used for each category but in this study, we shall evaluate the networks in an "end-to-end" configuration, that does not have to estimate the blurring kernel, rather the blurred image is directly transformed into the restored (deblurred) image.We expect that super neurons with non-localized kernel operations to achieve a superior performance because image deblurring usually requires a large receptive field for enhanced global knowledge [37] while conventional CNNs (and Self-ONNs) can provide local knowledge limited with the size of their filters.We consider two blurring problems: • Disc() blurring: a circular averaging filter (pillbox) within the square matrix of size, 2 ×  + 1.

• Motion(𝛬, 𝛩) blurring: the linear motion of a camera
where  specifies the length of the motion and  specifies the angle of motion in degrees in a counter-clockwise direction.In both problems, we aim to evaluate the learning capability of the super neurons in Self-ONNs against the generative, conventional, and especially deformable convolutional neuron models under harsh conditions and for this reason, along with the aforementioned restrictions, we apply a severe blurring with the following parameter settings:  = 5,  = 11   = /4.Disc() basically applies an averaging of 11 × 11 pixels and Motion(, ) approximates a linear motion of 11 pixels diagonally.Both blurring artifacts can sometimes cause such severe image degradation that makes it difficult or even infeasible to comprehend the content of the image (e.g., see Figure 6).Finally, for the random bias ranges for super neurons where (α   , β   ) ∈ ℤ[±], are set as  = {4,4,2} for the 1 st , 2 nd, and output layers, respectively.Thus, with this setting, for instance, the 1 st layer super neurons will have the improved size of the receptive fields as 11x11 pixels, which is significantly larger than the original kernel size of 3x3 pixels.
Figure 5 shows PSNR and SSIM plots of the best Disc-5 (top) and motion (bottom) deblurring results per fold over the test partitions.The Self-ONNs with generative neurons having no bias are with the label '0-0-0'), and with super neurons having random bias within  = {4,4,2} are with the label '4-4-2', and with BPoptimized bias are with the label 'Opt', respectively.The conventional and the two versions of the deformable CNNs have the labels 'CNN×4', 'DefCNN×4 v1', and 'DefCNN×4 v2', respectively.The average PSNR and SSIM scores are presented at the end of each corresponding plot.
In both problems, Self-ONNs achieve significantly higher PSNR (around 1dB) and SSIM (> 4%) levels as compared to the three CNN models with four times more neurons.Though both deformable CNNs achieve slightly higher PSNR and SSIM scores than the conventional CNNs in the majority of the folds, they fail to achieve a higher average performance due to the lowest scores obtained in two folds, indicating a robustness issue.Finally, in both problems, the Self-ONNs with super neurons achieve more than 0.6dB higher PSNR score on average compared to Self-ONNs with generative neurons.
For a visual evaluation, Figure 6 and Figure 7 show a set of Disc-5 and motion-blurred (input) images, the target image, and the corresponding outputs of CNN×4, Self-ONNs (with no, random, and BP-optimized spatial biases) from the test partition.We skipped the outputs from both deformable CNNs since they have a very similar or occasionally worse visual quality than the conventional CNN×4 model.The superior deblurring performance of Self-ONNs with super neurons is visible in all outputs.

2) Face Segmentation
Deep CNNs have often been used in face and object segmentation tasks [52]- [61].In this study, we use the benchmark FDDB face detection dataset [44], which contains 2000 images with one or many human faces in each image.As per the aforementioned restriction, all images are downsampled to 60x60 pixels and in this very low resolution, pixelaccurate face segmentation becomes an even more challenging task.
Finally, for the random bias ranges for super neurons where (α   , β   ) ∈ ℤ[±], are set as Γ = {4,4,2} for the 1 st , 2 nd and output layers, respectively.Thus, with this setting, for instance, the 1 st layer super neurons will have the improved size of the receptive fields as 11x11 pixels, which is significantly larger than the original kernel size of 3x3 pixels.Figure 8 shows F1 plots of the best (in training) Face Segmentation results per fold over the test set.The average test F1 scores achieved by the three Self-ONNs are 80.95% (no bias), 83.83% (random bias), 84.22% (BP-optimized bias), respectively whilst the CNN×4 has the F1-score of 75.94%.In both train and test partitions and all folds Self-ONNs achieve significantly higher performance as compared to CNNs.This is despite the fact that it has four times less neurons.In particular, the average performance gap between CNNs and Self-ONNs with super neurons is widened around 8% and 5.6% in train and test partitions, respectively.Finally, the Self-ONNs with super neurons can achieve higher than 3% (train) and 4% (test) on the average than the corresponding performance of the Self-ONNs with generative neurons.

Input Target CNNx4
Rand.Bias BP Opt.For a visual evaluation, Figure 9 shows some typical original input images (first column), their (target) ground-truth face maps (last column), and the corresponding outputs of the CNN×4 and the three Self-ONNs (with no, random and BPoptimized spatial bias) from the test partition.Obviously, the best face segmentation results belong to the Self-ONNs with super neurons while CNNs suffer from severe false-positive regions.The super neurons with BP-optimized spatial bias yield the overall best results with minimal false positives

C. Computational Complexity Analysis
In this section, the computational complexity of the proposed Self-ONNs with super neurons is analyzed with respect to the parameter-equivalent Self-ONNs with generative neurons and the three CNN models with 4 times more neurons.As assumed in this study, when the pool operator is the "summation", Ρ i l = Σ, in an FP of a Self-ONN with super neurons, Eq. ( 3) can be expressed as follows: where  is the (Taylor series) nodal operator function and    (, ) is a Q-dimensional array for the kernel element (, ).Putting the q th order 2D kernel,    〈〉 (q=1..Q), which is composed of the kernel elements,   +1 (r, t, ), then Eq. ( 8) can be simplified as, Such a 2D convolutional representation of a generative neuron's input map formation is illustrated in Figure 10.It is straightforward to see that this indeed resembles a multi-output and multi-kernel convolutional neuron.When the shifted powers of the output maps, ( ⃗ 1 −1 )  , for q=1,..., Q, are computed for all hidden neurons in the network, Eq. ( 9) simply turns out to be ( ×  −1 ) independent 2D convolutions.Like in conventional CNNs, this can be implemented in a parallel manner, and hence, it will roughly take a similar inference time.We can thus conclude that, in a parallelized implementation, a Self-ONN and a CNN with the same configuration have similar computational complexity.For both super neuron models, the number of parameters, , in the Self-ONNs can be expressed as, For each network model, Table 3 presents the number of network parameters,  and the memory overhead, which is the additional memory needed during the FP besides the network parameters and I/O buffers for feature maps.Besides having times more neurons, it is apparent from the table that all CNN×4 models have around 2.3 to 6 times more parameters and 2.2 to times higher computational complexity than the Self-ONNs with super neurons that are configured with the non-localized kernel operations by random spatial bias.Whilst having a similar computational complexity, the only overhead cost for super neurons over the generative neurons is about 1.22 times more parameters due to the spatial bias elements.This is true for both models, randomized and BP-optimized bias; however, super neurons with BP-optimized bias have around 1.1 times higher computational complexity than the models with no bias (generative neurons) and random bias.This is due to the bilinear interpolation performed to compute the shifted output maps.
Particularly, for deformable CNN×4 models, v1 and v2, the memory overhead,  + , can be expressed as follows: where  is constant ( = 2 for v1 and  = 3 for v2),  is the batch size,  is the group size,    and    are the width and height of the input feature map of the layer, l.The memory overhead can, therefore, be infeasibly large, especially for deep networks with practical settings.As an example, for a single layer with 256×256 pixel feature maps, 3×3 kernels, and  =  = 8, v1 and v2 versions of deformable CNNs will require around 302Mb and 452Mb extra memory, respectively only for a single layer.

V. CONCLUSIONS
The ancient neuron model from the 1950s [38] has been used by the MLPs ever since, and later on shared by its popular derivative, the conventional CNNs.As a linear model, it can only perform linear transformations with "localized" kernels making CNNs entirely homogenous with a static neuron model in terms of transformation and localization.This study is inspired by the wellknown proverb, "doing the right thing at the right place and the right time".The Self-ONNs with the generative neuron model can do the "right thing" by customizing each nodal operator on the fly.So, the generative neurons can create the best possible operator for the kernel of each connection during BP training.However, generative neurons can neither locate the "right place" for their kernels nor enhance their "receptive field" bounded by the kernel size.To overcome this, the proposed super neurons can be jointly optimized to do the right transformation at the right (kernel) location of the right connection to maximize the learning performance.This study has proposed two models for super neurons: randomized and BP-optimized kernel localizations of each connection.Both models improve the size of the receptive fields but only the latter one can seek the right (kernel) location of each connection.However, we observe that the underlying problem may not require the "right" location and in this case, both approaches are expected to perform either equally well or the former approach can work even slightly better than the latter because it can optimize each nodal operator of each kernel during the entire BP run without altering the location.Whereas the latter approach jointly optimizes both the nodal operator and the location (spatial bias) during the BP run, this is a significantly harder task because the optimization of the nodal operator cannot be finalized while the kernel keeps moving in each BP iteration.In other words, the optimal nodal operator will obviously be different for different kernel locations, and until the location (the spatial bias) is converged the nodal operator optimization cannot be finalized.
Both models of super neurons are evaluated against the conventional and deformable convolutional neurons of CNNs and generative neurons of Self-ONNs.First, shallow Self-ONNs with super-neurons (Super-ONNs in short) have been tested against deep models: DnCNN and DnONN in a Real-World denoising problem.Despite being a significantly shallow model with few neurons, Super-ONNs outperformed both deep models.Then, to reveal the true learning capabilities of super neurons, we purposefully selected challenging learning tasks and applied harsh learning conditions and restrictions such as scarce train data, shallow configurations with few neurons, and minimal kernel size.Despite 4 times more learning units (neurons) being used for all CNN models for comparative evaluations, the results clearly show that Self-ONNs with super neurons can achieve a superior learning and generalization capability thanks to the improved receptive field size they can provide.The computational complexity analysis reveals that an elegant computational efficiency is also achieved in terms of network parameters and memory overhead.In most problems, a notable performance gap is observed over the conventional Self-ONNs with generative neurons without any significant computational burden.
We can foresee that further performance boost can be expected for the Self-ONNs with super-neurons with the following improvements: • instead of fixing to some naïve values for the two hyperparameters, Q and , we are aiming to optimize each parameter per layer, • adapting a better optimization scheme for training, e.g., SGD with momentum [39], AdaGrad [40], RMSProp [41], Adam [42] and its variants [43], all of which should be adapted for Super-ONNs for proper functioning, • and implementing other kernel operations such as scaling and rotation.
These will be the topics for our future research.The optimized PyTorch implementations of Self-ONNs and Super-ONNs are publicly shared in [62].

A. Training by Back Propagation for Self-ONNs with Generative Neurons
For Self-ONNs, the contributions of each pixel in the  ×  output map,    (, ) on the next layer input map,   +1 (, ), can now be expressed as in Eq. (12).Using the chain rule, the delta error of the output pixel,    (, ), can therefore, be expressed as in Eq. (13) in the generic form of pool,   +1 , and composite nodal operator function, , of each operational neuron  ∈ [1, . .,  +1 ] in the next layer.In Eq. (13) .Then, Eq. (13) simplifies to Eq. ( 14).Note further that Δ   , ∇ Ψ  Ρ  +1 and ∇   have the same size,  ×  while the next layer delta error, Δ i +1 , has the size, ( − K x + 1) × ( − K y + 1), respectively.Therefore, to enable this variable 2D convolution in this equation, the delta error, Δ i +1 , is padded by zeros at all four boundaries (K x − 1 zeros on left and right, K y − 1 zeros on the bottom and top).Thus, ∇  (, , , ) can simply be expressed as in Eq. (15) Δ   (, )| (0,0) (−1,−1) = ∑ ( ∑ ∑ Δ i +1 ( − ,  − ) When there is a down-sampling by factors, ssx and ssy, then the back-propagated delta-error should be first up-sampled to compute the delta-error of the neuron.Let zero order up-sampled map be:    = up , (   ).Then Eq. ( 16) can be modified, as follows: where the q th element of the array,   + (, ), contributes to all the pixels of   +1 (, ).By using the chain rule of partial In Eq. ( 21) there is no need to register a 4D matrix for ∇   =    ( + ,  + )  since it can directly be computed from the outputs of the neurons.Moreover, when the pool operator is the sum, then ∇  Ρ  +1 (, , , ) = 1 and Eq.(21) will simplify to Eq. ( where ∂ ∂  +1 〈〉 is the q th 2D sensitivity kernel, which contains the updates (SGD sensitivities) for the weights of the q th order outputs in Maclaurin polynomial.Finally, the bias sensitivity expressed in Eq. ( 23) is the same for ONNs and CNNs since the bias is the common additive term for all.
Let   +1 〈〉 be the q th 2D sub-kernel where q=1..Q and composed of kernel elements,   +1 (r, t, q).During each BP iteration, , the kernel parameters (weights),   +1 〈〉(), and biases,    (), of each neuron in the Self-ONN are updated until a stopping criterion is met.Let, ε(t), be the learning factor at iteration, t.One can express the update for the weight kernel and bias at each neuron, i, at layer, l as follows: As a result, the pseudo-code for BP is presented in Alg. 1.

Algorithm 1: BP training for Self-ONNs with generative neurons
Input: Self-ONN,   (, ) Output: Self-ONN* = BP(Self-ONN, , ) 1) Initialize network parameters randomly (i.e., ~U(-a, a)) 2) UNTIL a stopping criterion is reached, ITERATE: a.For each mini-batch in the train dataset, DO: i. FP: Forward propagate from the input layer to the output layer to find q th order outputs, (   )  and the required derivatives and sensitivities for BP such as  ′ (   ), ∇  Ψ  +1 , ∇ Ψ  Ρ  +1 and ∇  Ψ  +1 of each neuron,k, at each layer, l. ii.BP: Compute delta error at the output layer and then using Eqs.( 14) and ( 16) back-propagate the error back to the first hidden layer to compute delta errors of each neuron, k, Δ   at each layer, l.
iii.PP: Find the bias and weight sensitivities using Eqs.( 22) and (23), respectively.iv.Update: Update the weights and biases with the (cumulation of) sensitivities found in previous step scaled with the learning factor, ε, as in Eq. (49): 3) Return Self-ONN*

B. BP for Non-localized Kernel Operations by Random Bias
In a conventional BP, starting from the output (operational) layer, the error is back-propagated to the 1 st hidden layer.For the sake of simplicity, for an image I in the training dataset suppose that the error (loss) function is L2-loss or the Mean-Square-Error (MSE) error function, (), is used can be expressed as, where   is the pixel p of the image ,  is the target output and  1  is the predicted output.The delta error in the output layer of the input map can then be expressed in Eq. (26).
Then Eq. ( 33) can be updated as follows: As for the computation of the sensitivities for kernel parameters, , and bias, ∂ ∂   , Eq. ( 3) indicates that the q th element of the array,   + (, ), contributes to all the pixels of   +1 (, ).Once again by using the chain rule of partial derivatives, the sensitivities for kernel parameters can be expressed in Eq. (36) For the bias sensitivity, the chain rule yields: Recall that Eq. ( 6) allows us to compute the derivatives of the output map w.r.t the individual bias elements, as expressed in Eq. (41).These derivatives will be needed in the BP formulation that will be covered in this section.The delta error in the output layer of the input map is the same as in Eq. (25).With  ⃗   (, ) =    ( + α   ,  + β   ) Eq. ( 28) can be simplified as in Eq. ( 42) and with Ρ i l+1 = Σ, it yields Eq. (43) where ∇  ⃗⃗ Ρ  +1 (, , , ) = ∇  Ρ  +1 (, , , ) × ∇  ⃗⃗ (, , , ) = ∇  ⃗⃗ (, , , ) and ∇  ⃗⃗ (, , , ) can be directly computed as in Eq. (44).Finally, the delta error of  ⃗   (from its contribution to   + alone) can be computed as, Basically, in these equations, we are using the grid of  ⃗   (, )not the original grid of    .However, we need to compute individual Δ   from the Δ ⃗   for each connection in the next layer so that we can cumulate them to compute the overall delta error for    .To accomplish this, as in the earlier approach with random (integer) bias, the overall delta error for the output map, Δ   , will be computed as the cumulation of the back-shifted individual delta-errors, Δ ⃗   computed for each connection, i.e., where   =  + ⌊   ⌋ and   =  + ⌊   ⌋.Since the bias elements are not an integer, we should now use the reverseinterpolation to compute first, Δ   ( + ⌊   ⌋,  + ⌊   ⌋) as illustrated in Figure 11.Once again using bilinear interpolation, Δ   (  ,   ) can be computed as expressed in Eq. (45).As in the random bias approach, the overall delta error for the output map, Δ   , is computed as the cumulation of the back-shifted individual delta-errors using Eq.(40).Once on the integer grid, it is straightforward to compute Δ   using Eq. ( 32).After the (overall) Δ   is computed, using Eq. ( 33) (or Eq. ( 34) or (35) in case down-or up-sampling is performed), the delta error, Δ   , can be computed and hence, the back-propagation of the (delta) error from layer l+1 to the k th neuron at layer l is completed.Once the back-propagation of delta errors is completed, then weight and bias sensitivities can be computed using Eqs.(37) and (38) Similarly, it is straightforward to show that the sensitivity, , can be expressed as, where ∇ β  ⃗(, ) = ∂ ⃗⃗   (,) ∂β   as expressed in Eq. (41).It is interesting to see that both spatial bias sensitivities depend on the cross-correlation of two distinct gradients, the shifted (interpolated) output map delta error and its direct derivative w.r.t the corresponding bias element.This means that during BP iterations, the ongoing gradient descent operation, e.g.Stochastic Gradient Descent (SGD), will keep updating the kernel location until either correlation between these two gradients vanishes (e.g., they become uncorrelated) or when the (magnitude of the) delta errors diminishes eventually at the final stages of the BP (e.g.convergence of the gradient descent).In other words, the local optimal location of a particular kernel of a particular connection -if exists for the particular problem at hand-will be converged when either of the conditions is satisfied (i.e., when Δα   , ∆β   ≈ 0).During each BP iteration, , the kernel parameters,   +1 〈〉(), and biases,    (), (spatial) bias pairs, α   (), β   (), of each super neuron in the Self-ONN are updated until a stopping criterion is met.Let, ε(t) and γ(t) be the learning factors at iteration, t, of weights and spatial bias pairs, respectively.One can express the SGD update for the kernel parameters, bias, and the kernel location of each super neuron, i, at the layer, l, in Eq. (49).The parameters of a Self-ONN for BP training via SGD are presented in Table 4 2) UNTIL either stopping criterion is reached, ITERATE ( = 1: ): a.For each batch in the train dataset, DO: i. Init: Assign next item,   , directly as the output map(s) in the input layer neurons and using Eq. ( 6) create the shifted output map(s) along with their powers, ( ⃗  0 )  , ∀ ∈ [1,  1 ] where  1 is the polynomial order of the super neurons in the 1 st hidden layer.ii.FP: From the previous layer (shifted) output maps, compute each input map in the 1 st hidden layer,   1 , ∀ ∈ [1,  1 ] using Eq. ( 7), then the native output maps,   1 and finally, the shifted output maps along with their powers, ( ⃗  1 )  ∀ ∈ [1,  1 ].iii.FP: Then compute the required derivatives and sensitivities for each hidden layer, such as  ′ (   ), ∇  Ψ   , and ∇  Ψ   of each neuron, i, at each layer, l. (∇ Ψ  Ρ   = 1) iv.FP: Repeat (ii) until the output layer is reached.Compute the output map(s),  1  (  ), of the neurons in the output layer and then, compute the MSE and delta error, Δ 1  , using Eqs.( 25) and ( 26), respectively.v. BP: For each hidden neuron at the last hidden layer, using Eq. ( 39) compute delta error for the shifted output map and then using Eq. ( 45), perform reverse-interpolation (and shift) to compute the delta error of the actual output map for each connection to the next layer.vi.BP: Using Eq. ( 32) compute the overall delta error for the output map, Δ   , as the cumulation of the back-shifted individual delta errors.vii.BP: Finally, using Eq. ( 33) (or Eq. ( 34) or (35) in case down-or up-sampling is performed), compute the delta error at this level, Δ   .
viii.PP: Compute sensitivities for the kernel parameters, bias, and spatial bias pair using Eqs.( 37), (38), (47), and ( 48) respectively.ix.Update: Update for the kernel parameters, bias, and the kernel location of each super neuron in the network with the (cumulation of) sensitivities found in step (viii) scaled with the current learning factors, ε(t) and γ(t), using Eq. ( 49).

3) Return Self-ONN*
To initiate the BP training by SGD over a dataset, a Self-ONN is first configured according to the network parameters, i.e., number of layers () and hidden neurons (  ), the kernel-size (, ), the pooling type and the (polynomial) order for each layer/neuron are set in advance.Let Self-ONN(0) be the initially configured network ready for BP training.In the pseudo-code for BP training presented in Alg. 1, five consecutive stages in an iterative loop are visible: 1) BP initialization (Step 1), 2) Forward-Propagation (FP) of each image in the batch where native and shifted (interpolated) output maps, derivatives and output MSE and delta error are computed (in Step 2.a, i -iii), 3) Back-Propagation (BP) of the delta error from the output layer to the first hidden layer (in Step 2.a, v -vii), 4) post-processing (PP) where the kernel parameter and bias sensitivities, the sensitivities of the spatial bias pair are Update: when all images in the batch are processed, then the kernel, bias and the kernel location of each super neuron in the network are updated and this is repeated for the other batches and iterations.The pseudo-code In Alg. 1 can be used for a Self-ONN with super neurons that are configured with the non-localized kernel operations by random spatial bias, the following steps should be modified accordingly.First, the initialization of bias elements should be an integer in 1.b., i.e., α   (0) = ⌊(− − ,  + )⌋ and β   (0) = ⌊(− − ,  + )⌋ for ∀ ∈ [1,  +1 ], ∀ ∈ [1,   ].Then since the spatial bias elements are integers now, Eq. ( 3) can be used instead of Eq. ( 7) for FP.Steps 2.a.iv and 2.a.vii are identical for both approaches.The main difference in BP is step 2.a.vwhere Eq. (31) should be used instead of Eq. ( 39) for the delta error computed for the connection to the i th neuron at layer l+1 and there is no need for reverse interpolation, hence Eq. ( 45) is simply omitted.Obviously for post-processing (PP) at step 2.a.vii, and Update at step 2.a.xi,Eqs. ( 47), (48), and (49) are, too, omitted since there is no gradient computation for the spatial bias pair, α   , and β   , as they are fixed as integers during step 1.Since the spatial bias elements are integers now, Eq. ( 3) can be used instead of Eq. ( 7) for FP.Steps 2.a.iv and 2.a.vii are identical for both approaches.The main difference in BP is step 2.a.vwhere Eq. (31) should be used instead of Eq. ( 39) for the delta error computed for the connection to the i th neuron at layer l+1 and there is no need for reverse interpolation, hence Eq. ( 45) is simply omitted.Obviously for post-processing (PP) at step 2.a.vii, and update at step 2.a.xi,Eqs. ( 47), (48), and (49) are, too, omitted since there is no gradient computation for the spatial bias pair, α   , and β   , as they are fixed as integers.

D. Proof of Concept
In order to validate the super neurons' ability to learn the true shift using BP-optimization of the spatial bias pair, a Self-ONN network with one hidden layer and a single neuron is trained over a toy problem where the network aims to learn to regress (transform) an input image to an output image, which is the shifted version of the input image by (α, β) ∈ ℤ[−, ] , i.e.,  0 2 (, ) =  0 0 ( + α,  + β).Therefore, with this setup, we can now validate whether the super neurons with the nonlocalized kernels are able to learn the true shift collectively during the BP training, and if so, whether the Self-ONN is able to generate the target (shifted) image perfectly well.Figure 12 illustrates this over a sample image where the output image is the shifted version of the input image with (α = 6, β = −7) pixels.In this ideal regression case, the cumulative bias shift of the two super neurons in x-and y-directions indeed is equal to the target shift, i.e., ∑(α 0 0 , β 0 0 ) = (6, − 7) where the 1 st order learned kernels are impulses, i.e.,  00 1 (r, t) =  00 2 (r, t) = δ(r, t).Since this is a validation experiment where the cumulative bias convergence is compared against the actual shift, we keep Q=1 to avoid the higher-order (nonlinear) operations and thus to achieve a perfect reconstruction by linear convolution.
Over the 40 input images randomly selected in the Pascal dataset, we created the target images with random shifts by  =  pixels.Figure 13 shows four examples of this verification experiment where the input, output, and target images are shown in the first and the last two columns, respectively.The 2 nd and 3 rd columns show bar plots of the kernels and the 4 th column shows the plots of the cumulative bias elements (hidden and output super neurons) in each BP iteration with a blue point.The cumulative, ∑(α 0 0 , β 0 0 ), and target shifts, (α, β), are shown with the red circles on the plot.The spatial bias pair is initially set as, (α 0 0 , β 0 0 ) = (0,0).The BP iterations are stopped when the regression SNR reaches 35dB.In all experiments including the four shown in the figure, the cumulative bias converged to the close vicinity of the actual shift and we observed that offsets such as (0,1), (1,0) or (1,1) pixelsare accommodated by the 2x2 kernels with shifted impulses.This is also visible in the figure where the offset is (1,1) pixels.In the experiments shown in the first and third rows, the kernel functions in the 1 st and 2 nd (output) layers are:  00 1 (r, t) ≅ δ(r − 1, t − 1) and  00 2 (r, t) ≅ δ(r, t) while the one in the fourth row, they are:  00 1 (r, t) ≅ δ(r, t − 1) and  00 2 (r, t) ≅ δ(r − 1, t).Since the early stopping criterion is set as SNR=35dB, the kernels are only approximating the (shifted) impulses.A common observation in all experiments is that the spatial bias elements usually converged during the early stages of the BP, i.e., within around 20-50 iterations while the optimization of the kernels was initiated afterwards.In brief, such a "Proof of Concept" demonstration shows a unique capability of the super neurons in a regression problem, i.e., only with a single hidden neuron, from an arbitrary input image, the network can perfectly regress the output image which is the shifted version of the input image.Such an image transformation is not possible for any conventional CNNs, or even Self-ONNs with generative neurons, unless the effective receptive field is expanded by using sufficiently deep and complex networks.

Figure 1 :
Figure 1: An illustration of the nodal operations in the kernels of the i th CNN (left), ONN (middle), and Self-ONN (right) neurons at layer l+1 [10].

Figure 2 :
Figure 2: Localized (top) vs. non-localized kernel operations (bottom) to create the pixel,   + (, ), from the output maps of the previous layer neurons.At the (bottom) right, randomly localized (uniformly distributed) kernels within a spatial bias range of  =  are shown.At the (bottom) left, the BP-optimized locations of each kernel during a BP epoch with bias gradients, (∆   , ∆   ) (yellow vectors) are illustrated.

Figure 4 :
Figure 4: Some sample images with real-world noise (left), their zoomed sections (2 nd column), and the corresponding outputs of the DnCNN (3 rd column), DnONN (4 th column), and Super-ONN (right) from the validation set of SIDD dataset.

Figure 5 :
Figure 5: Best PSNR and SSIM scores for each Disc-5 (top) and Motion (bottom) deblurring fold achieved by the corresponding Self-ONNs (with no, random and BP-optimized spatial biases) and the three CNN×4 configurations over the test set.

Figure 6 :
Figure 6: Some typical original (target) and Disc-5 blurred (input) images and the corresponding outputs of the CNN×4 and the three Self-ONNs (with no, random and BP-optimized spatial bias) from the test partition.

Figure 7 :
Figure 7: Some typical original (target) and Motion blurred (input) images and the corresponding outputs of the CNN×4 and the three Self-ONNs (with no, random and BP-optimized spatial bias) from the test partition.

Figure 8 :Figure 9 :
Figure 8: Best F1 scores for each face segmentation fold achieved by the corresponding Self-ONN (with no, random and BP-optimized spatial biases) and CNNx4.

Figure 10 :
Figure 10: The illustration of a Self-ONN equivalent to Figure 1 (right) when the pool operator is "sum",    = , and the activation function is tanh.
for Non-localized Kernel Operations by the BPoptimized Bias

Figure 12 :
Figure 12: A sample Self-ONN with a single (hidden) super neuron over the toy problem.The perfect regression of the target is illustrated (SNR = ∞) for an ideal case.

Figure 13 :
Figure 13: Four "Proof of Concept" verification experiments where the target images are created with random shifts are shown at each row.The 2 nd and 3 rd columns show bar plots of the kernels and the 4 th column shows the plots of the cumulative bias elements (hidden and output super neurons) in each BP iteration with a blue point.The cumulative, ∑(   ,    ), and target shifts (, ) are shown with the red circles on the plot.The BP is stopped at the iteration when the SNR is reached to 35dB.

Table 1 presents
[10]formula abbreviations and mathematical symbols used in this article.Back-Propagation (BP) training for Self-ONNs 1 is briefly formulated in Appendix A and further details can be obtained from[10].
The details of BP training are covered in Appendices B and C.

Table 3 : Comparison of the total number of multiply- accumulate operations for the networks used in this study.
, note that the first term, .
derivatives, one can express the weight sensitivities, . Since with the same simplifications.Note that  ⃗