Two-argument activation functions learn soft XOR operations like cortical neurons

Neurons in the brain are complex machines with distinct functional compartments that interact nonlinearly. In contrast, neurons in artificial neural networks abstract away this complexity, typically down to a scalar activation function of a weighted sum of inputs. Here we emulate more biologically realistic neurons by learning canonical activation functions with two input arguments, analogous to basal and apical dendrites. We use a network-in-network architecture where each neuron is modeled as a multilayer perceptron with two inputs and a single output. This inner perceptron is shared by all units in the outer network. Remarkably, the resultant nonlinearities often produce soft XOR functions, consistent with recent experimental observations about interactions between inputs in human cortical neurons. When hyperparameters are optimized, networks with these nonlinearities learn faster and perform better than conventional ReLU nonlinearities with matched parameter counts, and they are more robust to natural and adversarial perturbations.


Introduction
Neurons in the brain are not simply linear filters followed by a half-wave rectification, and exhibit properties like divisive normalization (Heeger, 1992;Carandini and Heeger, 2012), coincidence detection (Larkum et al., 1999;Branco et al., 2010), and history dependence (Barlow et al., 1961;Rieke and Warland, 1999).Instead of fixed canonical nonlinear activation functions such as sigmoid, tanh, and ReLU, other nonlinearities may be both more realistic and more useful (Poirazi et al., 2003;Beniaguev et al., 2021;Jones and Kording, 2021).We are particularly interested in multivariate nonlinearities like f (w 1 x, w 2 x, ...), where the arguments could correspond to inputs that arise, for example, from multiple distinct pathways such as feedforward, lateral, or feedback connections, or from different dendritic compartments.Such multi-argument nonlinearities could allow one feature to modulate the processing of the others.
Recent work showed that a single dendritic compartment of a single neuron can compute the exclusive-or (XOR) operation (Gidon et al., 2020).The fact that an artificial neuron could not compute this basic computational operation discredited neural networks for decades (Minsky and Papert, 1969).Although XOR can be computed by networks of neurons, the find- ing that even single neurons can too highlights the possibility that individual neurons may be much more sophisticated than is often assumed in machine learning.Many single-argument nonlinearities permit universal computation, but the right nonlinearity could allow faster learning and better generalization, both for the brain and for artificial networks.
To investigate this, we parameterize the nonlinear inputoutput transformation flexibly by an "inner" neural network, which becomes a 'subroutine' called from the conventional "outer" network made of many of these complex neurons with parameters that are shared across all layers and all nodes of a given cell type (Figure 1).We evaluate fully-connected and convolutional feedforward networks on image classification tasks given a diverse set of random initial conditions.We focus especially on two-argument nonlinearities learned from MNIST and CIFAR-10 datasets.

Related work
Numerous recent studies have focused on developing novel activation functions, building on the simplicity and reliability of ReLU (Hahnloser et al., 2000;Nair and Hinton, 2010).These studies can be distinguished by the type of learning algorithm used for optimizing the activation function and the size of the search space.Many recent modifications such as PReLU (He et al., 2015), ELU (Clevert et al., 2015), SELU (Klambauer et al., 2017), and GELU (Hendrycks and Gimpel, 2016) provide single-argument activation functions with a small number of parameters that are mostly fixed (or tuned through hyperparameter optimization).However, such hand-designed functional forms result in restricted expressivity.Swish (Ramachandran et al., 2017) is noteworthy in this respect, because its activation function is discovered by a combination of exhaustive search and reinforcement learning.The search space in this case is based on a set of predetermined one-and two-argument functions, so this approach can span a broader class of nonlinearities than past work, although it is limited by the specific basis set and the combination rules chosen.
More closely related to our work, the network-innetwork architecture proposes to replace groups of simple ReLUs with a fully connected network (Lin et al., 2013).This activation function allows arbitrary dimensional inputs and outputs; thus it is essentially the most general and expressive nonlinear function.However, our work is primarily motivated by neurons in the brain, which can be formalized as multi-input and single-output nonlinear units.As in network-innetwork, we parameterize the nonlinear many-to-one transformation by a fully-connected multi-layer network to examine the learned spatial activation function without sacrificing its representational power.
The multi-argument nonlinear transformation is also a canonical operation subsumed under the emerging network architectures such as graph neural networks (GNNs) (Scarselli et al., 2008;Li et al., 2015;Kipf and Welling, 2016;Hamilton et al., 2017) and transformers  (Vaswani et al., 2017;Jaegle et al., 2021).As conceptual extensions from scalar to vector-valued inputs, the message functions in GNNs are multi-input nonlinearities while the scaled dot-product attention in transformers can be viewed as a three-input argument nonlinearity.Although these architectures evaluate performance benefits of specific multi-argument activations, to the best of our knowledge, ours is the first study to characterize the emergent properties of multivariate nonlinear activation functions and their connection to the neuronal nonlinearities in the brain.

Model structure
To define our multi-argument nonlinearity, we introduce the concepts of inner network and outer network.The inner network aims to learn an arbitrary multivariate nonlinear function f (x 1 , ..., x n ) with n inputs and a single output.This will replace the regular scalar activation functions like ReLU.The outer network refers to the rest of the model architecture aside from the activation function.Our framework, composed of two disjoint networks, is flexible and general since diverse neural architectures can be used as outer networks, such as multilayer perceptrons (MLPs), convolutional neural networks (CNNs), ResNets, etc.On the other hand, for the inner network, we use MLPs that have two hidden layers with 64 units followed by ReLU nonlinearities.The MLP is shared across all layers, analogous to the fixed canonical nonlinear activation functions commonly used in feedforward deep neural networks.When we test a CNN-based outer network, we use 1 × 1 convolutions instead of MLPs for the inner network to make the model fully convolutional, but the inner network is otherwise essentially the same as the two-layer MLP.In this framework, the 1×1 conv implies that the inputs to the inner network are channel-wise features, which is similar to the idea of mixing channel information per location in the recent MLP mixer architecture (Tolstikhin et al., 2021).Figure 2 summarizes how the inner network is incorporated into the outer network.

Training procedure
Pretraining (session I) We first generate a random activation function and then use supervised learning to pretrain our inner network to match it (Figure 3a).The motivation for this inner network pretraining stage is that common initialization methods (Glorot and Bengio, 2010;He et al., 2016) do not generate spatial activations that are "random" enough to study the changes in functions over time.To start with a sufficiently complex initial nonlinearity, we create a piecewise constant random output sampled uniformly from [−1, 1] over a 5×5 grid of unit squares tiling the input space.We blur this by a 2D gaussian kernel (σ = 3 units) to define a random smooth activation map.This function serves as the target for the inner network to match (Figure 3a).Example activation functions after pretraining are shown in Figure 4b.This produces our initialized inner network, whose parameters are transferred to the next phase of training.

Training inner and outer networks (session II)
Next we merge the pretrained inner network with outer network via parameter sharing (Figure 3b) and apply this general network-in-network architecture to the task of image classification.In this session, both networks are trained simultaneously so that the entire network is made to learn over what might be analogous to an evolutionary timescale on which nonlinear cell properties emerge (Figure 3b).As our baseline outer networks, we use (1) MLPs that have three hidden layers with 64 units or (2) CNNs that have four convolutional layers with [60,120,120,120] kernels of size 3 × 3 and a stride of 1, using 2 × 2 max-pooling with a stride of 2. Aside from the MLPs or convolutional layers, the outer network uses other standard architectural components: layer normalization (Ba et al., 2016) (placed before inner networks) and dropout (Srivastava et al., 2014) (placed after each hidden/convolutional layer; p = 0.5).Our models are trained on the MNIST and CIFAR-10 datasets using ADAM (Kingma and Ba, 2014) with a learning rate of 0.001 until the validation error saturates; early-stopping is used with a window size of 20.We freeze the learned nonlinearity f inner-net (•) at the time of saturation or at a maximum epoch.Examples of learned nonlinearities are shown in Figure 4c.
To obtain some intuition about the learned 2-arg input nonlinearities, we first collect the values of every input to the nonlinearities (i.e. to the inner networks) over all test data at inference time.For display, we compute the pre-activation input distribution (Figure 4a), and show the nonlinearities over the region enclosing 99% of the input distribution (Figure 4b-c Training outer network for fixed inner network (session III) Having learned multi-argument nonlinear activation functions, we now fix these inner networks and retrain the outer network to use them on new task data.We borrow the f inner-net (•) from its parameters trained in session II, freeze the inner network, and then re-initialize the outer network.In this session, only the outer network is trained as for typical training of a deep neural network with a canonical activation function (Figure 3c).The training curves in this stage are not qualitatively different from what we observed in session II (Figure 4d), indicating that most of the learning over long time intervals (epochs) is attributable to the change of parameters in outer network.In other words, the learning of multi-argument nonlinear activation function may be terminated in an early stage and the rest of learning may be dedicated to solving the classification tasks.
We thus look for evidence of structural stability of inner network in early development by plotting the learned nonlinearities every epoch in session II.We find that the two-argument activation functions mature into typical two-dimensional spatial patterns within 1-5 epoch in general (Figure 5), suggesting that the overall spatial structure of the the activation function emerges quite rapidly from pressures that arise early in the learning process.

Comparing to other nonlinearities
With the aim of providing context for the performance of our proposed approach we compare against a singleargument nonlinearity.For fair comparison, we train the baseline models, whose architectures are depicted in Figure 6, just as we train our outer networks.The baseline models all involve the same MLP or CNN architecture, i.e. they use the same type and number of outer network layers as our proposed model.
When comparing different architectures we take care to use comparable numbers of learnable parameters in the classification tasks by systematically adjusting the number of hidden units or feature maps in each layer.Specifically, MLP-based outer network with n-arg input nonlinearities (Figure 5a) contains x(nh 1 + 1) + L−1 =1 nh h +1 + h L y + (65n + 4288) parameters, where x, y, and h are the dimension of input, output, and the number of units in hidden layer , respectively.The last term represents the number of inner network parameters; this is independent of input and output dimensions as well as the number of hidden layers L, so it does not increase the model complexity (due to parameter sharing).In contrast, the second term L−1 =1 nh h +1 dominates the parameter counts, so our baseline model (Figure 6b) has L layers, each comprising √ nh + β hidden units where β is a constant to approximate the parameter counts of the proposed model: . This way of matching parameter counts in MLP-based outer network applies also to CNN-based models, by setting h to be the number of feature maps in convolutional layer instead of hidden units.
Figure 4d compares training performance of the twoinput argument nonlinearity to networks using a ReLU or single-argument nonlinearity.We repeat the training of the nonlinearities on MNIST and CIFAR-10 4 times, which produces 4 different samples of model performance.We average the results across 4 samples and find that the models with learned activation functions achieve an overall strong performance (Figure 4d).Notably, Figure 4d suggests that our proposed network learns faster than the ReLU network and achieves better asymptotic performance, providing evidence for a better inductive bias in the network due to the learned multi-argument nonlinearities.

Explicit polynomial nonlinearities
The results outlined in the previous section focus on the predictive performance of multivariate nonlinear functions.We next turn our attention to the analysis of the structure learned by our multi-argument nonlinearities.We repeat four different trials of the learning experiment and collect samples of two-argument activation functions trained on MNIST and CIFAR-10, within MLP and CNN outer networks.Figure 7a-d (left columns) demonstrates that learned two-argument nonlinearities are reliably shaped like quadratic functions, varying by shifts and/or rotations.We therefore fit an algebraic quadratic functional form, , to the learned inner-network nonlinearities and find that the learned nonlinearity and its best-fit quadratics have extremely similar structure .This is the case even though the spatial patterns have different rotations (Figures 7a-d).
We next validate the specificity of the observed inner network output responses.It is clear by eye that the learned nonlinearities are substantially different than those produced by random functions (Figure 4b-c).However, this regular pattern of learned nonlinearities might also be obtainable by popular network initialization methods, such as Xavier weight initialization.To differentiate between these two possibilities, we therefore compare the learned nonlinearities with inner nets initialized with Xavier random initialization (Glorot and Bengio, 2010) (Figure 7e).We find that the Xaiver random initial activations, although not as "random" as those we generated ourselves (Figure 4b), are far from the regular quadratic patterns observed in the learned nonlinearities (Figure 7e).They instead evolve to display such smooth quadratic patterns (Figure 5b), suggesting that the quadratic structures we observe are not captured by standard weight initialization schemes, but are favored by the optimization process instead.
To test whether the learned quadratic functions have statistically significant sub-structure (for example, hyperbolic vs. elliptical or negative vs. positive curvature), we computed the curvature implied by the quadratic form above, c 1 c 2 − c 2 3 /4 (Figure 7f-g).The convolutional architecture learned nonlinearities with negative curvatures for both tasks, a total of 78% of 48 trials (p = 0.007 according to a binomial null distribution with even odds of either curvature).This indicates a multiplicative interaction between the input features, and is consistent with a gating interaction or soft XOR.In contrast, the multilayer perceptron architecture produced more positive curvatures, but these were not statistically significant (p = 0.06 by the same test).

Spectral Analysis
To further compare the structure of learned nonlinearities to the structure of Xavier-initialized ones, we also performed a spectral analysis on both.We computed spectra using basis functions φ(x) appropriate for the symmetry and boundary conditions of the nonlinearities: we used Hermite-Bessel functions (Victor et al., 2006) for the 2-argument functions, and solid harmonics for the 3-argument functions.We only evaluated the power in regions of the input space that were explored by the distribution p(x) of their actual inputs.The power was therefore computed according to P , where is the analog of spatial frequency for these basis functions and m is analogous to spatial phases.Figure 8 shows that the learned multi-argument nonlinearities have more higher-order structure than the Xavier initialized ones.Randomly initialized networks favor strong dipole structure with = 1.In contrast, the power spectra of learned nonlinearities are consistent with an underlying quadrupole structure, which has  (c-d) Power spectra for these learned functions (black curves) reveal larger power at = 2 than spectra for Xavier-initialized inner networks (red), consistent with stronger quadrupolar structure.For the two-argument case, we used 64 learned functions and 24 randomly initialized functions.For the three-argument case, we used 8 learned functions for each.Example basis functions are shown beneath the horizontal axis to illustrate the spatial structure quantified by the frequency number.its strongest frequency content at = 2.A soft XOR can be described by f (x 1 , x 2 ) = x 1 x 2 or its rotations, which produces positive outputs in two opposite quadrants and therefore creates a quadrupole moment with negative curvature.

Generalization
We now consider out-of-distribution generalization performance of the models for image classification with multi-argument nonlinear functions.In particular, we test whether these activation functions make the learned representations more robust against common image corruptions and adversarial perturbations.We quantify the robustness of the models against common corruptions and perturbations using the recently introduced CIFAR-10-C benchmark (Hendrycks and Dietterich, 2019) and parameter-free AutoAttack (Croce and Hein, 2020b).
Robustness against common image corruptions CIFAR-10-C was designed to measure the robustness of classifiers against common image corruptions and contains 15 different corruption types applied to each CIFAR-10 validation image at 5 different severity levels.
The robustness performance on CIFAR-10-C is measured by the corruption error As seen in Figure 9, two-input argument nonlinearities significantly improve the robustness over the ReLU baseline model (mCE = 91.3%).Note that mCE scores lower than 100 indicate more success at generalizing to corrupted distribution than the reference model.Moreover, the observed relative mCE (= 99.5%, which is less than 100) shows that the accuracy decline of the proposed model in the presence of corruptions is on average less than that of the network with ReLU.The results suggest that this corruption robustness improvements be attributable not only to the simple model accuracy improvements on clean images, but to stronger representations of the learnable multivariate nonlinearity than ReLU against natural corruptions.

Adversarial robustness
We next consider both black-box and white-box attacks to measure the robustness of the model against adversarial perturbations.We use the recently introduced AutoAttack (Croce and Hein, 2020b) combining two parameter-free versions of Projected Gradient Descent (PGD) (Madry et al

Discussion
The neurons in biological neural networks are much more intricate machines than the units they inspired in machine learning.Instead, neural networks in machine learning have been dominated by scalar activation functions.At the same time, it is widely acknowledged that different design choices here can lead to different inductive biases, and architectures with new neural elements are proposed frequently.These elements are usually based on guesses or intuition.Interestingly, one of the most influential elements has been a multiplicative gating nonlinearity, seen in LSTMs (Hochreiter and Schmidhuber, 1997), GRUs (Chung et al., 2014), and transformers (Vaswani et al., 2017).Our experiments demonstrated that gating-like functions emerge automatically from learned multi-argument nonlinear activation functions, as the soft XOR can be interpreted as an output that selects one input dimension of its input and modulates or gates that output by another input dimension.These learned functions have properties resembling dendritic interactions in biological neurons (Gidon et al., 2020).Networks endowed with these functions learn faster and are more robust.
Although these learnable nonlinearities add some complexity to a network, overall these extra inner network parameters are few in number since they are shared across all neurons in the outer network.Moreover, using algebraic polynomial approximations to the learned nonlinearities, as in section 4.3, can reduce both the number of parameters and the memory requirements of the inner networks in practical applications.
Nontrivial computations in a multilayer network require some sort of nonlinearity, since otherwise the whole network merely performs one linear transformation.The simplest nonlinearity is quadratic, whether the quadratic has negative curvature like a soft XOR, or a positive curvature like coincidence detection.It is interesting that even when allowing for more input arguments, the resultant learned nonlinearities still favor low-order quadratic functions (Figure 8b-d).This could be explained by an implicit bias toward smooth functions (Williams et al., 2019;Sahs et al., 2020) while still bending the input space to provide useful computations.Perhaps the learned nonlinearities are as random as possible while fulfilling these minimal conditions.It will be interesting to test this hypothesis by examining the transformations of multiple cell types, or those produced by higher-dimensional functions like network-in-network Lin et al. (2013), and to see whether different tasks incentivize different computations.
Our study demonstrates that flexible multi-argument activation functions converge to reliable and interpretable patterns and provide computational benefits.However, our study has important limitations that should be addressed in future work.The performance benefits should be evaluated in more architectures and tasks, and at larger scales.There might be synergistic benefits from additional features like skip connections or global modulation.Some of the additional complexity afforded by multi-argument activation functions might be more useful when used in richer architectures, including those with recurrence, dedicated input types (e.g.distinct feedforward, feedback, and lateral interaction arguments), multiple cell types (Douglas and Martin, 1991;Shepherd, 2004), and more intricate dendritic substructures (Poirazi and Mel, 2001;Poirazi et al., 2003).Such biologically-inspired additions to neural network architectures could provide inductive biases closer to the inductive biases in biological brains (Sinz et al., 2019;Litwin-Kumar and Turaga, 2019).

Figure 1 :
Figure 1: Multi-argument nonlinearities in artificial neurons.Schematic of architecture including a multi-argument nonlinear activation function (purple triangles).These functions' two arguments are different linear weighted sums of features, and may correspond to distinct inputs such as apical and basal dendrites.

Figure 2 :
Figure 2: Overview of the proposed model structures.(a) Scalar nonlinear activation function ReLU (top) and MLP-based outer network with ReLU nonlinearities (bottom), (b) n-arg input MLP-based inner network (top; n = 2 in this figure) and the MLP-based outer network that replaces ReLU with the inner network above (bottom).The activation functions are color-coded by red boxes and the rest of the black other than the red boxes represents the elements of outer network, (c) 1 × 1 conv-based inner network (top) merged into conv-based outer network (bottom).The inner network takes inputs from different feature maps; thus the conv-based outer network requires slice and concatenation operations from the depth dimension before and after the inner network.The model schematics assume a two-input argument nonlinearity.

Figure 3 :
Figure 3: Training procedure.(a) Pretraining.Schematic of two-input argument inner network (green) trained to predict a smoothed random initial activation map (bottom).(b) Simultaneously training inner (red) and outer (black) networks.(c) Retraining outer network (black) with frozen inner networks (gray).

Figure 4 :
Figure 4: Learned nonlinearities learn tasks faster.Examples of (a) input distribution, (b) pretrained random initial nonlinearities, and (c) learned two-argument activation functions trained on two different data sets, CIFAR-10 and MNIST, within two different architecture types, a convolutional network and a multi-layer perceptron.Colors indicate the output of activation function, masked to the best-trained part of the input distribution, i.e. for the 99% of input values that are most common.White bands showing the crossing point between positive (blue) and negative (red) outputs.(d) Average test accuracy (solid line) ±1 SD (shaded region; n = 4 samples) of the 2-arg activation model (red) and the baselines (blue: ReLU, green: 1-arg activation) in session II (200 epochs) and session III (400 epochs).Networks with these two-argument nonlinearities learn faster than others.
).If twoargument nonlinearities learned what is essentially a one-argument structure, we would see parallel bands of constant color.Instead, notably, all the examples show nontrivial two-dimensional structure, reflecting interactions between the two input arguments (see Section 4.3).

Figure 5 :
Figure 5: Evolution of learned two-argument activation functions.(a) Snapshot of random initial and learned nonlinear activation functions across development.(b) The same evolution of nonlinearity when it is Xavier-initialized.

Figure 6 :
Figure 6: Baseline architecture for parameter counts.(a) MLP-based outer network that have L hidden layers with h units (green) along with n-arg input nonlinearities (red).(b) Baseline model architecture with ReLU composed of L hidden layers with √ nh + β units (blue) in each layer .

Figure 7 :
Figure 7: Gating operations emerge naturally from learnable multi-argument nonlinear structures.(a-d) Left: Examples of learned multi-argument activation functions trained on CIFAR-10 and MNIST, within two different architecture types, CNN and MLP.Each row is a different repetition of the learning experiment.All examples show nontrivial two-dimensional structure, reflecting interactions between two input arguments.The majority show a (potentially rotated) white X shape, indicating a multiplicative interaction between the input features, and consistent with a gating interaction or soft XOR.(a-d) Right: The best-fit quadratics of the corresponding left nonlinearities.(e) Random activation functions generated from Xavier weight initialization.(f ) Cumulative Distribution Function (CDF) of nonlinearity curvature.(g) Fraction of nonlinearities with negative (XOR-like) curvature.Even a set of random functions may by chance have nonzero average curvature.The CONV architectures show deviations that are outside of the 95% Confidence Interval (CI) of the null distribution (binomial distribution with probability of 1/2 for positive or negative curvature, for 24 trials).

Figure 8 :
Figure 8: Spectral Analysis.Nonlinearities for various architectures and tasks for (a) two-argument and (b) three-argument inner networks.(c-d)Power spectra for these learned functions (black curves) reveal larger power at = 2 than spectra for Xavier-initialized inner networks (red), consistent with stronger quadrupolar structure.For the two-argument case, we used 64 learned functions and 24 randomly initialized functions.For the three-argument case, we used 8 learned functions for each.Example basis functions are shown beneath the horizontal axis to illustrate the spatial structure quantified by the frequency number.

Figure 9 :
Figure 9: Robustness of two-argument nonlinearities against common image corruptions.Corruption error (CE; bars), mCE (black solid line), and relative mCE (black dashed line) of different corruptions on CIFAR-10-C and Conv-based outer networks.The mCE is the mean corruption error of the corruptions in Noise, Blur, Weather, and Digital categories.Models are trained only on clean CIFAR-10 images.

Table 1 :
., Robustness of adversarial defenses by AutoAttack.Numbers indicate average classification accuracy from 4 trials.Conv} outer-net × {2-arg, 1-arg, ReLU} inner-net ) trained for ∞ -robustness.For each classifier we report the accuracy on the robustness test, at the specified in the table, on the whole test set obtained by the ensemble AutoAttack.This method counts an attack successful when at least one of the four attacks finds an adversarial example (worst case evaluation).