Representation Learning in Sensory Cortex: A Theory

We review and apply a computational theory based on the hypothesis that the feedforward path of the ventral stream in visual cortex’s main function is the encoding of invariant representations of images. A key justification of the theory is provided by a result linking invariant representations to small sample complexity for image recognition - that is, invariant representations allow learning from very few labeled examples. The theory characterizes how an algorithm that can be implemented by a set of “simple” and “complex” cells - a “Hubel Wiesel module” – provides invariant and selective representations. The invariance can be learned in an unsupervised way from observed transformations. Our results show that an invariant representation implies several properties of the ventral stream organization, including the emergence of Gabor receptive filelds and specialized areas. The theory requires two stages of processing: the first, consisting of retinotopic visual areas such as V1, V2 and V4 with generic neuronal tuning, leads to representations that are invariant to translation and scaling; the second, consisting of modules in IT (Inferior Temporal cortex), with class- and object-specific tuning, provides a representation for recognition with approximate invariance to class specific transformations, such as pose (of a body, of a face) and expression. In summary, our theory is that the ventral stream’s main function is to implement the unsupervised learning of “good” representations that reduce the sample complexity of the final supervised learning stage.

in different layers the motif of simple and complex cells 23 in V1, have led to a series of quantitative models from 24 Fukushima [36] and Riesenhuber and Poggio [106] HMAX 25 (Hierarchical architecture with MAX pooling) to more recent 26 architectures based on contrastive [20], [139] or slow fea- 27 tures [109] learning. These models are increasingly faithful 28 to biological architecture constraints and are able to mimic 29 The associate editor coordinating the review of this manuscript and approving it for publication was Wei Xiang .
properties of cells in different visual areas while achieving 30 human-like recognition performance under restricted condi-31 tions. Starting from the architectures in [45], [112], and [138], 32 deep learning convolutional networks, which are hierarchical 33 but otherwise do not respect the ventral stream architecture 34 and physiology, have been trained with very large labeled 35 datasets. The resulting model's neuron population accurately 36 mimic the object recognition performance of the macaque 37 visual cortex (e.g., [17], [59], [132], [133], and [139]). How-38 ever, the nature of the computations carried out in the ventral 39 stream is not fully explained by such models that, despite 40 being simulated on a computer, remain rather opaque. 41 In other papers (in particular [6], [7], [10], and [102]) 42 we have developed a mathematics of invariance that can be 43 applied to the ventral stream. Invariance and equivariance 44 • We prove that Gabor functions are the optimal templates 101 for maximizing simultaneous invariance to translation 102 and scale (IIC). Hierarchies of HW modules retain their 103 properties, while alleviating the problem of clutter in the 104 recognition of wholes and parts, (sections IIE,IIF). 105 • We show that the same HW modules at high levels in 106 the hierarchy are able to compute representations which 107 are approximately invariant to a much broader range of 108 transformations (e.g., 3D expression of a face, pose of 109 a body, and viewpoint). They do so by using templates, 110 reflected in neuron's tuning, that are highly specific for 111 each object class (sections IID, IIE).

112
• We describe (section III) how neuronal circuits may 113 implement the operation required by the HW algorithm. 114 We specifically discuss new models of simple and com-115 plex cells in V1 (sections IIIA, IIIB). We also introduce 116 plausible biophysical mechanisms for tuning, pooling, 117 and learning the wiring based on Hebbian-like unsuper-118 vised learning (sections IIIC, IIID, IIIE).

119
• We explain (section IV) how the final IT stage computes 120 class-specific representations that are quasi-invariant to 121 non-generic transformations. We also discuss the modu-122 lar organization of anterior IT in terms of the theory; in 123 particular, proposing an explanation of the architecture 124 and of some puzzling properties of the face patches 125 system. 126 We conclude the paper with a discussion of predictions to be 127 tested and open problems (section V).

130
For this paper, we will use the following conceptual frame-131 work for primate vision: 132 • The first 100ms of vision in the ventral stream are 133 mostly feed forward. The main computational goal is to 134 generate a number of image representations, each one 135 invariant or approximately invariant to some transfor-136 mations experienced during development and at matu-137 rity, such as scaling, translation, and pose changes. The 138 representations are used to answer basic questions about 139 image type and what may be in it.

140
• The answers will often have low confidence, requiring 141 an additional ''verification/prediction step'', which may 142 require a sequence of shifts of gaze and attentional 143 changes. This step may rely on generative models and 144 probabilistic inference and/or on top-down visual rou-145 tines following memory access. Routines that can be 146 synthesized on demand as a function of the visual task 147 are needed to go beyond object classification. 148 We consider only the feedforward architecture of the ven-149 tral stream and its computational function. To help the reader 150 to more easily understand the mathematics of this section, 151 we give here an overview of the network of visual areas 152 that we propose for computing invariant representations in 153 feedforward visual recognition. There are two main stages: 154 the first one computes a representation that is invariant 155 to affine transformations, followed by a second stage that 156 computes approximate invariance to object specific, non- , good performance (black line) was obtained from a single training image from each rectified class, using a linear classifier operating on pixels, whereas training from the unrectified training set yields chance performance. In other words, the sample complexity of the problem becomes much lower with a rectified (and therefore invariant) representation ( [7] [110]. This implies that in many cases, 211 recognition (both identification, e.g., of a specific car relative 212 to other cars,as well as categorization, e.g., distinguishing 213 between cars and airplanes) would be much easier (only 214 a small number of training examples would be needed to 215 achieve a given level of performance) if the object images 216 were rectified with respect to all transformations, or equiva-217 lently, if the image representations were invariant. The case 218 of identification is obvious since the difficulty in recognizing 219 exactly the same object, e.g., an individual face, is due solely 220 to transformations. Figure 1 provides suggestive evidence 221 from a classification task, showing that if an oracle factors 222 out all transformations in images of many different cars 223 and airplanes, providing ''rectified'' images with respect to 224 viewpoint, illumination, position and scale, the problem of 225 categorizing cars vs airplanes becomes easy: it can be done 226 accurately with very few labeled examples. In this case, good 227 performance was obtained from a single training image of 228 each class, using a simple classifier. In other words, the 229 sample complexity of the problem seems to be very low. 230 A proof of the conjecture for the special case of translation 231 VOLUME 10, 2022 is provided in [7] for images defined on a grid of pixels and, 232 with the main results restated below.  The following HW algorithm is biologically plausible, as we 255 will discuss in further detail in section II and III, where we 256 argue that it may be implemented in cortex by a HW module.

257
The module consists of a set of KH complex cells with the 258 same receptive field, each pooling the output of a set of simple 259 cells whose sets of synaptic weights correspond to one of 260 the K ''templates'' of the algorithm and its transformations 261 (which are also called templates) and whose output is filtered 262 by a sigmoid function with a h threshold, h = 1, · · · , H . · · · , |G|) observed over a time interval (thus 270 = g 0 t, g 1 t, · · · , g |G| t for template t; for tem-271 plate t k the corresponding sequence of transforma-272 tions is denoted k ).

273
2) Repeat for each of the K templates.
values (eq. 1) in the main text) correspond to the the histogram where k=1 denotes the template'' green blackboard'', h the bins of the histogram, and the transformations are from the rotation group. Crucially, mechanisms capable of computing invariant representations under affine transformations can be learned (and maintained) in an unsupervised, automatic way just by storing sets of transformed templates which are unrelated to the object to be represented. In particular the templates could be random patterns.
3) The signature is the set of K cumulative histograms 281 that is the set of: where I is an image, σ is a threshold function, 284 > 0 is the width of bin in the histogram, and 285 h = 1, · · · , H is the index of the bins of the 286 histogram.

287
The algorithm consists of two parts: the first is unsu-288 pervised learning of transformations by storing transformed 289 templates, which are ''images''. This can be thought of as a 290 ''only once'' stage, possibly done during development of the 291 visual system. The second part is the actual computation of 292 invariant signatures during visual perception.

293
This is the algorithm used throughout the paper. The guar-294 antees we can provide depend on the type of transformations. 295 The main questions are a) whether the signature is invariant 296 under the same types of transformations that were observed 297 in the first stage and b) whether it is selective, e.g., can it 298 distinguish between N different objects. A summary of the 299 main results of [5], [6], [7], [8], and [10] is that the HW 300 algorithm is invariant and selective (for K in the order of 301 log N ) if the transformations form a group. In this case, any 302 set of randomly chosen templates will work for the first 303 stage. Given that we are interested in transformations from 304 a 2D image to a 2D image, the natural choice is the affine 305 group consisting of translations, rotations in the image plane, 306 scaling (possibly non-isotropic), and their compositions. The 307 HW algorithm can learn with exact invariance and a desired 308 selectivity level in the case of the affine group or its sub-309 groups. In the case of 3D ''images'' consisting of voxels 310 with x, y, z coordinates, rotations in 3D are also a group that 311 in principle can be dealt with, achieving exact invariance   such as the average or the variance (energy model of complex 355 cells, see [3]) or the max, can effectively replace the cumula-356 tive distribution function. Notice that any linear combination 357 of the moments is also invariant and a small number of linear 358 combinations is likely to be sufficiently selective. We will 359 discuss implications of this remark for models of complex 360 cells in the last section. The previous results apply to all groups, in particular to those 364 which are not compact but only locally compact such as trans-365 lation and scaling. In this case it can be proved that invari-366 ance holds within an observable window of transformations 367 [6], [7]. For the HW module the observable window corre-368 sponds to the receptive field of the complex cell (in space and 369 scale). In order to maximize the range of invariance within the 370 observable window, [6], [7] proves that the templates must 371 be maximally sparse relative to generic input images (see 372 below for definition of sparseness). In the case of translation 373 and scale invariance, this sparsity requirement is equivalent 374 to localization in space and spatial frequency, respectively: 375 templates must be maximally localized for maximum range 376 of invariance in order to minimize boundary effects due to 377 the finite window. Assuming therefore that the templates are 378 required to have simultaneously a minimum size in space 379 and spatial frequency, it follows from the results of Gabor 380 ( [40], see also [29]) that they must be Gabor functions. The 381 following surprising property holds:

382
Optimal invariance result 383 Gabor functions of the form (here in 1D) 2σ 2 e iωx 384 are the templates that are simultaneously maximally invariant 385 for translation and scale (at each x and ω.) 386 In general, templates chosen at random from the space 387 of images can provide scale and position invariance. How-388 ever, for optimal invariance under scaling and translation, 389 templates of the Gabor form are optimal. This is the only 390 computational justification we know of the Gabor shape of 391 simple cells in V1 which seems to be remarkably universal: 392 it holds in primates (Optimal invariance result [107]), cats 393 (Optimal invariance result [61]) and mice (Optimal invariance 394 result [93]) (see also Figure 3 for results of simulations 395 and [92]).  . Each ''cell'' sees a set of images through a Gaussian window (its dendritic tree), shown in the top row. Each cell then ''learns'' the same weight vector, extracting the principal components of its input. b) This figure shows n y = σ y /λ vs n x = σ x /λ for the modulated (x) and unmodulated (y ) direction of the Gabor wavelet. Notice that the value of the slope σ y /σ x is a robust finding in the theory and apparently also in the physiology data. Neurophysiology data from monkeys, cats and mice are reported together with our simulations.  The main implication is that approximate invariance can be 448 obtained for non-group transformation by using templates 449 specific to the class of objects. This means that class specific 450 modules are needed, one for each class; each module requires 451 highly specific templates, that is cell tunings. The obvious 452 example is face-tuned cells in the face patches. Unlike exact 453 invariance for affine transformations where tuning of the 454 ''simple cells'' is non-specific in the sense that does not 455 depend on the type of image, non-group transformations 456 require highly tuned neurons and yield at best only approxi-457 mate invariance (see, e.g. [46] and [137] [87], and [105]. It is illuminating to consider two extreme 463 ''cartoon'' architectures for the first of the two stages 464 described at the beginning of section II:

465
• one layer comprising one HW module and its KH 466 complex cells, each one with a receptive field covering 467 the whole visual field 468 • a hierarchy comprising several layers of HW modules 469 with receptive fields of increasing size, followed by par-470 allel modules, each devoted to invariances for a specific 471 object class.

472
In the first architecture invariance to affine transformations 473 is obtained by pooling over KH templates, with each one 474 transformed in all possible ways: each of the associated 475 simple cells corresponds to the transformation of a template. 476 Invariance over affine transformation is obtained by pooling 477 over the whole visual field. In this case, it is not obvious 478 how to incorporate invariance to non-group transformations 479 directly in this one-hidden layer architecture.

480
Notice however that a HW module dealing with non-group 481 transformations can be added on top of the affine module. 482 The results in [5] and [7] allow for this factorization. Inter-483 estingly, they do not allow in general for factorization of 484 translation and scaling (e.g., one layer computing transla-485 tion invariance and the next computing scale invariance). 486 Instead, what the mathematics allows is factorization of the 487 range of invariance for the same group of transformations 488 (see also [5] par 3.6-7-8-9). This justifies the first layers of 489 the second architecture, corresponding to Figure 4 stage 1,490 where the size of the receptive field of each HW module 491 and the range of its invariance increases from lower to higher 492 layers.  property is also very important in modern neural networks 523 2 Notice that because images are filtered by the retina with spatial bandpass filters (ganglion cells), the input to visual cortex is a rather sparse pattern of activities, somewhat similar to a sparse edge map.
(see, e.g., [23] and [24]). It turns out that the architectures we 524 describe have this property (see [5] and [7] par 3.5.3 for the 525 translations case): isotropic architectures, like the ones con-526 sidered in this paper, with point-wise nonlinearities are equiv-527 ariant. The key difference from the architecture described 528 above is that equivariance can be achieved when the complex 529 cells pool over single cells responses coming from templates 530 transforming w.r.t. a subset of all group transformations. 531 In this way the complex first layer representation will be 532 invariant to ''small'' transformations (e.g., small translations) 533 but still carry information about ''large'' transformations 534 (equivariance). Since each module in the architecture gives an 535 invariant output if the transformed object is contained in the 536 pooling range, and since the pooling range increase from one 537 layer to the next, there is an invariance over larger and larger 538 transformations. The second point is that in order to make 539 recognition possible for both parts and wholes of an image, 540 the supervised classifier should receive signatures not only 541 from the top layer (as in most modern neural architectures) 542 but also from the other levels as well (directly or indirectly). 543  The second biophysical model for the HW module that 565 implements the computation required by our theory consists 566 of a single cell where dendritic branches play the role of 567 simple cells (each branch containing a set of synapses with 568 weights providing, for instance, Gabor-like tuning of the den-569 dritic branch) with inputs from the LGN (lateral geniculate 570 nucleus); active properties of the dendritic membrane distal 571 to the soma provide separate threshold-like nonlinearities 572 for each branch separately, while the soma sums the con-573 tributions for all the branches. This model would solve the 574 puzzle that there seems to be no morphological difference 575 between pyramidal cells classified as simple vs complex by 576 physiologists.

577
It is interesting that our theory is robust with respect to the 578 nonlinearity from simple to the complex ''cells''. We conjec-579 ture that almost any set of non trivial nonlinearities will work. for artificial neural networks, e.g., [13], [26], [134]. However 618 their biological plausibility is not clear (see also [47]). The question is whether or not such a mechanism is compat-632 ible with our theory and how to implement it if so. 633 We explore this question for V1 in a simplified setup that 634 can be extended to other areas. We assume: 635 • a) that the synapses between LGN inputs and (immature) 636 simple cells are Hebbian and in particular that their 637 dynamics follows Oja's flow [64], [95]. In this case, 638 the synaptic weights will converge to the eigenvector 639 with the largest eigenvalue of the covariance of the input 640 images.

641
• b) that the position and size of the untuned simple cells is 642 set during development according to an inverted pyrami-643 dal lattice (see Figure 3 in [102]). The key point here is 644 that the size of the Gaussian spread of the synaptic inputs 645 and the positions of the ensemble of simple cells are 646 assumed to be set independently of visual experience.

647
In summary we assume that the neural equivalent of the 648 memorization of frames (of transforming objects) is per-649 formed online via Hebbian synapses that change as an effect 650 of visual experience. Specifically, we assume that the dis-651 tribution of signals ''seen'' by a maturing simple cell is 652 Gaussian in x, y reflecting the distribution on the dendritic 653 tree of synapses from the lateral geniculate nucleus. We also 654 assume that there is a range of Gaussian distributions with 655 different σ variances which increase with retinal eccentricity. 656 As an effect of visual experience the weights of the synapses 657 are modified by a Hebb rule [50]. Hebb's original rule can be 658 written as where α is the ''learning rate'', x n is the input vector, w 661 is the presynaptic weights vector, and y is the postsynaptic 662 response. In order for this dynamical system to actually 663 converge, the weights have to be normalized. In fact, there is 664 considerable experimental evidence that the cortex employs 665 normalization [130] and references therein). Hebb's rule, 666 appropriately modified with a normalization factor, turns out 667 to be an online algorithm to compute PCA from a set of 668 input vectors. In this case it is called Oja's flow. Oja's rule 669 [64], [95] defines the change in presynaptic weights w given 670 the output response y of a neuron given its inputs x to be 671 w n = w n+1 − w n = αy n (x n − y n w n ) (7) 672 where y n = w T n x n . The equation follows from expanding to 673 the first order Hebb's rule, normalized to avoid divergence of 674 the weights.

675
Since the Oja flow converges to the eigenvector of the 676 covariance matrix of the x n which has the largest eigenvalue, 677 we analyze the spectral properties of the inputs to ''simple'' 678 cells and study whether a PCA computation can be used by 679 the HW algorithm and in particular whether it satisfies the 680 selectivity and invariance results.

681
Alternatives to the Oja's rule that still converge to PCAs 682 can also be considered (see [113] and [96]). Also notice 683 that a relatively small change in the Oja  of K templates. The cell is thus exposed to a set of images 723 (columns of X ) X = (g 1 T , . . . , g |G| T ). For the sake of this 724 example, assume that G is the discrete equivalent of a group.

725
Then the covariance matrix determining the Oja's flow is It is immediately clear that if φ is an eigenvector of C then 728 g i φ is also an eigenvector with the same eigenvalue (for more 729 details on how receptive fields look like in V1 and higher 730 3 Suppose that the simple cells are exposed to patterns and their scaled and translated versions. Suppose further that images are defined on a lattice and translations and scaling (a discrete similitude group) are carefully defined on the same lattice. Then a set of discrete orthogonal wavelets -defined in terms of discrete dilation and shifts -exist and is invariant under the group. The Oja rule (extended beyond the top eigenvector) could converge to specific wavelets. layers see also [5], [101] par 4.3.1 and 4.7.3, [10], [41], 731 [42], [51]). Consider an example G to be the discrete rotation 732 group in the plane: then all the (discrete) rotations of an 733 eigenvector are also eigenvectors. The Oja rule will converge 734 to the eigenvectors with the top eigenvalue and thus to the 735 subspace spanned by them. It can be shown that L 2 pooling 736 over the PCA with the same eigenvalues represented by 737 different simple cells is then equivalent to L 2 pooling over 738 transformations, as the theory of section II.B dictates, in order 739 to achieve selectivity and invariance ([5] par 4.6.1 and [10]). 740 This argument can be formalized in the following variation 741 of the pooling step in the HW algorithm: The results of section II on the HW module imply that the 755 templates, and therefore the tuning of the simple cells, can 756 be the image of any object. At higher levels in the hierarchy, 757 the templates are neuroimages -patterns of neural activity -758 induced by actual images in the visual field. The previous 759 section, however, offers a more biologically plausible way 760 to learn the templates from unsupervised visual experience 761 via Hebbian plasticity. In the next sections we will discuss 762 predictions following from this assumptions for the tuning of 763 neurons in the various areas of the ventral stream. As discussed in section IIE, approximate invariance for trans-769 formations beyond the affine group requires highly tuned 770 templates, and therefore highly tuned simple cells, probably 771 at a level in the hierarchy corresponding to AIT (anterior 772 inferotemporal cortex). According to the considerations of 773 section IIF this is expected to take place in higher visual 774 areas of the hierarchy. In fact, the same localization condi-775 tion of Equation 4 suggests Gabor-like templates for generic 776 images in the first layers of a hierarchical architecture and 777 specific tuned templates for the last stages of the hierarchy, 778 since class specific modules are needed with each containing 779 highly specific templates, and thus highly tuned cells. This 780 is consistent with the architecture of the ventral stream and 781 the the existence of class-specific modules in primate cortex 782 VOLUME 10, 2022 such as a face module and a body module [28], [30], [  However, as our theory shows, approximate invariance to 807 smooth non-group transformations can still be achieved in 808 several cases (but not always) using the same HW module.

809
The reason this will often approximately work is because it transformation are well predicted by it. If no module can be 838 found with this property the new image and all its transfor-839 mations will be the seed of a new object cluster/module.

840
For the special case of rotation in depth, [77], ran a simu-841 lation using 3D modelling / rendering software to obtain the 842 transformations of objects for which there exist 3D models. 843 Faces had the highest degree of clustering of any naturalistic 844 category -unsurprising since recognizability likely influ-845 enced face evolution. A set of chair objects had broad clus-846 tering, implying that little invariance would be obtained from 847 a chair-specific region. A set of synthetic ''wire'' objects, 848 very similar to the ''paperclip'' objects used in several classic 849 experiments on view-based recognition, e.g., ([12], [82], and 850 [83]) were found to have the smallest index of clusterability: 851 experience with familiar wire objects does not transfer to new 852 wire objects (because the 3D structure is different for each 853 individual paperclip object).

854
It is instructive to consider the limit case of object classes 855 that consist of single objects -such as individual paperclips. 856 If the object is observed under rotation several frames are 857 memorized as transformations of a single template (identity 858 is implicitly assumed to be conserved by a Foldiak-like rule, 859 as long as there is continuity in time of the transforma-860 tion). The usual HW module pooling over them will allow 861 view-independent recognition of the specific object. A few 862 comments: In [100] the similarity operation was the Gaussian of a distance -instead of the dot product required by our theory. Notice that for normalized vectors, l 2 norms and dot products are equivalent.

917
In the case of ''simple'' neurons in the AL face patch [37], 918 [39], [77], exposure to several different faces -each one gen-919 erating several images corresponding to different rotations in 920 depth -yields a set of views with a covariance function which 921 has eigenvectors (PCs) that are either even or odd functions 922 (because faces are bilaterally symmetric).

923
The Class-specific result together with the Spectral pooling 924 proposition suggests that square pooling (over these face PCs) 925 provides approximate invariance to rotations in depth. The 926 full argument goes as follows. Rotations in depth of a face 927 around a certain viewpoint -say θ = θ 0 -can be well 928 approximated by linear transformations (by transformations 929 in the general linear group, g ∈ GL(2)). The HW algo-930 rithm can then provide invariance around θ = θ 0 . Finally, 931 if different sets of ''simple'' cells are plastic at somewhat 932 different times, exposure to a partly different set of faces 933 yields different eigenvectors summarizing different sets of 934 faces. The different sets of faces play the role of different 935 object templates in the standard theory.

936
The limit case of object classes that consist of single 937 objects is important to understand the functional architecture 938 of most of IT. If an object is observed under transforma-939 tions, several images of it can be memorized and linked 940 together by continuity at time of the transformation. As we 941 mentioned, the usual HW module pooling over them will 942 allow view-independent recognition of the specific object. 943 Since this is equivalent to the Edelman-Poggio model for 944 view invariance [100] there is physiological support for this 945 proposal (see [ The theory then offers a direct interpretation of the 949 Tsao-Freiwald data (see [37] [38], and [39]) on the face patch 950 system. The most posterior patches (ML/MF) provide a view 951 and identity specific input to the anterior patch AL, where 952 most neurons show tuning that is an even function of the 953 rotation angle around the vertical axis. AM, which receives 954 inputs from AL, is identity-specific and view-invariant. The 955 puzzling aspect of this data is the mirror symmetric tuning 956 in AL: why does this appear in the course of a computation 957 that leads to view-invariance? According to our theory this 958 result should be expected if AL contains ''simple'' cells that 959 are tuned by a synaptic Hebb-like Oja rule and the output 960 of the cells is roughly a squaring nonlinearity as required 961 by the Spectral pooling proposition. In this interpretation, 962 cells in AM pool over several of the squared eigenvector 963 filters to obtain invariant second moments (see Figure 6).

1001
• IT is a complex of parallel class-specifc modules for a 1002 large number of large and small object classes. These 1003 modules receive position and scale invariant inputs 1004 (invariance in the inputs greatly facilitates unsupervised 1005 learning of class specific transformations). We recall 1006 that, from the perspective of the theory, the data of [83] 1007 concern single object modules and strongly support the 1008 prediction that exposure to a transformation lead to 1009 neuronal tuning to several ''frames'' of it.

1010
Object-based vs 3D vs view-based recognition. We should 1011 mention here an old controversy about whether visual recog-1012 nition is based on views or on 3D primitive shapes called 1013 geons. In the light of our theory image views retain the main 1014 role but ideas related to 3D shape may also be valid. The psy-1015 chophysical experiments of Edelman and Buelthoff [32] con-1016 cluded that generalization for rotations in depth was limited to 1017 a few degrees (≈±30 degrees) around a view (independently 1018 of whether 2D or 3D information was provided to the human 1019 observer (psychophysics in monkey [82], [83] yielded similar 1020 results). The experiments were carried out using ''paperclip'' 1021 objects with random 3D structure (or similar but smoother 1022 objects). For this type of objects, class-specific learning is 1023 impossible (they do not satisfy the second condition in the 1024 class-specific result) and thus our theory predicts the result 1025 obtained by Edelman and Buelthoff. For other objects, how-1026 ever, such as faces, the generalization that can be achieved 1027 from a single view by our theory can span a much larger range 1028 than ±30 degrees, effectively exploiting 3D-like information 1029 from templates of the same class.

1030
Genes or learning. Our theory shows how the tuning of 1031 the ''simple'' cells in V1 and other areas could be learned in 1032 an unsupervised way. It is possible however that the tuning -1033 or the ability to quickly develop it in interaction with the 1034 environment -may have been partly compiled during evolu-1035 tion into the genes. 5 Notice that this hypothesis implies that 1036 most of the times the specific function is not fully encoded 1037 in the genes: genes facilitate learning but do not replace it 1038 completely. It is then be expected in the ''nature vs nurture 1039 debate'' that usually nature needs nurture and nurture is made 1040 easier by nature. An interesting result in this respect comes 1041 from a recent paper [66] where the authors propose a genetic 1042 5 If a function learned by an individual represents a significant evolutionary advantage we could expect that aspects of learning the specific function may be encoded in the genes, since an individual who learns more quickly has a significant advantage. In other words, the hypothesis implies a mix of nature and nurture in most competencies that depend on learning from the environment (like perception). This is an interesting implication of the ''Baldwin effect'' -a scenario in which a character or trait change occurring in an organism as a result of its interaction with its environment becomes gradually assimilated into its developmental genetic or epigenetic repertoire [27]. In the words of Daniel Dennett, ''Thanks to the Baldwin effect, species can be said to pretest the efficacy of particular different designs by phenotypic (individual) exploration of the space of nearby possibilities. If a particularly winning setting is thereby discovered, this discovery will create a new selection pressure: organisms that are closer in the adaptive landscape to that discovery will have a clear advantage over those more distant.'' (p. 69, quoting [27]).
bottleneck as an effective regularizer that enables evolution to select simple circuits that can be readily adapted to important 1044 real-world tasks (see also [48]).

1045
Computational structure of the HW module. The HW mod-1046 ule computes the CDF of I , g i t k over all g i ∈ G. The 1047 computation consists of can potentially give an advantage in terms of sample com-1091 plexity w.r.t. hardwired translation invariant convolutional 1092 networks. Note however that, differently from state of art 1093 CNNs (Convolutional Neural Networks), our proposed learn-1094 ing algorithm is unsupervised and a direct comparison could 1095 be potentially misleading. In general, supervised methods 1096 are nowadays superior to unsupervised ones although unsu-1097 pervised models based on contrastive embeddings [139] or 1098 biologically plausible backpropagation [57] and contrastive 1099 predictive coding [11] might offer an alternative. A more 1100 appropriate comparison can be done with architecture where 1101 the signal representation is learned in an unsupervised way 1102 and supervision is only used to adapt the representation to the 1103 specific task(s) [68]. Finally note that our unsupervised model 1104 differs from those cited above in that we make the hypothesis 1105 that the input is a collection of group transformed sensory 1106 signals. We take advantage of this data structure deriving the 1107 equivariant and invariant properties of the representation.

1108
Invariance in 2D and 3D vision. We have assumed here 1109 that ''images'' as well as templates are in 2D. This is the 1110 case if possible sources of 3D information such as stereopsis 1111 and/or motion are eliminated. Interestingly, it seems that 1112 stereopsis does not facilitate recognition, suggesting that 3D 1113 information, even when available, is not used by the human 1114 visual system (see [15]). 7

1115
Explicit or implicit gating of object classes. The second 1116 stage of the recognition architecture consists of a large set 1117 of object-class specific modules of which probably the most 1118 important is the face system. It is natural to think that signals 1119 from lower areas should be gated, in order to route access 1120 only to the appropriate module. In fact, Tsao [129], but see 1121 also [18] postulated a gate mechanism for the network of 1122 face patches. The structure of the modules however suggests 1123 that the module themselves automatically provides a gat-1124 ing function even if their primary computational function is 1125 invariance. This is especially clear in the case of the module 1126 associated with a single object (the object class consists of 1127 a single object as in the case of a paperclip). The input to 1128 the module is subject to dot products with each of the stored 1129 views of the object: if none match well enough the output of 1130 the module will be close to zero, effectively gating off the 1131 signal and switching off subsequent stages of processing.

1132
Invariance to X and estimation of X. The description of our 1133 theory focuses on the problem of recognition as estimating 1134 identity or category invariantly to a transformation X -such 1135 as translation or scale or pose. Often however the comple-1136 mentary problem, of estimating X, for instance pose, is also 1137 important. The same neural population may be able to support    an attentional focus of processing to reduce clutter effects 1196 and also to run visual routines [118] at various levels of 1197 the hierarchy. An interesting, but not biologically plausible, 1198 alternative might be offered by the recently introduced trans-1199 formers architecture, [136]. 1200 Motion helps learning isolated templates. Ideally tem-1201 plates and their transformations should be learned without 1202 clutter. It can be argued that if the background changes 1203 between transformed images of the same template then the 1204 averaging effect intrinsic to pooling will mostly ''average 1205 out'' the effect of clutter during the unsupervised learning 1206 stage. Though this is correct and we have computer simu-1207 lations that provide empirical support to the argument, it is 1208 interesting to speculate that motion could provide a simple 1209 way to eliminate most of the background. Sensitivity to 1210 motion is one of the earliest visual computations to appear 1211 in the course of evolution and one of the most primitive. 1212 Stationary images on the retina tend to fade away. Detection 1213 of relative movement is a strong perceptual cue in primate 1214 vision as well as in insect vision, probably with similar 1215 normalization-like mechanisms [49], [103]. Motion induced 1216 by the transformation of a template may then serve two 1217 important roles:

1218
• To bind together images of the same template while 1219 transforming: continuity of motion is implicitly used to 1220 ensure that identity is preserved;

1221
• To eliminate background and clutter by effectively using 1222 relative motion.

1223
The required mechanisms are probably available in the retina 1224 and early visual cortex.

1225
Despite significant advances in sensory neuroscience over 1226 the last five decades, a true understanding of the basic func-1227 tions of the ventral stream in visual cortex has proven to be 1228 elusive. Thus it is interesting that the theory used in this paper 1229 follows from a novel hypothesis about the main computa-1230 tional function of the ventral stream: the representation of 1231 new objects/images in terms of a signature which is invariant 1232 to transformations learned during visual experience, thereby 1233 allowing recognition from very few labeled examples-in the 1234 limit, just one. This view of the cortex may also represent 1235 a novel theoretical framework for the next major challenge 1236 in learning theory beyond the relatively-mature supervised 1237 learning: the problem of representation learning, formulated 1238 here as the unsupervised learning of invariant representations 1239 that significantly reduce the sample complexity of the super-1240 vised learning stage.

1242
The authors declare no conflict of interests 1243 ACKNOWLEDGMENT 1244 The authors would like to thank Danny Harari, Lorenzo 1245 Rosasco, and especially to Gabriel Kreiman for discussions, 1246 and also would like to thank Ryan Pyle for reading the 1247 manuscript.