A Survey of Hardware Self-Organizing Maps

Self-organizing feature maps (SOMs) are commonly used technique for clustering and data dimensionality reduction in many application fields. Indeed, their inherent property of topology preservation and unsupervised learning of processed data without any prior knowledge put them in the front of candidates for data reduction in the Internet of Things (IoT) and big data (BD) technologies. However, the high computational cost of SOMs limits their use to offline approaches and makes the online real-time high-performance SOM processing more challenging and mostly reserved to specific hardware implementations. In this article, we present a survey of hardware (HW) SOM implementations found in the literature so far: the most widely used computing blocks, architectures, design choices, adaptation, and optimization techniques that have been reported in the field of hardware SOMs. Moreover, we give an overview of main challenges and trends for their ubiquitous adoption as hardware accelerators in many application fields. This article is expected to be useful for researchers in the areas of artificial intelligence, hardware architecture, and system design.

A Survey of Hardware Self-Organizing Maps Slaviša Jovanović , Member, IEEE, and Hiroomi Hikawa , Member, IEEE Abstract-Self-organizing feature maps (SOMs) are commonly used technique for clustering and data dimensionality reduction in many application fields.Indeed, their inherent property of topology preservation and unsupervised learning of processed data without any prior knowledge put them in the front of candidates for data reduction in the Internet of Things (IoT) and big data (BD) technologies.However, the high computational cost of SOMs limits their use to offline approaches and makes the online real-time high-performance SOM processing more challenging and mostly reserved to specific hardware implementations.In this article, we present a survey of hardware (HW) SOM implementations found in the literature so far: the most widely used computing blocks, architectures, design choices, adaptation, and optimization techniques that have been reported in the field of hardware SOMs.Moreover, we give an overview of main challenges and trends for their ubiquitous adoption as hardware accelerators in many application fields.This article is expected to be useful for researchers in the areas of artificial intelligence, hardware architecture, and system design.

I. INTRODUCTION
T HE self-organizing map (SOM) is a special type of arti- ficial neural network (ANN) proposed by Kohonen [1].The SOM is an unsupervised learning algorithm performing a nonlinear mapping from a given high-dimensional input vector space to a low-dimensional map of neurons, usually a regular 2-D grid.It acts as a nonsupervised clustering algorithm as well as a powerful visualization tool, and it has been used to visualize, interpret, and classify large high-dimensional data in many application domains, such as economy, industry, management, sociology, geography, and text mining [2], [3].
Color versions of one or more figures in this article are available at https://doi.org/10.1109/TNNLS.2022.3152690.
Digital Object Identifier 10.1109/TNNLS.2022.3152690CUDA, and so on) or in dedicated hardware (HW) implemented on field-programmable gate arrays (FPGAs) or in application-specific integrated circuit (ASICs) by proposing specific hardware architectures exploiting the inherent parallelism of the SOM algorithm for better performances [4]- [28].
In applications requiring a large number of neurons (i.e., more than 10 2 ) and/or processing huge volumes of data (i.e., more than 10 4 of samples with dimensions >10 2 ), the SOM algorithm requires significant processing power that often cannot be provided with the conventional CPU-based computing platforms.In the last decade, as shown in Fig. 1, with a massive surge of general-purpose computation on GPUs (GPGPU), the SOM GPU-based implementations have gained an increasing interest due to their significant speedups with respect to CPU counterparts, as well as impressive improvements in terms of performances in comparison with HW SOMs [33].On the other hand, the GPU processing power is to the detriment of the overall energy efficiency (number of operations, i.e., connection updates, per second per consumed watt), which is more than ten times higher for the FPGA-based SOMs, as reported in [35].In addition, the substantial parallelism found in the SOM algorithm with the high energy efficiency is the main reason to target HW for SOM implementations.This article gives an overview of the hardware, application-specific implementations of the SOM algorithm, the most widely used computing blocks, architectures, design choices, adaptation, Performance overview of different SOM implementations (in million of connection updates per second-MCUPS, see Section V-C) (1995-2021): HW (FPGA and ASIC) implementations (data extracted from [4]- [28], [35]); GPU implementations (data extracted from [29]- [35]); and CPU implementations (data extracted from [29]- [33], [35]- [37]).Fig. 2. Illustration of a 2-D SOM structure: the black nodes represent the neurons, the grayed area represents the direct neighbors of the neuron K + 3, and the light gray lines represent the input vector delivery to all neurons.and optimization techniques that have been reported in the literature so far in the field of HW SOMs.
The remainder of this article is organized as follows.Section II describes the original SOM algorithm, while Section III provides an overview of hardware adaptations of the original SOM algorithm in all phases (initialization, vector distance computation, BMU search operation, neighborhood function, and weight update).Section IV describes the types of architectures commonly found in hardware implementations of SOM.In Section V, the methods, tools, datasets, and application use cases for validation of hardware SOMs as well as an overview of their performance measurements are provided.Open research and challenging problems in the field of HW SOM implementations are discussed in Section VI.Finally, Section VII provides the final conclusion of the present survey.In addition, Table I shows acronyms and terms frequently used in this article.

A. Original Self-Organizing Map Algorithm
The original SOM algorithm proposed by Kohonen is summarized in Algorithm 1.Its starting point is a map of neurons i ∈ G usually placed in a two-dimensional L × K grid, as shown in Fig. 2. Every neuron i ∈ G on the map includes a D-dimensional vector m i ∈ D , called the weight vector ( The learning phase starts with an appropriate initialization, where each weight vector's element μ i, j (0 < i ≤ L • K , 0 ≤ j < D) is initialized with a random value.In the learning phase, which is carried out through λ ∈ N steps or iterations, the map is trained with a set of training vectors x ∈ X ⊂ D x = μ 0 , μ 1 , . . ., μ j , . . ., μ D−1 ∈ D . ( At the beginning of each iteration, a training vector x is delivered to all neurons of the map, as shown with light gray lines in Fig. 2. In each iteration, all neurons calculate the distances of their weight vectors m i with respect to the input vector x.Then, the neuron-C ∈ G that has the closest weight vector m C to the input vector x is determined from the calculated vector distances of all neurons This search for the neuron having the shortest vector distance is often called a winner-take-all (WTA) or best matching unit (BMU) operation, while the elected neuron is called winner or BMU neuron.In Kohonen's study [1], the Euclidean metric is used as the vector distance to find similarities between input x and weight After the winner neuron is determined, the weight vectors of the neurons in its neighborhood are updated toward the input vector as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II
TIME COMPLEXITY OF THE MAIN OPERATIONS OF THE SOM ALGORITHM [42] where t ∈ N represents the discrete-time coordinate and h ci is the neighborhood function used to find the neighborhood neurons in the vicinity of the winner neuron whose weights are updated at the end of an iteration.Originally, the neighborhood function h ci is defined as follows [1]: where r C ∈ 2 and r i ∈ 2 are position vectors of the winner neuron-C and neuron-i , respectively, and α(t) and σ (t) are learning rate and neighborhood radius, respectively.
After the unsupervised learning with the training vectors, the SOM builds the weight map representing the quantized projection of the input vector space, whose probability density function is represented with the distributed prototype vectors of the SOM weight map.Moreover, the weights of the SOM are retained and are used in the recall phase, where only the winner neuron search is carried out without weight update (see Algorithm 1).An example of the behavior of a 16 × 16 SOM processing three-dimensional vectors during the learning phase is shown in Fig. 3. Small red dots distributed in a triangular shape represent the input vectors, and the larger blue plots specify weights of neurons, while lines connect the weights of nearest neighbors' neurons.Since the weight vectors are initialized with small values, they cluster at the origin of the plot, as shown in Fig. 3(a).Weights then gradually expand in an orderly way through numerous training iterations (E represents training iterations) until they approximate the distribution of the input samples.Note that the positional relationship between neurons in the vector space is maintained during training.This is possible due to the SOM inherent topology-preserving nature, which maps input vectors close to each other in the input vector space onto neighboring neurons of the SOM.
In the SOM algorithm, during an iteration, each neuron carries out the vector distance computation and the weight vector update.These operations are independent between neurons and can be performed simultaneously (see Algorithm 1).Consequently, this inherent parallelism of the SOM algorithm makes it suitable for hardware implementations.Neurons implemented in hardware can work in parallel, while the neurons in software are executed sequentially.Therefore, the hardware-implemented SOM can process its input vectors more efficiently than the software implemented one.This efficient computational power is especially desired when the size of the SOM or the input vector dimension is large.Due to the parallelism of the algorithm, the more neurons or vector processing elements (PEs) are implemented in hardware, the higher performance is obtained.In addition, the number of neurons implemented in hardware can be increased easily because they work independently of other neurons.As a result, the hardware SOM has the potential to provide high scalability.However, the bottleneck of the SOM algorithm is the winner search operation, in which the shortest vector distance must be searched.Indeed, all vector distance values must be compared to find the shortest one, and various hardware comparison methods have been proposed so far, as discussed in Section IV.Moreover, Table II summarizes the time complexity of the main operations of the SOM algorithm, for sequential and parallel implementations as well [42] supports the idea that by introducing SOM operation-specific hardware, a considerable gain in overall performances can be obtained.

III. SOM ALGORITHM FOR HARDWARE IMPLEMENTATON
As mentioned earlier, the SOM algorithm is suitable for hardware implementation because of its inherent parallelism.However, hardware expensive functions, such as Gaussian function in the original SOM algorithm, makes the hardware implementation challenging because the hardware resources of implementation platforms are limited.Therefore, the SOM algorithm has been further modified to be more suitable for hardware implementations.
This section provides an overview of hardware adaptations of the original SOM algorithm in all phases: initialization, vector distance computation, BMU search operation, neighborhood function, and weight update (see Table III for the overall summary).

A. Initialization
Even though, in the original SOM algorithm, it is suggested to use random initial values for demonstrations purposes, it turns out that this initialization policy does not necessarily provide the best performances in terms of convergence speed to the stationary values and quality of resulting maps [1], [43].The initialization of SOMs has been widely studied in the literature [1], [43]- [48].Basically, two groups for initialization purposes can be found [1], [44], [45]: random initialization and linear initialization (often called data-driven initialization).In the random initialization, the weights of the SOM neurons are initialized: either by randomly choosing data in the range of values observed in the input dataset, by randomly selecting the data from the input datasets, or by randomly choosing the perturbed values around the mean values observed in the input dataset.In the data-driven or linear initialization, the input datasets are previously analyzed before initializing the SOM weights.Different methods can be found in the literature: based on the data analysis with k-means where input datasets are projected on the map of the same size in order to find the cluster centers that are then rearranged with some Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.heuristics to fit the used SOM map [44]; on the statistical analysis such as principal component analysis (PCA) on regular grids [47] where largest eigenvectors of the projected input dataset are chosen; or the same PCA-based analysis with projection on irregular grids [45], by using connected graphs to rearrange the most distant elements in the input dataset of nonvectorial elements [46], or by placing the SOM neurons on the Hilbert self-similar curves [47].Akinduko et al. [48] compared random and PCA-based initialization on quasilinear and nonlinear datasets.They concluded that there is no universal initialization policy giving the best performances in terms of convergence and quality of obtained results; the random initialization performs better on nonlinear datasets, whereas the PCA-based initialization gives better results for quasi-linear ones.
When the SOMs are implemented in software, as it was the case in all initialization techniques previously presented, the weights can be easily programmed.However, the vector initialization is a burden for the hardware implementation since it requires additional circuits such as a linear feedback shift register (LFSR) that provides random values.If a single LFSR is employed, all vector elements must be initialized one by one, and thus, a communication link between the initialization circuit and all neurons is necessary.On the other hand, each neuron can have its own LFSR.In this case, the global communication link is not necessary, but all LFSRs must be differently initialized to generate different values, which thus breaks modularity of neurons.A weight programmability is necessary for neurons to modify their weights, but additional circuits to initialize all weight vector elements increase the hardware cost of the neurons.Thus, if possible, it is desirable to omit the initialization of the weight vector.
Kolasa et al. [49] investigated the weight initialization problem of the hardware SOM by simulations.Comprehensive simulations were carried out for several initialization strategies with different scenarios, and the result revealed that the hardware SOMs in many situations could be trained without any initialization, simply by using zeroed weights.These results are explained by the influence of the SOM algorithm's neighborhood mechanism that in a given learning cycle stimulates also the activity of the neurons topologically positioned in a broader vicinity of the winner neuron.
This study was followed by [50] where an efficient initialization of HW SOM neuron weights was investigated.The obtained results confirmed that the SOM could learn properly even if the learning process started with zeroed weights.
In addition, in this work, an initialization circuit with full programmability of the weights was also proposed for the map without the neighborhood function.

B. Vector Distance Computation
The Euclidean metric is one particular type of a Minkowski metric that can be considered as a generalization of both the Euclidean distance and the Manhattan distance.The Minkowski distance is defined as By setting L = 2, the Minkowski distance corresponds to the Euclidean distance.The Minkowski norm with L = 1 is known as Manhattan, City block, or Taxicab metric Since no squaring or square root circuit is required, the silicon area saving caused by the Manhattan metric is significant, and it is well suited for hardware implementation.Dlugosz et al. [51] investigated the effect of the Euclidean and Manhattan distances on the learning.The detailed system-level simulations showed that the Euclidean and Manhattan distances both lead to similar learning results.Another variant of the Minkowski distance is the Chebyshev distance (also known as chessboard distance) that can be obtained with Pena et al. [11] investigated the use of the Chebyshev distance in HW SOMs and concluded from FPGA synthesis results that the Manhattan distance outperforms the Chebyshev one in terms of speed but at the expense of higher HW resources (bigger chip area).
In the original SOM algorithm, the computed Euclidean distances of all neurons are compared to find the shortest one.Even if the square root is not calculated, there is no effect on the magnitude comparison of the distances.Therefore, another popular vector distance metric is squared Euclidean distance because the hardware cost of the square root function is very expensive Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The types of the vector distance that have been used in the hardware SOMs so far are summarized in Fig. 4 and Table III.It can be noticed that more than 75% of the used vector distances in the HW SOMs in the last 25 years (1995-2021) are the Manhattan, squared Euclidean, and Euclidean distance.This ongoing trend is also found in the recent state-of-the-art HW SOM implementations.The last quarter ("Other" in Fig. 4) is the different attempt to use uncommon distance metrics for the HW SOM algorithm: modified Hamming distance [52], frequency comparison [25], dot product cosine [53], inner product [27], count of cycle slips [54], and so on.

C. BMU (Winner) Search
In the winner search, often referred to as the BMU or WTA search operation, the neuron unit that has the closest weight to the input vector (shortest distance) is searched.In HW SOMs, this operation is implemented either with analog or digital BMU circuits.
1) Analog BMU Search Circuit: Different analog hardware WTA computation circuits have been proposed in the literature.One of the WTA circuits is MAXNET [55], where neurons in the network mutually inhibit each other while activating themselves.Eventually, only one neuron is kept activated and becomes the winner.Lazzaro et al. [56] proposed a CMOS WTA circuit, where signals are represented as analog currents.Osteret al. [57]examined analytically the ability of a spike-based WTA network.Similarly, other examples of spiking WTAs with temporal coding have been reported in [58]- [60].
2) Digital BMU Search Circuit: Among digitally implemented WTAs, a bit-serial parallel minimum search circuit, shown in Fig. 5(a), is reported in [1].The bit-serial winner determination is based on a bit-by-bit comparison of all neurons' vector distances, performed in parallel.This calculation requires a global AND operation between all neurons and feedback to them.The distance of each neuron is loaded to a shift register and its corresponding flag is set to "1" at the beginning of the search operation.All the neurons in the network, in particular the most significant bits (MSBs) of their shift registers, are sequentially connected through AND gates.The AND chain signal becomes low if at least one of these MSB bits is zero.The bit comparison starts from the MSB and proceeds through the least significant bit (LSB) by shifting the register to the left, covering one bit per stage (per clock).At any stage, the flags of the neurons with MSB = "1" are reset to zero if one of the MSBs of all the neurons is zero (the AND chain signal is zero).Otherwise, the flag is unchanged if it is already zero.Resetting the flag of a neuron to zero eliminates that neuron from further competition.After the last stage, only the neurons with shortest distances are left and their flag is "1."Therefore, this method takes L clock periods to complete the search, where L is the bit length of the distance norm.Similarly, Tamukoh et al. [9] proposed a WTA circuit, which is a modified version of the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.bit-serial comparison.The proposed WTA circuit performs a rough comparison of neurons' distances in the early stage and a strict one later, which allows the faster learning in massively parallel SOM architectures.
Hikawa et al. [23] modified the bit-serial comparison circuit so that all neurons' distance bits at the same position are tested simultaneously without any shift registers.Thus, most significant bits of the neurons' distance are excluded from the winner search.The comparison results indicating whether a neuron is still a candidate for the winner or not are propagated to the lower bit through a signal instead of being stored in a flag register, as it was the case in the bit-serial winner search.At the LSB position, the neuron having the shortest distance is determined.This winner search circuit is a pure combinatorial circuit (no clock), leading though to a longer latency (poorer performance) of the circuit.
Another popular method of doing fast BMU search is using several parallel comparators in a binary tree structure.An example of the binary tree search circuit is shown in Fig. 5(b).Each comparator is accompanied with two 2-to-1 multiplexers (MUX) whose selection signals are controlled by the comparator's output.The winner is selected by tournament selections, and the multiplexer forwards the shortest distance and the corresponding neuron ID to the next stage.This parallel binary-tree BMU search method is implemented as a global WTA circuit that collects vector distance data from all the neurons of the map.
Mailachalam et al. [61] compared the bit-serial minimum and the binary-tree search circuits in terms of the speed, hardware resources (chip area), and circuit complexities.They concluded that the binary tree method is faster than the bit-serial method, allowing to build HW SOMs with better performances.On the other hand, in terms of needed hardware resources for their implementation, often presented with the chip area the designed circuit will have once manufactured, no difference was found.
The hardware SOM proposed by Hendry et al. [7] employs a unique winner search circuit.In this method, the distance value stored within each neuron is decreased by one at each clock cycle until zero is reached.The neuron or neurons, which first reach zero, candidate themselves to become the winner by attempting to output their identifier to the global output bus.Since it is possible that several neurons have the same minimum distance, the global winner is randomly chosen from those neurons.
Kurdthongmee [62] proposed a low-latency digital BMU search circuit for HW SOM quantizers.The general idea is to use a K -address 1-bit memory, where K corresponds to the maximal value of distance encountered in a given application (here color quantization).The 1-bit memory location is used to indicate whether the corresponding distance is found ("1" or "occupied") or not ("0" or "unoccupied") in a given learning iteration.At the end, the first "occupied" memory location starting from the beginning designates the BMU for a given learning iteration.The index of the first "occupied" memory location is retrieved with a custom lookup table (LUT)based circuit.The proposed BMU search approach reduces the latency of the overall BMU operation (0.62 times of the conventional method with comparators and binary tree) to the detriment of the hardware cost (2× overhead for an FPGA implementation).

D. Neighborhood Function
A very important feature of the SOM is its topologypreserving nature, where two adjacent vectors in the input vector space are mapped onto adjacent neurons on the map.This topology-preserving nature is realized by the weight update with the neighborhood function, which affects as well the performance of the SOM in the recall phase.The original algorithm uses the Gaussian neighborhood function shown in (6) and Fig. 6(a).However, the Gaussian neighborhood function is not suitable for hardware implementation because of its high hardware cost.Straightforward way to implement the Gaussian function is to store precomputed values in a lookup table (LUT).The size of neighborhood depends on the SOM size, and thus, the memory content that implements the LUT-based Gaussian function tends to be large, leading to the need for more hardware resources.Moreover, the LUT-based neighborhood function requires the use of multiplier for vector adaptation, which should be avoided due to its higher hardware cost.Thus, most of the hardware SOMs use simplified neighborhood functions.
The most simple neighborhood function is a rectangular or step function, as shown in Fig. 6(b).With this function, only weight vectors of the neurons within a certain radius R from the winner neuron are updated with the same update coefficient α.Since no multiplication is necessary to implement the neighborhood function, the hardware cost for the neighborhood function is reduced.The works in [5], [7], [15], [52], and [63] used the rectangular function.
A function, the value of which is restricted to negative powers of two, is another popular neighborhood function in HW SOMs.The multiplication can be replaced by a shift operation because the values are restricted to the negative powers of two.This function is shown in Fig. 6(c).The multiplication by the neighborhood function is executed by the right-shift arithmetic operation, and its shift size is determined by the distance to the winner neuron.As shown in Fig. 7 and Table III, many hardware SOMs employ the negative-powersof-two neighborhood function.
Dlugosz et al. [14], [64] proposed a triangular neighborhood function.As shown in Fig. 6(d), the function value linearly decreases with the distance to the winner neuron.The effect of this triangular neighborhood function on the SOM learning was studied by Hspice simulation [65], [66] and revealed that it can be successfully used instead of the Gaussian neighborhood function.Moreover, the sound performance of the SOM was also confirmed in the cases where the signal resolution of the neighborhood function was low.However, the triangular neighborhood function requires multipliers for its computation, leading to a higher hardware cost.
The hardware SOM proposed by Pohl et al. [10] is based on the modular array architecture where neurons exchange data via a shared bus.In this approach, during each iteration, the distance value to the winner in every neuron is decreased by one at each clock cycle.Then, the neuron whose distance reaches zero takes a value from the shared data bus, which represents the update coefficient for the neighborhood function.These values, provided by the bus to all the neurons of the architecture, are the neighborhood function values previously precalculated in software and stored in the memory.
The neighborhood function proposed in [16] is realized as a combination of an update pulse generator and an update pulse selector.The update pulse generator generates three pulse signals with different frequencies.The amount of update to use by the neighborhood function is proportional to the chosen frequency.Thus, the winner neuron uses the highest frequency update pulse, whereas its neighboring neurons use the lower frequency ones.To provide satisfactory learning results, these frequencies are determined and predefined offline.The neighborhood mechanism used in [17] was implemented in the similar way, where three precalculated neighborhood function values are broadcast to all the neurons of the map.Each neuron uses the appropriate value according to its distance to the winner neuron.In addition, the function values are made adaptively large to cover the cases where the distance to the winner neuron is larger than the usual one.Thus, the learning performance in terms of quantization error is also improved.
Despite the optimizations mentioned above, all hardware implementations of the SOM presented in the literature are affected by a common problem, i.e., the performance decreases with an increasing number of neurons.The winner search operation, commonly realized with a comparator tree, is the main reason for this performance degradation (see Section III-C and Table II).Cardarilli et al. [67] proposed the all winner-SOM (AW-SOM), a modified version of the SOM algorithm.The AW-SOM is based on the following idea; if the input vector is close enough to the winner neuron, the coordinates of the former can be used directly in the neighborhood function.Because the AW-SOM does not use the conventional neighborhood function that requires the coordinate of the global winner, it does not require neither the identification of the winner neuron, which is the most critical operation in terms of propagation delay.Experimental results show that if the neurons are initialized using a uniform or random distribution, the results of AW-SOM and traditional SOM clustering are comparable in 92% of the cases.The failed cases are related to the bad position of the clusters with respect to the initial position of the neuron.In addition, the absence of the comparator tree for the winner neuron selection considerably improves the system performance (see Section V-C).On the other hand, because the topological relations between neurons are not considered and the winner-based neighborhood function is not used as well, the topology-preserving nature of the SOM, stated earlier as one of the pillars of the SOM algorithm, is not guaranteed in this approach.
The types of the neighborhood functions that are used in the hardware SOMs are summarized in Fig. 7 and Table III   the negative-powers-of-two function, which is also the current trend in the recent state-of-the-art HW SOM implementations, followed by triangular (16.7%) and rectangular neighboring function (11.1%).Some examples of LUT-based [10], [16] or programmed [25] neighboring functions can also be found as well as the HW SOMs without neighboring function [67], [68].

E. Weight Vector Update
Hardware SOMs handling low-dimensional vectors, such as [11], [24], [26], use registers for storing the weight vector elements.Fig. 9(a) shows an example of weight update circuit for a single-element vector.This circuit comprises a register, where the weight value is stored, a barrel shifter for the neighborhood function, and a 2-to-1 multiplexer to choose if the weight value needs to be updated or not.The circuit first calculates (μ − μ), which is the difference between the input and the weight vector element.This difference is then fed to a barrel shifter that outputs (μ − μ)/2 Q , which is a Q-bit shifted value to the right of the initial difference.If the signal S is one, the barrel shifter output is added to the register; otherwise, no change in the weight value.The values of S and Q are determined by computing the distance between the neuron whose weights are updating and the winner one.Consequently, by controlling S and Q, the negative-powers-of-two neighborhood function is implemented.In addition, to process D-dimensional vectors, D update circuits are employed to update D vector elements simultaneously.
On the other hand, in the HW SOMs targeting higher dimensional vectors, such as [9], [19], [69], a memory block is often employed to store the weight vector elements.In this case, a typical update circuit is shown in Fig. 9(b).It can be noticed that is heavily inspired by the update circuit used for low-dimensional vectors and presented in Fig. 9(a).The main difference is the replacement of the register with a memory block and the addition of an address generator, whose function is to provide memory addresses to access all weight vector elements.In this update circuit, the weight vector elements are read out sequentially from the memory and are modified in the same way as in Fig. 9(a).

IV. HARDWARE SOM ARCHITECTURE
As substantial parallelism is found in the SOM algorithm, various hardware SOM architectures aiming to speed up its computation and provide better both, learning and recall performances, have been proposed in the literature.These hardware SOMs, which employ a high degree of parallel computation to accelerate the SOM algorithm, are listed in Fig. 10 and Table III.Typically, four types of SOM computing architecture, shown in Fig. 8, can be found in the HW SOMs (see Fig. 10 for percentage chart): 1) dedicated processor; 2) systolic array; 3) distributed; 4) modular architecture.

A. Dedicated Processor
The dedicated processor hardware, shown in Fig. 8(a), typically consists of computing components dedicated for the SOM computation.These components are a memory for storing all weight vectors, a vector distance computing unit, a WTA unit for the winner search operation, and a component performing the weight update.Since the SOM computation is based on the vector arithmetics, the computing units in this processor (the vector distance and weight update units) can handle vectors.
Asanovic [4] implemented the SOM algorithm within the Spert-II system based on a T0 vector microprocessor.The T0 microprocessor includes an MIPS-II compatible RISC CPU with a 1-kB on-chip instruction cache, a fixed-point vector coprocessor, and an external memory interface.By making use of these available parallel computation resources, the SOM computation was highly accelerated (more than 10× compared to the state-of-the-art architectures at the time).
Kurdthongmee [70] proposed a color image compression system based on a hardware SOM, which employs a rational-numbering system for the codebook and learning kernel.Use of the rational numbering system and approximated nonlinear learning kernel extends the capabilities of the quantizer.The experimental results proved that the quality of the outcome images is superior to predecessor implementations with an acceptable throughput and FPGA resource utilizations.
Tanaka et al. [27] developed a single layer of a deep SOM network and a fully connected neural network (FCNN), used to mimic the function and structure of the amygdala.The amygdala is a specific area of the brain associated with classical fear conditioning.In general, the performance of deep neural networks (DNNs) is quite reliant on the availability of large amounts of training data, which is not always present.The authors tackled this problem by the brain-inspired amygdala model to achieve computer learning with limited training data.Hardware for the deep SOM network and the FCNN was designed and implemented on an FPGA, and the proposed amygdala model was applied to a robot waiter task in a restaurant.The experimental results show that the model learns a customer's preferences after only a few human-robot interactions.
Sun et al. [71] presented a hardware platform for accelerating the SOM algorithm.In the proposed platform, four types of neuron network topology are supported: 1-D array, 2-D square, 3-D cube, and binary tree.The proposed acceleration circuit contains eight processing units performing weight update and distance calculations for eight neurons simultaneously.It also includes a comparator tree to find the winner neuron.The proposed accelerator was applied to three applications: chromaticity diagram learning, labeling of handwritten number image, and image vector quantization.The functionality of the proposed platform was proven by MATLAB simulations of these applications, whereas the hardware cost estimation was carried out through the FPGA synthesis.
de Sousa et al. [72] proposed an FPGA-based SOM architecture called SOMprocessor.The proposed architecture explores two different computational strategies to improve both the data flow through the processor and its flexibility to implement different network topologies.The first improvement is achieved by multiplexer components, which supports alternating processing of neuron sets by the arithmetic cir-cuits.This strategy provides a more flexible use of the chip, in which larger networks can be even processed in lowdensity FPGAs.The second improvement is the inclusion of a pipeline architecture for the training algorithm so that different parts of the circuit could process data at the same time.The SOMprocessor was applied to a video categorization task, on the example of human actions video categorization for autonomous surveillance.

B. Distributed Architecture
The distributed architecture shown in Fig. 8(b) is based on the use of multiple PEs.Thus, it can be regarded as a dedicated array processor for the SOM architecture.Also, it is considered as a massively parallel architecture when a large number of PEs are included in the array.All PEs perform a part of the SOM computation in parallel.Global computations, such as the winner neuron search, are carried out by the PEs, whereas interconnections propagate the necessary information throughout the entire structure.Each PE may process multiple neurons and the highest parallelism is achieved if each PE processes a single neuron, which also yields the highest efficiency.
1) Processing Element: The PEs execute the same computation on different data, i.e., weight vectors.Thus, some researchers, such as [7], [8], [10], employed a single instruction multiple data (SIMD) processor for flexibility.The same instructions or commands are generated by a global controller and fed to the SIMD processors.Hendry et al. [7] presented their SIMD array for the hardware SOM as a soft IP core so that the number of neurons, the number of elements per vector, and the number of bits of each element could be defined by synthesis time parameters.
Another example is the hardware SOM proposed by Sudha et al. [68].They presented a novel hardware architecture for a 3-D SOM, designed for color quantization.Color quantization is a process of generating a color palette containing a Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
limited number of colors from a full-color image and can be done with the SOM.The generated color palette is afterward used to reconstruct the image (i.e., for image compression purpose).Each pixel of the reconstructed image is replaced with one of the colors from the palette.This process allows to quantize the initial color of each pixel in the image and thus reduces the image data size without significantly degrading the image quality.In the 3-D SOM, the neurons are arranged as points on a cube with their red, green, and blue axes.The sequence of operations in the coding phase is controlled by the instructions generated and propagated by the control unit of the proposed SOM architecture.
Another approach to build SOM PE is by using dedicated PE circuits made of hardwired logic circuit defined at the register transfer level (RTL).Since this approach does not have to decode the instructions or commands, the higher performance is yielded at the expense of the overall flexibility.This type of hardware neurons was used in the hardware SOMs listed in Table III and labeled as "distributed" in the "architecture type" column.
2) Interconnection Links: The distributed architecture requires communication links between the PEs, the controller, and I/O.Examples of a global bus-based communication method can be found in [9], [16], [17].In these approaches, a global winner search (GWS) operation is commonly carried out.The GWS circuit collects all weight vector distances from all neurons, and the global winner is searched and identified.
The works, such as [11], [14], [73], employed the local interconnection network, in which the PEs are connected with their direct adjacent neurons.The input vectors are fed to the SOM through the local links and the winner neuron search is distributed among groups of neurons by using the local interconnections.The local winner within a group of direct neighboring neurons (next to each other) is searched first.Then, the global winner is determined by comparing the local winners from all the neuron groups.
de Abreu de Sousa et al. [21] compared three types of architecture for executing the SOM learning and recall phases: distributed, centralized, and hybrid.The centralized architecture is defined as the architecture using a central control unit to collect the distance information from all neurons and for the BMU search as well.In the distributed architecture, the local winner of the neighboring neurons is computed and broadcast to all other neurons.This process is continuously repeated, and at the end, the global shortest distance value is propagated through the entire network.The third architecture is a hybrid model, which is a mix of the two previous architectures.The FPGA implementations of the three models were compared to support the system design choices.Results show that the centralized model outperforms the other models in terms of chip area occupation and maximum operating clock frequency.
Rodriguez et al. [74] presented the generic iterative grid principles for distributed computing.The use of the iterative grid to implement three types of SOMs, i.e., the original Kohonen SOM (KSOM), the dynamic SOM (DSOM), and the pruning cellular SOM (PCSOM), was investigated.Due to the iterative grid, the implementations of those SOMs are fully decentralized.The behavior of these iterative grid models was simulated and was found to be competitive to centralized models using the Manhattan distance as the vector distance metric.

C. Systolic Array Architecture
Based on the distributed PE array structure, previously introduced, some of the HW SOM architectures, such as the works presented in [5], [12], and [28], employ also the systolic model of data exchange.For instance, the systolic array proposed by Kung [75] is a homogeneous network of tightly coupled PEs, as shown in Fig. 8(c).Each PE independently computes a partial result using the data received from its neighborhood PEs in the upstream direction(s), stores the result within itself, and passes it to its neighboring PEs in the downstream direction(s).Thus, every PE performs different computation.In addition, the data path is aggressively pipelined allowing to increase the overall architecture performances.
Ienne et al. [5] employed the MANTRA I system to validate the new SOM learning algorithm they proposed.The MANTRA I system is a massively parallel system based on the systolic array of up to 1600 PEs.The systolic array at the heart of the SIMD part of the proposed architecture is a square mesh of GENES IV PEs, designed in CMOS 1 μm standard-cell technology.In the original learning algorithm, the weights are updated after the presentation of every input vector.On the contrary, in the proposed architecture, the weight vectors are updated after the presentation of a group of inputs in a batch fashion.Consequently, the authors proposed a new weight update algorithm called mantra algorithm, which is an intermediate between the original and the batch algorithm.With the mantra algorithm, the winner selection is based on the batch algorithm, and the weights are updated by using the original method.It was validated through theory and simulations that the mantra algorithm performs almost as well as the original learning algorithm.The main advantage of the mantra algorithm is its finer grain of parallelism, which allows it to be used in hardware with a very large number of processors without compromising the properties of the algorithm.
Manolakos et al. [12] designed a modular SOM systolic architecture, described as a soft IP core in a synthesizable VHDL.The network size, the vectors dimension, the weight and data element bit width precision, and so on are all tunable parameters.The proposed array is made of two types of PEs: the recall mode PE (PER) and the weights update PE (PEU).An SOM module control unit (MCU) generates all the necessary control signals for both array columns.The proposed SOM was implemented on an FPGA and validated on a vector classification task in real time working on input vectors with thousands of elements.
Similarly, Ben Khalifa et al. [28] proposed a new SOM architecture called systolic-SOM (SSOM).The SSOM is based on the use of a generic model, also inspired by a systolic movement.In this model, two levels of nested parallelism of neurons and connections are used.The proposed approach was validated and its performances were evaluated through several different SOM networks integrated on an FPGA platform.

D. Modular Architecture
Problem of the distributed architecture is its expandability.In order to increase the number of neurons, the whole system must be redesigned.The modular architecture shown in Fig. 8(d) divides the whole SOM hardware into modules.Each module is usually the distributed architecture SOM (see Section IV-B), and the number of neurons can be increased easily by adding new modules.Thus, this approach provides the expandability to the HW SOM architecture.
Lachmair et al. [15] proposed hardware SOM based on the modular architecture, called gNBXe.This work is based on the principles of NBX and gNBX, previously introduced in [6] and [10], respectively.It consists of a global controller (GC) and the PEs that are the hardware units performing the calculations for simulating the neurons.The local controller in a PE translates the macro commands from the global controller to the local control signals, allowing the corresponding PE to perform the calculations related to the neuron specified in the macro command.Consequently, the system can be extended by adding the gNBXe modules into the system bus architecture.
Ramirez-Agundis et al. [13] proposed a modular, massiveparallel, SOM-based vector quantizer for real-time video coding.The hardware architecture is divided into three sections: the processing units array, the address generator, and the control unit.The processing units array is distributed in modules, with 16 units each and a maximum of 16 modules (up to 256 neurons).Input vectors are stored in an external memory.To process one input vector, its elements are read from the memory and applied to the input of the network, one element at a time, until the complete vector is scanned.At the training stage, the scan is done twice.The first scan is to determine the winner neuron, whereas the second one is to update the weights.Each module consists of 16 processing units and a comparator to identify the local winner neuron inside that module.The control unit determines the global winner.The maximum frequency obtained after the place and route was 71.43 MHz when the design was implemented on an FPGA device.

E. Scalable/Flexible Architecture
A joint research group from the University of Lorraine and the University of Sousse has proposed hardware SOM architecture that uses network-on-chip (NoC) communication as an alternative communication approach [18], [19], [22], [40], [76].The NoC is a network-based communication subsystem and is presented as an alternative to a traditional shared bus allowing the connection of several PEs on a single chip [77], [78].The NoCs enjoy an explicit parallelism, high bandwidth, and a high degree of modularity, which makes them very suitable for distributed architectures such as SOM networks.The common structure of a 2-D mesh NoC is composed of the same number of PEs and routers.The data are transmitted among neurons by using packets and well-defined protocols and routing policy (i.e., wormhole routing and XY routing algorithm).The architecture can define its communication links dynamically by programming the packets.Its significant feature is that the NoC based architecture can perform different applications in a time-sharing manner by dynamically defining the use of neurons and their interconnections [76].The proposed hardware SOM architecture also employs the systolic way of data exchange through the NoC, which provides high flexibility and scalability.
Recently, Hikawa [26], [69] proposed a hierarchical architecture, called nested architecture.In this layered architecture, the top module is made of four submodules placed in the second layer, and each submodule is made of smaller subsubmodules in the third layer.Each module in the bottom layer is made of four neuron units.Consequently, this architecture provides high expandability where, only by adding a module, the number of neurons is quadrupled.

F. Vector Representation
Researchers have been developing hardware SOMs and their components using digital or analog signals and techniques to represent the vector element values.Some HW SOM components are designed by using different analog techniques (i.e., current mode circuits [51] and WTA classifiers [79]).On the other hand, for a complete HW SOM implementation, all of the hardware implementations found in the literature so far employed digital techniques, as summarized in Table III.This implementation preference can be explained by the fact that digital implementation can take advantage of the benefits of the state-of-the-art VLSI and ULSI techniques [80] and powerful design environment tools and methods, such as the hardware description languages (HDL) and the easy accessibility of FPGAs.On the contrary, the primary disadvantage of analog implementations is low design flexibility even though it can possibly provide higher speed with lower hardware cost.
1) Analog: As mentioned before, some computing components implemented with analog techniques have been reported so far.However, to the best of our knowledge, no full analog implementation of SOM can be found in the literature.Here, the full implementation of SOM is assumed to include both the learning and the recall capabilities.Indeed, it is hard for the analog implementations to provide the learning mechanism of the SOM, where the weight values must be adjusted according to the input vectors.In addition, it is difficult to build and design an analog circuit that holds a number of analog weight values that are also programmable.
Dlogosz et al. [51] proposed analog components for the SOM, which includes an analog current-mode circuit for distance calculation between the weight and input vectors and a new analog programmable neighborhood mechanism, providing the triangular neighborhood function [64].
Shah et al. [79] presented a vector classifier based on a vector-matrix multiplier (VMM) and a winner-take-all (WTA) classifier structure.The proposed classifier was implemented on a large-scale system-on-chip (SoC) field-programmable analog array (FPAA).The implemented design is a mixedmode system, including the analog classifier data path and the control circuitry for weight updates which is done with the microprocessor available on the SoC FPAA.
2) Digital: The most effective solution to design the hardware SOM is to implement the whole system in silicon with Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.digital technology.As shown in Fig. 11 and Table III, we seem to have two choices of technologies: complementary metaloxide-semiconductor (CMOS) implementation or FPGAs.
The T0 processor for the Spert-II vector microprocessor system that implemented the SOM was fabricated with 1-μm CMOS design rules [4].Similarly, the GENES IV processor was designed in CMOS 1-μm standard-cell technology, which is the basic block of the MANTRA I system implementing the SOM with the mantra algorithm [5].The NBX processor was synthesized with the 0.8-μm CMOS standard-cell process and was used to build the MoNA system to implement the SOM [8].Kim et al. [63] implemented their HW SOM with 65-nm CMOS technology, used for electrocardiogram (ECG) clustering.
Key design issues for the implementation technology are area, signal delay, and power consumption.The CMOS implementation is superior to the FPGA in terms of all these design issues.In contrast, big advantage of the FPGA technology is its reconfigurability and smaller nonrecurring engineering (NRE) costs.Use of the FPGA can be viewed as an intermediate solution between the ASIC and the software approach.In terms of the area, speed, and power consumption, the recent families of the major FPGA vendors have been improved significantly, which has caused a huge increase in their use in the various hardware designs.Motivated by the high speed and the low cost of parallel computation, numerous studies have employed the FPGA technology as the hardware platform to implement the SOM.As it has been shown in Fig. 11, almost two thirds of the HW SOM implementations have been implemented by using the FPGA technology.Moreover, this trend is especially valid in the last five years.
In digital design techniques, fixed-point binary representation is usually used to reduce the computing cost and memory usage, instead of floating-point representation.The crucial issue of the binary representation is the data size affecting the obtained precision.In addition, the reduction of the number of bits has an influence on the circuit size of the computing components, which significantly affects the number of neurons that can be synthesized on a single FPGA device.
Dlugosz et al. [81] studied the allowable reduction of the number of bits of some internal signals in an HW SOM implementation that does not deteriorate the SOM network behavior.To determine the influence of the bit lengths of signals on the quality of learning process, a series of simulations was completed using the accurate software model of the SOM implemented on an FPGA device.They revealed that the length of some internal signals can be shortened to 7 bits without disrupting the learning process and deteriorating the overall quantization error.Consequently, by shortening the data length of some internal signals, the number of neurons that can be implemented on a single device increases by 240% compared to the case where the resolution was unchanged from 16 bits.
A tristate SOM as a resource-efficient architecture for implementation on FPGA was proposed by Appiah et al. [52].The tristate SOM maintains the tristate weights with {0, 1,#} as the possible values, where "#" represents "don't care" state.In addition, a modified version of the Hamming distance is used to compare input to weight vectors.Since the weight takes only three states, each state could be represented by a two-bit binary code.
Kleyko et al. [53] presented a generalization of the tristate self-organizing maps.In the proposed SOM, weights are allowed to be updated beyond [−1, 1] to the wider range of [−κ, κ].A clipping function is used as a nonlinear activation function, which is applied to all weights of each neuron.The proposed SOM achieves a better accuracy in a classification task when compared to the original tristate SOM.
3) Pulse-Based Architecture: One of the important objectives in brain modeling is to explain how the organizational order emerges by itself in the various brain maps.Kohonen [82] demonstrated that the SOM has a similar self-ordering function to that found in the biological brain.In biological neural systems, information is conveyed by electric pulses.Neural network hardware architecture that uses pulse signals to perform the neuron's computation has been proposed and investigated [83].A popular approach for implementing hardware neural networks is a stochastic computation [84].The advantage of stochastic computing is that the computing elements can be realized with a smaller circuit than the conventional arithmetic ones.In stochastic computing systems, a global clock provides a time interval during which each weight value is defined.Pulse stream signal and its pulse density, i.e., probability of the presence of pulse during the interval, are used to represent a vector value.Thus, various computations can be implemented with a simple gate circuit.
Moran et al. [85] proposed a novel HW/SW hybrid SOM implementation using stochastic computing.Several stochastic block designs were implemented as the squared Euclidean distance and the WTA similarity check hardware.The proposed SOM was implemented on a hard processor system (HPS) and an FPGA.The WTA unit, implemented on the FPGA, carries out the WTA computation and returns the winning neuron index to the HPS.Then, software running on the HPS performs the weight update of all neurons.In this way, the SOM learning is carried out within a combined HW/SW codesign.
As another pulse-based SOM architecture, Hikawa [16] proposed a hardware SOM that uses the phase-modulated pulse signal to represent the weight value of neurons.The elements of the input and weight vectors are given as the phase of the input and internal carrier signals, and a digital phaselocked loop (DPLL) is employed as a computing element because the operation of the DPLL is very similar to that of the SOM's computation.The vector distance is computed as a phase difference of the two signals, and the winner is found by the binary-tree type BMU search circuit.Another DPLL-based SOM is proposed in [86].It employs a new WTA method, in which the winner neuron is determined by the competition between the neurons.In all the neurons, the similarity of the carrier signals is given as pulse signals.The accumulation of these pulse signals is carried out within a digital counter.The neuron whose counter overflows first is chosen as the winner.To determine the winner, these neurons compete with each other for the time it takes for the counter to overflow.The winner neuron then spreads an update pulse signal, and its frequency is halved when it goes through other neurons.As a result, the negative-power-of-two type of neighborhood function is implemented.The proposed winner search method does not require global communication between neurons, which makes the architecture scalable.
Hikawa [87] proposed a vector classifier based on the WTA neural network (WTANN).In the WTANN, its input and weight vectors are represented by frequencies of carrier signals, and the weight update is carried out with a digital frequency-locked loop (DFLL).The winner search operation is implemented by using frequency comparators distributed among all neurons, allowing to easily increase the number of neurons in the proposed approach.In [88], a modified winner search method is proposed, where a cycle slip detector is employed to estimate the frequency difference of the signals.Since the number of cycle slips increases in proportion to the frequency difference of the two signals, the similarity between the signals in the frequency is assessed by counting the cycle slips.A VHDL simulation revealed that accuracy in the WTA operation of the proposed method is much better than the frequency comparator-based WTA circuit.However, the abovementioned WTANNs are not the classical SOM implementations since they do the learning without the neighborhood function, thus violating the topology preservation.On the other hand, the proposed WTANNs were used to test and demonstrate the use of the frequency-modulated pulse signal in the WTA function, i.e., vector distance computation and BMU search.In addition, the use of the frequency-modulated signal with the DFLL was also adopted to build SOM hardware in [25].In the proposed SOM, the vector values are conveyed by the frequency-modulated signals and the DFLL is used for the neuron's computation.For the winner search, a cycle slip detector is also employed.In [54], this DFLL-based SOM is improved by employing the triangular neighborhood function.This function is implemented by using pulsewidth-modulated signals spread from the winner neuron without multipliers.

G. Off-Chip Learning Architecture
The works listed in Table III can be divided into two categories: the first category is called on-chip learning where the learning algorithm is also implemented in hardware; the second one, called off-chip learning, where the HW implementations belonging to this category perform only the recall operation.The recall operation, as shown in Algorithm 1, requires only the BMU search without the weight vector update operation.Thus, the vector distance calculation and the argmin function are only implemented in hardware.In the offchip learning approach, weight vectors are trained beforehand, often by a software running on a computer, and then are loaded into the hardware to perform the recall operation.
Li et al. [89] developed an SOM-based positioning scheme for continuous crystal-based PET detector.The SOM training phase was accomplished off-line by MATLAB software and the test phase of the scheme was implemented on FPGA.Similarly, Hikawa et al. [90] proposed a hardware hand sign recognition system with a hybrid network called SOM-Hebb classifier.The weight vectors were trained by an off-chip computer, and the SOM implemented on FPGA performed the recall function only.Other examples of hardware SOMs working in the recall mode were proposed by Kurdthongmee [91] and Huang et al. [20] for image compression.In the former work, a memory-based BMU search unit was proposed, resulting in a reduced final number of colors.In Huang's work, a reconfigurable complete-binary-adder-tree (RCBAT), where the reuse of the employed arithmetic units is possible, was devised to reduce the hardware usage.In addition, by distributing the codebook into parallel PE blocks, the proposed design successfully demonstrated a high compression speed up to 500 frames/s.

A. Implementation Platforms
To demonstrate or investigate the proposed HW SOM architectures or their computing components, authors use Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.various methods and tools, including simulation and/or physical experimental validations.Some of the works presented in Fig. 11 and Table III were investigated by simulations (almost one fifth of the proposed HW SOM implementations).Among simulation validation tools, we can find examples of VHDL simulation [87], [88], HSpice simulation [65], and MATLAB simulation [71].The simulation verification is very flexible and useful approach in order to validate the algorithms or circuits before their physical implementation.On the other hand, physical experimental verification is the common way of validating HW SOM implementations, where CMOS and/or FPGA technologies are targeted (almost 80% of the reported works in Fig. 11 and Table III).

B. Datasets/Target Applications
Table III summarizes the used datasets and target applications for validation of the proposed HW SOM architectures: artificial datasets, image coding and compression experiments, and classification tasks on publicly available datasets (MNIST, IRIS, and so on).From Table III, it can be noticed that most of the proposed HW SOM architectures were validated by using in-lab artificial datasets, which often represents a simple dataset to confirm the SOM's basic functions, such as the topology-preserving nature.
Many HW SOMs are applied to image coding or image compression applications.The main reason for this type of validation is that the obtained results are easily observable.Indeed, in these applications, the weight vectors are often represented as colors from the input dataset, which are the used images.In a true color digital image, each pixel is represented by R-G-B components, whose intensity is coded with 24 bits (8 bits per component).Thus, the size of a high-resolution image can be quite large (a number of pixels × 24 bits).In image compression, the main goal is to reduce the overall data size of manipulated images for storage or transmission purposes.One of the methods to compress an image is to reduce the total number of colors used to code it, which is commonly called color compression.The color compression limits the number of colors in an image and can be done with SOMs.Indeed, by presenting random input vectors (read pixels) from an image, the SOM selects the best colors from it by using its vector quantization capability.The result of this operation is a palette of the limited number of the most representative colors in the input image.This palette simply represents the SOM weights of the trained map.To produce the compressed image, each pixel of the initial image is replaced with the index of the most similar color from the SOM palette (obtained through the SOM recall operation).Thus, instead of using 24 bits to code a pixel, each pixel is coded with the number of bits necessary to uniquely identify each neuron in the map (i.e., 8 bits for a 16 × 16 map).An example of the color compression with an SOM is shown in Fig. 12 [17].The original image, the color palette consisting of 256 colors and obtained after training a 16 × 16 SOM network, and image reconstructed by using the palette colors are shown.In addition, instead of using only one pixel as the input vector of the SOM map, B × B blocks of pixels can be presented as input vectors and quantized by the SOM in the same way as previously presented.The HW SOMs in [13], [17], [18], [20], [28], [40], [67], [68], [70], [71], [76], [91] were applied to the real-time color/image compression.
Another popular application for validation of HW SOMs is various classification tasks.In [4], a speech coding benchmark provided by EPFL was used to measure the training performance of the proposed SOM.Appiah et al. [52] applied their HW SOM to recognize handwritten digits.The authors used the Modified National Institute of Standards and Technology (MNIST) database [92], which is a database of handwritten digit images.In this database, each image is a 28 × 28 greyscale image, which can be considered as an input vector whose dimension is 28 × 28 = 784.An example of the MNIST data clustering [69] is shown in Fig. 13, where 784-D vectors were mapped onto a lower two-dimensional 16 × 16 SOM.It can be seen that due to the SOM topologypreserving nature, the same digit characters are assigned to adjacent neurons.Note also that clusters representing different digits but similar shape are assigned to neighboring neurons.In this way, the SOM can be used to visualize relations among input vectors.Other examples of the use of the MNIST dataset also for recognition or clustering applications can be found in [53], [69], [71].Another example of classification tasks used to validate an HW SOM architecture is the hand sign recognition system proposed in [90].In the proposed system, preprocessed input images were applied to the off-line trained SOM used in the recall mode allowing to achieve real-time hand sign recognition (60 frames/s with a recognition accuracy of 97.1%).Moreover, the IRIS dataset has also been used for many statistical classification techniques in machine learning algorithms and HW SOM validation as well.The IRIS dataset consists of 50 samples from each of three species of Iris  (Iris setosa, Iris virginica, and Iris versicolor).Each sample consists of four features: the length and the width of the sepals and petals, in centimeters.Classification performances of the works in [85], [87] were demonstrated by using the IRIS dataset.Finally, the mixed-mode classifier proposed by Shah et al. [79] was trained to identify the sound source, whether it is a generator, truck, or car.
Another data clustering application based on the SOM's vector quantization capability is presented in [15].Lachmair et al. [15] used hyperspectral image data of the lunar crater volcanic field (LCVF) in Nevada, taken by NASA's AVIRIS sensor for clustering target.Since the LCVF dataset is fairly large with the high-dimensional vectors (in the hyperspectral image) and it has complex cluster structure, the clustering of these hyperspectral images is seen as a quite challenging and nontrivial task.Similarly, Kim et al. [63]

C. Performance Measure
The processing speed of the HW SOM architectures is usually evaluated by connection updates per second (CUPS).This metric quantifies the number of weight updates that the SOM system performs per second during the learning process Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.where D is the input vector dimension, N is the total number of neurons in the SOM network, T I is the total time in clock cycles needed to finish one learning iteration, and f is the maximal operating frequency of the HW architecture.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
The performances of the HW SOM architectures reported in the literature so far are shown in Figs. 14 and 15 and in Table III in terms of million of CUPS (MCUPS) for a period (1995-2021); in terms of million of CUPS (MCUPS) as a function of the number of neurons and input vector dimension (raw and normalized values); and in the column "MCUPS."From the presented results, it can be noticed that the highest performance of 5.7 GCPS was achieved by the approach proposed by Lachmair et al. [35].It should be mentioned that this reported value is for MCPS (evaluated only in the recall phase without updates) and not for MCUPS.Thus, to have an order of MCUPS, this value should be at least divided by a factor of 2. The second best performance of 109 800 MCUPS was achieved by the AW-SOM [67].This is not surprising at all and is mainly due to the modified SOM algorithm, which is at the heart of this approach: no use of any neighborhood function and no BMU search at all.A small reminder is that the BMU search operation is the biggest performance bottleneck of all HW SOM architectures.Moreover, the AW-SOM does not have all the features provided by the conventional SOMs because the neighborhood function and topological relations of the neurons were also omitted.Another way of presenting the performances of the HW SOM architectures is to normalize the values of MCUPS by the number of neurons and input vector dimension.These results are presented in Fig. 15(b).If we look at 11, the normalized MCUPS presents only the ratio between the maximum operating frequency and the time needed to finish one learning operation.This value is more representative as it shows the most optimized HW SOM architectures (the higher is better).
In addition, it should also be noted (see Fig. 14) that the performances of the most recent HW SOM implementations are almost all in a range between 10 and 100 Giga CUPS.This can be explained by the use of the most recent FPGA devices and in the adoption of the architectural choices discussed earlier in this article.It should also be highlighted that the performance of the hardware SOM with the off-chip architecture presented in [20] was given in Figs. 14 and 15 and Table III in connections per second (CPS) (evaluated only in recall phase).

VI. OPEN RESEARCH PROBLEMS
From the previous discussion on the different aspects of the HW SOM architectures found in the literature, it can be concluded that significant improvements have been achieved in the last few decades at all, circuit, algorithmic, and architectural levels.Consequently, these improvements pave the way to use the SOM algorithm in a large variety of applications with specific data and computing requirements [93]- [98].
Although the SOM algorithm is not new, its simplicity and capability to tackle a large variety of problems starting from vector quantization and clustering through data visualization and image/video processing keep it attractive now more than ever, especially nowadays where the big data (BD) and Internet of Things (IoT) infrastructures are at the origin of a tremendous amount of data of different types, which is continuously generated and supplied by a myriad of sensors deployed everywhere.
The possibility to make HW SOMs ubiquitous and easily accessible as HW IPs and accelerators in many application fields (and HW platforms) goes through the solution of some still open research and challenging problems, which can be categorized into four groups.
1) Huge Scalable HW SOM: From the recent state-of-theart works, it has been shown that the new trends in the design of HW SOMs are oriented toward scalable and expandable architectures [18], [19], [22], [26], [40], [69], [76].This current trend is in line with the explosive growth of data volumes we are witnessing these last few years where very large SOM architectures in terms of neurons are necessary to satisfy these huge volume data-related needs.The design of new high-performance scalable and expandable HW SOM architectures will allow to tackle this problem easier where these new architectures will be used as the basic building blocks for larger networks of different size.2) Fast BMU Search: High degree of scalability and the possibility to build huge HW SOMs put in the forefront the problem of BMU search in such networks, the operation which is even in today's HW SOM architecture identified as the biggest performance bottleneck.Potential alternatives could be newly adapted SOM algorithms as the one proposed in AW-SOM [67] or even to shorten the BMU search operation by exploiting the inherent topology preservation property of SOMs [99].3) High Configurability: The new HW SOM architectures have to be not only as much as possible high performance and scalable but also highly configurable and adaptable to a large variety of applications.Ideally, the new HW SOM architectures should be application-agnostic and all application-specific needs should be provided as a simple list of online and real-time configurable parameters.4) Toward Growing HW SOM-Models: High configurability of the HW SOM architectures implies also to have the possibility to change the number of neurons dynamically or to adapt it to learning requirements of a given application.Consequently, dynamic and growing counterparts of the SOM algorithm should be targeted in HW SOM architectures.For instance, the pioneering work presented in [100] where a growing grid (GG)-based algorithm has been implemented in hardware is an example of these dynamic and growing HW architectures.Ideally, the growing neural gas (GNG)-based models that are quite challenging and more advantageous over other growing models should find their implementations in HW in the near future.

VII. CONCLUSION
In this article, a survey of hardware SOM implementations has been presented.The SOM algorithm, whose simplicity and capability to tackle a large variety of problems (vector quantization, clustering, data visualization, image/video processing, and so on), is still attractive in many application fields and is considerably gaining more attention in nowadays's BD and IoT infrastructures.Its inherent property of topology preservation Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
and unsupervised learning of processed data without any prior knowledge put it in the front of candidates for data reduction.However, its high computational cost makes the online real-time high-performance SOM processing mostly reserved for specific hardware implementations.In this article, we gave an overview of the hardware, application-specific implementations of the SOM algorithm, the most widely used computing blocks, architectures, design choices, adaptation, and optimization techniques that have been reported in the literature so far in the field of hardware SOMs.Moreover, an overview of main challenges and trends for their ubiquitous adoption as hardware accelerators in many application fields has also been addressed.This article is expected to be useful for researchers in the area of artificial intelligence in a broader sense, real-time hardware architecture, and system design with tight design constraints in terms of performance (timing), power, and area.
. It can be noticed that almost a half of the used neighborhood functions in the HW SOMs in the last 25 years (1995-2021) is Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Fig. 15 .
Fig. 15.Overview of performances of the HW SOM architectures in million of CUPS (MCUPS) as a function of the number of neurons and input vector dimension: (a) raw values and (b) normalized by the number of neurons and input vector dimension.

TABLE I ACRONYMS
AND TERMS USED FREQUENTLY IN THIS ARTICLE