A Digital Processing in Memory Architecture Using TCAM for Rapid Learning and Inference Based on a Spike Location Dependent Plasticity

In this paper, we present a digital processing in memory (DPIM) configured as a stride edge-detection search frequency neural network (SE-SFNN) which is trained through spike location dependent plasticity (SLDP), a learning mechanism reminiscent of spike timing dependent plasticity (STDP). This mechanism allows for rapid online learning as well as a simple memory-based implementation. In particular, we employ a ternary data scheme to take advantage of ternary content addressable memory (TCAM). The scheme utilizes a ternary representation of the image pixels, and the TCAMs are used in a two-layer format to significantly reduce the computation time. The first layer applies several filtering kernels, followed by the second layer, which reorders the pattern dictionaries of TCAMs to place the most frequent patterns at the top of each supervised TCAM dictionary. Numerous TCAM blocks in both layers operate in a massively parallel fashion using digital ternary values. There are no complicated multiply operations performed, and learning is performed in a feedforward scheme. This allows rapid and robust learning as a trade-off with the parallel memory block size. Furthermore, we propose a method to reduce the TCAM memory size using a two-tiered minor to major promotion (M2MP) of frequently occurring patterns. This reduction scheme is performed concurrently during the learning operation without incurring a preconditioning overhead. We show that with minimal circuit overhead, the required memory size is reduced by 84.4%, and the total clock cycles required for learning also decrease by 97.31 % while the accuracy decreases only by 1.12%. We classified images with 94.58% accuracy on the MNIST dataset. Using a 100 MHz clock, our simulation results show that the MNIST training takes about 6.3 ms dissipating less than 4 mW of average power. In terms of inference speed, the trained hardware is capable of processing 5,882,352 images per second.

generated from the user's edge device. Due to these shortcomings, artificial intelligence on things (AIoT) technology that can learn from edge devices was recently introduced [2].
Unlike the existing convolutional neural networks (CNNs) and deep neural networks (DNNs) methods in which heavy amounts of iterative computation between tensors are performed during training [3], [4], an SNN has a simple structure that sends a spike when the action potential due to the accumulation of spikes exceeds a critical point, which invokes the STDP for feed-forward style training [5]. This method does not require complicated feedback computations. Thus, it is possible to design a neural network to train directly at the circuit level to enable low power consumption and low latency [6]. One unique feature of SNNs is that they move information as spikes using action potentials, and the outputs are activated using accumulating membrane potentials and threshold voltages. Due to its feedforward nature, unsupervised learning is possible in that there are no computations performed to optimize the difference between the ground truth and the inference value, which are a necessity for weight updates in the gradient descent-based backpropagation methods.
In this paper, we present digital processing in memory (DPIM) with a learning algorithm that is completely feedforward. There are no computation stages that affect previous stages. Although the STDP does not feedback up to the input stage, it does require computation against pre-synaptic spikes [7], which adds some computational complexity. Furthermore, the input spikes must be preconditioned to desynchronize the spikes using the Poisson distribution [8]. Processing in memory-based SNN was investigated to further reduce computational complexity and increase energy efficiency [9].
Despite the complexities associated with the SNN, it is still considered the next generational step in neural processing [10]. It currently has several challenges to overcome. One of the problems is the reduced accuracy compared to the well-established DNN and CNN [11]. To compensate for this deficiency, the group in [12] combined both artificial neural network (ANN) and SNN hardware. Recently, a group used SNN as a low-level extractor for a spiking hyper-dimensional network to make the system more robust in terms of accuracy [13]. Another problem associated with SNN is scalability. Loihi architecture introduced a scalable architecture using multiple cores in a chip as well as multiple chips on a board [14]. However, this required complex networking controllers to manage data as packets. PIM architectures are being developed to address scalability issues associated with increasing input and output sizes [15], [16], [17]. In their approaches, they all try to take advantage of advanced non-volatile memories to store weights as analog values to multiply with the analog inputs and then subsequently utilize the inherent current summation capability in the memory structure. These structures are considered the most compact form of multiply and accumulate (MAC) functions also used for ANNs [18]. There are a couple of major problems associated with analog PIMs. Firstly, the process, voltage, and temperature variations in multiple D/A and A/D converters that are required to convert input data and output data can result in inaccuracies. Secondly, the asymmetry between weight update values that causes the hardware implementation to deviate from software-predicted results also contributes to inaccuracies [19]. Furthermore, DPIM cannot be used in these structures to take advantage of the MAC operation inherent in the analog PIM. A DPIM was studied to perform SNN inference using fused weights [20]. In this work, the inference task was optimized by reducing the overhead for reading the weights.
We introduce spike-location-dependent plasticity (SLDP), a learning mechanism reminiscent of the STDP, for use in image classification. For the proposed method, we use a ternary content addressable memory (TCAM), a memory that outputs the stored address upon entering a search pattern. If higher frequency patterns searched in the TCAM are configured to be moved to higher address locations, it is possible to rank the input patterns. Using this memory-based approach, we demonstrate both rapid learning and inference on MNIST images.
In the case of SNNs, STDP learning occurs by increasing or decreasing the amount of influence that each pre-synapse has on subsequent post-synapses. The post-synapse is activated through the summation of all the connected pre-synaptic weights over time. Our SLDP has a similarity in that learning proceeds by observing the frequencies of occurrences of all observed patterns in a specified block space and subsequently reinforcing or weakening the influence of the specified block space on the final outcome. The entire image space is composed of these block spaces, and the sum of all the influences from all the block spaces indicates the value representing the input image. Thus, if many images that represent a class are put through the SLDP, the similarities will be reinforced, and the summed value will increase. Note that since we do not need to track back to the previous time for updates, the learning is completely feedforward.
Although this work primarily focuses on supervised learning, the SLDP can be applied to unsupervised learning as well. The SLDP can be considered a spiking-based algorithm since the first layer of our stride edge-detection search frequency neural network (SE-SFNN) converts a ternary image signal into a 1-bit spiking data representation for the second layer. These spikes per block space are formatted as a word pattern and are put through the SLDP to update the pattern's address in the second layer. The SE-SFNN can easily be parallelized in hardware since the block space ranking process is essentially independent of other block space ranking processes. Due to this, frequently occurring results can be learned online, and the inference can occur in a very short time.
The major innovations and contributions of this paper are as follows.
• Many TCAM blocks are configured to perform PIM tasks on the MNIST dataset during both the training and inference phases. This drastically improves throughput. VOLUME 11, 2023 • It is a digital PIM without multiplier operations. Therefore, it is robust against process variations and can be fast and compact.
• A completely feedforward learning algorithm is created for the TCAM PIM. This enables a simple data flow and thus facilitates hardware realization.
• We introduce data filtering as well as data striding and neighbor searches, all utilizing the PIM to improve accuracy with minimal impact on processing time.
• Furthermore, the TCAM memory size required for learning and inference is significantly reduced by limiting the learned content to the most frequently accessed values. The algorithm to reduce the TCAM content also utilizes the PIM to concurrently perform the reduction task and the learning task. This is performed as the training images are streaming in.
Section II describes the hardware building blocks for the SE-SFNN. Section III describes the SLDP algorithm. Section IV describes the methods to improve accuracy. Section V shows how TCAM blocks are organized in our algorithm development process. Section VI shows how significant improvements are made in memory configuration for the SLDP and the subsequent performance increases in the SE-SFNN with minor to major promotion (M2MP) scheme. Finally, Section VII summarizes the entire architecture, discussing the similarities and differences with SNNs, followed by the conclusion.

A. TCAM
A content addressable memory (CAM) is a data storage device used for very high-speed search applications. CAM compares input search data against a table of stored data and returns the address of the matching data in a single clock cycle [21]. A TCAM is a type of CAM that allows ''don't cares'' in addition to 1s and 0s when searching for patterns, providing more flexibility in searching.
A typical TCAM has a word line and a match line, and the operation of the word line allows writing the content in the memory. For the output match lines that are assigned to all the row addresses, they remain high if the searched pattern and the corresponding contents are the same, indicating a hit or hits; otherwise, unmatched rows of the match lines are pulled low. Two bits are used to express the ternary values of a high, a low, and a don't care. Fig. 1 summarizes the operation of the TCAM, where the search pattern is shown at the top of the memory array.
TCAMs, on the other hand, have not been used for largescale pattern searches due to their inherent high power consumption and large cell size. Our group demonstrated that such a large system can be possible through the use of a dynamic memory configuration [22].

B. DESIGNED CIRCUIT FOR FEASIBILITY VERIFICATION
In our SFNN, the patterns are sorted in memory in ascending order according to the frequency at which they are received. Specifically, the patterns with high search frequencies are stored at higher addresses, and patterns with low search frequencies are stored at lower addresses. If you search for any pattern after this ranking stage of the learning, the match line at the location where the pattern is stored becomes high. This is the inference operation, where the match line location indicates how often the pattern was searched during the learning stage. Two-bit binary values are output for the inference. Output is 10 2 if the output address is greater than or equal to the upper threshold, 00 2 if it is less than or equal to the lower threshold, and 01 2 if it is between the two thresholds.
To perform the learning operation, a control circuit that feeds information back to the TCAM was added. We define this structure as computational TCAM (CTCAM). The hardware structure is largely divided into three modules: a search count storage module, a control module, and a system clock module. Fig. 2 shows the block diagram of an FPGA-implemented prototype circuit for concept verification. It has a 16 × 4 bit TCAM. The 16 rows represent all possible binary 4 bit patterns.

1) SEARCH COUNT STORAGE MODULE
A search count storage module is located at the upper right side of Fig. 2 and is connected to the match line of the TCAM. A counter is added to the match line to store the number of times the pattern is searched when hits are output. The value stored in the counter is then compared to find the address that needs to be sorted and transferred to the control module.

2) CONTROL MODULE
The TCAM is controlled by the control module to change the value of the pattern stored in the input address and the address immediately above it. It is located in the lower-right corner of Fig. 2. The operation of changing the storage position of the pattern was made possible by configuring two shift registers in the TCAM. Two position-swapping patterns are retrieved by the two shift registers, and the shift registers are configured to write back to the opposite positions. The operation of swapping the storage location consumes six clocks. Fig. 3 shows the clock step processes in which the control circuit performs the position-swapping operation of the TCAM.
This miniature CTCAM operation was tested on an FPGA, where TCAM was emulated using a behavioral module.

III. SLDP LEARNING ALGORITHM A. CTCAM
The CTCAM for training an image is arranged by the size of the image. Subsequently, each CTCAM learns a pattern that mainly appears at a specific location in the image.
For computational convenience, the original 28 × 28 MNIST images are truncated to 27 × 27 images. Since CTCAMs are arranged primarily to search for the most frequently occurring 3 × 3 image subset patterns, we have opted not to resize the 28 × 28 image to 27 × 27. This was to ensure the images were not stretched, thereby blurring the images. 27 × 27 truncation is the maximum image size that can be subdivided by the image subsets. When training 27 × 27 size MNIST images, 9 × 9 CTCAMs that learn 3 × 3 patterns are arranged. This CTCAM arrangement is shown in Fig. 4. The highlighted square on the right represents a CTCAM with a 3 × 3 pattern as the search input. The CTCAM is initially stored with all possible outcomes in an orderly fashion. Thus, the grid on the right is an array of TCAM memories representing a label.
Training is performed by repeatedly searching for training data with the same label. In the case of a pattern that occurs frequently in the corresponding location, the search count increases. At certain intervals, the frequently occurring pattern data is moved to the upper address in the CTCAM, thereby disrupting the previous order. Since the learning is performed through the frequency of the searched pattern, we named it the search frequency neural network (SFNN).

B. DETECTING ALGORITHM
We define a CTCAM array that has been trained using training data with a specific label as the learning group. During inference, each image is searched in parallel with all previously trained learning groups. If it belongs to the learned pattern, it outputs a 2, if it was low in the rankings, it outputs a 0, and otherwise, it outputs a 1. To provide these 2-bit outputs, an upper threshold and lower threshold were set for the pattern storage addresses. The left side of Fig. 5 shows the results for the case when a pattern belonging to higher occurring patterns learned by the corresponding learning group is searched, and the right side shows the other case when a pattern belonging to lower occurring patterns is searched. It is important to note that the threshold can affect accuracy. If it is too low, multiple patterns contribute to scores, and feature discrepancies may not be distinct. If the threshold is too high, specific sets of patterns contribute to scores and will not be able to let similar patterns contribute to scores. Even at optimal threshold levels, frequently occurring patterns of equal frequency will be bunched up in the CTCAM and could be split by the threshold level. Furthermore, this bunching order depends on the initial order of patterns. In our implementation, we found sufficient accuracy levels with global threshold values and chose this approach due to its simplicity. It is possible to further improve the accuracy by finding local threshold values that do not split the bunch up for each CTCAM.   The output of the learning group is a number that is the sum of the output values of each CTCAM. The learning group with the largest output among the learning groups is determined to be the inference result, as shown in Fig. 6. This means that the pattern that moved to the upper addresses through search in the learning process also appeared in the tested input image.

C. RESULT
The results henceforth are obtained through Python simulations. All the TCAM operations are emulated by unrolling the parallel operations.

1) ACCURACY
As a result of learning the MNIST image through SFNN, the MNIST numeric handwriting images were recognized with an accuracy of 82.83%. Table 1 is a confusion matrix showing the recognition rate.

2) LEARNING CONSUMPTION TIME
Using SFNN, we estimate the time it takes to learn the MNIST image. To learn 6000 training images, the search is performed a total of 6000 times. When connecting a 10-bit counter to the match line of the TCAM, to prevent overflow, a total of 6 rank sorting operations are performed. Thus, the ranking operations are performed after performing searches on every 1000 images. Table 2 shows the result of calculating the time consumed for the search. In summary, learning is completed in about 0.0415 seconds at 100 MHz clock speed.

3) INFERENCE SPEED
7,142,857 images per second can be processed using the learning group that has been trained. A clock is consumed to send the image into the learning group. Since we send 27 × 27 images to 9 × 9 CTCAMs with 3 × 3 inputs, the output comes out as 9 × 9 × 2 bits. These bits are summed up using a binary tree style adder, so all 162, each with a 2-bit width, can be conservatively summed up in 8 clock cycles. If the 10 summed results from the 10 learning groups are compared in pairs, the learning group that produces the largest output can be conservatively extracted in 5 clock cycles. In other words, one image can be processed in a total of 14 clocks. If this is factored in, assuming a 100 MHz clock rate, 7,142,857 images can be processed per second.

4) REQUIRED MEMORY SIZE
The total memory size required for learning 27 × 27 MNIST images using the SFNN is as follows: To train 0 to 9 images, 10 learning groups are needed, and 81 CTCAMs per learning group are required to process 27 × 27 images segmented into 3 × 3 space blocks covering a 9 × 9 grid per image. Therefore, the total required number of CTCAMs is 810, the number of storage bits per CTCAM is 9 × 2 9 bits, and the total number of storage bits is 9 × 2 9 × 810 bits. Table 3 summarizes the memory consumption estimate for MNIST processing. It shows that approximately 4M cells are required, and since all 810 CTCAMs should work in parallel, each CTCAM requires less than 5K cells of memory.

IV. ACCURACY IMPROVEMENT METHOD
When light enters, the human visual system uses the difference in light intensity to find edges, and subsequently find the orientation of the edge [23].
In this section, we add an edge detection algorithm to the learning group to improve the SFNN's learning accuracy. We will refer to this algorithm as an edge-detection and search frequency neural network (E-SFNN). The edge detection algorithm creates a featured image by extracting edge information from a defined direction in the image. The proposed algorithm is implemented through the edge filtering TCAM module. If the image is passed through the edge filtering TCAM, an image containing edge information can be acquired without performing any arithmetic computations.
In our implementation, a vertical mask, a horizontal mask, an anti-diagonal mask, and a main-diagonal mask are used to filter the MNIST image pixels quantized to 1, 0, and don't care values. These filters extract four edge feature images from an input image. These four feature images are subsequently processed in parallel by the TCAM blocks assigned to each 3 × 3 pixel segment.

A. EDGE DETECTION METHOD 1) EDGE FILTERING TCAM
This module searches for edge information from images. In the case of edge filtering TCAM, the presence or absence of horizontal, vertical, anti-diagonal, and main-diagonal edges in the image is identified using a 4-bit pattern. Fig. 7 shows the values of the edge filtering TCAM's internal storage. Patterns representing horizontal, vertical, anti-diagonal, and main-diagonal edges are stored for each location, and the pattern corresponding to each edge can be arranged differently as a group for each feature class to determine the existence of an edge at a particular location.  Thus, a total of four images containing edge information can be obtained. These directional edge images are generated using a single filtering TCAM. Fig. 8 shows an example of an operation in edge filtering TCAM. The image block input, which can be interpreted as having both horizontal and anti-diagonal edge features, is shown. The image pattern input to the edge filtering TCAM is [1, 0.5, 0, 0], where 0.5 represents a don't care. When this ternary pattern is input to the TCAM, it outputs binary values; the horizontal and anti-diagonal edge output 1 and the rest output 0s. The figure shows how each pixel position is filtered. The red box shows the position of the image and the 2 × 2 image segment associated with the pixel position. The segment is filtered by the corresponding 8 × 4 bit TCAM associated with the pixel position to output 4 directional edge results simultaneously. Fig. 9 shows a detailed example of searching for an edge using the edge filtering TCAM. A pattern of 4 bits covers four-pixel spaces on the image to find the edge. As shown in the figure, when the anti-diagonal pattern is found, a hit occurs at the position where the anti-diagonal edge pattern is stored. 1 is written to the anti-diagonal feature image, and 0s are written to the other edge direction feature images.   anti-diagonal, and main-diagonal feature images can be obtained in one clock cycle.

3) RESULT
Applying the E-SFNN, the MNIST images were recognized with an accuracy of 92.49%. Compared to the SFNN, a 9.66% accuracy improvement was achieved. Table 4 shows a confusion matrix showing the recognition rate.
Compared to SFNN, E-SFNN requires more time to learn. This is due to the additional edge-filtering TCAM search operation that is required. Table 5 shows the result of calculating the time consumed for the search. As shown, learning is completed in about 0.0472 seconds at 100 MHz clock speed. Note that this is virtually the same amount of time consumed for SFNN without the edge filtering operation. To produce edge-filtered images, the edge filtering operation requires a clock cycle. Thus, compared to SFNN, small extra clock cycles are required for the edge preprocessing and the pattern match score updating. This is visible by comparing the number of searches between Tables 2 and 5. 6,666,666 images per second can be processed using the learning group that has been trained. It takes 2 clocks to retrieve the image for the learning group. This is because a search process to find the edge has been added. After two  searches, the output is 9 × 9 × 2 bits, 8 clocks to sum up, and 5 clocks to compare the 10 summed results from 10 learning groups in a tournament format, so it takes a total of 15 clocks to process one image. Assuming a 100 MHz clock rate, 6,666,666 images can be processed per second.
The total memory size required for learning 28 × 28 MNIST images using E-SFNN is as follows: To train 0 to 9 images, 10 learning groups are needed, and 4 edge images are trained per learning group. 324 TCAMs per learning group are required to put 27 × 27 × 4 images into 9 × 9 × 4 TCAMs with 3 × 3 inputs. Therefore, the total required number of TCAMs is 3240, the number of storage bits per TCAM is 9 × 2 9 bits, and the total number of storage bits is 9 × 2 9 × 3240 bits. In addition, for the edge detection TCAM, a total of 4 (directions) × 8 (two 4-bit patterns) × 27 × 27 bits are required (Table 6).

B. STRIDE AND NEIGHBOR LEARNING
For higher accuracy, the stride method, a technique frequently used in the convolutional neural network (CNN) [24], was introduced to our learning group. Due to the learning through frequency of a pattern at a specific location, a pattern that has a similar shape but is at a slightly different location within the image may be recognized as a completely different pattern. Thus, if a trained image is slightly translated, it is unrecognized. It is considered an unlearned and unfamiliar pattern, which adversely affects accuracy [25].
To prevent this from happening, the stride technique is augmented in the E-SFNN. In addition, we present an algorithm that improves accuracy by allowing one pattern to be searched for overlapping neighboring CTCAMs. These two techniques are combined with E-SFNN and will be referred to as t stride edge-detection search frequency neural network (SE-SFNN).

1) STRIDE
When performing convolution in CNN, the sliding interval of the convolution filter can be adjusted by using the stride. The next convolution is performed by superimposing the previous convolutional area, and this concept is applied to our learning algorithm. Fig. 11 shows the difference between the existing learning method and the stride-applied learning method. Unlike in the E-SFNN, where the image regions did not overlap, the regions searched by each CTCAM overlap and respond less sensitively to differences due to the positional discrepancies between similar images.

2) SIMULTANEOUS LEARNING OF NEIGHBOR TCAMS
In addition to the stride method, a method of learning with the surrounding TCAMs is introduced. During the learning process, a TCAM block receives not only the assigned 3 × 3 pixel values, but 4 more clocks are used to receive 4 successive 3 × 3 images. These successive images are neighboring up, down, left, and right 3 × 3 image segments, as shown in Fig. 12. This allowed a broader range of patterns to contribute during the learning process. Since a 3 × 3 pattern is searched for in all the strided 625 CTCAMs at the same time, the search must be performed 5 times to perform both strided and near neighbor schemes for an image. In addition, since there is a limit to the number of searches that can be stored in the 10-bit counter, it requires five times as many sorting processes as compared to the E-SFNN.

3) RESULT
As a result of SE-SFNN, the MNIST numeric image was recognized with an accuracy of 95.69%. Table 7 is a confusion matrix showing the recognition rate.
Using the SE-SFNN, we estimate the time it takes to learn the MNIST images. Since the search for edge detection is required only once, and the TCAM located on the top, bottom, left, and right sides also need to be searched, the search      to search the image for the edge filtering TCAM and the learning group TCAM. After the two searches, the output comes out as 25 × 25 × 2 bits because 25 × 25 TCAMs with 3 × 3 inputs are arranged to learn 27 × 27 images while using the stride method. Adding it all up takes 11 clocks. Afterward, it takes 5 clocks to compare 10 additional results from 10 learning groups in a tournament format, so a total of 18 clocks are consumed to process one image. If this is calculated based on the 100 MHz clock rate, 5,555,555 images can be processed per second.
It shows that the time required for learning increased by adding the stride and the neighboring CTCAM learning algorithm. On the other hand, the inference time did not increase significantly.

5) REQUIRED MEMORY SIZE
The total memory size required for learning 28 × 28 MNIST images using SFNN is as follows: To train 0 to 9 images, 10 learning groups are needed, and 4 edge images are trained per learning group. 2500 CTCAMs per learning group are required to send 27 × 27 × 4 images into 25 × 25 × 4 TCAMs with 3 × 3 inputs. Therefore, the total required number of TCAMs is 25000, the number of storage bits per TCAM is 9 × 2 9 bits, and the total number of storage bits is 9 × 2 9 × 25000 bits. The edge detection TCAM additionally requires the 4 × 8 × 27 × 27 bits and this amounts to a total of approximately 115M cells, which is a large memory size for TCAMs. The memory estimate for the SE-SFNN is summarized in Table 9.

V. MEMORY ORGANIZATION
For each of the developed schemes, we will briefly describe the TCAM memory organization to show how rapid learning is possible.

A. TCAMS FOR SFNN
Eqs. (1) and (2) represent the number of CTCAMs used. It also shows the functions performed during the learning process.
Each CTCAM has a TCAM organized as 512 × 9 bit memory to store all possible patterns for a 3 × 3 image segment. The variables i and j represent the index of the image segment. The variable l represents the 10 types of labels in MNIST.
The Score function performs a count of the hit frequency for each row in the memory. Eq. (2) shows the Sort operations that occur at set intervals of time for the same memory as in Equation 1. As previously stated, the Sort operations occur several times during the learning stage to prevent counter overflow. This parallel CTCAM organization allows each Score operation to finish in one clock cycle for the 10 images representing each label at the same time.

B. TCAMS FOR E-SFNN
In E-SFNN, preprocessing of the image into 4 directional edge feature images is performed. Eq. (3) describes the memory organization necessary for the preprocessing.  4) and (5) show the next layer's CTCAM memory organization and the functions performed. It is similar to Eqs. (1) and (2), except that the number of TCAMs has been multiplied by four to process the four directional image maps. These increases are represented by the variable e. Sort CTCAM l,e,i,j The NN 4 0 function represents the nearest neighbor operation performed 5 times on the 10 × 4 × 25 × 25 TCAMs of 512 × 9 bits to tally up the frequency scores for each row. Note that the Sort function on the same TCAMs already utilizes the nearest neighbor counted data.

VI. MEMORY SIZE REDUCTION
One significant issue that complicates SE-SFNN implementation is that the number of patterns to be stored in memory grows exponentially as the number of bits in the searched pattern grows. From the point of view of TCAM, which learns by taking on a part of an image, it could extract more advantageous characteristics for learning if it could operate on a wider area (larger pattern). Just to increase the current 9-bit (3 × 3 space block) to 16-bit (4 × 4 space block), the size of the pattern increases by 1 bit horizontally and vertically, but the memory required for all possible patterns increases by 2 7 times. Such an exponential increase in memory is a significant issue in terms of accuracy and scalability. In addition, in edge device circuits where area and power consumption are limited, this problem is even greater.
The two thresholds that yielded the highest accuracy in the SE-SFNN method were 95% and 90% address values of the total memory address, respectively. Therefore, it was judged that the patterns used for recognition in the TCAM were limited. In light of this discovery, we developed a way to store only the patterns used very frequently for recognition, to reduce the amount of memory used.
The proposed M2MP scheme divides TCAM into major and minor memories. The promotion to major TCAM from minor TCAM occurs when the same pattern is searched twice within a certain interval. This promoted pattern is not deleted or overwritten until the entire learning process is finished. Major and minor TCAM.

1) ALGORITHM
The M2MP algorithm is as follows and is depicted in Fig. 13. To select a pattern, a search for all patterns to be learned should be performed. If a pattern is found, the count proceeds as in the SE-SFNN. The improved method focuses on only filling the important patterns in the much smaller major TCAM. To do so, the minor TCAM is used as a temporal window to promote only the frequent pattern occurring within that window.
First, a pointer that indicates the next pattern location to be written for the minor TCAM is denoted by check bit 1 as shown on the left side of Fig. 13(a). This pointer is stored in the left shift register that can shift down, and the pointer points to the adjacent row of the major-minor TCAM. Thus, initial pattern searches are filled in chronological order from the top to the bottom. The rows above and including the check bit are assigned to the minor TCAM. In the beginning, the rows are filled up sequentially from top to bottom until the check bit is reached. Subsequently, the top row is designated as the oldest row through the shift register on the right of the minor TCAM initialization stage. After this initial state, new image segments are loaded into the oldest row, and the oldest row pointer is updated by shifting down. The oldest row pointer cycles around the minor TCAM regions.
If a pattern is found in the minor TCAM, the pattern is not written and the major TCAM counter value is increased. In this case, which is shown in Fig. 13(b), the check bit is shifted down to maintain or fill up to the designated row count. The pointer to the oldest minor TCAM row remains unchanged unless the hit occurs in the oldest minor TCAM row. In such a case, the pointer is moved to the next oldest. Note that the hit location now becomes part of the major TCAM. The pattern that has been promoted to the major TCAM is maintained without being deleted. These operations continue, as shown on the right side of the figure, where the memory is filled up and the size of the major TCAM is increasing. It also indicates that while the designated minor TCAM row count was reached, no hits were recorded for 5 searches, and those patterns were being filled from the top at the next available minor TCAM locations. Furthermore, if the pointer reaches the bottom of the minor TCAM locations, it cycles back to the top to indicate the next oldest minor TCAM row.
When the check bit reaches the bottom row of the entire TCAM, it indicates that the number of minor TCAM rows is less than or equal to the designated number of minor TCAM rows. In this case, the designation count is decreased to allow more promotions, and the patterns are searched and promoted until the entire TCAM becomes a major TCAM. Note that in this situation, the check bit remains in the bottom position while the oldest minor TCAM pointer cycles through steadily decreasing numbers of minor TCAM rows.

2) HARDWARE IMPLEMENTATION
Our TCAM reduced memory size by including the same CTCAM structure with the M2MP block to implement the M2MP operation in hardware. Since the number of rows in the TCAM was reduced from 512 to 80, the number of counters required to keep the frequency scores was also reduced. All other hardware requirements remain the same as for SE-SFNN. Fig. 14 shows the CTCAM block augmented with the M2MP block. It shows that only 80 patterns are stored.  In our simulations, we chose 16 as the initial designated minor TCAM row count. When a hit occurs, the left shift register serves as a check bit that points to the bottom boundary for the minor TCAM, and the right shift register tracks the oldest minor TCAM row. The hit mask in the M2MP marks which addresses are skipped during the right shift register operation. The M2MP control block is in charge of maintaining the minor TCAM's counter and eventually converting the TCAM into an entire major TCAM.
A. RESULT 1) ACCURACY As a result of learning the MNIST image through the SE-SFNN with the M2MP algorithm, the MNIST images were recognized with an accuracy of 94.58%. This loss is only 1.16% compared to the loss for the SE-SFNN. The size of the TCAMs within CTCAMs was reduced from 512 × 9 bits to 80 × 9 bits. As shown in Fig. 15, this size was obtained experimentally by trying out various reduced sizes to obtain a good trade-off between accuracy and memory reduction. Table 10 is a confusion matrix showing the recognition rate. The upper and lower thresholds were changed to optimize for the M2MP implementation. The optimal thresholds for the reduced total TCAM size of 80 were 9 and 46. Converted to percentages, these are 11.25% and 42.5%.

2) LEARNING CONSUMPTION TIME
As a result of measuring the time it takes to learn the MNIST images using the SE-SFNN with M2MP, the number of sorting operations was significantly reduced. Because the time required for bubble sorting is (total number of TCAMs) × (total number of TCAMs -1)/2, it consumed only 2.42% of the energy used by the SE-SFNN. There was no difference in the number of sorting times or the time consumed when performing the change. A clock for edge detection and 5 × 2 clocks for stride and near neighbor operations are required for the learning time. Learning takes twice as long as in the non-M2MP approach due to minor TCAM to major TCAM update time. Therefore, 66000 clocks are consumed for 6000 images. Adding the number of clocks required for the total sorting, the total required clock count is 634,800. This is only 2.69% compared to the 23,582,880 clocks of the method without M2MP. Table 11 shows the result of calculating the time consumed for the search. The result indicates that learning is completed in about 0.0063 seconds at 100 MHz clock speed.

3) REQUIRED MEMORY SIZE
This method reduces the amount of memory used by each TCAM. Otherwise, the operations performed on the TCAM are the same as SE-SFNN, but the 2 9 = 512 memories required for each TCAM are reduced to 80. This means that the same inference can be performed using only 15.6% of the SE-SFNN's memory. However, M2MP memory reduction is only performed on the CTCAMs. The edge-processing TCAMs do not reduce in size. This is reflected in Table 12.
Although the memory size is significantly reduced by M2MP, each CTCAM block now requires the M2MP controller block to fill the memory with relevant patterns during  the learning phase. It is worth noting that the learning time overhead has increased by two. This increase is due to the additional clock required for the major TCAM pattern insertion.

4) COMPARISONS WITH PREVIOUS STUDIES
To gauge our implementation's performance, we compared it with SNN learning systems that were implemented or simulated for hardware. Table 13 compares our proposed design with others on various performance and feature parameters. We have also included power consumption estimates using TCAM bit energy measurements from [26]. Their measurement was based on a low power TCAM design fabricated on a 65 nm CMOS process. Our estimation, that assumed TCAMs dissipate 90% of the total power, shows a lower average power dissipation of 3.1 mW for training than the inference that required 4.6 mW. This lower training power consumption is mainly due to the large number of clock cycles required for sorting operations, which have a relatively small impact on power consumption.
The papers that were compared are as follows: The coordinate rotational digital computer (CORDIC) algorithm was applied to implement SNN in hardware [27]. Using this algorithm, they obtained values for learning through repetitive calculations of complex expressions represented by SNN as exponential functions. Another group used a backpropagation STDP (BP-STDP) algorithm and a weighted neuron model [28]. The proposed accelerator is implemented in the Xilinx Virtex-7 VC707 FPGA development board and achieves better accuracy due to the efficient and supervised BP-STDP rule. In [29], they try to reduce hardware costs and power consumption by developing a parallel neuromorphic architecture. As a result, their proposed work achieves competitive recognition performance with smaller network complexity. Nonetheless, there were losses when the software was implemented in hardware. However, our proposed work shows good accuracy with no difference in software or hardware implementation and a very fast image processing speed. The SNN processor that [30] proposes is suitable for real-time and energy-constrained applications. With their proposed adaptive clock/event-driven computing scheme and neighboring PE borrowing technique, they reduce energy and classification time. Finally, a group proposed a training algorithm using SNN, including STDP and the neural sampling method [31]. For hardware, they proposed RRAM SNN in PIM configuration for good power efficiency and fast learning speed. Table 13 verifies the usefulness of the proposed algorithm based on several parameters. It shows an accuracy that is about 1.02% lower than the maximum accuracy that can be obtained with the SNN algorithm, which is 95.6%. The parameter that shows the strength is the inference time or recognition time, which is much faster than the other methods based on SNN. Even compared to [27], which is the fastest SNN method, it can recognize images about 3 times faster. In addition, the time required to learn is overwhelmingly small. It took 6.3 ms to train 6000 images, so it takes only 0.001 ms to train an image. Compared to the fastest learning system [31], training can be completed about 256 times faster.
Although our results are very good for the MNIST dataset, there are still major challenges ahead. In particular, our hardware architecture only achieved 89.94% accuracy on the N-MNIST dataset with Gaussian noise and did not produce a meaningful result for Cifar-10. This relatively poor result is due to two problems. One is the ternary value we use for VOLUME 11, 2023 a pixel, and the other is the localized single-layer learning performed in our algorithm. These challenges are discussed in the conclusion section. Fig. 16 shows the entire architecture of the proposed system. The main two TCAM blocks are the edge detection filtering TCAMs and the CTCAMs with M2MP, which apply SLDP to incoming 2D spikes. During the supervised learning, each learning group, represented by the associated CTCAMs and their summed output value, is updated with the corresponding class of inputs. For inferencing, shown in Fig. 16(a), the largest of the learning groups' outputs indicates the recognized class. At this stage, the inputs are distributed to 10 groups that operate in parallel.

VII. ENTIRE ARCHITECTURE
In contrast, only the inputs corresponding to their group are processed in parallel during the learning stage. Note that the nearest neighbor operation is only performed during the learning stage, as shown in Fig. 16(b). A sequenced multiplexor is used to allow each of the 2500 CTCAMs with M2MP to choose between 5 different positional patterns in order to implement the near neighbor searches sequentially.

VIII. CONCLUSION
In this paper, we have developed a highly computationally efficient DPIM neural network system. A two-layer TCAM scheme that preprocesses an image into multiple spatial spikes and applies spiking-location-dependent plasticity, respectively, is used for both rapid learning and inference. Since memory processing is performed digitally, it is a much more practical approach compared to many analog approaches for PIM. Although power consumption is an issue with generic TCAMs, the problem can be resolved with low power TCAMs that are studied. To further address this issue, our approach reduced the required memory size significantly by applying a two-tiered caching memory approach to selectively store the most frequently occurring pattern. In turn, this allowed a significant reduction in the number of clocks required for training operations. Moreover, this proposed architecture, similar to STDP, can facilitate unsupervised learning due to its rapid and feed-forward learning. We demonstrated the feasibility of using TCAMs in MNIST learning. For general datasets, however, there are a couple of major challenges ahead for rapid learning. First is the construction of a multi-layer learning TCAM architecture to learn a variety of datasets. The second challenge in our framework is to use higher resolution quantization to detect minute details.