FASTA: Revisiting Fully Associative Memories in Computer Microarchitecture

Associative access is widely used in fundamental microarchitectural components, such as caches and TLBs. However, associative (or content addressable) memories (CAMs) have been traditionally considered too large, too energy-hungry, and not scalable, and therefore, have limited use in modern computer microarchitecture. This work revisits these presumptions and proposes an energy-efficient fully-associative tag array (FASTA) architecture, based on a novel complementary CAM (CCAM) bitcell. CCAM offers a full CMOS solution for CAM, removing the need for time- and energy-consuming precharge and combining the speed of NOR CAM and low energy consumption of NAND CAM. While providing better performance and energy consumption, CCAM features a larger area compared to state-of-the-art CAM designs. We further show how FASTA can be used to construct a novel aliasing-free, energy-efficient, Very-Many-Way Associative (VMWA) cache. Circuit-level simulations using 16 nm FinFET technology show that a 128 kB FASTA-based 256-way 8-set associative cache is 28% faster and consumes 88% less energy-per-access than a same sized 8-way (256-set) SRAM based cache, while also providing aliasing-free operation. System-level evaluation performed on the Sniper simulator shows that the VMWA cache exhibits lower Misses Per Kilo Instructions (MPKI) for the majority of benchmarks. Specifically, the 256-way associative cache achieves 17.3%, 11.5%, and 1.2% lower average MPKI for L1, L2, and L3 caches, respectively, compared to a 16-way associative cache. The average IPC improvement for L1, L2, and L3 caches are 1.6%, 1.4%, and 0.2%, respectively.


I. INTRODUCTION
Associative access, i.e., access by content rather than by explicit address, is widely used across computer microarchitecture.This type of access is required for caches of all levels, translation lookahead buffers (TLB), reservation stations and register renaming in out-of-order execution The associate editor coordinating the review of this manuscript and approving it for publication was Mario Donato Marino .engines, branch target buffers (BTB), and other fundamental microarchitectural components [1], [2], [3], [4], [5], [6], [7], [8].Caches in typical processors are limited up to 8-way or 16-way associativity [9], [10], [11], [12], [13].However, higher associativity has at least two distinct advantages over lower way-associative options: 1) Higher associativity leads to fewer conflict misses, with fully associative caches entirely devoid of such misses, thus reducing the overall miss rate.
A CAM is a memory that, in addition to write and read, enables simultaneous comparison of a query (search) pattern with the entire content of the memory [19].The top-level scheme of a CAM array is presented in Fig. 1(a).The array is based on bitcells (CAM BCs) with the same basic structure of a 6T-SRAM bitcell.This memory structure enables read and write operations to a cross-coupled inverter core through a set of wordlines (WLs) and complementary bitlines (BLs).In the context of CAM arrays, these BLs can be denoted as searchlines (SLs) [28].Write and read operations are implemented similarly to SRAM.A set of search data registers (SDRs) drive the data onto the SLs during write and sample the data from the SLs following a read.The search or compare operation is where CAM deviates from random-access memories [19], [29].To carry out a compare operation, the search pattern is loaded into the SDRs and driven onto the SLs.If the search pattern matches the pattern stored in a certain row, the matchline (ML) of that row signals a hit.In the mismatching rows, a miss is signaled by the ML.
Logically, every i th row in an m entry CAM contains nbits of stored data and an n-bit comparator, whose output is ML i .The comparator is typically implemented by one of the following two identical functions: where S is the search (query) pattern, D is the pattern stored in the CAM row, and ⊕ represents a bitwise XNOR operation between the S i,j and D i,j values broadcast and stored respectively in the column j.Accordingly, there are two types of CAMs: NOR and NAND.The bitcells implementing NOR and NAND CAM arrays are presented in Fig. 1(b) and Fig. 1(c), respectively.The internal data nodes (D and D) of the cross-coupled inverter core of both cells are connected to the access transistors (denoted ''MT'' in the figure).The cell is accessed for write and read operations by asserting the WL, and driving BL and BL to opposite logic values for write, or precharging them for read.The data nodes are also connected to XNOR circuitry that is implemented differently for the two types of cells.The NOR CAM cell implements the XNOR by conditionally pulling down the ML through M1/M3 and M2/M4.The NAND CAM cell implements the XNOR by the M2/M3 transistor pair, conditionally enabling a pass gate (M1).The NOR and NAND CAM row designs are illustrated in Fig. 1(d) and Fig. 1(e), respectively.
Both the NOR and the NAND CAMs carry out a compare operation in two phases: precharge and evaluation.During the precharge phase, the pre signal is asserted to pull the ML high in both configurations.However, during the evaluation phase, the behavior of the NOR and NAND CAMs is complementary.In the NOR CAM, an inverted search pattern is asserted on the SLs.If the stored bit equals the inverted search bit in even one cell, i.e., D i,j = S i,j for row i and any column j, the ML discharges through either M1/M3 or M2/M4.Consequently, the value of ML i at the end of the compare cycle is '0', signaling a mismatch.Otherwise, if the stored bit does not equal the inverted search bit in every cell, the ML has no discharge path.Therefore, its value remains high at the end of the compare cycle, signaling a match.For the NAND CAM, if the stored bit equals the search bit in every cell (i.e., D i,j = S i,j for a row i and every column j), the ML discharges through the series of M1 i transistors, dropping to '0' and signaling a match at the end of the compare cycle.If even a single NAND CAM cell mismatches, the ML remains high, signaling a mismatch.While both options provide the required functionality, they present very different design tradeoffs when implementing an associative memory.

1) OTHER TECHNOLOGY ALTERNATIVES FOR ASSOCIATIVE MEMORIES a: CNTFET-BASED CAM
Carbon Nanotube Field-Effect Transistor (CNTFET) technology has been proposed as a potential alternative to Silicon transistors.With high mobility and a fabrication process compatible with various substrates, CNTFETs have demonstrated the capability to build high-speed CMOS digital circuits and microprocessors, providing as much as a 10× improvement in performance over silicon counterparts [30].Various CNTFET-based CAMs have been proposed in the literature [30], [31], [32], [33], [34], showing that CNTFET technology can excel in terms of search power and delay as compared to conventional silicontransistors-based CAM.Some CAM cells utilize different techniques such as top-gated CNTFET (TG-CNTFET) and gate-all-around CNTFET (GAA-CNTFET) to improve performance [31], [32].Such approaches have been shown to be efficient in reducing search delay and power consumption, while improving the noise immunity of CNTFET-based CAMs.Other types of optimizations include shorted gate CNTFET (SG-CNTFET) and independent gate CNTFET (IG-CNTFET).Unfortunately, CNTFET technology suffers from yield issues in large-scale manufacturing [35], which currently limits the size of CNTFET-based CAMs.

b: FeRAM-BASED CAM
Ferroelectric Random Access Memory (FeRAM) is a type of capacitor-based memory, where the ferroelectric material is used as the dielectric layer between two electrodes [36].FeRAM has been in the market for some time and has niche usage in low-power-embedded systems, thanks to its speed, low power access, and low voltage operation.However, it has not reached widespread adoption in CAMs mainly due to its moderate area footprint and its destructive read operation [37], [38].To address this issue, a novel approach has been developed utilizing quasi-nondestructive readout (QNRO) of the capacitor polarization.By selectively switching a small portion of polarization, enough to activate a read transistor channel, this technique allows for multiple read cycles exceeding 10 6 cycles before necessitating a write-back operation [37].This enhances the feasibility of integrating FeRAM technology into CAM systems, presenting a viable solution to its previous limitations.

c: FeFET-BASED CAM
As in FeRAM technology, the ferroelectric field-effect transistor (FeFET) uses the polarization of ferroelectric materials.Toggling between the two polarities, FeFET can be used as memory storage.Several FeFET-based CAMs have been proposed [39], [40], [41], [42], [43].FeFET CAMs mainly offer significant advantages over traditional SRAMbased CAMs, demonstrating improved memory density and energy efficiency [41].These designs include the conventional 2FeFET-based CAM, multi-level cell architectures, and single FeFET CAMs [42], [43].The latter has very high density, but requires several steps to carry out a search operation.Recent advancements include single FeFET CAM designs that exploit the ambipolar transport of devices, such as Schottky barrier FeFET or ambipolar ferroelectric tunnel FET (FeTFET), allowing pattern searching in just one step [44].However, the asymmetry in ambipolar FeFETs impacts the CAM functionality and necessitates careful tuning [44].Furthermore, FeFET-based CAMs use two different voltage levels for programming and search operation, which leads to an increase in design complexity and area overhead of the overall memory architecture.

d: MRAM-BASED CAM
Magnetoresistive Random-Access Memory (MRAM) is a spintronics memory technology that exploits the intrinsic spin of the electron to store information.MRAM technology has seen significant advancements in the last decade, with the two latest generations, namely Spin-Transfer Torque (STT) and Spin-Orbit Torque (SOT), gaining great interest.These MRAM technologies have found been proposed as memory cells in CAMs [45], [46], [47], [48], [49], [50], including a hybrid SOT/STT CAM cell [51].However, the widespread adoption of MRAM faces challenges, such as the delicate balance between high programming current density and fast access speed, as well as concerns related to reliability arising from operational disturbance and process variations.Other magnetic memory technologies, including skyrmions and Domain-Wall ''racetrack'' Memory (DWM), have also been proposed for the construction of associative memories [52], [53].

e: RRAM-BASED CAM
Resistive Random-Access Memory (RRAM) stores data in the form of resistance.RRAM is built using oxide-based material, which changes its ion distribution in the presence of heat or electric field.Several RRAM-based CAMs have been proposed [54], [55], [56], [57], [58].Among the most important issues of RRAM technology are the uniformity and reliability of resistive switching.This impacts the RRAM-based CAM performance, leading to false mismatch results [56].
Flash is based on moving the electric charge on and off of a floating-gate structure, thereby modifying the effective threshold voltage of the device.A number of CAM designs based on both NOR and NAND Flash alternatives have been proposed [59], [60], [61], [62], [63], [64].In addition to the small area footprint and reduced power consumption compared to conventional CMOS-based CAM designs, Flash technology is mature enough for large-scale integration [60].NAND Flash, employing 3D stacking and multilevel cells, has achieved significant density improvements.However, this has introduced reliability issues necessitating advanced wear leveling and error-correction mechanisms in modern Flash controllers.Moreover, 3D NAND flash CAM performance is limited by the charge trapping mechanism, and presents a high area-cost peripheral mainly due to the large voltage required for write operations [60].

B. TRADEOFFS AND CHALLENGES IN CAM DESIGN
The CAM circuits, described above, face several design tradeoffs that have limited their use in many microarchitectural components [16], [26], [65], instead giving preference to alternative implementations based on SRAM or register files [66].These tradeoffs and challenges, including speed, energy consumption, scalability, and area overhead, are elaborated upon hereafter.

1) SPEED
The worst-case discharge path of the ML in NOR CAM is a pair of serially connected transistors (M1/M3 or M2/M4 of a single mismatching bit).In comparison, the ML discharge path of a NAND CAM goes through w serially connected transistors, where w is (tag_size+number_of_auxiliary_bits) in the case of a cache tag array.Therefore, NOR CAM has a clear speed advantage over NAND.For many high-speed applications [24], [67], [68], and especially when the associative array is on the critical path as in a TLB [69], [70], [71], the speed penalty of a NAND implementation is too high, 13926 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
leaving the NOR implementation as a likely option.The novel CCAM bitcell and a FASTA tag array architecture, introduced in Section III, do not require precharge and feature a tree-like logic for resolving the long resistive paths.

2) ENERGY CONSUMPTION
Unfortunately, the speed advantage of a NOR CAM comes at a very significant power consumption cost [24], [26], [66].In the majority of CAM applications in computer microarchitecture (including cache and TLB), only a single CAM row matches in a compare cycle in the case of a hit and no rows match in the case of a miss [25], [72].Therefore, at best, all but one of the MLs in a NOR CAM discharge during every compare cycle, and hence, need to be fully charged again before the next access.The resulting power consumption is excessive and is often the reason to prefer an alternative solution that is based on a non-associative memory.In contrast, a NAND CAM has lower energy consumption, as only the matching ML discharges, while the rest remain high and likely do not need to be re-charged for the next comparison (access).However, as mentioned above, the slow operation of a NAND CAM strongly limits its use in most microarchitectures.The proposed CCAM bitcell and FASTA architecture resolve this challenge, as well, by enabling low-energy CAM operation, comparable to NAND CAM.

3) SCALABILITY
Another concern in CAM-based fully-associative cache is its limited scalability [73], [74].The reason is as follows.Large random-access memory structures are typically implemented hierarchically, for example with hierarchy levels such as ranks, banks, and subarrays in dynamic random-access memory (DRAM).A subarray size is limited due to physical restrictions, such as wordline and bitline lengths.Address decoding in such memories is also hierarchical, with a global decoder or pre-decoder selecting the subarray or a limited set of subarrays, and a local (row) decoder selecting the memory row.Because of that, only the subarrays where the data is located are activated, while the rest might be idle, saving energy.In CAMs, such a hierarchical approach is typically impossible, since the searched-for pattern might be located anywhere in the memory.Therefore, the entire memory needs to be searched, for which the entire memory needs to be active (i.e., all searchlines must be asserted), thus making truly large CAMs energetically impractical.The FASTAbased VMWA cache architecture, presented in Section IV, trades off full associativity for a slightly lower, but still very high associativity, while idling the non-accessed portions of a large cache.This feature enables the construction of large and energy-efficient caches.

4) AREA OVERHEAD
A final challenge that limits the use of CAMs is their area overhead [75].CAMs are well-known to be larger than SRAMs of the same dimensions.Both NOR and NAND CAM cells have at least 50% more transistors than a 6T-SRAM cell.This has been another factor that traditionally turned away computer architects in times of scarcity.
However, in most computer microarchitecture applications, associative access is not used standalone.An associative memory array is typically coupled with an SRAM.For example, in cache, while the tag array might be implemented with CAM, the cache blocks are stored in an SRAM, controlled by the tag array.The data SRAM is of the same height as the tag array, but is much wider, for example, 64-bit tag array and 512-bit data SRAM.Therefore, even if the tag array carries a significant area overhead, it becomes much less substantial when amortized over the entire cache area.Accordingly, we believe that a certain degree of area overhead can be tolerated.This is because of the benefits provided by full (and very-many-way) associativity, as well as the relative size of the associative tag memory as compared to adjacent random access data memories.

C. RELATED WORK
Fully associative structures in computer microarchitecture, in general, and cache architecture, in particular, have not been a major focus of academic research in recent years.Therefore, positioning and evaluating the proposed FASTAbased VMWA cache design vis-a-vis recent developments in cache architecture is challenging.In this fairly limited related work section, we focus on several recent works that attempted to improve the associativity of cache, while optimizing the associativity vs. power consumption trade-off.
Increasing the associativity is a common technique for conflict miss reduction in caches [76], [77].Consequently, VMWA eliminates conflict misses almost entirely as shown in Section V.
Non-temporal Streaming, a conflict miss rate optimization technique, uses a separate fully associative buffer in parallel with the main direct-mapped cache [78].In [79] and [80], the conflict miss reduction is achieved by optimally selecting address bits for the cache index, for a fixed set of applications.MCC-DB [81] uses the prior knowledge of data access patterns to minimize LLC conflicts in multi-core systems through an enhanced OS facility of cache partitioning.CCProf [82] is a lightweight measurement-based profiler that identifies conflict cache misses and associates them with program source code and data structures.
Higher associativity often results in increased cache power consumption.Sub-banking was proposed in [83] as a method to reduce power consumption in way-associative caches by activating only parts of the cache while keeping the rest idle.Somewhat similarly, the proposed VMWA cache resolves the power consumption bottleneck and the scalability limitation of many-way associative caches by limiting the associativity level to the size of a single memory subarray.However, the proposed VMWA cache is built with an actual associative memory designed upon a novel CCAM cell, including additional circuit techniques.This approach enables very high associativity (e.g., 256-way), while at the same time featuring much lower energy consumption.
A B-cache was introduced as a technique for conflict miss reduction [84].One disadvantage of this method is high per-access power consumption.An enhanced B-Cache [85] reduced the total access energy consumption; however, it remained higher than that of a lower associativity cache.The FASTA-based VMWA cache enables almost complete conflict miss elimination, and at the same time, significantly reduces the cache access energy consumption compared to a typical way-associative cache.
Another work that discusses the concept of many-way associative cache is [86], as one of the applications of a programmable resistive address decoder.However, the solution proposed there is based on emerging resistive memories, and therefore, its application in traditional computer microarchitecture is limited.

III. PROPOSED CCAM BITCELL AND FASTA ARCHITECTURE
The previous section overviewed CAM and introduced the various design tradeoffs and challenges that have limited its use in modern microarchitectures.In this section, we introduce FASTA, an energy-efficient and scalable associative memory architecture based on CCAM, a novel CAM bitcell that enables FASTA to meet the performance and power requirements of cache memories and other microarchitectural components.

A. COMPLEMENTARY CAM BITCELL
The basic building block of the energy-efficient FASTA architecture is the complementary CAM bitcell, shown in Fig. 2. While the storage mechanism of the CCAM bitcell is identical to standard CAM and SRAM cells (M1, M2, and the cross-coupled inverters of Fig. 2), the comparison logic is a combination of the NOR and NAND CAM schemes.This complementary (full CMOS) design enables static precharge-free operation with low power consumption, as explained hereafter.
The storage nodes of the CCAM bitcell (D and D) are connected to the gates of a pair of nMOS pass transistors (M3 and M4).These devices are connected to the searchlines (SL and SL) on one side and to each other on the other side, similar to a NOR CAM cell.The node connecting M3 and M4 represents an XNOR operation between the stored bit and the bitline data, as noted in Fig. 2. The XNOR node is connected to the gates of two additional transistors: a pMOS (M NOR ) and an nMOS (M NAND ).Similar to the NAND CAM configuration of Fig. 1(e), M NAND is connected in series to the M NAND transistors of the adjacent bitcells in the same row, i.e., the ML NAND line of each CCAM cell is connected to the left end of the M NAND transistor in the neighboring CCAM cell.The source of the M NAND of leftmost bitcell in a row is connected to ground, such that if every XNOR node in a row is high, the ML NAND line is discharged.This is illustrated in Fig. 3(a) for a 3-bit row of CCAM cells with all query patterns (driven onto SL j and SL j ) matching the data stored in the bitcells.
To complement this, the source of M NOR is connected to the power supply (V DD ), while its drain is connected to the ML NOR signal, shared by all bitcells in a row.Therefore, it is sufficient for the XNOR node of one cell in a row to be low, i.e., a single mismatching bit, to charge ML NOR .This is illustrated in Fig. 3(b), where SL 1 ='0' and D 1 ='1', such that XNOR 1 is pulled down, opening a pull-up path to ML NOR through M NOR .By shorting together the ML NOR and ML NAND lines at the drain node of the rightmost bitcell in a row, a row of CCAM cells makes up an n-input CMOS NAND gate.This structure benefits from the statically driven full-swing output characteristics of CMOS digital logic.
The complementary functionality of the CCAM ML dramatically reduces the switching activity of this high capacitive signal, because all but one ML (in the case of cache/TLB hit), or all MLs (in the case of cache/TLB miss) are statically held at V DD .This is in contrast to a conventional NOR CAM, where all MLs (or all MLs but one) discharge during every compare cycle, and hence, need to be precharged before the next cache access [87].Table 1 summarizes the electrical simulation results in terms of search latency and energy consumption of NOR CAM and NAND CAM relative to CCAM.All CAM designs were organized as 256 64-bit words built using a 16 nm FinFET technology operating at a V DD of 0.8 V. CCAM has the lowest search energy but the highest delay among the three.The latter is due to the long pull-down path, similar to NAND CAM, exacerbated in CCAM by the additional capacitance of the pull-up PMOS transistors absent in NAND CAM.However, the resulting 13928 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.CCAM frequency is still higher than that of NAND and NOR CAM, because those require a precharge cycle.CCAM-based FASTA requires no precharge, as elaborated upon in the next subsection.

B. FULLY-ASSOCIATIVE TAG ARRAY (FASTA)
Using the CCAM cell as the basis for data storage and comparison, a memory architecture is proposed to implement a fully associative tag array (FASTA).As shown in Fig. 4  inverter was used to implement the MLSAs; however, a faster sensing circuit could be used at the expense of increased area.The output signals of the MLSAs (hit [0 : m − 1]) typically enable the wordlines of an adjacent SRAM data array, e.g., in cache, where the SRAM row contains a cache block.The hit/miss logic is essentially an m-input OR gate that signals a hit if there is a match in (at least) one of the rows and a miss otherwise.This logic can be implemented using various design approaches, such as a wired-OR or a NOR-tree, which is the structure used in this work.
The Query pattern (Tag) register of Fig. 4(a) is used by all three basic operations (write, read and compare) in the FASTA architecture.During writes, this register drives the data onto the CCAM searchlines (SLs), and during memory reads (if optionally implemented in the array), the data that is read out is sampled by that register.For a compare operation, the query pattern (a tag in the case of a cache) is written to this register and driven onto the SLs.XNOR between the search pattern bit and the stored data bit drives the ML transistors in the CCAM cells (see Fig. 2).This, in turn, discharges (match) or charges (mismatch) ML i as demonstrated in Fig. 3.For a matching row, when the ML voltage discharges past the trip point of the MLSA inverter, the hit [i] signal of that row rises.This signal is then output from the array and transmitted through the hit/miss logic to generate a global hit signal.
Unlike conventional NOR and NAND CAMs, the compare operation in the CCAM-based FASTA does not require precharge.This saves the delay of the precharge phase, which can be substantial, as it has to be designed assuming all MLs are fully discharged before the compare operation.However, the pull-down path still needs to traverse through w nMOS devices to signal a match, as illustrated in Fig. 5(a).One option to reduce the delay of this path is to use low or ultra-low threshold voltage (LVT or ULVT) transistors (Fig. 5(b)), which are typically provided as part of any modern process design kit, to implement M NAND .While these transistors are much faster than the standard or high threshold (SVT or HVT) devices, they suffer from subthreshold leakage several orders of magnitude higher than that of an SVT device, leading to excessive leakage power.However, in the case of multiple serially connected cutoff transistors, the so-called ''stack-effect'' [88], [89] reduces this leakage to a tolerable level.
For FASTA arrays with large w (wide tags), the discharge delay may be excessive, even with ULVT transistors.To mitigate, the comparison logic of the FASTA row is segmented to limit the number of serially connected nMOS devices in the discharge path.Specifically, the w-bit row is divided into s comparison segments, such that each pull-down path has w/s transistors in series.In each such segment, the pull-down (ML NAND ) and the pull-up (ML NOR ) paths are connected to the output of the segment, as shown in Fig. 5(c).All segment outputs are propagated through an s-input CMOS OR gate.Such OR functionality can be further implemented by an OR logic tree.
These two methods (LVT devices and segmentation) can be combined to create an optimal trade-off between speed and leakage power (while also taking into consideration the area overhead of the segmentation option).

IV. VERY-MANY-WAY ASSOCIATIVE CACHE A. MANY-WAY VS. FULLY-ASSOCIATIVE
The FASTA architecture, presented above, implements an energy-efficient associative memory, based on the CCAM bitcell, that mitigates the energy bottleneck of traditional CAM.While this solution is suitable for relatively small associative memories, the implementation of a very large cache (MBs or tens of MBs) would lead to excessive energy consumption, as compared to a conventional SRAM.This is because in a conventional SRAM, a hierarchy of many memory banks and subarrays would be employed, enabling the pre-decoding mechanism to disable large portions of the memory during access, thereby saving power.An associative memory, on the other hand, has to search the entire memory content, and therefore, all parts of the memory need to be accessed on every compare operation.For an application that requires large memories, such as a last-level cache, excessive energy consumption might be prohibitive.We propose resolving this by reducing the way-associativity to the level equal to the vertical dimension of a typical memory subarray (i.e., 256-1024-way associative), as presented below.

B. ALIASING-FREE CACHE
Before going into the proposed very-many-way associative cache architecture, let us revisit one of the main advantages of high associativity.In a virtual memory-based processor, a straightforward physically-indexed physically-tagged cache access requires translating the virtual tag into a physical tag by searching the TLB and only then accessing the cache with the full physical address.Since this type of access is slow, a popular solution is virtually-indexed physicallytagged (VIPT) cache access, which overlaps the TLB lookup with the cache access, thereby reducing the access latency.However, the size of a VIPT cache is limited by aliasing [90], [91].Aliasing can occur when the index extends beyond the page offset, such that a part of the index requires virtual to physical translation.This limits the index size to (page_offset − block_offset), which in turn, limits the cache size to (page_size × associativity).So a processor with a 4 kB page size (page_offset = 12) and 64B block size (block_offset = 6) can have, at the most, 2 6 =64 sets, and with 16-way associativity, the maximum aliasing-free cache size is 64 kB.
FASTA-enabled 256-1,024 way-associativity allows constructing a 1MB-4MB cache with 64 sets, such that the index would require no virtual to physical translation, thus both preserving the low latency and eliminating the aliasing.The VMWA approach detailed below supports such levels of associativity, such that large, low-latency, energy-efficient, and aliasing-free caches can be built.

C. VMWA ARCHITECTURE
The VMWA cache is implemented by several separate FASTA subarrays, each coupled with corresponding data SRAM modules that store the cache blocks (lines).Each of the FASTA subarrays is a tag array, with its hit [i] signals enabling the wordlines of the adjacent data SRAM modules.Each individual FASTA is a set, and each FASTA row is a way.The index bits of the virtual address are used to select one of the FASTA arrays (sets), such that all the others can be idle during access to save energy.
Enabling the idling of all but one of the FASTA tag arrays (sets) is a very important advantage of the VMWA cache.In a conventional VIPT way-associative cache, all tag and all data SRAM arrays should be accessed in parallel to synchronize the arrival of physical tags from the tag arrays and the TLB.While this allows access time reduction, it also leads to significant energy consumption overhead.Such overhead is avoided in a VMWA cache because the working FASTA array (set) is selected by the virtual index, hence only one of many sets needs to be active.Fig. 6 visualizes the implementation of an aliasing-free 1MB VIPT cache based on the proposed VMWA architecture.The cache is implemented with 64 sets, which are 256-row (i.e., 256-way) FASTA tag arrays and corresponding SRAM data modules.The VMWA address decoder activates one of the 64 sets.If the physical tag is found in the selected set, the corresponding data line is read out and passed through a 64-1 multiplexer as the cache output.
Revisiting the area overhead of the FASTA-based solution: the tag arrays are made up of 256 × 53 CCAM bits (including valid bit), while the adjacent cache line arrays are 256 × 512 SRAM bits, so there are approximately 10× more (small area) SRAM bits than (large area) CCAM bits.While this overhead is not negligible, it is not unique.For example, similar or larger area overheads are applied to enable Error Correction Code (ECC) protection in SRAM memories [92], [93] (12.5% for 64-data bits [93]) and fault-tolerance in Phase-Change Memory, PCM, (from 11.9% [94] to 12.5% [95]).We argue that such overhead might be afforded in microarchitectures, where performance is among the most important requirements.
Similarly to how the tag array is ECC protected in conventional cache, FASTA can also support ECC and provide tag protection by slightly adjusting the sense amplifier design, as suggested in [29] and silicon-proven by [28].

D. REPLACEMENT POLICY OVERHEAD
VMWA cache may support a variety of replacement policies, however, their overhead may be different from that of a conventional cache.The hardware overhead of Least Recently Used (LRU) is O(number_of _sets×associativity× log 2 (associativity)).For a 256-way 8-set associative VMWA cache, such overhead is 1.56%, while for the same size conventional 256-set 8-way associative cache, the overhead is 0.58%.However, for replacement policies such as Dynamic Insertion Policy (DIP) [96], Static Re-Reference Interval Prediction (SRRIP), and Dynamic Re-Reference Interval Prediction (DRRIP) [97] or Not Recently Used (NRU), the hardware complexity is O(number_of _sets × associativity).Therefore, the overhead should be very similar for the same size X-way Y-set associative VMWA cache and Y-way X-set associative conventional cache.
To conclude, a VMWA cache provides three major advantages: 1) It resolves the scalability limitation of VIPT caches and enables the construction of very large aliasing-free caches.
2) It enables a lower miss rate and performance improvement, as shown further in Section VI.
3) It saves energy by accessing only a single tag/data array during a search operation, as compared to an SRAMbased cache, which needs to access and compare the entire memory (all ways) in parallel.

V. EVALUATION AND RESULTS
A FASTA-based 256-way 8-set associative cache was custom-designed, including layout, using a commercial 16 nm FinFET PDK featuring a nominal supply voltage of 0.8 V.The results are obtained from post-layout simulations carried out within the Cadence Virtuoso environment, using the Spectre family of circuit simulators.Therefore, the results take into account the impact of the RC parasitics.For comparison, a baseline 8-way 256-set associative cache was also custom-designed using the same PDK under the same assumptions.The comparison between the 256-way 8-set associative cache and its baseline 8-way 256-set associative counterpart is presented below.

A. EVALUATION SETUP
The architecture of the baseline 8-way 256-set associative cache is shown in Fig. 7(a).The choice of 8-way set associative is based on reported associativity typically found in modern processors, such as the AMD Zen-3 [12] and Intel Skylake quad-core or Intel i7-6700 CPU [11], [13].Such associativity requires 256 sets to provide a cache size of 128KB; however, note that when indexed virtually and tagged physically, this configuration is prone to aliasing.
The SRAM tag arrays are 256 rows × 51 columns (64-bit virtual address, less 6-bit block offset, less 8-bit virtual index, plus one valid bit), as illustrated in Fig. 7(b).The outputs of eight comparators are OR-ed to generate the global hit signal.The outputs of the sense amplifiers of the data SRAM arrays (eight ways) are fed into an 8-to-1 multiplexer that outputs the 64B cache line.
The structure of the 256-way 8-set FASTA-based VMWA cache is shown in Fig. 7(c).Note that, contrary to the convention of Fig. 7(a), FASTA arrays are sets while FASTA rows are ways.Each FASTA block (set) has 256 rows × 56 columns (64-bit virtual address, less 6-bit block offset, less 3-bit virtual index, plus one valid bit), as shown in Fig. 7(d).The three-bit index activates one of the eight sets (FASTAbased tag arrays), while the 55-bit physical tag is fed to the selected FASTA as the search (query) pattern.The Hit/Miss signal outputs from each set are OR-ed to provide the global hit signal, while the data outputs of each set are multiplexed to output the cache line.

B. ACCESS TIME AND ENERGY SIMULATION
To compare the proposed 256-way 8-set VMWA architecture with the baseline SRAM-based 8-way 256-set cache, we evaluate the access time and energy consumption.A FASTA row is divided into 14 segments of 4 cells.The evaluation is done by simulation, however below we detail the main delay and energy components.For the baseline cache, an access comprises the following (refer to Fig. 7 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.SRAM (way) and is applied in parallel to all 8 ways (to ensure the simultaneous readiness of the data and all physical tags).This operation is energy-expensive since all 8 ways of the baseline cache are accessed in parallel, both for data and tags.An alternative is to access only the tags (while keeping the data parts of SRAM idle), and once the way is resolved, read the cache block from the relevant way only.This option is slower but less energy-demanding.For delay evaluation purposes, we select the more aggressive first (parallel access) alternative.3) Tag Comparison: The time to compare the physical tags within a single way, denoted as t COMPARE =163 ps.4) Multiplexing: The time it takes to multiplex out the valid data from all 8 ways, denoted as t MUX =25 ps.This operation takes the one-hot encoded Hit/Miss vector from the 8 ways and selects one to output from the cache.
Altogether, the baseline VIPT cache hit latency is calculated as: For the 256-way 8-set VMWA cache, the hit latency comprises the following components (refer to Fig. 7(c) and (d) for the annotated timing): 1) Index Decoding: The time to decode the 3-bit index to select one of eight 256-way FASTA sets, while disabling the others.This time is denoted as t INDEX =10 ps. 2) FASTA Search: The time to compare the physical tag within the selected set (FASTA).This delay is denoted t FASTA =166 ps.Note that this delay ends with the assertion of the matched row's hit[i] signal, without the need to wait for the global Hit/Miss signal of the set.3) Data Readout: The time to readout the 64B cache line, initiated once the matched row's hit[i] signal enables the assertion of a wordline in the SRAM array.This delay is denoted t DATA =128 ps.4) Multiplexing: The time it takes to multiplex out the valid data from one of the 8 sets, denoted as t MUX =25 ps.Note that this delay is slightly shorter than the t MUX of the baseline cache, as the multiplexer is directly controlled by the 3-bit index (which requires no translation).Altogether, the VMWA hit latency is calculated as: For the energy consumption comparison, the baseline 8way 256-set associative cache includes the following components: 1) Tag Readout: The energy to readout a 50-bit tag and valid bit from a 256-row SRAM tag array, denoted as E TAG = 0.71 pJ.This figure also includes the energy of the tag comparator and the Hit/Miss generation circuitry.2) Cache Line Readout: The energy to read out a 64B cache line from the data SRAM array, denoted E DATA = 3.88 pJ.This includes precharging the bitlines, charging a single wordline, and sensing the bitlines.Both tag and cache line readouts are performed in parallel in all eight ways.3) Multiplexing: The energy required to multiplex out the valid data from eight ways, denoted as E MUX = 0.19 pJ.Altogether, the baseline 8-way 256-set associative cache energy is calculated as: The overall energy consumption of the 256-way 8-set VMWA cache includes: 1) Tag Search: The energy to apply a 56-bit search within a 256-row FASTA, denoted as E SEARCH = 0.43 pJ.This includes decoding the index to select the set, driving the searchlines, discharging and sensing a single matchline, and generating the global Hit/Miss signal of the set.Since only one set is selected and the others are disabled, the search energy applies to a single FASTA array.2) Cache Line Readout: The energy to read out a 64B cache line from the FASTA-coupled SRAM array, denoted E DATA = 3.88 pJ.This includes precharging the bitlines, charging a single wordline, and sensing the bitlines, but does not include address decoding, as the wordline is selected based on the hit[i] outputs of the active set.3) Multiplexing: The energy required to multiplex out the data from the selected set to the cache output, denoted Altogether, the VMWA energy is calculated as: Note that the energy during a cache miss is slightly lower since no ML is discharged; however, this small deviation is omitted from the results.

C. POST-LAYOUT ACCESS TIME, ENERGY AND AREA RESULTS
Table 2 presents the results of the post-layout comparison between the proposed 256-way 8-set VMWA cache and the baseline SRAM-based 8-way 256-set design for any microarchitectural application that requires associative access, as presented in Fig. 7.The energy consumption and delay figures are calculated at a supply voltage of 0.8 V and during a search operation resulting in a hit.All memory peripherals taken into account in the calculations.The energy consumption of a single FASTA array is 0.43 pJ (Table 2).This is 40% lower than the energy consumption of the building block of the SRAM-based tag array, which is a 256-row SRAM bank.However, during a search operation, all ways (i.e., individual Tag/Data arrays) of the baseline cache are accessed in parallel, while the VMWA activates only a single set (i.e., only a single FASTA+data array).This leads to energy savings of 88% for the VMWA as compared to the baseline, when considering all of the energy components of equations ( 5) and (6).Leakage power consumption is also reported in Table 2 showing that the baseline consumes 70% lower standby power than VMWA.Fig. 8 shows the breakdown of the energy consumption during a hit case for a single way of the SRAM-based 8-way 256-set baseline cache and a single set of the 256-way 8-set VMWA proposed cache.The SRAM data array and the multiplexing present about the same energy consumption for both designs.The main difference is in the SRAM tag and FASTA array.As mentioned above, FASTA achieves 40% less energy consumption as compared to the SRAM baseline.This energy gain is mainly due to the lack of address decoders and differential sense amplifiers, among others.Fig. 9 shows the 'pie-to-pie' energy consumption breakdown for a hit case for the proposed and baseline caches.SRAM tag and FASTA arrays consume about 15% and 9.5% of the total energy respectively, during a hit.Out of 1 I/O peripherals (e.g., BL drivers, buses, etc.), as well as the control circuitry, are not taken into account.
13934 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.15% SRAM tag consumption, about 9.5% is consumed by the SRAM array, while the remaining energy consumption is distributed mainly between BL conditioning circuitry, comparison circuitry, sense amplifiers, and address decoders.Out of 9.5% energy consumption of FASTA array, its memory core consumes most of the energy, while sensing and hit/miss logic consumes only about 0.04%.This is mainly because inverters are used as sense amplifiers in each row, and only one row is active/toggled during a hit, thereby reducing the overall energy consumption per operation.

c: AREA OVERHEAD
The primary contributor to area overhead is the memory array.The 10-transistor-CCAM cell is 1.86× larger than the 6T-SRAM baseline cell.Additional area is needed for the per-row segmented NAND/OR-tree, and the Hit/Miss logic.However, in most FASTA applications, each tag array is coupled with an SRAM data array.While these two components feature the same number of rows, a typical cache tag array has approximately 10× fewer columns, e.g., 56-bit FASTA vs. 512-bit cache line in VMWA sets.Fig. 10 shows the relative area footprint of the baseline and proposed cache designs, where the data array is (on average) about 85% of the total area.The overall area footprint of a single baseline way and VMWA set are about 0.075 mm 2 and 0.082 mm 2 , respectively.Therefore, the total area increase for the standalone FASTA and VMWA implementations is limited to 9%.
It is noteworthy that there is negligible hardware overhead if the proposed FASTA is compared with a fully associative tag array built with conventional NAND CAM cells.The proposed 10T-CCAM cell features an extra transistor (M NOR ) as compared to the conventional 9T-NAND CAM.Despite this, the layout footprint of both CCAM and NAND CAM cells remains nearly identical.This equivalency is primarily attributed to the alignment of the cell pitch (or cell height) with the CAM row pitch, which corresponds to the standard cell pitch.In this way, the vacant space within the NAND CAM cell macro can be filled with the M NOR pMOS transistor.Consequently, the M NOR area overhead in the CCAM cell is negligible or non-existent.

d: DESIGN TRADE-OFFS
When compared to conventional SRAM-based associative memory, the proposed FASTA-based VMWA cache features the following advantages: • Cost-effective very high associativity • Aliasing-free large-scale cache design • Improved search access time • Lower energy consumption Overall, the above benefits are obtained at a moderate area overhead of 9%.Such overhead is not extraordinary in contemporary memory For example, a 12.5% overhead is typical in ECC-protected memories.

e: FASTA ARRAY SIZE CONSIDERATIONS
• FASTA associativity increases linearly with its height (number of rows).FASTA height allows efficient associativity scaling, infeasible in SRAMbased design.However, increasing the array height may adversely affect the FASTA performance and energy.Therefore a tradeoff that balances associativity, performance, and energy efficiency exists.
• FASTA width is defined by the tag size, which in turn is determined by the memory address space size (64-bit).Therefore changing the width is not a design choice.
• Number of FASTA arrays (number of sets) is determined the desired cache size given the constant FASTA height.

VI. VMWA CACHE PERFORMANCE EVALUATION
We focus the performance evaluation on the FASTA-based VMWA cache presented in Section 4. associative structures of computer microarchitecture, such as TLB, BTB, etc. are not evaluated since popular public domain architecture simulators (e.g., GEM5 [98]) do not easily support user-defined associativity for any of these structures but cache.Sniper [27], another commonly used CPU simulator, supports user-defined associativity only for cache and small TLBs.
The VMWA cache performance is evaluated using Sniper version 7.4 [99], which is arguably the fastest, yet sufficiently accurate tool for system-level analysis.Sniper supports variable associativity in three-level cache memory hierarchy, including L1, L2, and Last-Level L3 caches.
The figures of merit in our performance evaluation are Miss Per Kilo Instructions (MPKI), Instructions Per Cycle (IPC), and execution time.We study their behavior as a function of the associativity and replacement policy.While the relations between miss rate and cache size and associativity have been extensively investigated in the past [100], our evaluation differs in that we extend it to a very high associativity level (256).
We performed three distinct experiments: 'cacheconflicts', 'cache associativity', and 'energy consumption'.Further details on each experiment are provided below.
13936 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

A. CACHE-CONFLICTS
The purpose of this first experiment is to stress and highlight the adverse effect of conflict misses on MPKI and IPC.We use the cache-conflicts benchmark from hardwareeffects suite [101] which is a part of the recent DAMOV benchmark [102].
The cache-conflicts benchmark stresses the cache associativity by periodically writing to the same set, causing quick thrashing in low-associativity caches.For example, in a CPU with 32 kB 8-way associative L1 cache and 64-byte cache blocks, a conflict miss-caused cache thrashing occurs when the CPU continuously writes to memory addresses with the interval of 4096.The simulator was set in Gainestown configuration, with one logical and physical core and with 32 kB L1 cache and block size of 64 bytes.We run the cache-conflicts benchmark while varying the L1 associativity.The memory access interval is set to vary according to the associativity.The overall number of writes is set such that the capacity miss component remains negligible, to allow the focus on conflict miss.The resulting MPKI and IPC are presented in Fig. 11(a) and (b) respectively.The continuous write access to the same set causes a significant MPKI and the resulting drop in low associativity configurations.The increase of associativity to 128 and 256 resolves this problem almost entirely (the small MPKI is due to capacity miss), allowing a significant improvement in IPC.

B. CACHE ASSOCIATIVITY
This second experiment is devised to study the typical impact of high associativity on MPKI, IPC, and execution time using more general-purpose workloads.This experiment involves several workloads from the splash2 [103] benchmark as well as the NetworkX (running large-scale graph traversal), Pytorch, and Tensorflow (processing large-scale tensor datasets).The splash2 workloads are specified in the legends of Figs.12-14.In each run, we vary the cache associativity of one of the cache hierarchy levels (L1, L2, or L3) while the rest are set to Sniper default (32 kB, 256 kB, and 8 MB for L1, L2, and L3, respectively).The LRU and random replacement policies were tested.At high associativity levels, random performs slightly worse than LRU hence we show here only the random replacement results.
The MPKI, IPC, and execution time results are presented in Fig. 12 and Fig. 13, respectively.The following are the main results of our analysis: • The VMWA cache exhibits lower MPKI in the majority of the benchmarks.Specifically, the average MPKI improvement achieved by the 256-way associative L1 compared to 16-way associative L1 is 17.3%.The average IPC improvement is 1.6%.
• The average MPKI improvement achieved by the 256-way associative L2 compared to 16-way associative L2 is 11.5%.The average IPC improvement is 1.4%.
• The average MPKI improvement achieved by the 256-way associative L3 compared to 16-way associative L3 is 1.2% which translates to 0.2% IPC improvement on average.
• The s-curve summarizing the IPC gains of the VMWA cache compared to 16-way associative L1, L2, and L3 cache is presented in Fig. 14.

C. ENERGY CONSUMPTION
This third experiment aims to functionally evaluate energy consumption.Fig. 15 shows the total cache energy consumption for different workloads and for the proposed VMWA and a 16-way set associative cache.The VMWA presents about 94% lower energy consumption (on average) compared to the 16-way set associative cache.This energy gain is achieved mainly because during a search operation, the VMWA activates a single set while the baseline has to access all 16 ways in parallel.

VII. CONCLUSION
Associative access is widely employed across modern computer microarchitecture.In this work, we revisit the convention surrounding associative memory (CAM) and propose a set of solutions that enable the reintroduction of CAM to computer microarchitecture.Specifically, we focus on developing a fully associative tag array (FASTA), which enables full associativity, while providing better access time and lower energy consumption compared with a wayassociative SRAM-based solution.We also introduce, design, and evaluate a FASTA-based Very-Many-Way Associative (VMWA) cache, which enables very high associativity, scalability, and aliasing-free large-scale cache implementation.The benefits of VMWA cache associativity, scalability, and reduced energy consumption come with an area overhead of around 9%.We argue that such an overhead is quite typical in contemporary memory designs and hence might be afforded in applications where high associativity plays a critical role in enabling performance improvement.
Performance evaluation shows that for many benchmarks and cache configurations (sizes and levels), VMWA achieves lower MPKI and shorter execution time compared to 16way associative cache.Looking beyond cache, the VMWA architecture may deliver significant functional upside to a variety of associative access memory structures widely deployed across computer microarchitecture.

APPENDIX A GLOSSARY
Acronyms used in this manuscript are given in Table 3. VOLUME 12, 2024

FIGURE 2 .
FIGURE 2. Schematic of the ten-transistor CCAM bitcell.* Only the source of the M NAND of leftmost bitcell in a row is connected to ground.

FIGURE 3 .TABLE 1 .
FIGURE 3. Match and mismatch cases for a 3-bit wide row of CCAM cells.
(a), FASTA comprises three components: (1) an array of CCAM cells, (2) a column of matchline sense amplifiers (MLSAs), and (3) hit/miss logic.The array is organized as m rows of w CCAM bitcells, which output m ML signals that are sensed by the MLSA circuitry.A w-bit wide row with its MLSA is shown in Fig. 4(b).For the current study, a simple

FIGURE 4 .
FIGURE 4. FASTA Architecture: (a) Top-level schematic diagram of the CCAM array, and (b) Zoom in on w -bit wide row of the array.

FIGURE 5 .
FIGURE 5. FASTA NAND chain discharge path in the case of a match operation.(a) Simplified scheme of the FASTA row.(b) FASTA NAND chain built from different transistor flavors.(c) w -bit wide FASTA NAND chain divided in s segments of w /s bitcells.

FIGURE 6 .
FIGURE 6. Top level view of a VMWA 256-way associative cache.Note: contrary to the convention, tag arrays are the sets, and memory rows are the ways.

1 )
(a) and (b) for the annotations): Index Decoding: The time to decode the 8-bit index is denoted as t INDEX =140 ps.This is essentially the delay of an 8:256 SRAM address decoder (we assume the 2 MSBs of the index are not translated).2) Tag and Data Readout: The time to read out the 50-bit physical tag, valid bit, and 64B cache line, denoted as t DATA =128 ps.This time includes the bitline discharge and sense amplifier delay of the 512+51=563-bit wide 13932 VOLUME 12, 2024

1
a: ACCESS TIME FASTA-based 256-way 8-set cache provides 28% faster access when compared with an SRAM-based 8-way 256-set associative cache.The access time advantage is due to the optimized division of the match line (into comparison segments), and the number of fins and low threshold voltage devices in comparison transistors.b: ENERGY CONSUMPTION

FIGURE 8 .
FIGURE 8. Breakdown of the energy consumption during a Hit case for an SRAM-based 8-way 256-set cache and the proposed 256-way 8-set VMWA cache.The energy figures are referred to as 1-way and 1-set of the baseline and proposed designs, respectively.

FIGURE 9 .
FIGURE 9. 'Pie-to-pie' chart energy breakdown for (a) the SRAM-based 8-way 256-set cache and (b) the proposed 256-way 8-set VMWA cache.The energy breakdown of the SRAM tag and FASTA arrays are highlighted in the right charts.

FIGURE 10 .
FIGURE 10.Relative area footprint of the baseline and proposed cache designs.The total area of a single way (for baseline) and set (for VMWA) are about 0.075 mm 2 and 0.082 mm 2 , respectively.

FIGURE 11 .
FIGURE 11.(a) MPKI and (b) IPC as a function of associativity for cache-conflicts benchmark.The experiment emphasizes the practical implications of the proposed VMWA design on cache performance optimization.

FIGURE 15 .
FIGURE 15.Total cache energy consumption for different workloads.

TABLE 2 .
Post-layout comparison results.256-way 8-set associative VMWA cache vs. 8-way 256-set baseline cache.Reported energy and access time results are obtained for a Hit case during a search operation.