Introduction
Internet routers are required to lookup Internet protocol (IP) addresses by referencing their forwarding tables to identify an appropriate output link for each incoming packet at wire-speed [1], [2]. Given the potential influx of packets (reaching up to several millions per second), processing them at wire-speed presents a highly challenging task for routers.
IP addresses have a two-level hierarchy comprising network and host parts. The network part (termed the prefix) is used for IP address lookup. To accommodate the increasing number of Internet users, the classless inter-domain routing (CIDR) scheme has been introduced, allowing prefixes of arbitrary length [3]. Under CIDR, a single destination IP address can match multiple prefixes, where the longest matching prefix is determined as the best matching prefix (BMP) of the IP address. A packet destined to the host with a destination IP address is forwarded to the output port associated with the BMP.
Although the advent of CIDR imposes complexities in terms of the IP lookup process, it significantly improves the scalability of Internet addressing. This complexity, coupled with the rapid growth of Internet traffic, underscores the necessity for enhancing the efficiency of the IP address lookup process [4]. Addressing this issue requires a comprehensive approach that includes both sophisticated algorithmic strategies and effective hardware solutions [2], [5], [6]. The concept of applying on-chip Bloom filters to minimize the number of times slower off-chip memory is accessed is highly regarded for its potential to reduce overall processing delays [7], [8]. Mun and Lim. proposed an algorithm that applies an on-chip Bloom filter to improve the search performance in a binary trie [9].
On our previous research, we demonstrated that combining a leaf-pushing binary trie with a Bloom filter and employing the longest-first search strategy yields promising performance [10]. In this paper, we extend these initial findings by proposing a hardware architecture implementing our algorithm on a field programmable gate array (FPGA).
Packet forwarding functions in an Internet router are often custom-built by manufacturers, typically using application-specific integrated circuits (ASICs). While ASICs are designed to meet a specific performance metric in terms of speed, they cannot be altered once manufactured. In contrast, FPGAs offer flexibility by allowing reprogramming. Accordingly, FPGAs serve as excellent tools for testing and optimizing hardware architectures before they are implemented as ASICs [11], [12].
The contribution of this paper can be summarized as follows:
We extend our previous work [10] by proposing a hardware architecture that implements the longest-first search using an on-chip Bloom filter, which stores prefixes in a leaf-pushing binary trie.
The proposed hardware architecture is implemented on an FPGA, demonstrating its practical applicability and effectiveness.
C++ codes are developed to perform an algorithmic-level performance comparison with existing algorithms. In addition, the Verilog codes are developed for FPGA implementation. They are publicly available, allowing other researchers to use for their studies.
The remaining parts of this paper are organized as follows: Section II presents related works, including Bloom filter theory, a binary trie, and some previous IP address lookup algorithms. In Section III, we explain the proposed algorithm, while Section IV contains a description of the proposed hardware architecture. Section V shows algorithmic-level simulation results and Section VI presents FPGA implementation results. Section VII concludes the paper.
Related Works
A. Bloom Filter Theory
A Bloom filter is a space-efficient probabilistic data structure that tests whether an input element is a member of a given set [13]. Bloom filters and their variants [14], [15], [16] are used in many applications [17], [18], [19] due to their simplicity, low memory requirements, and high-speed operations [20]. A Bloom filter uses an array of m bits that are initially set to 0.
When programming set \begin{equation*} k=\frac {m}{n}\ln 2 . \tag {1}\end{equation*}
Figure 1 shows a Bloom filter programmed with two elements (
In Fig. 1, querying for inputs \begin{equation*} f = \left ({{1 - \left ({{1 - \frac {1}{m}}}\right )^{k n}}}\right )^{k}. \tag {2}\end{equation*}
The false positive rate in Equation (2) is influenced by m, and \begin{equation*} m = \alpha 2^{\lceil {log_{2}n} \rceil }. \tag {3}\end{equation*}
A higher value of
B. Binary Trie and Leaf Pushing Trie
The binary trie is a useful data structure for locating prefixes with various lengths in nodes of the trie [21], [22]. Each node in a binary trie has at most two children, representing the binary digits 0(left) and 1(right). Starting from the most significant bit, each bit of a prefix determines the path from the root node, while the length of the prefix determines the level of the node, in which the prefix is placed [23]. Fig. 2 displays an example set of prefixes and the corresponding binary trie.
Prefixes exhibit a nesting relationship when a prefix is a sub-string of other prefixes. For example, prefix 000* is a sub-string of prefix 00001*. The sub-string prefix is located at an internal node in a binary trie. This means that the search process should continue to the deepest level of the trie to find the longest matching prefix, even when a prefix node is encountered earlier in the process. Leaf-pushing is applied to ensure that every prefix is free from the nesting relationship, enabling the search to be ended upon encountering a prefix node. Figure 3 shows the leaf-pushed version of the binary trie from Fig. 2, where each internal prefix is pushed down to one or more leaf nodes by extending its length for the longest prefix matching. For instance, prefix 000* is relocated to leaf nodes of 00000* and 0001*, and prefix 11* is relocated to leaf nodes of 110*, 1111* and 11101*.
C. Using Bloom Filters for IP Address Lookup
Dharmapurikar et al. [7] explored how to integrate Bloom filters into the IP address lookup process. Their approach employed multiple Bloom filters, and each Bloom filter was designated to manage prefixes of a specific length. These filters were stored in on-chip memories and queried in parallel. Only the lengths that yielded a positive result from the Bloom filter query were considered for further off-chip accesses.
Lim et al. [8] improved the search performance of trie-based algorithms by incorporating Bloom filters to check the presence of a node within a binary trie. The search efficiency improved by removing the need to access the trie stored in an off-chip memory when the Bloom filter returned a negative result.
Mun et al. [9] reversed the principle of using a Bloom filter in a binary trie. In this approach, starting from the root node, when the Bloom filter produced positive results, the off-chip hash table storing the nodes of the trie was not accessed. The Bloom filter querying continued by increasing the number of bits for querying, and stopped when the Bloom filter produced a negative result. This negative result confirmed that the trie did not have a node in the current level of access as well as longer levels. This was because child nodes cannot exist without ancestor nodes in a binary trie. Hence, the last positive level would be the longest matching level if it was not a false positive. In this way, the number of times the external hash table was accessed was limited to once [9].
In particular, a refined version of the algorithm, Mun2R employed leaf-pushing to use the structural characteristics of the leaf-pushed binary trie to reduce the required amount of hash table memory. To the best of our knowledge, Mun2R [9] is the fastest algorithmic solution for IP address lookup.
Proposed Algorithm
We propose to perform the longest-first search using a Bloom filter that stores prefixes in a leaf-pushing trie. The proposed algorithm constructs the binary trie for a given prefix set, and then leaf-pushing is applied to the binary trie to ensure that all prefix nodes are positioned at leaves. The prefixes in the leaf-pushing trie are then programmed into an on-chip Bloom filter. The prefixes are also stored in an off-chip hash table with output port information associated with each prefix.
Figure 4 shows the overall structure of the proposed approach. The on-chip Bloom filter is queried starting from the longest length denoted as W. In case of the negative result implying no prefix node in the binary trie, the Bloom filter query is repeated with the length reduced by one. Otherwise, in case of the positive result implying a possible matching prefix node, the off-chip hash table is accessed to obtain the output port associated with the matching prefix. The hash table determines whether the positive result from the Bloom filter query is true or false. If there is no matching entry in the hash table, the query result of the Bloom filter is identified as a false positive. Hence, the querying length is reduced by one and the Bloom filter querying is continued. Otherwise if there is a matching entry in the hash table, it becomes the best matching prefix (BMP) and the search is completed.
Algorithm 1 present the pseudo-code to find the best matching prefix of each incoming IP address in the proposed algorithm.
Optimizing the size of the Bloom filter is crucial because it directly impacts the false positive rate. A sufficiently large Bloom filter can significantly reduce the number of false positives, potentially allowing the search operation to be fulfilled with a single access to the off-chip hash table, optimizing the speed of the lookup process.
To illustrate the proposed algorithm, we use the example of Fig 3. Consider a simplified 6-bit IP address, 000110. The search begins at the longest length (which is 6), where the hash indices corresponding to 000110 are generated and queried to the Bloom filter. If the initial result from the Bloom filter is assumed to be negative, the length of the input is reduced by 1 to yield 00011. If the querying of input 00011 to the Bloom filter is assumed to be positive, this results in an access to the hash table using a relevant hash index. However, the absence of a matching prefix for 00011 in the hash table reveals this to be a false positive. Hence, the input undergoes a further truncation to 0001, initiating another search. The Bloom filter querying returns a positive again. Subsequent access to the hash table with 0001 successfully locates a valid best matching prefix (BMP). Thus, for the input 000110, the example demonstrates that the corresponding BMP is obtained through accessing the hash table twice. If no false positive in the Bloom filter querying is assumed, the hash table would only need to be accessed once.
Proposed Hardware Architecture
The hardware architecture of the proposed algorithm is structured into two distinct modules: the controller and the data path. The controller has a state machine controlling the overall search process displayed in Algorithm 1. Each state of the state machine in the controller generates signals to control the data path. The data path generates signals for the state machine to transit to the next state.
A. Controller Design
Figure 5 displays the state machine controlling the search process of the proposed algorithm, after programming the Bloom filter and constructing the hash table. The controller consists of seven states: Initial state, Start state for preparing the input IP address, CRC state and TakeIdx for generating hash indices, q_BF state for querying the Bloom filter, RL state for reducing the query length in case of a negative result, and s_HT state for accessing the hash table in case of a positive result from the Bloom filter query.
The search process begins at the Initial state. When the IPgo signal comes from the data path, the state changes to the Start state, where the IP_on signal is set to trigger fetching an input IP address in the data path. Upon receiving the IPready signal from the data path, the state moves to the CRC.
In the proposed architecture, hash indexes for both the Bloom filter and the hash table are obtained by a 64-bit cyclic redundancy check (CRC) generator in the data path. The generated CRC code is very useful for obtaining multiple hash indexes of arbitrary lengths because predetermined parts of the CRC code can be extracted and used as hash indexes. At the CRC state, the CRC_on signal is set to trigger the CRC calculation. After the CRC_Code is generated, the CRCdone signal comes from the data path, prompting a transition to the next state (TakeIdx).
At the TakeIdx state, the required hash indexes are retrieved from the CRC_Code. At the q_BF state, Bloom filter querying is conducted by setting the BFQuery signal, resulting in either BFNeg or BFPos from the data path.
If BFNeg is generated, the next state becomes RL. With the length reduced by 1 at the RL, the state returns to the TakeIdx state to retrieve the Bloom filter indexes corresponding to the reduced length through the reduceLen signal.
Otherwise, if the BFPos signal is generated from the data path, the next state transitions to the s_HT state, accessing the off-chip hash table. If there is no entry in the hash table, which is identified by the signal FalsePos from the data path, the process continues to the RL state. Otherwise, if there is the matching entry in the hash table, which is identified by the signal TruePos from the data path, the BMP is returned and the process goes to the Start state for the next input IP address for processing. If there is no matching entry even after reducing the length to the shortest among the prefixes, the signal Nomatch also derives the state to Start. The same steps are repeated for the next input.
B. Data Path Design
Figure 6 displays the data path of the proposed architecture communicating with the controller. The data path consists of four blocks: IPAddr, CRC_64, QueryBF, and SearchHT. The black lines in Fig. 6 indicate the signals communicated with the controller, and the blue lines represent signals through the data path.
The lookup process begins with IPAddr block. The IPgo signal forces the controller to change state from Initial to Start. Then, triggered by the IP_on signal, this block retrieves an IP address and passes it to CRC_64 block.
Upon receiving a CRC_on signal from the state machine, the CRC_64 block undertakes the task of calculating CRC codes across all lengths of the input IP address and stores them in a SRAM. Since each bit of the input IP address is serially entered to the CRC generator (starting from the most significant bit), it is more efficient to calculate and store CRC codes of all lengths at once.
The completion of this process is communicated back to the state machine via the CRCdone signal. The QueryBF block conducts the querying to the Bloom filter. Depending on the querying result, this block signals back to the state machine with either a BFNeg or a BFPos signal. The BFNeg signal indicates the need to re-query the Bloom filter with the reduced length. Conversely, the BFPos signal suggests a potential match, prompting to the SearchHT block.
The final stage of the process is managed by the SearchHT block, which accesses the hash table and reports back to the state machine with the results. Depending on whether a matching entry is found, it sends either a FalsePos or a TruePos signal. The FalsePos signal results in necessitating reduction of the search length. In contrast, the TruePos signal confirms a successful match, resulting in a return of the best matching prefix (BMP). If the search was conducted up to the shortest length but failed to find the BMP, the Nomatch signal is set.
Simulation
We conducted behavior simulations using C++ language for three routing sets used in the real world. The source codes are available for public use [24]. The performance of the developed algorithm was then compared with Mun2R [9]. Hash indices for both the Bloom filter and the hash table were generated through a 64-bit CRC generator. The number of Bloom filter indices (k) was determined based on the formula presented in (1). The size of the Bloom filter (m) was calculated using (3), where n represents the total number of prefixes obtained after applying the leaf-pushing to the binary trie. The Bloom filter size factor (
Table 1 presents a comparison of the prefix counts before and after applying leaf-pushing and the number of input IP addresses used in the simulation.
Table 2 presents a comparison of the nodes used when programming the Bloom filters. While Mun2R requires to program all the nodes in the trie, the proposed algorithm only programs prefix nodes. As indicated in the table, the number of nodes programmed in the Bloom filter of the proposed algorithm was almost the half that in Mun2R. Hence the amount of memory for the Bloom filter was also half that required for Mun2R.
Table 3 compares the average number of times the Bloom filter was accessed in the proposed algorithm and Mun2R. The number of Bloom filter accesses was affected by the search method of each algorithm, independent of
Since both algorithms utilize leaf-pushing, it is more likely for prefixes to be located at longer levels, especially for large routing sets. Therefore, the average number of Bloom filter queries was smaller in the proposed algorithm. Moreover, the proposed algorithm demonstrated superior scalability, requiring fewer Bloom filter queries as the size of the routing set increased.
Figure 7 compares the average number of times the hash table was accessed for each algorithm according to the size factor of the Bloom filter. The dashed line indicates the results of Mun2R [9], and the solid line represents the results of the proposed algorithm. We can see the impact of varying
Average number of times the hash table was accessed according to the size factor of the Bloom filter.
Fig. 8 compares the average number of times the hash table was accessed for each algorithm according to the memory size of the Bloom filter. The dashed line indicates the results of Mun2R [9], and the solid line represents the results of the proposed algorithm. Since the proposed algorithm required less memory for the Bloom filter compared to Mun2R (Table 2), the proposed algorithm demonstrated superior performance in nearly all cases.
Average number of times the hash table was accessed according to the memory size of the Bloom filter.
Table 4 compares the number of entries in the hash table (
Hardware Implementation
The hardware implementation of our algorithm was executed using Verilog hardware description language (HDL), with the Vivado 2023.1 development environment. The source codes have been made publicly available [25]. The target device was the Nexys4 DDR FPGA, which operates at a clock frequency of 100MHz. The implementation made use of a block RAM (BRAM) to construct the hash table, which was pre-populated with data obtained from behavior-level simulations conducted prior to hardware implementation. In the experiments, the input IP addresses were also stored in a BRAM to ensure compatibility with the operational speed of the FPGA. These IP addresses were placed in dual-port RAMs.
Before synthesizing the Verilog HDL code, it was verified through a testbench to ensure the desired operations were performed correctly. Figure 9 shows the waveform generated by the simulation using the testbench. Fig. 10 presents the resource utilization of the implemented hardware. There was a high use of BRAMs due to having to store both the hash table and input IP addresses. After porting to the FPGA, the results were successfully verified.
The throughput of the proposed architecture was calculated as the number of packets processed per second. Since obtaining the BMPs of 3,000 pre-stored IP addresses in BRAM took 3.753 ms, the throughput was approximately 0.8 million packets per second. The throughput could be significantly improved by increasing the clock speed and applying the pipeline technique to the data path of the architecture.
Conclusion
This paper presented an IP address lookup algorithm that efficiently combined the leaf-pushing trie with a Bloom filter, which was optimized through the longest-first searching strategy. The use of an on-chip Bloom filter with the leaf-pushing trie structure significantly enhanced the search performance by minimizing the dependency on slower off-chip memory access.
We systematically designed and implemented the proposed algorithm in hardware, dividing it into a control path with a discrete state machine and a data path consisting of four processing blocks. The feasibility of our approach was demonstrated through a Verilog HDL implementation, highlighting its potential for real-world applications. Experimental results confirmed that our method effectively reduces memory accesses and improves lookup speed, making it well-suited for high-performance network processing.
Future research will focus on further optimizing the hardware implementation by exploring alternative hash functions, improving pipelining efficiency, and refining the overall architecture to achieve even greater performance and scalability [26], [27], [28].