Loading web-font TeX/Math/Italic
Longest-First Search Using Bloom Filter: Algorithm and FPGA Implementation | IEEE Journals & Magazine | IEEE Xplore

Longest-First Search Using Bloom Filter: Algorithm and FPGA Implementation


It shows the overall structure of the proposed approach. The on-chip Bloom filter is queried starting from the longest length denoted as W. In case of the negative result...

Abstract:

Due to the surge in Internet traffic and the rapid increase in forwarding table entries, achieving wire-speed packet forwarding in Internet routers demands both algorithm...Show More

Abstract:

Due to the surge in Internet traffic and the rapid increase in forwarding table entries, achieving wire-speed packet forwarding in Internet routers demands both algorithmic enhancements and hardware innovations. In this paper, we propose the longest-first search algorithm with a Bloom filter that stores prefixes in a leaf-pushing trie. Our approach utilizes an on-chip Bloom filter to indicate the presence of prefixes within the trie, while an off-chip hash table stores the corresponding output port information for each prefix. For each incoming IP address, the Bloom filter query begins from the longest length, reducing the queried number of bits by one for each negative result. The off-chip hash table is only accessed when the query to the Bloom filter produces a positive result. Therefore, access to the slower off-chip memory is minimized to once, given a reasonable Bloom filter size. The proposed approach is simulated using C++ and constructed with Verilog for field programmable gate array (FPGA) implementation. The experimental results indicate that the proposed approach achieves the throughput of 0.8 million packets per second at a clock frequency of 100MHz.
It shows the overall structure of the proposed approach. The on-chip Bloom filter is queried starting from the longest length denoted as W. In case of the negative result...
Published in: IEEE Access ( Volume: 13)
Page(s): 49354 - 49361
Date of Publication: 17 March 2025
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Internet routers are required to lookup Internet protocol (IP) addresses by referencing their forwarding tables to identify an appropriate output link for each incoming packet at wire-speed [1], [2]. Given the potential influx of packets (reaching up to several millions per second), processing them at wire-speed presents a highly challenging task for routers.

IP addresses have a two-level hierarchy comprising network and host parts. The network part (termed the prefix) is used for IP address lookup. To accommodate the increasing number of Internet users, the classless inter-domain routing (CIDR) scheme has been introduced, allowing prefixes of arbitrary length [3]. Under CIDR, a single destination IP address can match multiple prefixes, where the longest matching prefix is determined as the best matching prefix (BMP) of the IP address. A packet destined to the host with a destination IP address is forwarded to the output port associated with the BMP.

Although the advent of CIDR imposes complexities in terms of the IP lookup process, it significantly improves the scalability of Internet addressing. This complexity, coupled with the rapid growth of Internet traffic, underscores the necessity for enhancing the efficiency of the IP address lookup process [4]. Addressing this issue requires a comprehensive approach that includes both sophisticated algorithmic strategies and effective hardware solutions [2], [5], [6]. The concept of applying on-chip Bloom filters to minimize the number of times slower off-chip memory is accessed is highly regarded for its potential to reduce overall processing delays [7], [8]. Mun and Lim. proposed an algorithm that applies an on-chip Bloom filter to improve the search performance in a binary trie [9].

On our previous research, we demonstrated that combining a leaf-pushing binary trie with a Bloom filter and employing the longest-first search strategy yields promising performance [10]. In this paper, we extend these initial findings by proposing a hardware architecture implementing our algorithm on a field programmable gate array (FPGA).

Packet forwarding functions in an Internet router are often custom-built by manufacturers, typically using application-specific integrated circuits (ASICs). While ASICs are designed to meet a specific performance metric in terms of speed, they cannot be altered once manufactured. In contrast, FPGAs offer flexibility by allowing reprogramming. Accordingly, FPGAs serve as excellent tools for testing and optimizing hardware architectures before they are implemented as ASICs [11], [12].

The contribution of this paper can be summarized as follows:

  • We extend our previous work [10] by proposing a hardware architecture that implements the longest-first search using an on-chip Bloom filter, which stores prefixes in a leaf-pushing binary trie.

  • The proposed hardware architecture is implemented on an FPGA, demonstrating its practical applicability and effectiveness.

  • C++ codes are developed to perform an algorithmic-level performance comparison with existing algorithms. In addition, the Verilog codes are developed for FPGA implementation. They are publicly available, allowing other researchers to use for their studies.

The remaining parts of this paper are organized as follows: Section II presents related works, including Bloom filter theory, a binary trie, and some previous IP address lookup algorithms. In Section III, we explain the proposed algorithm, while Section IV contains a description of the proposed hardware architecture. Section V shows algorithmic-level simulation results and Section VI presents FPGA implementation results. Section VII concludes the paper.

SECTION II.

Related Works

A. Bloom Filter Theory

A Bloom filter is a space-efficient probabilistic data structure that tests whether an input element is a member of a given set [13]. Bloom filters and their variants [14], [15], [16] are used in many applications [17], [18], [19] due to their simplicity, low memory requirements, and high-speed operations [20]. A Bloom filter uses an array of m bits that are initially set to 0.

When programming set S=\{x_{1}, x_{2}, \ldots , x_{n}\} to a Bloom filter, k hash functions h_{i}(x) (for i=1 to k) are used. Each element x in the set is processed by these hash functions to produce k indices in the range of {0, 1, …, m-1 }. The bits at these indices in the Bloom filter are set to 1. For a set with n elements, the optimal number of hash functions with an m-bit Bloom filter is determined as follows:\begin{equation*} k=\frac {m}{n}\ln 2 . \tag {1}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Figure 1 shows a Bloom filter programmed with two elements (x_{1} and x_{2} ) using three hash functions. The same k hash functions are used for querying an input. The input is hashed to obtain k indices, and if any of the bits at these indices in the Bloom filter are 0, the result is negative, implying that the input is not a member of the set. If all bits are 1, the result is positive, implying that the input is considered a member of the set. However, false positives can occur due to hash collisions.

FIGURE 1. - Bloom filter.
FIGURE 1.

Bloom filter.

In Fig. 1, querying for inputs q_{1} , q_{2} , and q_{3} would produce the following results: negative, positive, and positive, respectively. It should be noted that although all the queried bits for q_{3} are 1, those bits are set by two different elements, and hence it is a false positive. The false positives of a Bloom filter are identified by additional memory access such as a hash table. The false positive rate f of an m-bit Bloom filter with k hash indexes for each of n elements is obtained as follows [9]:\begin{equation*} f = \left ({{1 - \left ({{1 - \frac {1}{m}}}\right )^{k n}}}\right )^{k}. \tag {2}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

The false positive rate in Equation (2) is influenced by m, and \frac {m}{n} represents the number of bits allocated per element in the Bloom filter. A higher \frac {m}{n} reduces hash collisions, therefore lowering the false positive rate. We use a size factor \alpha to define the Bloom filter size as:\begin{equation*} m = \alpha 2^{\lceil {log_{2}n} \rceil }. \tag {3}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

A higher value of \alpha can achieve a lower false positive rate, while it causes a higher memory overhead.

B. Binary Trie and Leaf Pushing Trie

The binary trie is a useful data structure for locating prefixes with various lengths in nodes of the trie [21], [22]. Each node in a binary trie has at most two children, representing the binary digits 0(left) and 1(right). Starting from the most significant bit, each bit of a prefix determines the path from the root node, while the length of the prefix determines the level of the node, in which the prefix is placed [23]. Fig. 2 displays an example set of prefixes and the corresponding binary trie.

FIGURE 2. - Binary trie.
FIGURE 2.

Binary trie.

Prefixes exhibit a nesting relationship when a prefix is a sub-string of other prefixes. For example, prefix 000* is a sub-string of prefix 00001*. The sub-string prefix is located at an internal node in a binary trie. This means that the search process should continue to the deepest level of the trie to find the longest matching prefix, even when a prefix node is encountered earlier in the process. Leaf-pushing is applied to ensure that every prefix is free from the nesting relationship, enabling the search to be ended upon encountering a prefix node. Figure 3 shows the leaf-pushed version of the binary trie from Fig. 2, where each internal prefix is pushed down to one or more leaf nodes by extending its length for the longest prefix matching. For instance, prefix 000* is relocated to leaf nodes of 00000* and 0001*, and prefix 11* is relocated to leaf nodes of 110*, 1111* and 11101*.

FIGURE 3. - Leaf-pushing binary trie.
FIGURE 3.

Leaf-pushing binary trie.

C. Using Bloom Filters for IP Address Lookup

Dharmapurikar et al. [7] explored how to integrate Bloom filters into the IP address lookup process. Their approach employed multiple Bloom filters, and each Bloom filter was designated to manage prefixes of a specific length. These filters were stored in on-chip memories and queried in parallel. Only the lengths that yielded a positive result from the Bloom filter query were considered for further off-chip accesses.

Lim et al. [8] improved the search performance of trie-based algorithms by incorporating Bloom filters to check the presence of a node within a binary trie. The search efficiency improved by removing the need to access the trie stored in an off-chip memory when the Bloom filter returned a negative result.

Mun et al. [9] reversed the principle of using a Bloom filter in a binary trie. In this approach, starting from the root node, when the Bloom filter produced positive results, the off-chip hash table storing the nodes of the trie was not accessed. The Bloom filter querying continued by increasing the number of bits for querying, and stopped when the Bloom filter produced a negative result. This negative result confirmed that the trie did not have a node in the current level of access as well as longer levels. This was because child nodes cannot exist without ancestor nodes in a binary trie. Hence, the last positive level would be the longest matching level if it was not a false positive. In this way, the number of times the external hash table was accessed was limited to once [9].

In particular, a refined version of the algorithm, Mun2R employed leaf-pushing to use the structural characteristics of the leaf-pushed binary trie to reduce the required amount of hash table memory. To the best of our knowledge, Mun2R [9] is the fastest algorithmic solution for IP address lookup.

SECTION III.

Proposed Algorithm

We propose to perform the longest-first search using a Bloom filter that stores prefixes in a leaf-pushing trie. The proposed algorithm constructs the binary trie for a given prefix set, and then leaf-pushing is applied to the binary trie to ensure that all prefix nodes are positioned at leaves. The prefixes in the leaf-pushing trie are then programmed into an on-chip Bloom filter. The prefixes are also stored in an off-chip hash table with output port information associated with each prefix.

Figure 4 shows the overall structure of the proposed approach. The on-chip Bloom filter is queried starting from the longest length denoted as W. In case of the negative result implying no prefix node in the binary trie, the Bloom filter query is repeated with the length reduced by one. Otherwise, in case of the positive result implying a possible matching prefix node, the off-chip hash table is accessed to obtain the output port associated with the matching prefix. The hash table determines whether the positive result from the Bloom filter query is true or false. If there is no matching entry in the hash table, the query result of the Bloom filter is identified as a false positive. Hence, the querying length is reduced by one and the Bloom filter querying is continued. Otherwise if there is a matching entry in the hash table, it becomes the best matching prefix (BMP) and the search is completed.

FIGURE 4. - Overall structure of the proposed approach.
FIGURE 4.

Overall structure of the proposed approach.

Algorithm 1 present the pseudo-code to find the best matching prefix of each incoming IP address in the proposed algorithm.

Algorithm 1 - Proposed IP Address Lookup Algorithm
Algorithm 1

Proposed IP Address Lookup Algorithm

Optimizing the size of the Bloom filter is crucial because it directly impacts the false positive rate. A sufficiently large Bloom filter can significantly reduce the number of false positives, potentially allowing the search operation to be fulfilled with a single access to the off-chip hash table, optimizing the speed of the lookup process.

To illustrate the proposed algorithm, we use the example of Fig 3. Consider a simplified 6-bit IP address, 000110. The search begins at the longest length (which is 6), where the hash indices corresponding to 000110 are generated and queried to the Bloom filter. If the initial result from the Bloom filter is assumed to be negative, the length of the input is reduced by 1 to yield 00011. If the querying of input 00011 to the Bloom filter is assumed to be positive, this results in an access to the hash table using a relevant hash index. However, the absence of a matching prefix for 00011 in the hash table reveals this to be a false positive. Hence, the input undergoes a further truncation to 0001, initiating another search. The Bloom filter querying returns a positive again. Subsequent access to the hash table with 0001 successfully locates a valid best matching prefix (BMP). Thus, for the input 000110, the example demonstrates that the corresponding BMP is obtained through accessing the hash table twice. If no false positive in the Bloom filter querying is assumed, the hash table would only need to be accessed once.

SECTION IV.

Proposed Hardware Architecture

The hardware architecture of the proposed algorithm is structured into two distinct modules: the controller and the data path. The controller has a state machine controlling the overall search process displayed in Algorithm 1. Each state of the state machine in the controller generates signals to control the data path. The data path generates signals for the state machine to transit to the next state.

A. Controller Design

Figure 5 displays the state machine controlling the search process of the proposed algorithm, after programming the Bloom filter and constructing the hash table. The controller consists of seven states: Initial state, Start state for preparing the input IP address, CRC state and TakeIdx for generating hash indices, q_BF state for querying the Bloom filter, RL state for reducing the query length in case of a negative result, and s_HT state for accessing the hash table in case of a positive result from the Bloom filter query.

FIGURE 5. - Controller composed of a state machine controlling the proposed algorithm.
FIGURE 5.

Controller composed of a state machine controlling the proposed algorithm.

The search process begins at the Initial state. When the IPgo signal comes from the data path, the state changes to the Start state, where the IP_on signal is set to trigger fetching an input IP address in the data path. Upon receiving the IPready signal from the data path, the state moves to the CRC.

In the proposed architecture, hash indexes for both the Bloom filter and the hash table are obtained by a 64-bit cyclic redundancy check (CRC) generator in the data path. The generated CRC code is very useful for obtaining multiple hash indexes of arbitrary lengths because predetermined parts of the CRC code can be extracted and used as hash indexes. At the CRC state, the CRC_on signal is set to trigger the CRC calculation. After the CRC_Code is generated, the CRCdone signal comes from the data path, prompting a transition to the next state (TakeIdx).

At the TakeIdx state, the required hash indexes are retrieved from the CRC_Code. At the q_BF state, Bloom filter querying is conducted by setting the BFQuery signal, resulting in either BFNeg or BFPos from the data path.

If BFNeg is generated, the next state becomes RL. With the length reduced by 1 at the RL, the state returns to the TakeIdx state to retrieve the Bloom filter indexes corresponding to the reduced length through the reduceLen signal.

Otherwise, if the BFPos signal is generated from the data path, the next state transitions to the s_HT state, accessing the off-chip hash table. If there is no entry in the hash table, which is identified by the signal FalsePos from the data path, the process continues to the RL state. Otherwise, if there is the matching entry in the hash table, which is identified by the signal TruePos from the data path, the BMP is returned and the process goes to the Start state for the next input IP address for processing. If there is no matching entry even after reducing the length to the shortest among the prefixes, the signal Nomatch also derives the state to Start. The same steps are repeated for the next input.

B. Data Path Design

Figure 6 displays the data path of the proposed architecture communicating with the controller. The data path consists of four blocks: IPAddr, CRC_64, QueryBF, and SearchHT. The black lines in Fig. 6 indicate the signals communicated with the controller, and the blue lines represent signals through the data path.

FIGURE 6. - Data path for the proposed algorithm.
FIGURE 6.

Data path for the proposed algorithm.

The lookup process begins with IPAddr block. The IPgo signal forces the controller to change state from Initial to Start. Then, triggered by the IP_on signal, this block retrieves an IP address and passes it to CRC_64 block.

Upon receiving a CRC_on signal from the state machine, the CRC_64 block undertakes the task of calculating CRC codes across all lengths of the input IP address and stores them in a SRAM. Since each bit of the input IP address is serially entered to the CRC generator (starting from the most significant bit), it is more efficient to calculate and store CRC codes of all lengths at once.

The completion of this process is communicated back to the state machine via the CRCdone signal. The QueryBF block conducts the querying to the Bloom filter. Depending on the querying result, this block signals back to the state machine with either a BFNeg or a BFPos signal. The BFNeg signal indicates the need to re-query the Bloom filter with the reduced length. Conversely, the BFPos signal suggests a potential match, prompting to the SearchHT block.

The final stage of the process is managed by the SearchHT block, which accesses the hash table and reports back to the state machine with the results. Depending on whether a matching entry is found, it sends either a FalsePos or a TruePos signal. The FalsePos signal results in necessitating reduction of the search length. In contrast, the TruePos signal confirms a successful match, resulting in a return of the best matching prefix (BMP). If the search was conducted up to the shortest length but failed to find the BMP, the Nomatch signal is set.

SECTION V.

Simulation

We conducted behavior simulations using C++ language for three routing sets used in the real world. The source codes are available for public use [24]. The performance of the developed algorithm was then compared with Mun2R [9]. Hash indices for both the Bloom filter and the hash table were generated through a 64-bit CRC generator. The number of Bloom filter indices (k) was determined based on the formula presented in (1). The size of the Bloom filter (m) was calculated using (3), where n represents the total number of prefixes obtained after applying the leaf-pushing to the binary trie. The Bloom filter size factor (\alpha ) was varied between 4, 8, and 16.

Table 1 presents a comparison of the prefix counts before and after applying leaf-pushing and the number of input IP addresses used in the simulation.

TABLE 1 Prefix Sets and Inputs Used for Simulation
Table 1- Prefix Sets and Inputs Used for Simulation

Table 2 presents a comparison of the nodes used when programming the Bloom filters. While Mun2R requires to program all the nodes in the trie, the proposed algorithm only programs prefix nodes. As indicated in the table, the number of nodes programmed in the Bloom filter of the proposed algorithm was almost the half that in Mun2R. Hence the amount of memory for the Bloom filter was also half that required for Mun2R.

TABLE 2 Number of Nodes Used in Programming Bloom Filters
Table 2- Number of Nodes Used in Programming Bloom Filters

Table 3 compares the average number of times the Bloom filter was accessed in the proposed algorithm and Mun2R. The number of Bloom filter accesses was affected by the search method of each algorithm, independent of \alpha . Mun2R initiated the search with the shortest length. When the Bloom filter query results were positive, the search continued with a longer length. When the Bloom filter query result was negative, backtracking occurred to the previous level (the level of the last positive) to access the hash table. Conversely, the proposed algorithm started the search from the longest length, refraining from accessing the hash table when the Bloom filter query results were negative. The hash table was accessed when a positive occurred in the Bloom filter query. Table 3 demonstrates that the proposed algorithm requires fewer Bloom filter queries compared to Mun2R.

TABLE 3 Average Number of Bloom Filter Queries of Each Algorithm
Table 3- Average Number of Bloom Filter Queries of Each Algorithm

Since both algorithms utilize leaf-pushing, it is more likely for prefixes to be located at longer levels, especially for large routing sets. Therefore, the average number of Bloom filter queries was smaller in the proposed algorithm. Moreover, the proposed algorithm demonstrated superior scalability, requiring fewer Bloom filter queries as the size of the routing set increased.

Figure 7 compares the average number of times the hash table was accessed for each algorithm according to the size factor of the Bloom filter. The dashed line indicates the results of Mun2R [9], and the solid line represents the results of the proposed algorithm. We can see the impact of varying \alpha on the performance. As \alpha increases, the false positive rate decreases, resulting in a lower average number of hash table accesses. Both algorithms only accessed the hash table once for each IP address lookup when \alpha was 16. For size factors 4 and 8, Mun2R delivered slightly better performance than the proposed algorithm.

FIGURE 7. - Average number of times the hash table was accessed according to the size factor of the Bloom filter.
FIGURE 7.

Average number of times the hash table was accessed according to the size factor of the Bloom filter.

Fig. 8 compares the average number of times the hash table was accessed for each algorithm according to the memory size of the Bloom filter. The dashed line indicates the results of Mun2R [9], and the solid line represents the results of the proposed algorithm. Since the proposed algorithm required less memory for the Bloom filter compared to Mun2R (Table 2), the proposed algorithm demonstrated superior performance in nearly all cases.

FIGURE 8. - Average number of times the hash table was accessed according to the memory size of the Bloom filter.
FIGURE 8.

Average number of times the hash table was accessed according to the memory size of the Bloom filter.

Table 4 compares the number of entries in the hash table (N_{h} ) and the memory consumption (M) of each algorithm. While Mun2R included both prefix and non-prefix nodes with a single child in the hash table, the proposed algorithm only stored prefix nodes in the hash table, saving off-chip memory capacity as shown in Table 4.

TABLE 4 Number of Stored Nodes and Off-Chip Memory Requirement for Each Algorithm
Table 4- Number of Stored Nodes and Off-Chip Memory Requirement for Each Algorithm

SECTION VI.

Hardware Implementation

The hardware implementation of our algorithm was executed using Verilog hardware description language (HDL), with the Vivado 2023.1 development environment. The source codes have been made publicly available [25]. The target device was the Nexys4 DDR FPGA, which operates at a clock frequency of 100MHz. The implementation made use of a block RAM (BRAM) to construct the hash table, which was pre-populated with data obtained from behavior-level simulations conducted prior to hardware implementation. In the experiments, the input IP addresses were also stored in a BRAM to ensure compatibility with the operational speed of the FPGA. These IP addresses were placed in dual-port RAMs.

Before synthesizing the Verilog HDL code, it was verified through a testbench to ensure the desired operations were performed correctly. Figure 9 shows the waveform generated by the simulation using the testbench. Fig. 10 presents the resource utilization of the implemented hardware. There was a high use of BRAMs due to having to store both the hash table and input IP addresses. After porting to the FPGA, the results were successfully verified.

FIGURE 9. - Waveform.
FIGURE 9.

Waveform.

FIGURE 10. - Resource utilization.
FIGURE 10.

Resource utilization.

The throughput of the proposed architecture was calculated as the number of packets processed per second. Since obtaining the BMPs of 3,000 pre-stored IP addresses in BRAM took 3.753 ms, the throughput was approximately 0.8 million packets per second. The throughput could be significantly improved by increasing the clock speed and applying the pipeline technique to the data path of the architecture.

SECTION VII.

Conclusion

This paper presented an IP address lookup algorithm that efficiently combined the leaf-pushing trie with a Bloom filter, which was optimized through the longest-first searching strategy. The use of an on-chip Bloom filter with the leaf-pushing trie structure significantly enhanced the search performance by minimizing the dependency on slower off-chip memory access.

We systematically designed and implemented the proposed algorithm in hardware, dividing it into a control path with a discrete state machine and a data path consisting of four processing blocks. The feasibility of our approach was demonstrated through a Verilog HDL implementation, highlighting its potential for real-world applications. Experimental results confirmed that our method effectively reduces memory accesses and improves lookup speed, making it well-suited for high-performance network processing.

Future research will focus on further optimizing the hardware implementation by exploring alternative hash functions, improving pipelining efficiency, and refining the overall architecture to achieve even greater performance and scalability [26], [27], [28].

References

References is not available for this document.