## HyunYong Lee* and Byung-Tak Lee*## |

BFs only | nBF | sBF | |
---|---|---|---|

Successful lookups | 83 | 84 | 85 |

False positives | 210 | Solved: 112 | 109 |

Not solved: 295 | 294 | ||

Total average: 114 | 111 | ||

Total average | 120 | 95 | 86 |

The routing table lookup performance of sBF is slightly better than that of nBF. This is due to the fact that sBF just needs to examine one additional filter while nBF may need to examine more than one additional filter. The routing table lookup performance when a false positive is not handled by nBF and sBF is worse than the case of BFs only case. This is due to the fact that L2 cache hit rate is affected by the number of accesses. In other words, BFs only case shows better performance in this case because it accesses the off-chip slow memory more than nBF or sBF.

The proposed approaches that exploit the additional filters have three attractive points. First, the proposed approaches reduce the false positive probability compared to the original BF and thus improves the performance of BF-based applications. Second, the use of additional filters does not cause the identified limitations of existing approaches described in Section 3.1. Third, the additional filters can be used for existing approaches to reduce the false positive probability of those approaches further.

nBF outperforms sBF in most cases, particularly when the percentage of [TeX:] $$N_{A C T I V E}$$ is high. Therefore, nBF is more suitable for the case where subsets change rarely. On the other hand, in the case where subsets change frequently and the percentage of [TeX:] $$N_{A C T I V E}$$ is low (like Internet environment where around 10% of IP prefixes accounts most traffic [17]), sBF is more suitable. However, if the update overhead of nBF can be properly managed while not causing performance degradation, nBF would also be a good choice for such a case.

Most BF-related approaches have two requirements: space efficiency and membership query. However, one problem (i.e., processing overhead) may happen when BF-based system is implemented using software. In the software-based system, required processing (i.e., k hash operations for each filter) is done sequentially. As a result, the processing time increases somewhat linearly as the number of subsets and the number of hash functions used increase. This processing overhead may become problematic for time-sensitive jobs (e.g., routing table lookup and traffic monitoring). In other words, in those time-sensitive jobs, there are three requirements: space efficiency, membership query, and fast processing time. To this end, we propose one approach to satisfy those three requirements for the time-sensitive jobs.

One simple approach to support fast membership query is to use a hash table. In the hash table, one hash computation is enough to find a bucket (i.e., basic unit of a hash table for storing data) to be examined. However, the hash table is not space-efficient. Most hash table implementations store both key (e.g., element itself) and value (e.g., subset ID) to handle hash collisions (i.e., the case where more than one element is mapped to the same bucket). As a result, the space overhead of the hash table increases as the number of elements and the size of element in bits increase.

In this paper, we try to achieve the space efficiency while supporting fast membership query. For this, our proposed approach (called sTable) keeps value only without keys (Fig. 10). As a result, the space overhead of sTable is not affected by the number of elements and the size of element in bits, but affected by the number of subsets that affects the required number of bits to represent subsets. The detailed descriptions about sTable are as follows.

sTable consists of buckets like a hash table. Each bucket includes a value (e.g., subset ID of an element mapped to this bucket or NULL as the initial value) and 1 bit-sized collision bit. Therefore, the size of one bucket is [TeX:] $$\left[\log _{2}(h+2)\right]$$ bits, where h is the number of subsets. Collision bit is used to indicate the status of a corresponding bucket in terms of hash collisions (i.e., 0 means no hash collision and 1 means hash collision). Because sTable does not store key that is required to handle hash collisions, sTable cannot handle hash collisions. In other words, sTable achieves the space efficiency at the cost of some membership query errors (i.e., sTable cannot return correct subset information for a given element when the given element is mapped to a bucket where hash collisions happen). Instead, sTable utilizes the collision bit to indicate (i) the value of a bucket may not be correct and/or (ii) additional operation is required to find correct subset information. For example, the additional operation for handling membership query error of sTable may be examining the additional data structures managed separately (e.g. a hash table kept in large off-chip slow memory). Given the storage space M (in bits) and the number of subsets h, the number of buckets of sTable is [TeX:] $$M /\left[\log _{2}(h+2)\right]$$.

The content of sTable’s buckets is determined by a corresponding hash table that is kept in large off-chip slow memory for the purpose of the management (Fig. 10). The hash table is first generated based on a given set. In this case, the key is element itself and the value is a corresponding subset ID. For the purpose of easy management, the number of buckets of the hash table is set to the number of buckets of sTable. In other words, buckets of sTable and the hash table have 1:1 mapping. The content of ith bucket of sTable is determined by the content of i^{th} bucket of sTable is determined by the content of i^{th} bucket of the hash table. If the bucket of the hash table does not contain any elements (i.e., 2nd bucket in Fig. 10), the value and the collision bit of the bucket of sTable are set to NULL and 0, respectively. If the bucket of the hash table contains one element (i.e., 1st bucket) or if the values of elements in the bucket of the hash table are same (i.e., 3rd bucket), the value and the collision bit of the bucket of sTable are set to the value of the hash table’s bucket and 0, respectively. In other cases (i.e., the case where hash collisions happen, the last bucket), the value of the bucket of sTable is randomly set to one of the values of the hash table’s bucket while the collision bit is set to 1.

The membership query with sTable is done as in a hash table. One hash function is applied to a given element to find a bucket to be examined. If the collision bit of the determined bucket is 0 and the value of the bucket is not NULL, the value of the bucket indicates correct subset information. On the other hand, if the collision bit is 1, the value of the bucket may or may not indicate correct subset information. Therefore, if some membership query error is allowable, the value of the bucket may be able to be used. If additional operations are available to handle this membership query error at the cost of additional overhead (e.g., access to slow memory to examine the hash table) and the additional overhead is allowable, additional operations can be done to find correct subset information.

When subsets change, the hash table is first updated. If contents of hash table’s buckets are changed (i.e., a new element is added to an empty bucket, different values of the same bucket become same, or same values of the same bucket becomes different), contents of corresponding buckets of sTable are updated.

In sTable, membership query error happens when a given element is mapped to a bucket whose collision bit is 1. The membership query error probability can be calculated using birthday problem [19] as follows. Given b buckets for sTable and n elements, the membership query error probability can be approximately expressed as

Eq. (12) means that given the number of elements, the membership query error probability is mainly affected by the number of subsets and the storage space that affect b.

Using the simulation settings described in Section 3.4, we conduct simulations. We first measure the change of the membership query error rate with regard to the number of subsets (Fig. 11). As we described before, the error rate of BF increases somewhat linearly as the number of subsets increases. On the other hand, the error rate of sTable increases logarithmically as the number of subsets increases. The number of subsets affects the error rate of sTable by influencing the number of buckets, but the number of buckets of sTable does not decrease linearly as the number of subsets increases. Please note that the size of one bucket of sTable is [TeX:] $$\left[\log _{2}(h+2)\right]$$. As a result, the error rate of sTable increases more slowly than that of BF.

We measure the effect of the number of subsets on the processing time (Fig. 12). The processing time of BF increases linearly as the number of subsets increases because the number of filters to be examined increases (i.e., from 2.6 ms for 10 subsets to 26 ms for 100 subsets). On the other hand, sTable shows somewhat constant processing time regardless of the number of subsets and the processing time is much smaller than that of BF. sTable shows 0.076 ms and 0.081 ms processing time for 10 subsets and 100 subsets, respectively.

To examine how much sTable improves performance compared to a hash table, we conduct the experiment using the same settings described in Section 3.4 except the following. We use 30 outgoing interfaces and 1024 kB is used for sTable. Table 2 shows the results. The successful routing table lookup with sTable (i.e., 85% of packets are handled within the fast memory) is much faster than that with the hash table. This is due to the fact that sTable handles packets within the fast memory while routing table lookup with the hash table causes accesses to the slow memory. As a result, sTable reduces the routing table lookup time by 58% compared to the hash table.

We also measure the required amount of storage space to achieve the target error rate (Fig. 13). For the hash table, we assume that the size of key is 48 bits (e.g., MAC address). Please note that the hash table always returns a correct result and thus requires the same amount of storage space regardless of the target error rate. On the other hand, BF and sTable require a larger amount of storage space to achieve the lower target error rate. The required amount of storage space of BF and sTable increases linearly and logarithmically as the target error rate decreases, respectively. As a result, the difference between the required amount of storage space for BF and sTable decreases as the target error rate decreases.

Above results first show that sTable achieves the fast and somewhat constant processing time at the cost of additional storage space. However, the required amount of storage space is much smaller than that of the hash table. In addition, unlike the hash table where the space overhead is affected by the size of key and the number of subsets, the space overhead of sTable is only affected by the number of subsets. The amount of additional storage space decreases as the target error rate decreases. Therefore, sTable may be more preferable than BF when the target error rate is low and the processing time is important.

We briefly compare the proposed three approaches as follows in the hope that the comparison is helpful to determine which one is proper for a given environment. In the following, [TeX:] $$A<B$$ B means A outperforms B. For the sake of simplicity, we assume that enough storage space is given for our approaches so that our approaches outperform BFs only case.

Space overhead: nBF < sBF < BF < sTable

Error rate: nBF < sBF < BF < sTable

Processing overhead: sTable < sBF < nBF < BF

Management overhead: sTable < BF < sBF < nBF

**Space overhead: ** For the same target membership query error rate, nBF and sBF require less amount of storage space than BF. sTable achieves the same error rate at the cost of slight additional storage.

**Error rate: ** Given the same amount of storage space, nBF and sBF cause a smaller number of errors than BF, which causes a smaller number of errors than sTable.

**Processing overhead:** sTable provides the fastest membership query. On the other hand, the relationship between BF and nBF/sBF depends on the case. If there is no false positives, BF, nBF, and sBF may show the same membership query speed because the examination of additional filters is not required. If there are false positives and additional operations (e.g., accessing the off-chip slow memory) are required, nBF and sBF may show better membership query speed than BF. In this case, nBF may take slightly more time than sBF because it may need to examine more than one filter.

**Management overhead: ** To insert or delete elements, sTable just needs one memory access for the bucket while BF-based approaches need to access k bits. The additional filters cause additional management overhead. In particular, nBF requires higher update overhead than sBF because it uses more than one additional filter.

Space-efficient membership query is useful in many areas. BF has been one of the good choices for the space-efficient membership query. Therefore, improving BFr may have noticeable impact on applications using BF. In this paper, we propose two approaches that utilize additional filters to reduce false positive probability. Simulations and experiments show that using additional filters reduce the false positive probability of existing approaches as well as the original BF and improve the BF-based routing table lookup performance. We also introduce one approach for improving membership query speed. Keeping values only, our approach archives fast and space-efficient membership query. Evaluation results prove that our approach is feasible in improving the membership query speed in a space-efficient manner.

He received the M.S. and Ph.D. degrees in computer science from the Gwangju Institute of Science and Technology (GIST), Korea, in 2003 and 2010, respectively. He is a currently senior researcher of Electronics Telecommunications and Research Institute (ETRI), Korea. His research interests include deep learning for renewable energy system.

He received his B.S. degree in electronic engineering from Yonsei University, Seoul, Korea, in 1992 and his M.S. and Ph.D. degrees in electrical and electronic engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 1994 and 2000, respectively. From 2000 to 2004, he was a principal R&D engineer at LG Electronics, Seoul, Korea, where he was engaged in the area of 1.6 Tbps long-haul dense wavelength division multiplexing systems. Since 2005, he has been a senior research engineer at ETRI, Gwangju, Korea. His current research interests include Internet of Things and data analytics.

- 1 S. Dharmapurikar, P. Krishnamurthy, D. E. Taylor, "Longest prefix matching using bloom filters," in
*IEEE/ACM Transactions on Networking (TON)*, 2006;vol. 14, no. 2, pp. 397-409. custom:[[[-]]] - 2 M. Yu, A. Fabrikant, J. Rexford, "BUFFALO: bloom filter forwarding architecture for large organizations," in
*Proceedings of the 5th International Conference on Emerging Networking Experiments and Technologies*, Rome, Italy, 2009;pp. 313-324. custom:[[[-]]] - 3 H. Wu, D. Yang, G. Zhang, "SybilBF: defending against Sybil attacks via Bloom filters,"
*ETRI Journal*, vol. 33, no. 5, pp. 826-829, 2011.doi:[[[10.4218/etrij.11.0210.0431]]] - 4 J. Byers, J. Considine, M. Mitzenmacher, S. Rost, "Informed content delivery across adaptive overlay networks,"
*ACM SIGCOMM Computer Communication Review*, vol. 32, no. 4, pp. 47-60, 2002.doi:[[[10.1109/TNET.2004.836103]]] - 5 L. Fan, P. Cao, J. Almeida, A. Z. Broder, "Summary cache: a scalable wide-area web cache sharing protocol,"
*IEEE/ACM Transactions on Networking (TON)*, vol. 8, no. 3, pp. 281-293, 2000.doi:[[[10.1109/90.851975]]] - 6 B. H. Bloom, "Space/time trade-offs in hash coding with allowable errors,"
*Communications of the ACM*, vol. 13, no. 7, pp. 422-426, 1970.doi:[[[10.1145/362686.362692]]] - 7
*Click modular router,*, https://github.com/kohler/click - 8 B. Fan, D. G. Andersen, M. Kaminsky, M. D. Mitzenmacher, "Cuckoo filter: Practically better than bloom," in
*Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies*, Sydney, Australia, 2014;pp. 75-88. custom:[[[-]]] - 9 A. Crainiceanu, "Bloofi: a hierarchical Bloom filter index with applications to distributed data provenance," in
*Proceedings of the 2nd International Workshop on Cloud Intelligence*, Trento, Italy, 2013;custom:[[[-]]] - 10 J. Lu, T. Yang, Y. Wang, H. Dai, L. Jin, H. Song, B. Liu, "One-hashing bloom filter," in
*Proceedings of 2015 IEEE 23rd International Symposium on Quality of Service (IWQoS)*, Portland, OR, 2015;pp. 289-298. custom:[[[-]]] - 11 T. Yang, A. X. Liu, M. Shahzad, Y. Zhong, Q. Fu, Z. Li, G. Xie, X. Li, "A shifting bloom filter framework for set queries,"
*Proceedings of the VLDB Endowment*, vol. 9, no. 5, pp. 408-419, 2016.doi:[[[10.14778/2876473.2876476]]] - 12 Y. Liu, X. Ge, D. H. C. Du, X. Huang, "Par-BF: a parallel partitioned Bloom filter for dynamic data sets," in
*The International Journal of High Performance Computing Applications*, 2016;vol. 30, no. 3, pp. 259-275. custom:[[[-]]] - 13 S. Tarkoma, C. E. Rothenberg, E. Lagerspetz, "Theory and practice of bloom filters for distributed systems,"
*IEEE Communications Surveys & Tutorials*, vol. 14, no. 1, pp. 131-155, 2011.doi:[[[10.1109/SURV.2011.031611.00024]]] - 14 M. Zhong, P. Lu, K. Shen, J. Seiferas, "Optimizing data popularity conscious bloom filters," in
*Proceedings of the 27th ACM Symposium on Principles of Distributed Computing*, Toronto, Canada, 2008;pp. 355-364. custom:[[[-]]] - 15 J. Bruck, J. Gao, A. Jiang, "Weighted bloom filter," in
*Proceedings of 2006 IEEE International Symposium on Information Theory*, Seattle, WA, 2006;pp. 2304-2308. custom:[[[-]]] - 16 B. Donnet, B. Baynat, T. Friedman, "Retouched bloom filters: allowing networked applications to trade off selected false positives against false negatives," in
*Proceedings of the 2006 ACM CoNEXT Conference*, Lisbon, Portugal, 2006;custom:[[[-]]] - 17 C. Kim, M. Caesar, A. Gerber, J. Rexford,
*in Passive and Active Network Measurement*, Heidelberg: Springer, pp. 3-12, 2009.custom:[[[-]]] - 18
*COIN-OR Interior Point Optimizer (IPOPT) (Online). Available:*, https://github.com/coin-or/Ipopt - 19 D. Wagner, "A generalized birthday problem,"
*in Advances in Cryptology-CRYPTO 2002. Heidelberg: Springer*, pp. 288-304, 2002.doi:[[[10.1137/1033051]]]