Overview and Python implementation of Bloom filters

Bloom Filter

Bloom filter is a probability space efficient data structure. It is very similar to a hashmap and is used to retrieve whether an element is in a set or not. It makes a good trade-off between space utilization and false alarm ratio when retrieving whether an element exists or not. It is due to this property that it is called probabilistic data structure (PDS).

spatial efficiency

Let's take a closer look at how space efficient it is. If you want to store a series of elements in a collection, there are many different ways to do it. You can store the data in a hashmap and subsequently retrieve whether the element exists or not in the hashmap. hashmap is very efficient for both insertion and querying. However, since the hashmap stores the content directly, it is not very space efficient.

If we wish to improve space utilization, we can do a hash transformation before the element is inserted into the collection. What about other methods? We could use an array of bits to store the hash values of the elements. Any more? Any more? We also allow hash conflicts in the bit array. This is exactly how Bloom filters work, they are based on bit arrays that allow hash conflicts, which may cause some false positives. Hash conflicts are allowed at the design stage of Bloom filters, otherwise the space usage wouldn't be compact enough.

When using either lists or collections, where space efficiency is important and significant, then Bloom filters should be considered.

Bloom Filter Basics

Bloom filters areNA bit array of bits whereNis the size of the bit array. It also has another parameterk, indicates the number of hash functions used. These hash functions are used to set the value of the bit array. When inserting elements into the filterxwhenh1(x), h2(x), ..., hk(x)The value of the corresponding index position is set to "1", and the index value is calculated by each hash function. Note that if we increase the number of hash functions, the probability of false positives tends to 0. However, the time overhead for insertion and lookup is greater and the capacity of the Bloom filter is reduced.

In order to check for the existence of an element using the Bloom filter, we need to check if all positions are set to "1", very similar to the process we use to insert an element. If all positions are set to "1", it means that the elementeasilyExists in the Bloom filter. If a position is not set to "1", then the element must not exist.

bamboo strips used for writing (old)The python implementation of the single

To implement a simple Bloom filter, we can do this:

from bitarray import bitarray
# 3rd party
import mmh3
class BloomFilter(set):
 def __init__(self, size, hash_count):
  super(BloomFilter, self).__init__()
  self.bit_array = bitarray(size)
  self.bit_array.setall(0)
   = size
  self.hash_count = hash_count
 def __len__(self):
  return 
 def __iter__(self):
  return iter(self.bit_array)
 def add(self, item):
  for ii in range(self.hash_count):
   index = (item, ii) % 
   self.bit_array[index] = 1
  return self
 def __contains__(self, item):
  out = True
  for ii in range(self.hash_count):
   index = (item, ii) % 
   if self.bit_array[index] == 0:
    out = False
  return out
def main():
 bloom = BloomFilter(100, 10)
 animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle',
    'bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear',
    'chicken', 'dolphin', 'donkey', 'crow', 'crocodile']
 # First insertion of animals into the bloom filter
 for animal in animals:
  (animal)
 # Membership existence for already inserted animals
 # There should not be any false negatives
 for animal in animals:
  if animal in bloom:
   print('{} is in bloom filter as expected'.format(animal))
  else:
   print('Something is terribly went wrong for {}'.format(animal))
   print('FALSE NEGATIVE!')
 # Membership existence for not inserted animals
 # There could be false positives
 other_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox',
      'whale', 'shark', 'fish', 'turkey', 'duck', 'dove',
      'deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla',
      'hawk' ]
 for other_animal in other_animals:
  if other_animal in bloom:
   print('{} is not in the bloom, but a false positive'.format(other_animal))
  else:
   print('{} is not in the bloom filter as expected'.format(other_animal))
if __name__ == '__main__':
 main()

The output is shown below:

dog is in bloom filter as expected
cat is in bloom filter as expected
giraffe is in bloom filter as expected
fly is in bloom filter as expected
mosquito is in bloom filter as expected
horse is in bloom filter as expected
eagle is in bloom filter as expected
bird is in bloom filter as expected
bison is in bloom filter as expected
boar is in bloom filter as expected
butterfly is in bloom filter as expected
ant is in bloom filter as expected
anaconda is in bloom filter as expected
bear is in bloom filter as expected
chicken is in bloom filter as expected
dolphin is in bloom filter as expected
donkey is in bloom filter as expected
crow is in bloom filter as expected
crocodile is in bloom filter as expected


badger is not in the bloom filter as expected
cow is not in the bloom filter as expected
pig is not in the bloom filter as expected
sheep is not in the bloom, but a false positive
bee is not in the bloom filter as expected
wolf is not in the bloom filter as expected
fox is not in the bloom filter as expected
whale is not in the bloom filter as expected
shark is not in the bloom, but a false positive
fish is not in the bloom, but a false positive
turkey is not in the bloom filter as expected
duck is not in the bloom filter as expected
dove is not in the bloommisinformation filter as expected
deer is not in the bloom filter as expected
elephant is not in the bloom, but a false positive
frog is not in the bloom filter as expected
falcon is not in the bloom filter as expected
goat is not in the bloom filter as expected
gorilla is not in the bloom filter as expected
hawk is not in the bloom filter as expected

As can be seen from the output, there are a number of false positive samples, but there are no false negatives.

Unlike this Bloom Filter implementation code, multiple implementations in other languages do not provide parameters for hash functions. This is because the false alarm ratio is a more important metric than the hash function in practical applications, and the user can adjust the number of hash functions according to the needs of the false alarm ratio. In general.sizecap (a poem)error_rateis the true percentage of false positives for the Bloom filter. If you reduce the initialization phase oferror_rate, they adjust the number of hash functions.

misinformation

Bloom filters are able to say "definitely does not exist" for a certain element, but for some elements they say "probably exists". Depending on the application, this can be a huge drawback or a non-issue. If you don't mind introducing false positives when retrieving the existence of an element, then you should consider using a Bloom filter.

In addition, if the false alarm rate is arbitrarily reduced, the number of hash functions will have to increase accordingly, and the latency during insertion and querying will increase accordingly. Another point of this section is that if the hash functions are independent of each other and the input elements are uniformly distributed in space, then the true false alarm rate will theoretically not exceed the theoretical value. Otherwise, due to the correlation of the hash functions and more frequent hash conflicts, the true false alarm percentage of the Bloom filter will be higher than the theoretical value.

The potential impact of false positives needs to be considered when using Bloom filters.

deterministic

When you use the same size and number of hash functions, the result of whether an element gets positive or negative feedback through the Bloom filter is determined. For a given elementxIf it's nowprobableThe state of affairs five minutes later, or an hour later, or a day later, or even a week later, are allprobable. I was a little surprised when I learned about this feature. Because the Bloom filter isprobabilisticof that, then there should clearly be some sort of random factor in the results, shouldn't there? Indeed there isn't. It'sprobabilisticis reflected in our inability to determine exactly which elements are in a state ofprobable。

In other words, once the filter has madeprobableThe conclusion will not change after the conclusion of the

drawbacks

Bloom filters are not perfect.

Bloom Filter Capacity

Bloom filters require prior knowledge of the number of elements to be inserted. This is not good if you don't know or have difficulty estimating the number of elements. You could also randomly specify a very large volume, but that would waste a lot of storage space, and storage space is the number one thing we are trying to optimize for, and one of the reasons why we chose to use a Bloom filter. One solution would be to create a Bloom filter that dynamically adapts to the amount of data, but this is not effective in some application scenarios. There is a scalable Bloom filter that is able to adjust its capacity to accommodate different numbers of elements. It makes up for some of the shortcomings.

Construction and retrieval of Bloom filters

When using Bloom filters, we have to accept not only a small amount of false positives, but also an additional time overhead in terms of speed. Compared to hashmap, there is bound to be some additional time overhead in doing hash mapping of elements and building Bloom filters.

Cannot return the element itself

Bloom filters do not save the contents of the inserted element, they can only retrieve whether an element exists or not, because of the existence of hash functions and hash conflicts we can not get a complete list of elements. This is its most significant advantage over other data structures, and space utilization contributes to this shortcoming.

Delete an element

Trying to remove an element from a Bloom filter is not an easy task, and you can't undo an insertion because the hash results of different items can be indexed in the same position. If you want to undo an insertion, you can only keep track of the number of times each indexed position has been set, or create it again. Both methods have additional overhead. Based on different application scenarios, if we want to remove some elements, we prefer to rebuild the Bloom filter.

Implementation in different languages

In a product, you certainly don't want to implement a Bloom filter yourself. There are two reasons for this, one of which is that a well-chosen hash function and implementation can effectively improve the distribution of error rates. Second, it needs to pass real-world testing, with both error rates and capacity sizes standing up to real-world scrutiny. Various languages have open source implementations, in my own experience, the following and Python version of the implementation works very well:

NodePython

There's a faster version.pybloomfilter(both insertion and retrieval are 10x faster than the python library above), but it needs to run in a PyPy environment and does not support Python3.

summarize

The above is a small introduction to the Bloom filter overview and Python implementation method, I hope to help you, if you have any questions welcome to leave me a message, I will reply to you in a timely manner!