Bloom Filter
Bloom filter is a probability space efficient data structure. It is very similar to a hashmap and is used to retrieve whether an element is in a set or not. It makes a good trade-off between space utilization and false alarm ratio when retrieving whether an element exists or not. It is due to this property that it is called probabilistic data structure (PDS).
spatial efficiency
Let's take a closer look at how space efficient it is. If you want to store a series of elements in a collection, there are many different ways to do it. You can store the data in a hashmap and subsequently retrieve whether the element exists or not in the hashmap. hashmap is very efficient for both insertion and querying. However, since the hashmap stores the content directly, it is not very space efficient.
If we wish to improve space utilization, we can do a hash transformation before the element is inserted into the collection. What about other methods? We could use an array of bits to store the hash values of the elements. Any more? Any more? We also allow hash conflicts in the bit array. This is exactly how Bloom filters work, they are based on bit arrays that allow hash conflicts, which may cause some false positives. Hash conflicts are allowed at the design stage of Bloom filters, otherwise the space usage wouldn't be compact enough.
When using either lists or collections, where space efficiency is important and significant, then Bloom filters should be considered.
Bloom Filter Basics
Bloom filters areN
A bit array of bits whereN
is the size of the bit array. It also has another parameterk
, indicates the number of hash functions used. These hash functions are used to set the value of the bit array. When inserting elements into the filterx
whenh1(x)
, h2(x)
, ..., hk(x)
The value of the corresponding index position is set to "1", and the index value is calculated by each hash function. Note that if we increase the number of hash functions, the probability of false positives tends to 0. However, the time overhead for insertion and lookup is greater and the capacity of the Bloom filter is reduced.
In order to check for the existence of an element using the Bloom filter, we need to check if all positions are set to "1", very similar to the process we use to insert an element. If all positions are set to "1", it means that the elementeasily
Exists in the Bloom filter. If a position is not set to "1", then the element must not exist.
bamboo strips used for writing (old)The python implementation of the single
To implement a simple Bloom filter, we can do this:
from bitarray import bitarray # 3rd party import mmh3 class BloomFilter(set): def __init__(self, size, hash_count): super(BloomFilter, self).__init__() self.bit_array = bitarray(size) self.bit_array.setall(0) = size self.hash_count = hash_count def __len__(self): return def __iter__(self): return iter(self.bit_array) def add(self, item): for ii in range(self.hash_count): index = (item, ii) % self.bit_array[index] = 1 return self def __contains__(self, item): out = True for ii in range(self.hash_count): index = (item, ii) % if self.bit_array[index] == 0: out = False return out def main(): bloom = BloomFilter(100, 10) animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle', 'bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear', 'chicken', 'dolphin', 'donkey', 'crow', 'crocodile'] # First insertion of animals into the bloom filter for animal in animals: (animal) # Membership existence for already inserted animals # There should not be any false negatives for animal in animals: if animal in bloom: print('{} is in bloom filter as expected'.format(animal)) else: print('Something is terribly went wrong for {}'.format(animal)) print('FALSE NEGATIVE!') # Membership existence for not inserted animals # There could be false positives other_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox', 'whale', 'shark', 'fish', 'turkey', 'duck', 'dove', 'deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla', 'hawk' ] for other_animal in other_animals: if other_animal in bloom: print('{} is not in the bloom, but a false positive'.format(other_animal)) else: print('{} is not in the bloom filter as expected'.format(other_animal)) if __name__ == '__main__': main()
The output is shown below:
dog is in bloom filter as expected cat is in bloom filter as expected giraffe is in bloom filter as expected fly is in bloom filter as expected mosquito is in bloom filter as expected horse is in bloom filter as expected eagle is in bloom filter as expected bird is in bloom filter as expected bison is in bloom filter as expected boar is in bloom filter as expected butterfly is in bloom filter as expected ant is in bloom filter as expected anaconda is in bloom filter as expected bear is in bloom filter as expected chicken is in bloom filter as expected dolphin is in bloom filter as expected donkey is in bloom filter as expected crow is in bloom filter as expected crocodile is in bloom filter as expected badger is not in the bloom filter as expected cow is not in the bloom filter as expected pig is not in the bloom filter as expected sheep is not in the bloom, but a false positive bee is not in the bloom filter as expected wolf is not in the bloom filter as expected fox is not in the bloom filter as expected whale is not in the bloom filter as expected shark is not in the bloom, but a false positive fish is not in the bloom, but a false positive turkey is not in the bloom filter as expected duck is not in the bloom filter as expected dove is not in the bloommisinformation filter as expected deer is not in the bloom filter as expected elephant is not in the bloom, but a false positive frog is not in the bloom filter as expected falcon is not in the bloom filter as expected goat is not in the bloom filter as expected gorilla is not in the bloom filter as expected hawk is not in the bloom filter as expected
As can be seen from the output, there are a number of false positive samples, but there are no false negatives.
Unlike this Bloom Filter implementation code, multiple implementations in other languages do not provide parameters for hash functions. This is because the false alarm ratio is a more important metric than the hash function in practical applications, and the user can adjust the number of hash functions according to the needs of the false alarm ratio. In general.size
cap (a poem)error_rate
is the true percentage of false positives for the Bloom filter. If you reduce the initialization phase oferror_rate
, they adjust the number of hash functions.
misinformation
Bloom filters are able to say "definitely does not exist" for a certain element, but for some elements they say "probably exists". Depending on the application, this can be a huge drawback or a non-issue. If you don't mind introducing false positives when retrieving the existence of an element, then you should consider using a Bloom filter.
In addition, if the false alarm rate is arbitrarily reduced, the number of hash functions will have to increase accordingly, and the latency during insertion and querying will increase accordingly. Another point of this section is that if the hash functions are independent of each other and the input elements are uniformly distributed in space, then the true false alarm rate will theoretically not exceed the theoretical value. Otherwise, due to the correlation of the hash functions and more frequent hash conflicts, the true false alarm percentage of the Bloom filter will be higher than the theoretical value.
The potential impact of false positives needs to be considered when using Bloom filters.
deterministic
When you use the same size and number of hash functions, the result of whether an element gets positive or negative feedback through the Bloom filter is determined. For a given elementx
If it's nowprobable
The state of affairs five minutes later, or an hour later, or a day later, or even a week later, are allprobable
. I was a little surprised when I learned about this feature. Because the Bloom filter isprobabilistic
of that, then there should clearly be some sort of random factor in the results, shouldn't there? Indeed there isn't. It'sprobabilistic
is reflected in our inability to determine exactly which elements are in a state ofprobable
。
In other words, once the filter has madeprobable
The conclusion will not change after the conclusion of the
drawbacks
Bloom filters are not perfect.
Bloom Filter Capacity
Bloom filters require prior knowledge of the number of elements to be inserted. This is not good if you don't know or have difficulty estimating the number of elements. You could also randomly specify a very large volume, but that would waste a lot of storage space, and storage space is the number one thing we are trying to optimize for, and one of the reasons why we chose to use a Bloom filter. One solution would be to create a Bloom filter that dynamically adapts to the amount of data, but this is not effective in some application scenarios. There is a scalable Bloom filter that is able to adjust its capacity to accommodate different numbers of elements. It makes up for some of the shortcomings.
Construction and retrieval of Bloom filters
When using Bloom filters, we have to accept not only a small amount of false positives, but also an additional time overhead in terms of speed. Compared to hashmap, there is bound to be some additional time overhead in doing hash mapping of elements and building Bloom filters.
Cannot return the element itself
Bloom filters do not save the contents of the inserted element, they can only retrieve whether an element exists or not, because of the existence of hash functions and hash conflicts we can not get a complete list of elements. This is its most significant advantage over other data structures, and space utilization contributes to this shortcoming.
Delete an element
Trying to remove an element from a Bloom filter is not an easy task, and you can't undo an insertion because the hash results of different items can be indexed in the same position. If you want to undo an insertion, you can only keep track of the number of times each indexed position has been set, or create it again. Both methods have additional overhead. Based on different application scenarios, if we want to remove some elements, we prefer to rebuild the Bloom filter.
Implementation in different languages
In a product, you certainly don't want to implement a Bloom filter yourself. There are two reasons for this, one of which is that a well-chosen hash function and implementation can effectively improve the distribution of error rates. Second, it needs to pass real-world testing, with both error rates and capacity sizes standing up to real-world scrutiny. Various languages have open source implementations, in my own experience, the following and Python version of the implementation works very well:
NodePython
There's a faster version.pybloomfilter(both insertion and retrieval are 10x faster than the python library above), but it needs to run in a PyPy environment and does not support Python3.
summarize
The above is a small introduction to the Bloom filter overview and Python implementation method, I hope to help you, if you have any questions welcome to leave me a message, I will reply to you in a timely manner!