Python Bloom Filter: Check String Existence in a Sea of Billions Without the Storage Bloat
Assume the following scenario, you are developing a Python application that processes content from text files. The files might contain…
Assume the following scenario, you are developing a Python application that processes content from text files. The files might contain duplicate text and if you want to avoid reprocessing files that have already been processed, let's examine some possible solutions
file comparison
It is a straightforward method but has some important cons
- You need to store all files processed, this means lots of storage!
- Slow, and will get slower as new files accumulated
Hashing comparison
It is another straightforward method but it also has some important cons
- It is CPU-consuming, and CPU cost increases as file size increases.
- The md5 hash output is always 16 Bytes despite that the input file might be smaller than that.
Would be nice if there was a mechanism that had less CPU usage, was fast, and did not require lots of storage? well, there is it called Bloom filter! The Bloom filter works in a very simple manner.
The Bloom filter
A Bloom filter is a smart and memory-efficient way to check if something is part of a set, however, there is some chance of making false positive mistakes, the chance of false positives can be minimized (but cannot never be zero) by configuring the error rate percentage in the cost of CPU and Memory usage.
The library we are going to use is rbloom, it is written in Rust so it should be super fast! To install it with pip enter
$ pip3 install rbloomCreating the Bloom filter
Create the following file as bloom.py , make it executable with chmod +x bloom.py
#!/usr/bin/env python3
from rbloom import Bloom
num_of_elements = 1000
error_rate = 0.001
bf = Bloom(num_of_elements,error_rate)
bf.add("user1")
bf.add("user2")
if "user1" in bf:
print("user1 exists")
else:
print("user1 does not exist")
if "user3" in bf:
print("user3 exists")
else:
print("user3 does not exist")Let's break down what it does, the first four lines import the rbloom library, fourth line creates a Bloom filter named bf with a number of elements equal to 1000 and an error rate of 1/1000. All values of the bit array created after the initialization are set to 0.
from rbloom import Bloom
num_of_elements = 1000
error_rate = 0.001
bf = Bloom(num_of_elements,error_rate)Next, we add two elements to the Bloom filter using the add function, when we add an element to the filter, the actual element is not stored, but instead, some hashing calculations take place that provide multiple positions in the bit array which is initialized when we define the number of elements.
bf.add("user1")
bf.add("user2")Next, the last lines show how we can check if an element exists on the filter, quite self self-explanatory example!
if "user1" in bf:
print("user1 exists")
else:
print("user1 does not exist")
if "user3" in bf:
print("user3 exists")
else:
print("user3 does not exist")Things to note
Deletion is not supported due to the very nature of the Bloom filter, this is because elements might share bits in the array, so un-setting those bits will also cause problems with other elements.
Merging filters works by doing bitwise OR with the bits of the two arrays, it will work only if the filters have the same size, the size will remain the same and not expand to double.
bf = bf | Bloom(num_of_elements,error_rate)Conclusion
In this article, we saw how we can use Bloom filters to increase our efficiency in scenarios like processing text files, etc, there are numerous more areas in which Bloom filters can be useful, I hope you enjoyed this article as much as I enjoyed writing it!