You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A Bloom filter is a representation of a set of _n_ items, where the main
8
+
A Bloom filter is a concise/compressed representation of a set, where the main
9
9
requirement is to make membership queries; _i.e._, whether an item is a
10
-
member of a set.
10
+
member of a set. A Bloom filter will always correctly report the presence
11
+
of an element in the set when the element is indeed present. A Bloom filter
12
+
can use much less storage than the original set, but it allows for some 'false positives':
13
+
it may sometimes report that an element is in the set whereas it is not.
11
14
12
-
A Bloom filter has two parameters: _m_, a maximum size (typically a reasonably large multiple of the cardinality of the set to represent) and _k_, the number of hashing functions on elements of the set. (The actual hashing functions are important, too, but this is not a parameter for this implementation). A Bloom filter is backed by a [BitSet](https://github.com/bits-and-blooms/bitset); a key is represented in the filter by setting the bits at each value of the hashing functions (modulo _m_). Set membership is done by _testing_ whether the bits at each value of the hashing functions (again, modulo _m_) are set. If so, the item is in the set. If the item is actually in the set, a Bloom filter will never fail (the true positive rate is 1.0); but it is susceptible to false positives. The art is to choose _k_ and _m_ correctly.
15
+
When you construct, you need to know how many elements you have (the desired capacity), and what is the desired false positive rate you are willing to tolerate. A common false-positive rate is 1%. The
16
+
lower the false-positive rate, the more memory you are going to require. Similarly, the higher the
17
+
capacity, the more memory you will use.
18
+
You may construct the Bloom filter capable of receiving 1 million elements with a false-positive
19
+
rate of 1% in the following manner.
13
20
14
-
In this implementation, the hashing functions used is [murmurhash](github.com/spaolacci/murmur3), a non-cryptographic hashing function.
21
+
```Go
22
+
filter:= bloom.NewWithEstimates(1000000, 0.01)
23
+
```
15
24
16
-
This implementation accepts keys for setting and testing as `[]byte`. Thus, to
25
+
You should call `NewWithEstimates` conservatively: if you specify a number of elements that it is
26
+
too small, the false-positive bound might be exceeded. A Bloom filter is not a dynamic data structure:
27
+
you must know ahead of time what your desired capacity is.
28
+
29
+
Our implementation accepts keys for setting and testing as `[]byte`. Thus, to
17
30
add a string item, `"Love"`:
18
31
19
32
```Go
20
-
n:=uint(1000)
21
-
filter:= bloom.New(20*n, 5) // load of 20, 5 keys
22
33
filter.Add([]byte("Love"))
23
34
```
24
35
@@ -37,16 +48,6 @@ For numerical data, we recommend that you look into the encoding/binary library.
37
48
filter.Add(n1)
38
49
```
39
50
40
-
Finally, there is a method to estimate the false positive rate of a particular
41
-
bloom filter for a set of size _n_:
42
-
43
-
```Go
44
-
if filter.EstimateFalsePositiveRate(1000) > 0.001
45
-
```
46
-
47
-
Given the particular hashing scheme, it's best to be empirical about this. Note
48
-
that estimating the FP rate will clear the Bloom filter.
@@ -74,3 +75,13 @@ Before committing the code, please check if it passes all tests using (note: thi
74
75
make deps
75
76
make qa
76
77
```
78
+
79
+
## Design
80
+
81
+
A Bloom filter has two parameters: _m_, the number of bits used in storage, and _k_, the number of hashing functions on elements of the set. (The actual hashing functions are important, too, but this is not a parameter for this implementation). A Bloom filter is backed by a [BitSet](https://github.com/willf/bitset); a key is represented in the filter by setting the bits at each value of the hashing functions (modulo _m_). Set membership is done by _testing_ whether the bits at each value of the hashing functions (again, modulo _m_) are set. If so, the item is in the set. If the item is actually in the set, a Bloom filter will never fail (the true positive rate is 1.0); but it is susceptible to false positives. The art is to choose _k_ and _m_ correctly.
82
+
83
+
In this implementation, the hashing functions used is [murmurhash](github.com/spaolacci/murmur3), a non-cryptographic hashing function.
84
+
85
+
86
+
Given the particular hashing scheme, it's best to be empirical about this. Note
87
+
that estimating the FP rate will clear the Bloom filter.
0 commit comments