Skip to content

Commit 5792172

Browse files
committed
Better documentation.
1 parent 094aacf commit 5792172

File tree

2 files changed

+35
-17
lines changed

2 files changed

+35
-17
lines changed

README.md

Lines changed: 28 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -5,20 +5,31 @@ Bloom filters
55
[![Go Report Card](https://goreportcard.com/badge/github.com/willf/bloom)](https://goreportcard.com/report/github.com/willf/bloom)
66
[![GoDoc](https://godoc.org/github.com/bits-and-blooms/bloom?status.svg)](http://godoc.org/github.com/bits-and-blooms/bloom)
77

8-
A Bloom filter is a representation of a set of _n_ items, where the main
8+
A Bloom filter is a concise/compressed representation of a set, where the main
99
requirement is to make membership queries; _i.e._, whether an item is a
10-
member of a set.
10+
member of a set. A Bloom filter will always correctly report the presence
11+
of an element in the set when the element is indeed present. A Bloom filter
12+
can use much less storage than the original set, but it allows for some 'false positives':
13+
it may sometimes report that an element is in the set whereas it is not.
1114

12-
A Bloom filter has two parameters: _m_, a maximum size (typically a reasonably large multiple of the cardinality of the set to represent) and _k_, the number of hashing functions on elements of the set. (The actual hashing functions are important, too, but this is not a parameter for this implementation). A Bloom filter is backed by a [BitSet](https://github.com/bits-and-blooms/bitset); a key is represented in the filter by setting the bits at each value of the hashing functions (modulo _m_). Set membership is done by _testing_ whether the bits at each value of the hashing functions (again, modulo _m_) are set. If so, the item is in the set. If the item is actually in the set, a Bloom filter will never fail (the true positive rate is 1.0); but it is susceptible to false positives. The art is to choose _k_ and _m_ correctly.
15+
When you construct, you need to know how many elements you have (the desired capacity), and what is the desired false positive rate you are willing to tolerate. A common false-positive rate is 1%. The
16+
lower the false-positive rate, the more memory you are going to require. Similarly, the higher the
17+
capacity, the more memory you will use.
18+
You may construct the Bloom filter capable of receiving 1 million elements with a false-positive
19+
rate of 1% in the following manner.
1320

14-
In this implementation, the hashing functions used is [murmurhash](github.com/spaolacci/murmur3), a non-cryptographic hashing function.
21+
```Go
22+
filter := bloom.NewWithEstimates(1000000, 0.01)
23+
```
1524

16-
This implementation accepts keys for setting and testing as `[]byte`. Thus, to
25+
You should call `NewWithEstimates` conservatively: if you specify a number of elements that it is
26+
too small, the false-positive bound might be exceeded. A Bloom filter is not a dynamic data structure:
27+
you must know ahead of time what your desired capacity is.
28+
29+
Our implementation accepts keys for setting and testing as `[]byte`. Thus, to
1730
add a string item, `"Love"`:
1831

1932
```Go
20-
n := uint(1000)
21-
filter := bloom.New(20*n, 5) // load of 20, 5 keys
2233
filter.Add([]byte("Love"))
2334
```
2435

@@ -37,16 +48,6 @@ For numerical data, we recommend that you look into the encoding/binary library.
3748
filter.Add(n1)
3849
```
3950

40-
Finally, there is a method to estimate the false positive rate of a particular
41-
bloom filter for a set of size _n_:
42-
43-
```Go
44-
if filter.EstimateFalsePositiveRate(1000) > 0.001
45-
```
46-
47-
Given the particular hashing scheme, it's best to be empirical about this. Note
48-
that estimating the FP rate will clear the Bloom filter.
49-
5051
Discussion here: [Bloom filter](https://groups.google.com/d/topic/golang-nuts/6MktecKi1bE/discussion)
5152

5253
Godoc documentation: https://godoc.org/github.com/bits-and-blooms/bloom
@@ -74,3 +75,13 @@ Before committing the code, please check if it passes all tests using (note: thi
7475
make deps
7576
make qa
7677
```
78+
79+
## Design
80+
81+
A Bloom filter has two parameters: _m_, the number of bits used in storage, and _k_, the number of hashing functions on elements of the set. (The actual hashing functions are important, too, but this is not a parameter for this implementation). A Bloom filter is backed by a [BitSet](https://github.com/willf/bitset); a key is represented in the filter by setting the bits at each value of the hashing functions (modulo _m_). Set membership is done by _testing_ whether the bits at each value of the hashing functions (again, modulo _m_) are set. If so, the item is in the set. If the item is actually in the set, a Bloom filter will never fail (the true positive rate is 1.0); but it is susceptible to false positives. The art is to choose _k_ and _m_ correctly.
82+
83+
In this implementation, the hashing functions used is [murmurhash](github.com/spaolacci/murmur3), a non-cryptographic hashing function.
84+
85+
86+
Given the particular hashing scheme, it's best to be empirical about this. Note
87+
that estimating the FP rate will clear the Bloom filter.

murmur_test.go

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,13 @@ func TestHashBasic(t *testing.T) {
3030
}
3131
}
3232

33+
func TestDocumentation(t *testing.T) {
34+
filter := NewWithEstimates(1000, 0.01)
35+
if filter.EstimateFalsePositiveRate(1000) > 0.0101 {
36+
t.Errorf("Bad false positive rate")
37+
}
38+
}
39+
3340
// We want to preserve backward compatibility
3441
func TestHashRandom(t *testing.T) {
3542
max_length := 1000

0 commit comments

Comments
 (0)