What Are Probabilistic Data Structures?
When dealing with large-scale data, storing every element exactly requires enormous amounts of memory. Probabilistic data structures are a family of data structures that achieve dramatic memory savings by tolerating a small margin of error.
The fundamental trade-off of probabilistic data structures is as follows:
| Aspect | Traditional Data Structures | Probabilistic Data Structures |
|---|---|---|
| Accuracy | 100% accurate | Allows small errors |
| Memory usage | Proportional to data size | Extremely small relative to data |
| Query speed | \(O(1)\) to \(O(n)\) | \(O(k)\) (number of hashes) |
| Element removal | Possible | Generally not possible |
This article covers two representative probabilistic data structures: Bloom Filter (set membership testing) and HyperLogLog (cardinality estimation), along with theoretical background and Python implementations from scratch.
Bloom Filter
A Bloom Filter is a probabilistic data structure for quickly determining whether an element belongs to a set. It was proposed by Burton Howard Bloom in 1970.
How It Works
A Bloom Filter consists of two components:
- Bit array: An array of size \(m\) (initialized to all zeros)
- Hash functions: \(k\) independent hash functions \(h_1, h_2, \dots, h_k\) (each returning a value in \(\{0, 1, \dots, m-1\}\))
Insert operation (add): When adding element \(x\), apply all \(k\) hash functions and set the corresponding bits to 1.
| Operation | Bit array state |
|---|---|
| Initial state | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] |
| Add “apple” | [0, 1, 0, 0, 1, 0, 0, 1, 0, 0] (positions 1,4,7) |
| Add “banana” | [0, 1, 1, 0, 1, 0, 1, 1, 0, 0] (positions 2,6 added) |
Query operation (contains): To check whether element \(x\) is in the set, examine the bits at all \(k\) hash positions.
- All bits are 1 → “Probably in the set” (false positive possible)
- Any bit is 0 → “Definitely not in the set” (no false negatives)
This is the key property of a Bloom Filter: false negatives never occur, but false positives can occur.
False Positive Rate Theory
The false positive rate after inserting \(n\) elements is approximated by:
\[P(fp) = \left(1 - e^{-kn/m}\right)^k \tag{1}\]where \(m\) is the bit array size, \(k\) is the number of hash functions, and \(n\) is the number of inserted elements.
The optimal number of hash functions that minimizes the false positive rate is:
\[k_{\text{opt}} = \frac{m}{n} \ln 2 \tag{2}\]At this optimum, the false positive rate simplifies to:
\[P(fp)_{\min} = \left(\frac{1}{2}\right)^k = (0.6185)^{m/n} \tag{3}\]Python Implementation
import hashlib
import math
class BloomFilter:
"""Bloom Filter implementation from scratch."""
def __init__(self, expected_items: int, fp_rate: float = 0.01):
"""
:param expected_items: Expected number of items to insert
:param fp_rate: Acceptable false positive rate (default 1%)
"""
# Optimal bit array size: m = -n*ln(p) / (ln2)^2
self.size = self._optimal_size(expected_items, fp_rate)
# Optimal number of hash functions: k = (m/n) * ln2
self.hash_count = self._optimal_hash_count(self.size, expected_items)
self.bit_array = [0] * self.size
self.item_count = 0
@staticmethod
def _optimal_size(n: int, p: float) -> int:
"""Calculate the optimal bit array size."""
m = -n * math.log(p) / (math.log(2) ** 2)
return int(math.ceil(m))
@staticmethod
def _optimal_hash_count(m: int, n: int) -> int:
"""Calculate the optimal number of hash functions."""
k = (m / n) * math.log(2)
return int(math.ceil(k))
def _hashes(self, item: str) -> list[int]:
"""Generate k hash values using double hashing."""
h1 = int(hashlib.md5(item.encode()).hexdigest(), 16)
h2 = int(hashlib.sha256(item.encode()).hexdigest(), 16)
return [(h1 + i * h2) % self.size for i in range(self.hash_count)]
def add(self, item: str) -> None:
"""Add an element to the filter."""
for pos in self._hashes(item):
self.bit_array[pos] = 1
self.item_count += 1
def contains(self, item: str) -> bool:
"""Check if an element is in the filter (may have false positives)."""
return all(self.bit_array[pos] == 1 for pos in self._hashes(item))
def false_positive_rate(self) -> float:
"""Calculate the current theoretical false positive rate."""
n = self.item_count
m = self.size
k = self.hash_count
return (1 - math.exp(-k * n / m)) ** k
def memory_bytes(self) -> int:
"""Approximate memory usage in bytes."""
return self.size // 8 + 1
# --- Example usage ---
if __name__ == "__main__":
bf = BloomFilter(expected_items=10000, fp_rate=0.01)
print(f"Bit array size: {bf.size:,} bits ({bf.memory_bytes():,} bytes)")
print(f"Number of hash functions: {bf.hash_count}")
# Add a word list
words = [f"word_{i}" for i in range(10000)]
for w in words:
bf.add(w)
# Query inserted elements (no false negatives)
fn_count = sum(1 for w in words if not bf.contains(w))
print(f"\nFalse negatives: {fn_count} (always 0)")
# Query non-inserted elements (measure false positives)
test_words = [f"test_{i}" for i in range(10000)]
fp_count = sum(1 for w in test_words if bf.contains(w))
print(f"False positives: {fp_count} / {len(test_words)}")
print(f"Measured FP rate: {fp_count / len(test_words):.4f}")
print(f"Theoretical FP rate: {bf.false_positive_rate():.4f}")
# Memory comparison
actual_set = set(words)
import sys
set_size = sys.getsizeof(actual_set)
print(f"\n--- Memory comparison ---")
print(f"Bloom Filter: {bf.memory_bytes():,} bytes")
print(f"Python set: {set_size:,} bytes")
print(f"Reduction: {(1 - bf.memory_bytes() / set_size) * 100:.1f}%")
Example Output
Bit array size: 95,851 bits (11,982 bytes)
Number of hash functions: 7
False negatives: 0 (always 0)
False positives: 107 / 10000
Measured FP rate: 0.0107
Theoretical FP rate: 0.0101
--- Memory comparison ---
Bloom Filter: 11,982 bytes
Python set: 524,504 bytes
Reduction: 97.7%
When storing 10,000 elements, the Bloom Filter uses approximately 97% less memory than a Python set.
Practical Applications
| Use case | Description |
|---|---|
| Web crawlers | URL deduplication (used in Google Bigtable) |
| Databases (LSM-Tree) | Avoiding disk reads for non-existent keys (LevelDB, RocksDB) |
| Spell checkers | Fast detection of words not in the dictionary |
| Network routers | Packet filtering, cache lookup |
| CDN / Caching | Pre-checking whether content exists in cache |
HyperLogLog
HyperLogLog is an algorithm for estimating the cardinality (number of unique elements) of a set using extremely little memory. It was proposed by Flajolet et al. in 2007.
How It Works
The core idea of HyperLogLog is to estimate the number of unique elements by observing the maximum number of leading zero bits in hash values.
Intuitive understanding: Imagine flipping a coin repeatedly and recording how many flips until you get heads. The probability of getting heads on the first flip is \(1/2\), of getting tails twice before heads is \(1/4\), and of getting \(r\) consecutive tails is \(1/2^r\). If across many trials you observe a maximum of \(r\) consecutive tails, you can estimate there were approximately \(2^r\) trials.
Improvement over LogLog: Since a single register has high variance, HyperLogLog uses the first \(p\) bits of each hash value to distribute elements across \(m = 2^p\) registers (stochastic averaging), improving estimation accuracy.
Each register \(M[j]\) stores the maximum number of leading zero bits + 1 observed among all elements assigned to that register.
The estimation formula is:
\[E = \alpha_m \cdot m^2 \cdot \left(\sum_{j=1}^{m} 2^{-M[j]}\right)^{-1} \tag{4}\]where \(\alpha_m\) is a bias correction factor defined as:
\[\alpha_m = \left(m \int_0^{\infty} \left(\log_2 \left(\frac{2+u}{1+u}\right)\right)^m du\right)^{-1} \tag{5}\]In practice, the following approximations are commonly used:
| \(m\) | \(\alpha_m\) |
|---|---|
| 16 | 0.673 |
| 32 | 0.697 |
| 64 | 0.709 |
| \(\geq 128\) | \(0.7213 / (1 + 1.079 / m)\) |
The standard error of HyperLogLog is:
\[\sigma = \frac{1.04}{\sqrt{m}} \tag{6}\]With \(m = 2^{14} = 16384\) registers, the standard error is approximately 0.81%.
Python Implementation
import hashlib
import math
class HyperLogLog:
"""HyperLogLog implementation from scratch."""
def __init__(self, precision: int = 14):
"""
:param precision: Precision parameter p that determines register count.
Register count m = 2^p (default p=14 -> 16384 registers)
"""
self.p = precision
self.m = 1 << precision # 2^p
self.registers = [0] * self.m
self.alpha = self._compute_alpha(self.m)
@staticmethod
def _compute_alpha(m: int) -> float:
"""Compute the bias correction factor."""
if m == 16:
return 0.673
elif m == 32:
return 0.697
elif m == 64:
return 0.709
else:
return 0.7213 / (1 + 1.079 / m)
def _hash(self, item: str) -> int:
"""Convert an element to a 64-bit hash value."""
h = hashlib.sha256(item.encode()).hexdigest()
return int(h[:16], 16) # Use the first 64 bits
@staticmethod
def _leading_zeros(value: int, max_bits: int) -> int:
"""Return the number of leading zero bits + 1."""
if value == 0:
return max_bits + 1
count = 1
for i in range(max_bits - 1, -1, -1):
if value & (1 << i):
break
count += 1
return count
def add(self, item: str) -> None:
"""Add an element."""
h = self._hash(item)
# Use the first p bits to determine register index
j = h >> (64 - self.p)
# Compute leading zeros from the remaining bits
remaining = h & ((1 << (64 - self.p)) - 1)
rank = self._leading_zeros(remaining, 64 - self.p)
self.registers[j] = max(self.registers[j], rank)
def count(self) -> int:
"""Estimate the cardinality (number of unique elements)."""
# Harmonic mean-based estimation
indicator = sum(2.0 ** (-r) for r in self.registers)
estimate = self.alpha * self.m * self.m / indicator
# Small range correction (Linear Counting)
if estimate <= 2.5 * self.m:
zeros = self.registers.count(0)
if zeros > 0:
estimate = self.m * math.log(self.m / zeros)
# Large range correction
if estimate > (1 << 32) / 30.0:
estimate = -(1 << 32) * math.log(1 - estimate / (1 << 32))
return int(estimate)
def merge(self, other: "HyperLogLog") -> "HyperLogLog":
"""Merge two HyperLogLog instances."""
if self.p != other.p:
raise ValueError("Precision parameters must match")
merged = HyperLogLog(self.p)
merged.registers = [
max(a, b) for a, b in zip(self.registers, other.registers)
]
return merged
def standard_error(self) -> float:
"""Return the standard error."""
return 1.04 / math.sqrt(self.m)
def memory_bytes(self) -> int:
"""Approximate memory usage in bytes."""
# Each register needs at most 6 bits (max value is 64-p+1)
return self.m * 6 // 8
# --- Example usage ---
if __name__ == "__main__":
hll = HyperLogLog(precision=14)
true_count = 1_000_000
for i in range(true_count):
hll.add(f"user_{i}")
estimated = hll.count()
error = abs(estimated - true_count) / true_count * 100
print(f"True unique count: {true_count:,}")
print(f"Estimated unique count: {estimated:,}")
print(f"Error: {error:.2f}%")
print(f"Theoretical standard error: {hll.standard_error() * 100:.2f}%")
# Memory comparison
import sys
actual_set = {f"user_{i}" for i in range(true_count)}
set_size = sys.getsizeof(actual_set)
hll_size = hll.memory_bytes()
print(f"\n--- Memory comparison ---")
print(f"HyperLogLog: {hll_size:,} bytes")
print(f"Python set: {set_size:,} bytes")
print(f"Reduction: {(1 - hll_size / set_size) * 100:.1f}%")
# Merge demo
hll_a = HyperLogLog(precision=14)
hll_b = HyperLogLog(precision=14)
for i in range(500_000):
hll_a.add(f"user_{i}")
for i in range(300_000, 800_000):
hll_b.add(f"user_{i}")
merged = hll_a.merge(hll_b)
print(f"\n--- Merge demo ---")
print(f"HLL_A unique count: {hll_a.count():,} (true: 500,000)")
print(f"HLL_B unique count: {hll_b.count():,} (true: 500,000)")
print(f"Merged unique count: {merged.count():,} (true: 800,000)")
Example Output
True unique count: 1,000,000
Estimated unique count: 1,007,822
Error: 0.78%
Theoretical standard error: 0.81%
--- Memory comparison ---
HyperLogLog: 12,288 bytes
Python set: 33,554,656 bytes
Reduction: 99.96%
--- Merge demo ---
HLL_A unique count: 498,653 (true: 500,000)
HLL_B unique count: 501,247 (true: 500,000)
Merged unique count: 802,134 (true: 800,000)
With 1 million unique elements, HyperLogLog achieves less than 1% error using only about 12 KB of memory – a 99.96% reduction compared to a Python set.
Practical Applications
| Use case | Description |
|---|---|
Redis PFCOUNT | Built-in HyperLogLog support, cardinality estimation in 12 KB |
| Web analytics | Real-time unique visitor counting |
| Network monitoring | Estimating unique IP addresses in network flows |
| Database query optimization | Fast approximation of COUNT(DISTINCT ...) |
| Distributed systems | Mergeable across nodes for independent computation and aggregation |
Comparison Table
| Feature | Bloom Filter | HyperLogLog | Python set |
|---|---|---|---|
| Purpose | Set membership testing | Cardinality estimation | General-purpose set ops |
| Memory (100K elements) | ~120 KB | ~12 KB | ~4 MB |
| Accuracy | False positives (rate tunable) | Std. error 0.81% (\(p=14\)) | 100% accurate |
| Insert | \(O(k)\) | \(O(1)\) | Amortized \(O(1)\) |
| Query | \(O(k)\) “Is it in the set?” | \(O(m)\) “How many unique?” | \(O(1)\) various ops |
| Deletion | Not possible (Counting BF can) | Not possible | Possible |
| Merge | OR operation | Max operation | Union |
| False negatives | None | N/A | N/A |
Probabilistic data structures provide dramatic improvements in memory efficiency and speed when 100% accuracy is not strictly required. Choosing the right data structure for the task at hand is key.
Related Articles
- Fast Modular Exponentiation in Python Using Repeated Squaring
- Fundamentals of Genetic Algorithms (GA) with Python Implementation
- Cross-Entropy Method: A Practical Monte Carlo Optimization Technique
References
- Bloom, B. H. (1970). “Space/time trade-offs in hash coding with allowable errors.” Communications of the ACM, 13(7), 422-426.
- Flajolet, P., Fusy, E., Gandouet, O., & Meunier, F. (2007). “HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm.” Discrete Mathematics and Theoretical Computer Science, AH, 137-156.
- Broder, A., & Mitzenmacher, M. (2004). “Network applications of Bloom filters: A survey.” Internet Mathematics, 1(4), 485-509.
Related Tools
- DevToolBox - Free Developer Tools - 85+ developer tools including JSON formatter, regex tester, and more
- CalcBox - Everyday Calculators - 61+ calculators for statistics, frequency conversion, and more