Introduction to Probabilistic Data Structures: Bloom Filter and HyperLogLog in Python

Explains probabilistic data structures (Bloom Filter, HyperLogLog) with theory and Python implementation from scratch.

What Are Probabilistic Data Structures?

When dealing with large-scale data, storing every element exactly requires enormous amounts of memory. Probabilistic data structures are a family of data structures that achieve dramatic memory savings by tolerating a small margin of error.

The fundamental trade-off of probabilistic data structures is as follows:

AspectTraditional Data StructuresProbabilistic Data Structures
Accuracy100% accurateAllows small errors
Memory usageProportional to data sizeExtremely small relative to data
Query speed\(O(1)\) to \(O(n)\)\(O(k)\) (number of hashes)
Element removalPossibleGenerally not possible

This article covers two representative probabilistic data structures: Bloom Filter (set membership testing) and HyperLogLog (cardinality estimation), along with theoretical background and Python implementations from scratch.

Bloom Filter

A Bloom Filter is a probabilistic data structure for quickly determining whether an element belongs to a set. It was proposed by Burton Howard Bloom in 1970.

How It Works

A Bloom Filter consists of two components:

  • Bit array: An array of size \(m\) (initialized to all zeros)
  • Hash functions: \(k\) independent hash functions \(h_1, h_2, \dots, h_k\) (each returning a value in \(\{0, 1, \dots, m-1\}\))

Insert operation (add): When adding element \(x\), apply all \(k\) hash functions and set the corresponding bits to 1.

OperationBit array state
Initial state[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Add “apple”[0, 1, 0, 0, 1, 0, 0, 1, 0, 0] (positions 1,4,7)
Add “banana”[0, 1, 1, 0, 1, 0, 1, 1, 0, 0] (positions 2,6 added)

Query operation (contains): To check whether element \(x\) is in the set, examine the bits at all \(k\) hash positions.

  • All bits are 1 → “Probably in the set” (false positive possible)
  • Any bit is 0 → “Definitely not in the set” (no false negatives)

This is the key property of a Bloom Filter: false negatives never occur, but false positives can occur.

False Positive Rate Theory

The false positive rate after inserting \(n\) elements is approximated by:

\[P(fp) = \left(1 - e^{-kn/m}\right)^k \tag{1}\]

where \(m\) is the bit array size, \(k\) is the number of hash functions, and \(n\) is the number of inserted elements.

The optimal number of hash functions that minimizes the false positive rate is:

\[k_{\text{opt}} = \frac{m}{n} \ln 2 \tag{2}\]

At this optimum, the false positive rate simplifies to:

\[P(fp)_{\min} = \left(\frac{1}{2}\right)^k = (0.6185)^{m/n} \tag{3}\]

Python Implementation

import hashlib
import math


class BloomFilter:
    """Bloom Filter implementation from scratch."""

    def __init__(self, expected_items: int, fp_rate: float = 0.01):
        """
        :param expected_items: Expected number of items to insert
        :param fp_rate: Acceptable false positive rate (default 1%)
        """
        # Optimal bit array size: m = -n*ln(p) / (ln2)^2
        self.size = self._optimal_size(expected_items, fp_rate)
        # Optimal number of hash functions: k = (m/n) * ln2
        self.hash_count = self._optimal_hash_count(self.size, expected_items)
        self.bit_array = [0] * self.size
        self.item_count = 0

    @staticmethod
    def _optimal_size(n: int, p: float) -> int:
        """Calculate the optimal bit array size."""
        m = -n * math.log(p) / (math.log(2) ** 2)
        return int(math.ceil(m))

    @staticmethod
    def _optimal_hash_count(m: int, n: int) -> int:
        """Calculate the optimal number of hash functions."""
        k = (m / n) * math.log(2)
        return int(math.ceil(k))

    def _hashes(self, item: str) -> list[int]:
        """Generate k hash values using double hashing."""
        h1 = int(hashlib.md5(item.encode()).hexdigest(), 16)
        h2 = int(hashlib.sha256(item.encode()).hexdigest(), 16)
        return [(h1 + i * h2) % self.size for i in range(self.hash_count)]

    def add(self, item: str) -> None:
        """Add an element to the filter."""
        for pos in self._hashes(item):
            self.bit_array[pos] = 1
        self.item_count += 1

    def contains(self, item: str) -> bool:
        """Check if an element is in the filter (may have false positives)."""
        return all(self.bit_array[pos] == 1 for pos in self._hashes(item))

    def false_positive_rate(self) -> float:
        """Calculate the current theoretical false positive rate."""
        n = self.item_count
        m = self.size
        k = self.hash_count
        return (1 - math.exp(-k * n / m)) ** k

    def memory_bytes(self) -> int:
        """Approximate memory usage in bytes."""
        return self.size // 8 + 1


# --- Example usage ---
if __name__ == "__main__":
    bf = BloomFilter(expected_items=10000, fp_rate=0.01)
    print(f"Bit array size: {bf.size:,} bits ({bf.memory_bytes():,} bytes)")
    print(f"Number of hash functions: {bf.hash_count}")

    # Add a word list
    words = [f"word_{i}" for i in range(10000)]
    for w in words:
        bf.add(w)

    # Query inserted elements (no false negatives)
    fn_count = sum(1 for w in words if not bf.contains(w))
    print(f"\nFalse negatives: {fn_count} (always 0)")

    # Query non-inserted elements (measure false positives)
    test_words = [f"test_{i}" for i in range(10000)]
    fp_count = sum(1 for w in test_words if bf.contains(w))
    print(f"False positives: {fp_count} / {len(test_words)}")
    print(f"Measured FP rate: {fp_count / len(test_words):.4f}")
    print(f"Theoretical FP rate: {bf.false_positive_rate():.4f}")

    # Memory comparison
    actual_set = set(words)
    import sys
    set_size = sys.getsizeof(actual_set)
    print(f"\n--- Memory comparison ---")
    print(f"Bloom Filter: {bf.memory_bytes():,} bytes")
    print(f"Python set:   {set_size:,} bytes")
    print(f"Reduction: {(1 - bf.memory_bytes() / set_size) * 100:.1f}%")

Example Output

Bit array size: 95,851 bits (11,982 bytes)
Number of hash functions: 7
False negatives: 0 (always 0)
False positives: 107 / 10000
Measured FP rate: 0.0107
Theoretical FP rate: 0.0101

--- Memory comparison ---
Bloom Filter: 11,982 bytes
Python set:   524,504 bytes
Reduction: 97.7%

When storing 10,000 elements, the Bloom Filter uses approximately 97% less memory than a Python set.

Practical Applications

Use caseDescription
Web crawlersURL deduplication (used in Google Bigtable)
Databases (LSM-Tree)Avoiding disk reads for non-existent keys (LevelDB, RocksDB)
Spell checkersFast detection of words not in the dictionary
Network routersPacket filtering, cache lookup
CDN / CachingPre-checking whether content exists in cache

HyperLogLog

HyperLogLog is an algorithm for estimating the cardinality (number of unique elements) of a set using extremely little memory. It was proposed by Flajolet et al. in 2007.

How It Works

The core idea of HyperLogLog is to estimate the number of unique elements by observing the maximum number of leading zero bits in hash values.

Intuitive understanding: Imagine flipping a coin repeatedly and recording how many flips until you get heads. The probability of getting heads on the first flip is \(1/2\), of getting tails twice before heads is \(1/4\), and of getting \(r\) consecutive tails is \(1/2^r\). If across many trials you observe a maximum of \(r\) consecutive tails, you can estimate there were approximately \(2^r\) trials.

Improvement over LogLog: Since a single register has high variance, HyperLogLog uses the first \(p\) bits of each hash value to distribute elements across \(m = 2^p\) registers (stochastic averaging), improving estimation accuracy.

Each register \(M[j]\) stores the maximum number of leading zero bits + 1 observed among all elements assigned to that register.

The estimation formula is:

\[E = \alpha_m \cdot m^2 \cdot \left(\sum_{j=1}^{m} 2^{-M[j]}\right)^{-1} \tag{4}\]

where \(\alpha_m\) is a bias correction factor defined as:

\[\alpha_m = \left(m \int_0^{\infty} \left(\log_2 \left(\frac{2+u}{1+u}\right)\right)^m du\right)^{-1} \tag{5}\]

In practice, the following approximations are commonly used:

\(m\)\(\alpha_m\)
160.673
320.697
640.709
\(\geq 128\)\(0.7213 / (1 + 1.079 / m)\)

The standard error of HyperLogLog is:

\[\sigma = \frac{1.04}{\sqrt{m}} \tag{6}\]

With \(m = 2^{14} = 16384\) registers, the standard error is approximately 0.81%.

Python Implementation

import hashlib
import math


class HyperLogLog:
    """HyperLogLog implementation from scratch."""

    def __init__(self, precision: int = 14):
        """
        :param precision: Precision parameter p that determines register count.
                          Register count m = 2^p (default p=14 -> 16384 registers)
        """
        self.p = precision
        self.m = 1 << precision  # 2^p
        self.registers = [0] * self.m
        self.alpha = self._compute_alpha(self.m)

    @staticmethod
    def _compute_alpha(m: int) -> float:
        """Compute the bias correction factor."""
        if m == 16:
            return 0.673
        elif m == 32:
            return 0.697
        elif m == 64:
            return 0.709
        else:
            return 0.7213 / (1 + 1.079 / m)

    def _hash(self, item: str) -> int:
        """Convert an element to a 64-bit hash value."""
        h = hashlib.sha256(item.encode()).hexdigest()
        return int(h[:16], 16)  # Use the first 64 bits

    @staticmethod
    def _leading_zeros(value: int, max_bits: int) -> int:
        """Return the number of leading zero bits + 1."""
        if value == 0:
            return max_bits + 1
        count = 1
        for i in range(max_bits - 1, -1, -1):
            if value & (1 << i):
                break
            count += 1
        return count

    def add(self, item: str) -> None:
        """Add an element."""
        h = self._hash(item)
        # Use the first p bits to determine register index
        j = h >> (64 - self.p)
        # Compute leading zeros from the remaining bits
        remaining = h & ((1 << (64 - self.p)) - 1)
        rank = self._leading_zeros(remaining, 64 - self.p)
        self.registers[j] = max(self.registers[j], rank)

    def count(self) -> int:
        """Estimate the cardinality (number of unique elements)."""
        # Harmonic mean-based estimation
        indicator = sum(2.0 ** (-r) for r in self.registers)
        estimate = self.alpha * self.m * self.m / indicator

        # Small range correction (Linear Counting)
        if estimate <= 2.5 * self.m:
            zeros = self.registers.count(0)
            if zeros > 0:
                estimate = self.m * math.log(self.m / zeros)

        # Large range correction
        if estimate > (1 << 32) / 30.0:
            estimate = -(1 << 32) * math.log(1 - estimate / (1 << 32))

        return int(estimate)

    def merge(self, other: "HyperLogLog") -> "HyperLogLog":
        """Merge two HyperLogLog instances."""
        if self.p != other.p:
            raise ValueError("Precision parameters must match")
        merged = HyperLogLog(self.p)
        merged.registers = [
            max(a, b) for a, b in zip(self.registers, other.registers)
        ]
        return merged

    def standard_error(self) -> float:
        """Return the standard error."""
        return 1.04 / math.sqrt(self.m)

    def memory_bytes(self) -> int:
        """Approximate memory usage in bytes."""
        # Each register needs at most 6 bits (max value is 64-p+1)
        return self.m * 6 // 8


# --- Example usage ---
if __name__ == "__main__":
    hll = HyperLogLog(precision=14)
    true_count = 1_000_000

    for i in range(true_count):
        hll.add(f"user_{i}")

    estimated = hll.count()
    error = abs(estimated - true_count) / true_count * 100

    print(f"True unique count: {true_count:,}")
    print(f"Estimated unique count: {estimated:,}")
    print(f"Error: {error:.2f}%")
    print(f"Theoretical standard error: {hll.standard_error() * 100:.2f}%")

    # Memory comparison
    import sys
    actual_set = {f"user_{i}" for i in range(true_count)}
    set_size = sys.getsizeof(actual_set)
    hll_size = hll.memory_bytes()
    print(f"\n--- Memory comparison ---")
    print(f"HyperLogLog: {hll_size:,} bytes")
    print(f"Python set:  {set_size:,} bytes")
    print(f"Reduction: {(1 - hll_size / set_size) * 100:.1f}%")

    # Merge demo
    hll_a = HyperLogLog(precision=14)
    hll_b = HyperLogLog(precision=14)
    for i in range(500_000):
        hll_a.add(f"user_{i}")
    for i in range(300_000, 800_000):
        hll_b.add(f"user_{i}")
    merged = hll_a.merge(hll_b)
    print(f"\n--- Merge demo ---")
    print(f"HLL_A unique count: {hll_a.count():,} (true: 500,000)")
    print(f"HLL_B unique count: {hll_b.count():,} (true: 500,000)")
    print(f"Merged unique count: {merged.count():,} (true: 800,000)")

Example Output

True unique count: 1,000,000
Estimated unique count: 1,007,822
Error: 0.78%
Theoretical standard error: 0.81%

--- Memory comparison ---
HyperLogLog: 12,288 bytes
Python set:  33,554,656 bytes
Reduction: 99.96%

--- Merge demo ---
HLL_A unique count: 498,653 (true: 500,000)
HLL_B unique count: 501,247 (true: 500,000)
Merged unique count: 802,134 (true: 800,000)

With 1 million unique elements, HyperLogLog achieves less than 1% error using only about 12 KB of memory – a 99.96% reduction compared to a Python set.

Practical Applications

Use caseDescription
Redis PFCOUNTBuilt-in HyperLogLog support, cardinality estimation in 12 KB
Web analyticsReal-time unique visitor counting
Network monitoringEstimating unique IP addresses in network flows
Database query optimizationFast approximation of COUNT(DISTINCT ...)
Distributed systemsMergeable across nodes for independent computation and aggregation

Comparison Table

FeatureBloom FilterHyperLogLogPython set
PurposeSet membership testingCardinality estimationGeneral-purpose set ops
Memory (100K elements)~120 KB~12 KB~4 MB
AccuracyFalse positives (rate tunable)Std. error 0.81% (\(p=14\))100% accurate
Insert\(O(k)\)\(O(1)\)Amortized \(O(1)\)
Query\(O(k)\) “Is it in the set?”\(O(m)\) “How many unique?”\(O(1)\) various ops
DeletionNot possible (Counting BF can)Not possiblePossible
MergeOR operationMax operationUnion
False negativesNoneN/AN/A

Probabilistic data structures provide dramatic improvements in memory efficiency and speed when 100% accuracy is not strictly required. Choosing the right data structure for the task at hand is key.

References

  • Bloom, B. H. (1970). “Space/time trade-offs in hash coding with allowable errors.” Communications of the ACM, 13(7), 422-426.
  • Flajolet, P., Fusy, E., Gandouet, O., & Meunier, F. (2007). “HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm.” Discrete Mathematics and Theoretical Computer Science, AH, 137-156.
  • Broder, A., & Mitzenmacher, M. (2004). “Network applications of Bloom filters: A survey.” Internet Mathematics, 1(4), 485-509.