Reservoir Sampling

Reservoir Sampling: The Ultimate Cheatsheet

1. Detailed Explanation

What is reservoir sampling? Reservoir sampling is a family of randomized algorithms for randomly selecting a sample of k items from a list S containing n items, where n is either a very large number or a number that is unknown. Typically, n is large enough that the list S cannot fit in main memory. The algorithm ensures that each item in S has an equal probability of being chosen for the sample.
Why is it important and what kind of problems does it solve? Reservoir sampling is crucial when dealing with massive datasets, data streams, or situations where the size of the data is unknown or impractical to determine beforehand. It allows you to obtain a representative sample without needing to store the entire dataset. This is particularly useful for:
- Estimating statistics (e.g., mean, variance) of a large dataset.
- Machine learning tasks on streaming data.
- Randomly selecting data for testing or validation.
- Situations where you only need a subset of the data to make inferences.
Core concepts, underlying principles, and key terminology:
- Reservoir: The sample of k items that is maintained throughout the algorithm.
- Streaming Data: Data that arrives sequentially and continuously, often at a high rate.
- Uniform Random Sampling: Each element has an equal probability of being selected.
- Probability of Selection: The probability that a given element is included in the reservoir. This probability changes as more elements are processed.
- Random Replacement: The key idea is to replace an existing element in the reservoir with a newly encountered element based on a probability that ensures uniform sampling.

2. When to Use Reservoir Sampling (and When Not To)

Identify problem patterns that suggest reservoir sampling is a good fit:
- You need to select a random sample of k items from a large dataset.
- The size of the dataset is unknown or too large to fit in memory.
- Data is arriving as a stream.
- You need a uniform random sample, meaning each element has an equal chance of being selected.
Discuss scenarios where a different data structure or algorithm would be more appropriate:
- Dataset fits in memory and random access is efficient: If the dataset is small enough to fit into memory and you can efficiently access any element at random, simpler methods like generating random indices and selecting elements based on those indices are preferable. Reservoir sampling introduces overhead and complexity that are unnecessary in such scenarios.
- Non-uniform sampling is required: If you need to sample elements with different probabilities (e.g., weighted sampling), reservoir sampling in its basic form isn’t suitable. You’ll need to use more complex algorithms or modifications of reservoir sampling to handle non-uniform probabilities.
- Exact statistics are required: Reservoir sampling provides a sample of the data, not the entire dataset. If you need to calculate exact statistics or perform operations requiring the full dataset, reservoir sampling is not appropriate.
- Order is important: If the order of elements in the sample matters or needs to be preserved, reservoir sampling might not be ideal. While you get a random sample, the order of the elements within the reservoir will generally reflect the order they appeared in the stream.

3. Core Algorithm / Data Structure Template

Here’s a general pseudo-code template for Reservoir Sampling (specifically, Algorithm R):

Input:
    S: Stream of items (potentially infinite)
    k: Size of the reservoir (number of items to sample)

Output:
    Reservoir: A sample of k items chosen uniformly at random from S

Algorithm:

1. Initialize a reservoir R of size k with the first k items from S.
2. For each remaining item item in S (from index k+1 to n):
    a. Generate a random integer j between 1 and i (where i is the index of the current item).
    b. If j <= k:
        Replace the j-th item in the reservoir R with the current item.
3. Return the reservoir R.

4. Code Implementations (Python, Java, C++)

Python

import random

def reservoir_sampling(stream, k):
    """
    Selects k items uniformly at random from a stream of data.

    Args:
        stream: An iterable (e.g., list, generator) representing the data stream.
        k: The number of items to sample.

    Returns:
        A list containing k randomly selected items from the stream.
    """
    reservoir = []
    for i, item in enumerate(stream):
        if i < k:
            reservoir.append(item)
        else:
            j = random.randint(0, i)  # Random integer between 0 and i (inclusive)
            if j < k:
                reservoir[j] = item
    return reservoir

# Example usage:
if __name__ == '__main__':
    data_stream = range(100)  # Simulate a stream of 100 numbers
    k = 10  # Sample size
    sample = reservoir_sampling(data_stream, k)
    print(f"Sample of size {k}: {sample}")

Java

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

public class ReservoirSampling {

    public static List<Integer> reservoirSampling(int[] stream, int k) {
        List<Integer> reservoir = new ArrayList<>(k);
        Random random = new Random();

        // Fill the reservoir with the first k elements
        for (int i = 0; i < k; i++) {
            reservoir.add(stream[i]);
        }

        // Iterate through the rest of the stream
        for (int i = k; i < stream.length; i++) {
            // Generate a random index between 0 and i (inclusive)
            int j = random.nextInt(i + 1);

            // If the random index is less than k, replace the element in the reservoir
            if (j < k) {
                reservoir.set(j, stream[i]);
            }
        }

        return reservoir;
    }

    public static void main(String[] args) {
        int[] dataStream = new int[100];
        for (int i = 0; i < 100; i++) {
            dataStream[i] = i;
        }
        int k = 10;
        List<Integer> sample = reservoirSampling(dataStream, k);
        System.out.println("Sample of size " + k + ": " + sample);
    }
}

C++

#include <iostream>
#include <vector>
#include <random>

std::vector<int> reservoirSampling(const std::vector<int>& stream, int k) {
    std::vector<int> reservoir(k);

    // Fill the reservoir with the first k elements
    for (int i = 0; i < k; ++i) {
        reservoir[i] = stream[i];
    }

    // Iterate through the rest of the stream
    std::random_device rd;
    std::mt19937 gen(rd());

    for (int i = k; i < stream.size(); ++i) {
        // Generate a random index between 0 and i (inclusive)
        std::uniform_int_distribution<> distrib(0, i);
        int j = distrib(gen);

        // If the random index is less than k, replace the element in the reservoir
        if (j < k) {
            reservoir[j] = stream[i];
        }
    }

    return reservoir;
}

int main() {
    std::vector<int> dataStream(100);
    for (int i = 0; i < 100; ++i) {
        dataStream[i] = i;
    }
    int k = 10;
    std::vector<int> sample = reservoirSampling(dataStream, k);

    std::cout << "Sample of size " << k << ": ";
    for (int val : sample) {
        std::cout << val << " ";
    }
    std::cout << std::endl;

    return 0;
}

5. Complexity Analysis

Operation	Time Complexity	Space Complexity	Explanation
Initialization	O(k)	O(k)	Filling the reservoir with the first k elements.
Processing item	O(1) (avg)	O(1)	Each item in the stream is processed in constant time. The random number generation and potential replacement in the reservoir are constant-time operations.
Overall	O(n)	O(k)	We iterate through the entire stream of n items, processing each in O(1) time on average. The space complexity is dominated by the reservoir, which stores k items.

Best Case: N/A - The algorithm always processes the entire stream.
Average Case: O(n) time and O(k) space, as described above.
Worst Case: O(n) time and O(k) space. The performance is consistent regardless of the data in the stream.

6. Pro Tips, Tricks, and Common Pitfalls

Pro Tips:
- Use efficient random number generators: The quality of the random number generator affects the uniformity of the sample. Use a good quality PRNG provided by your language’s standard library.
- Consider using a fixed seed for reproducibility: If you need to generate the same sample multiple times for testing or debugging, seed the random number generator.
- Optimization for very large k: If k is very large compared to the initial portion of the stream, you can optimize the initial filling of the reservoir.
Common Pitfalls:
- Off-by-one errors: Ensure that the random index j is generated correctly within the appropriate range (0 to i, or 1 to i, depending on the specific implementation).
- Incorrect random number generation: Using a biased or poorly seeded random number generator will lead to a non-uniform sample.
- Memory management: In languages like C++, be mindful of memory allocation when dealing with potentially very large streams.

7. Classic Problem Examples

Example: Linked List Random Node

Description: Given a singly linked list, return a random node’s value from the linked list. Each node must have the same probability of being chosen. Implement the Solution class:

High-Level Approach:

Treat the linked list as a stream of data. Use reservoir sampling with k = 1. Iterate through the linked list. Keep track of the index i of the current node. For each node, generate a random number j between 0 and i. If j is 0, replace the current node in the reservoir (which initially holds the first node). After iterating through the entire list, the reservoir will contain a single node selected uniformly at random. Return the value of that node.

import random

# Definition for singly-linked list.
class ListNode:
    def __init__(self, val=0, next=None):
        self.val = val
        self.next = next

class Solution:

    def __init__(self, head: ListNode):
        self.head = head

    def getRandom(self) -> int:
        reservoir = None
        i = 0
        curr = self.head
        while curr:
            if i == 0:
                reservoir = curr.val
            else:
                j = random.randint(0, i)
                if j == 0:
                    reservoir = curr.val
            i += 1
            curr = curr.next
        return reservoir