NIST

reservoir sampling

(algorithm)

Definition: Randomly select k items from a stream of items of unknown length. Save the first k items in an array of size k. For each item j, j > k, choose a random integer M from 1 to j (inclusive). If M ≤ k, replace item M of the array with item j.

Generalization (I am a kind of ...)
randomized algorithm.

Note: Each possible selection of k items has an equal probability of occurring.

This algorithm is an in-memory variant of Knuth's Algorithm 3.4.2R [Knuth97, 2:144]. He credits Alan G. Waterman. This algorithm is suggested in exercise 10.

Author: PEB

Implementation

(Perl).

More information

Sahil (Thaker?)'s explanation of correctness. The extension to distributed reservoir sampling is flawed.

Vitter's algorithms X, Y, and Z use far fewer random numbers by choosing how many items to skip, rather than deciding whether or not to skip each item.
Jeffrey Scott Vitter, Random Sampling with a Reservoir, ACM Transactions on Mathematical Software (TOMS), 11(1):37-57, March 1985.
Available on-line, for instance at http://www.cs.umd.edu/~samir/498/vitter.pdf.


Go to the Dictionary of Algorithms and Data Structures home page.

If you have suggestions, corrections, or comments, please get in touch with Paul Black.

Entry modified 21 January 2020.
HTML page formatted Tue Jan 21 16:28:40 2020.

Cite this as:
Paul E. Black, "reservoir sampling", in Dictionary of Algorithms and Data Structures [online], Paul E. Black, ed. 21 January 2020. (accessed TODAY) Available from: https://www.nist.gov/dads/HTML/reservoirSampling.html