Definition: Randomly select k items from a stream of items of unknown length. Save the first k items in an array of size k. For each item j, j > k, choose a random integer M from 1 to j (inclusive). If M ≤ k, replace item M of the array with item j.
Generalization (I am a kind of ...)
Note: Each possible selection of k items has an equal probability of occurring.
This algorithm is an in-memory variant of Knuth's Algorithm 3.4.2R [Knuth97, 2:144]. He credits Alan G. Waterman. This algorithm is suggested in exercise 10.
Sahil's explanation of correctness. The extension to distributed reservoir sampling is flawed.
Vitter's algorithms X, Y, and Z use far fewer random numbers by choosing how many items to skip, rather than deciding whether or not to skip each item.
Jeffrey Scott Vitter, Random Sampling with a Reservoir, ACM Transactions on Mathematical Software (TOMS), 11(1):37-57, March 1985.
Available on-line, for instance at http://www.cs.umd.edu/~samir/498/vitter.pdf.
If you have suggestions, corrections, or comments, please get in touch with Paul Black.
Entry modified 22 April 2019.
HTML page formatted Mon Apr 22 12:17:39 2019.
Cite this as:
Paul E. Black, "reservoir sampling", in Dictionary of Algorithms and Data Structures [online], Paul E. Black, ed. 22 April 2019. (accessed TODAY) Available from: https://www.nist.gov/dads/HTML/reservoirSampling.html