Definition: Randomly select k items from a stream of items of unknown length. Save the first k items in an array of size k. For each item j, j > k, choose a random integer M from 1 to j (inclusive). If M ≤ k, replace item M of the array with item j.
Generalization (I am a kind of ...)
Note: Each possible selection of k items has an equal probability of occurring.
This algorithm is an in-memory variant of Knuth's Algorithm 3.4.2R [Knuth97, 2:144]. He credits Alan G. Waterman. This algorithm is suggested in exercise 10.
Sahil's explanation of correctness and extension to distributed reservoir sampling.
Vitter's algorithms X, Y, and Z use far fewer random numbers by choosing how many items to skip, rather than deciding whether or not to skip each item.
Jeffrey Scott Vitter, Random Sampling with a Reservoir, ACM Transactions on Mathematical Software (TOMS), 11(1):37-57, March 1985.
Available on-line, for instance at http://www.cs.umd.edu/~samir/498/vitter.pdf.
If you have suggestions, corrections, or comments, please get in touch with Paul E. Black.
Entry modified 8 October 2010.
HTML page formatted Tue Dec 6 16:16:32 2011.
Cite this as:
Paul E. Black, "reservoir sampling", in Dictionary of Algorithms and Data Structures [online], Paul E. Black, ed., U.S. National Institute of Standards and Technology. 8 October 2010. (accessed TODAY) Available from: http://www.nist.gov/dads/HTML/reservoirSampling.html