未知输入流总量,取出n个数据,要求取出的数据是等概率的
概述
对于一个输入流,如果已知文件的总大小L,要取出其中的n个数据,并且要求所取出的文件都是等概率的,那么只需要根据n/L的概率来取数据就okay了。
然而在大多数情况下,L并不可知,此时还要求按照等概率来取n个数据,如何做到呢?
方法
- 先依次从数据流中取n个数据
- 从第n+1个数据开始,以n/i的概率来保留新数据,如果新数据要保留,那么旧数据以1/n的等概率淘汰
- 旧数据保留的概率为1-n/i
论证(假设n=10)
- 当有10个数据时,数据全部都保存,item的保留概率是1
- 当有11个数据时,数据的保留概率为:10/11。旧数据的保留概率为(1)*(1/11 + 10/11 * 9/10)= 10/11。
(1)为2的前提1的概率,1/11为旧数据的保留概率:1-10/11=1/11,10/11为保留新数据的概率,9/10为,当出现保留新数据时,旧数据保留的概率:1-1/n
- 当有12个数据时,数据的保留概率为:10/12。旧数据的保留概率为(10/11)*(2/12 + 10/12 * 9/10)= 10/12。
以上可得出,当有i个数据的时候,数据保留的概率为 10/i。
Congratulations @bgping! You have completed some achievement on Steemit and have been rewarded with new badge(s) :
You got a First Reply
Award for the number of comments
Click on any badge to view your own Board of Honor on SteemitBoard.
For more information about SteemitBoard, click here
If you no longer want to receive notifications, reply to this comment with the word
STOP
Congratulations @bgping! You have received a personal award!
1 Year on Steemit
Click on the badge to view your Board of Honor.
Do not miss the last post from @steemitboard!
Participate in the SteemitBoard World Cup Contest!
Collect World Cup badges and win free SBD
Support the Gold Sponsors of the contest: @good-karma and @lukestokes
Congratulations @bgping! You received a personal award!
You can view your badges on your Steem Board and compare to others on the Steem Ranking
Do not miss the last post from @steemitboard:
Vote for @Steemitboard as a witness to get one more award and increased upvotes!