Rao想当拿着尚方宝剑的钦差大臣# Biology - 生物学
a*u
1 楼
朋友刚刚面的。好像之前有人讨论过类似海量数据找重复的。但是找不到了。觉得这类
题挺经典的,所以再放上来听听大家意见。
1TB data on disk with around 1KB per data record. Find duplicates using
512MB RAM and infinite disk space.
My thoughts are external sorting. The data have around the 1 million records
- 512MB can hold around 0.5 million records, so we need around 2000 rounds
of in-memory quick sort and then merge and find duplicates.
Bloomfilter may not work since 512MB memory can only spare 4 bits for each
record, which will yield a high error rate (opt
题挺经典的,所以再放上来听听大家意见。
1TB data on disk with around 1KB per data record. Find duplicates using
512MB RAM and infinite disk space.
My thoughts are external sorting. The data have around the 1 million records
- 512MB can hold around 0.5 million records, so we need around 2000 rounds
of in-memory quick sort and then merge and find duplicates.
Bloomfilter may not work since 512MB memory can only spare 4 bits for each
record, which will yield a high error rate (opt