一个算法问题 - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>JobHunting - 待字闺中

一个算法问题

一个算法问题# JobHunting - 待字闺中

w*f2012-03-30 07:03

1 楼

有一个文件含1000000000个（user, login-timing,...）
要求登陆时间前1000名，以及median。请问那种算法最好？

s*n2012-03-30 07:03

2 楼

根据userid hash分配到k台机器。每台机器上统计，排序。
然后k-way合并(用heap)，得到总排名的前1000和median

w*f2012-03-30 07:03

3 楼

谢谢，swan。
这种算法好像对占memory较多，有无更好的算法？

z*82012-03-30 07:03

4 楼

求前1000个用 maxheap就行了，求median就只能那么做

【在 w****f 的大作中提到】

: 谢谢，swan。
: 这种算法好像对占memory较多，有无更好的算法？

s*n2012-03-30 07:03

5 楼

求median必须要排序，可以用external sort

d*n2012-03-30 07:03

6 楼

+1
好像1 billion 的数据不是都load 到memory.
是不是要把user 放进bst 或者hash里面，只保留最早时间？
1 billion user, 多少unique names?

【在 z*********8 的大作中提到】

: 求前1000个用 maxheap就行了，求median就只能那么做

w*f2012-03-30 07:03

7 楼

It's 1 billion user, and only 1 million unique userid.
"求median必须要排序，可以用external sort". What's external sort?
Does the below work?
1.) take first 1001, use 501th as the initial median values of login-
timing.
2.) read next one and shift the median to fit the new one.
3.) repeat step 2 till the end.
(But this one only give the values of timing, not the associated useid)

w*x2012-03-30 07:03

8 楼

1. 把文件切成2^n个大小相同的小文件, 每两个可以装入内存
2. 两两载入内存, 对每个pair做median merge, 2 个 file merge成一个大小相同的
file
3. 如此merge下去, 直到所有的file浓缩成一个file, 取这个file的median, 不过这题
因该允许取得非精确median