请教个面试题：大数据求中值 - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>JobHunting - 待字闺中

请教个面试题：大数据求中值

请教个面试题：大数据求中值# JobHunting - 待字闺中

j*32014-08-04 07:08

1 楼

这个题150题里有，用2个heap做，
可是如果memory不够大，放不下数据，要怎么做？

r*k2014-08-04 07:08

2 楼

heap就是2个数组
如果连这个都装不下，又怎么算中值？
至少得遍历一遍吧？
除非知道数的范围，碰到最大最小值可以删一对
否则没办法减少空间要求吧
要不用bucket统计一下，能猜出中值在哪个bucket里
然后等于是求那个bucket的第k个值
要遍历数组2遍

【在 j**********3 的大作中提到】

: 这个题150题里有，用2个heap做，
: 可是如果memory不够大，放不下数据，要怎么做？

j*32014-08-04 07:08

3 楼

哎呀妈呀终于有人看了，谢谢！
我看一下回复哈。。。

【在 r*******k 的大作中提到】

: heap就是2个数组
: 如果连这个都装不下，又怎么算中值？
: 至少得遍历一遍吧？
: 除非知道数的范围，碰到最大最小值可以删一对
: 否则没办法减少空间要求吧
: 要不用bucket统计一下，能猜出中值在哪个bucket里
: 然后等于是求那个bucket的第k个值
: 要遍历数组2遍

j*32014-08-04 07:08

4 楼

就是装不下这2个数组，
举例：4GB的data， memory 100mb
我想到的方法，和你说的一样，bucket的，但我总觉得有更好的方法。
我自己遇到过这个题，朋友被面过。前2天面镜貌似也出现过。不知道有什么好的方法
。。。

【在 r*******k 的大作中提到】

s*c2014-08-04 07:08

5 楼

Heap不是把所有的数据都放下，只维持内存能放下大小的heap,每次从文件里读数据进
内存，不断做heapify

j*32014-08-04 07:08

6 楼

可是一个heap，只能方下内存大小的数据，那咋找中值阿？每次进来一个新的data，要
把以前读过的data再读一遍？

【在 s******c 的大作中提到】

: Heap不是把所有的数据都放下，只维持内存能放下大小的heap,每次从文件里读数据进
: 内存，不断做heapify

w*k2014-08-04 07:08

7 楼

一次取一片，求取中值。
然后这些中值再取中值。
可以肯定，如果一个片段的中值Ki小于这个中值K，那片段中小于Ki的必不包含中值，
可以扔掉。
反之Kj大于K的话，大于Kj的片段也可以扔掉，
你只要保证扔掉的大头和小头数量一样，那中值就不会变。
就这么做下去就行。

j*32014-08-04 07:08

8 楼

嗯。。。让偶考虑一下。。。
这个方法比bucket好？

【在 w****k 的大作中提到】

: 一次取一片，求取中值。
: 然后这些中值再取中值。
: 可以肯定，如果一个片段的中值Ki小于这个中值K，那片段中小于Ki的必不包含中值，
: 可以扔掉。
: 反之Kj大于K的话，大于Kj的片段也可以扔掉，
: 你只要保证扔掉的大头和小头数量一样，那中值就不会变。
: 就这么做下去就行。

S*12014-08-04 07:08

9 楼

histogram, 你不是会做吗

【在 j**********3 的大作中提到】

: 嗯。。。让偶考虑一下。。。
: 这个方法比bucket好？

j*32014-08-04 07:08

10 楼

会做咋了？

【在 S******1 的大作中提到】

:
: histogram, 你不是会做吗

j*32014-08-04 07:08

11 楼

我需要更好的办法。。。

【在 S******1 的大作中提到】

:
: histogram, 你不是会做吗

c*y2014-08-04 07:08

12 楼

bucket神马的感觉靠谱，不同bucket存到不同机器上去，估计最好用平衡二叉树做一个
可以O lgn 插入，Olgn 查找第k个元素的结构。节点结构至少要包括node左边几个元素
，右边几个元素。每次插入还有平衡都得把这个也考虑一下。

【在 r*******k 的大作中提到】

T*u2014-08-04 07:08

13 楼

借个只能估计了。以前看过一篇paper，stream超大数据的情况下。我给你找找。

T*u2014-08-04 07:08

14 楼

先给你来2gb大的，再给你来2gb小的。

【在 w****k 的大作中提到】

d*52014-08-04 07:08

15 楼

我也遇到过这个问题，用双heap和bucket都做了，然后面试官说还有更好的办法。。
后来想到如果有好多个machine也许可以这么做，把这堆数字分堆放到machine里，对每
个machine的数字进行排序，同时计算出有多少个数字，假设为N。然后做一个heap，每
次都往里面push和pop数字，做N/2次，就能算出来的。
实在不行，就mapreduce吧

y*n2014-08-04 07:08

16 楼

使用BST作，可以容纳更大的数据量。
BST的每个结点包含value和value出现的次数。
代码再循环过程中保存当前中数，比当前值大的个数，和比当前值小的个数。
如果当前值已经不是中数，使用BST的（前值结点）和(后值结点)的算法移动中数。

p*e2014-08-04 07:08

17 楼

http://stackoverflow.com/questions/3572640/interview-question-f

j*32014-08-04 07:08

18 楼

我敢脚，可能是比较简单的方法，偶们都忽视了？

【在 T*****u 的大作中提到】

: 借个只能估计了。以前看过一篇paper，stream超大数据的情况下。我给你找找。

T*u2014-08-04 07:08

19 楼

算上统计背景的话不太简单。它那个方法基本不用空间。

【在 j**********3 的大作中提到】

: 我敢脚，可能是比较简单的方法，偶们都忽视了？

c*y2014-08-04 07:08

20 楼

大哥review一下我的思路？

【在 j**********3 的大作中提到】

: 我敢脚，可能是比较简单的方法，偶们都忽视了？

j*32014-08-04 07:08

21 楼

哪个方法？就是你提paper里的？

【在 T*****u 的大作中提到】

: 算上统计背景的话不太简单。它那个方法基本不用空间。

j*32014-08-04 07:08

22 楼

好的好的，有问题没看懂的话我问你哈！

【在 c*******y 的大作中提到】

: 大哥review一下我的思路？

f*s2014-08-04 07:08

23 楼

one approach to squeeze the space is to use a bit array or bit set: instead
of saving each number as an integer in the memory, use bit 1 to represent
this number's existence, and use bit 0 to represent a number's non-existence
.
For instance, to represent a set of numbers in range [0.. 10] : { 2, 4, 5, 7
, 8}, you can use a bit vector: 00101101100.
If the range of the data is all the integer (32 bits), then the space needed
to store this big bit vector will be 2^32 bits or 128 MB, which fits the
required 200MB memory.
Then find the median by counting 1's from both ends of the vector, until you
reach the center 1, or pair of 1's, and from its/their index in the vector,
you can find the number/numbers represented and get the median.

s*t2014-08-04 07:08

24 楼

instead
existence
7
needed
you
不对吧，您这是默认没有重复数字啊？要是有100万个1，一个100000，median是啥啊。
数字大小跟median无关，主要是每个不同数字的个数，得想法把这个记下来

【在 f******s 的大作中提到】

: one approach to squeeze the space is to use a bit array or bit set: instead
: of saving each number as an integer in the memory, use bit 1 to represent
: this number's existence, and use bit 0 to represent a number's non-existence
: .
: For instance, to represent a set of numbers in range [0.. 10] : { 2, 4, 5, 7
: , 8}, you can use a bit vector: 00101101100.
: If the range of the data is all the integer (32 bits), then the space needed
: to store this big bit vector will be 2^32 bits or 128 MB, which fits the
: required 200MB memory.
: Then find the median by counting 1's from both ends of the vector, until you