Each of the N machines has an array of N numbers, how to calculate the median of the N^2 numbers efficiently?
t*t
2 楼
比如 U6-GFP or U6-puroR
c*e
3 楼
I was asked by an Indian interviewer at Google Mountain View 2.5 years ago. I remembered we discussed communication cost between machines, available memory on each machine and available disk on each machine, but I couldn't give a convincing answer then.
【在 a**********k 的大作中提到】 : Each of the N machines has an array of N numbers, how to calculate : the median of the N^2 numbers efficiently?
r*e
4 楼
no. U6 is PloIII promoter.
【在 t****t 的大作中提到】 : 比如 U6-GFP or U6-puroR
a*k
5 楼
More general form would be: median of M*N numbers across M machines, each having N numbers. When M=2, there's a well-known solution. How to generalize it to any integer M?
.
【在 c*****e 的大作中提到】 : I was asked by an Indian interviewer at Google Mountain View 2.5 years ago. : I remembered we discussed communication cost between machines, available : memory on each machine and available disk on each machine, but I couldn't : give a convincing answer then.
b*h
6 楼
If you can use MapReduce, you could partition the numbers into several buckets (Mapper) and count how many numbers are in each bucket. Then, you know which bucket to look for the median. If the bucket is small enough, we could put it into memory and find the median. Otherwise, do another mapreduce job.
a*k
7 楼
If the numbers have been distributed beforehand, then each bucket will be distributed among N machines, then it is reduced to the original problem. Also, MapReduce infrastructure itself has a cost.
we
【在 b********h 的大作中提到】 : If you can use MapReduce, you could partition the numbers into several : buckets (Mapper) and count how many numbers are in each bucket. Then, you : know which bucket to look for the median. If the bucket is small enough, we : could put it into memory and find the median. Otherwise, do another : mapreduce job.
j*x
8 楼
这个问题很好,有讨论空间 不像那些傻逼的无聊题目。。。
b*h
9 楼
problem. No. After the map phase, every bucket will be on one machine. This is due to the shuffling between map and reduce. Yes, the largest cost is the shuffling between map and reduce.
【在 a**********k 的大作中提到】 : If the numbers have been distributed beforehand, then each bucket will : be distributed among N machines, then it is reduced to the original problem. : Also, MapReduce infrastructure itself has a cost. : : we
c*e
10 楼
I agree. This seems to be a good way to decrease the communication cost and disk I/O (but didn't know this two years ago).
we
【在 b********h 的大作中提到】 : If you can use MapReduce, you could partition the numbers into several : buckets (Mapper) and count how many numbers are in each bucket. Then, you : know which bucket to look for the median. If the bucket is small enough, we : could put it into memory and find the median. Otherwise, do another : mapreduce job.
a*k
11 楼
So the map task will run on each machine, and data will be shuffled so that each bucket will be on one machine, then the reduce task on each machine will do the counting and report it to one master machine to calculate which machine has the right bucket for the median, and asks it to finish up the rest of the work.
to
【在 b********h 的大作中提到】 : : problem. : No. After the map phase, every bucket will be on one machine. This is due to : the shuffling between map and reduce. : Yes, the largest cost is the shuffling between map and reduce.
j*u
12 楼
actually map reduce doesn't reduce the communication cost nor the disk I/O. Each M/R phase requires m*r connections/intermediate files.
and
【在 c*****e 的大作中提到】 : I agree. This seems to be a good way to decrease the communication cost and : disk I/O (but didn't know this two years ago). : : we
c*e
13 楼
Actually, this is a special Map/Reduce scenario. All the machine can pass its distribution (over each bucket) statistic (which is small compared to the original data) to one machine and then it can be determined that which bucket the median is in.
.
【在 j*****u 的大作中提到】 : actually map reduce doesn't reduce the communication cost nor the disk I/O. : Each M/R phase requires m*r connections/intermediate files. : : and
a*k
14 楼
Actually we can do some optimization over the classic Map/Reduce: Let all buckets distributed on N machines as is, each machine counts the numbers in all buckets, and pass the result to a master machine, which calculate and figure out which bucket has the median, then, we need merge the data in that particular bucket to one machine to finish the remaining job. In this case, we don't need to shuffle the data in all the buckets as in classic MR, which has the most overhead as some one pointed out earlier. We only need to merge data in one bucket.
【在 c*****e 的大作中提到】 : Actually, this is a special Map/Reduce scenario. All the machine can pass : its distribution (over each bucket) statistic (which is small compared to : the original data) to one machine and then it can be determined that which : bucket the median is in. : : .
j*j
15 楼
good
data
【在 a**********k 的大作中提到】 : Actually we can do some optimization over the classic Map/Reduce: : Let all buckets distributed on N machines as is, each machine counts : the numbers in all buckets, and pass the result to a master machine, : which calculate and figure out which bucket has the median, then, : we need merge the data in that particular bucket to one machine : to finish the remaining job. In this case, we don't need to shuffle the data : in all the buckets as in classic MR, which has the most overhead as : some one pointed out earlier. We only need to merge data in one bucket.
c*e
16 楼
bladesmith suggested to repeat the first step if the bucket is still large.
data
【在 a**********k 的大作中提到】 : Actually we can do some optimization over the classic Map/Reduce: : Let all buckets distributed on N machines as is, each machine counts : the numbers in all buckets, and pass the result to a master machine, : which calculate and figure out which bucket has the median, then, : we need merge the data in that particular bucket to one machine : to finish the remaining job. In this case, we don't need to shuffle the data : in all the buckets as in classic MR, which has the most overhead as : some one pointed out earlier. We only need to merge data in one bucket.
a*k
17 楼
Yes, if the master finds the bucket is still too large to fit in memory, it can ask each machine to divide that particular bucket into more sub-buckets, and count each sub-bucket on each machine. This way we only have the cross-machine communication overhead to collect the counters (not shuffling the data records around), plus the final merge, a huge saving in communication bandwidth as well as time/space cost.
【在 c*****e 的大作中提到】 : bladesmith suggested to repeat the first step if the bucket is still large. : : data
b*h
18 楼
True. This is how people do K-means using MapReduce. Instead of passing all the objects that belongs to one cluster to the Reducers. Just pass some statistics.
【在 c*****e 的大作中提到】 : Actually, this is a special Map/Reduce scenario. All the machine can pass : its distribution (over each bucket) statistic (which is small compared to : the original data) to one machine and then it can be determined that which : bucket the median is in. : : .