一道大数据题，求最优解。 - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>JobHunting - 待字闺中

一道大数据题，求最优解。

一道大数据题，求最优解。# JobHunting - 待字闺中

l*z2013-05-24 07:05

1 楼

给两个文件，employee file 和 department file，其结构如下：
employee file:
employee_id, employ_name,department_id
.
.
.
employee_id, employ_name,department_id
department file:
department_id, department_name, manager_id
.
.
.
department_id, department_name, manager_id
要求生成一个输出文件：
employee_id, employ_name,manager_id,department_id,number_of_employee_in_this
_department
简单的解法是scan一遍employee file 就可以count每个部门的人数。
问题是输入的文件很大，不能全部load到内存。如何实现？
不考虑mapreduce啥的。

l*z2013-05-24 07:05

2 楼

自己顶一下。

t*a2013-05-24 07:05

3 楼

不给用map-reduce的话，
step 1. sort两个表，按照department_id，复杂度nlogn
step 2. join两个表，按照department_id，复杂度n+m
step 2.1. join的时候，把每个department的东东load进buffer，或者别的什么地方，
如果太大的话。数一下行数也就是count了，在output时候放在尾巴上

d*x2013-05-24 07:05

4 楼

说起来，这种relational还大数据的东西，其实就是明显的暗示用Hive/Pig吧。。。

【在 t****a 的大作中提到】

: 不给用map-reduce的话，
: step 1. sort两个表，按照department_id，复杂度nlogn
: step 2. join两个表，按照department_id，复杂度n+m
: step 2.1. join的时候，把每个department的东东load进buffer，或者别的什么地方，
: 如果太大的话。数一下行数也就是count了，在output时候放在尾巴上

t*a2013-05-24 07:05

5 楼

楼主说不给map-reduce，hive pig估计就更不给用了。relational的东西sort好了join
也很快的啊。

【在 d**********x 的大作中提到】

: 说起来，这种relational还大数据的东西，其实就是明显的暗示用Hive/Pig吧。。。

z*32013-05-24 07:05

6 楼

不考虑mapreduce你就自己实现一个mapreduce么
用spring来做mapreduce
同样原理，impl一下就好了
还能怎么做？不能全部读入内存，就只能拆分
挨个处理，话说mapreduce也是这个思想
要是这个如果不让做，那还能怎么做？
多线程？要问问考官要求什么

z*o2013-05-24 07:05

7 楼

build hash table with dept id as key a.

【在 z*******3 的大作中提到】

: 不考虑mapreduce你就自己实现一个mapreduce么
: 用spring来做mapreduce
: 同样原理，impl一下就好了
: 还能怎么做？不能全部读入内存，就只能拆分
: 挨个处理，话说mapreduce也是这个思想
: 要是这个如果不让做，那还能怎么做？
: 多线程？要问问考官要求什么

t*a2013-05-24 07:05

8 楼

Good point! 如果department总数较少，可以直接把第二个文件放进内存。

【在 z********o 的大作中提到】

: build hash table with dept id as key a.

s*n2013-05-24 07:05

9 楼

传统的sql题目吧。
dept info可能可以放入内存。然后开始读emp, append, read into the output file.

【在 l*****z 的大作中提到】

: 自己顶一下。

A*g2013-05-24 07:05

10 楼

就是实现一个sql的join，根本不算大数据
employee (employee_id, employ_name,department_id)
department (department_id, department_name, manager_id)
select employee_id, employ_name,manager_id,department_id,number_of_employee_
in_this
_department from employee as E, department as D where E.department_id=D.
department_id;
算法有
nested loop join m, n 分别是tuple数的I/Os
blocked nested loop join， O(M*N) M, N是block数的I/Os
external sort merge然后两个都只要各扫一遍 O(N(logN) + M + N)
hash join，hash小的表，这个情况是department，扫另一个，适合小表放进内存， O(
N) N是大表block树
要根据具体数据分布选择，计算cost最小的方法。一般来说，如果其中一个可以hash
进内存，用hash join, 如果两个都很大，用external sort merge。

b*t2013-05-24 07:05

11 楼

MPI 实现Terasort

y*u2013-05-24 07:05

12 楼

最近在上coursera的课。看到一个mapreduce的伪实现，参考一下。
in python
MapReduce.py
import json
class MapReduce:
def __init__(self):
self.intermediate = {}
self.result = []
def emit_intermediate(self, key, value):
self.intermediate.setdefault(key, [])
self.intermediate[key].append(value)
def emit(self, value):
self.result.append(value)
def execute(self, data, mapper, reducer):
for line in data:
record = json.loads(line)
mapper(record)
for key in self.intermediate:
reducer(key, self.intermediate[key])
#jenc = json.JSONEncoder(encoding='latin-1')
jenc = json.JSONEncoder()
for item in self.result:
print jenc.encode(item)
wordcount.py
import MapReduce
import sys
"""
Word Count Example in the Simple Python MapReduce Framework
"""
mr = MapReduce.MapReduce()
# =============================
# Do not modify above this line
def mapper(record):
# key: document identifier
# value: document contents
key = record[0]
value = record[1]
words = value.split()
for w in words:
mr.emit_intermediate(w, 1)
def reducer(key, list_of_values):
# key: word
# value: list of occurrence counts
total = 0
for v in list_of_values:
total += v
mr.emit((key, total))
# Do not modify below this line
# =============================
if __name__ == '__main__':
inputdata = open(sys.argv[1])
mr.execute(inputdata, mapper, reducer)