Reviewer needed on human sensing, sensing system, image processing - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>Immigration - 落地生根

Reviewer needed on human sensing, sensing system, image processing

Reviewer needed on human sensing, sensing system, image processing# Immigration - 落地生根

l*e2013-08-17 07:08

1 楼

刚刚在中offline coding test中挂了，比较郁闷。我把问题和我的code 贴出来，希望
大家给些指导意见。
题目是这样的：
有几个非常大的file，每个file里包括最多一个million的行，每一行是一个sentence
，比如:
The sky is blue.
She said "she is a good girl".
Don't put it into firewood!
要求是：
1. 读取和处理这些file，当处理完毕后，给出提示符 "query>",等待输入
2. 输入的是一个query sentence, 包括很多很多key words。输出的是那些file中的所
有行，这些行包括所有query sentence的key words。要求打印出所在file的行数，行
的内容，和file的名字。
考察的重点是，如何efficiency的返回所有行。同时包括clean code，和corner case
的考虑。
题目要求必须2个小时做完, time的。
我费了九牛二虎之力，还是挂了。我把我的code上传
https://github.com/lsersime/oki/blob/master/test.cpp
new grad, 写得不好，平时刷leetcode，没顾及这种project，希望大家给些建设性意
见。
谢谢！

q*32013-08-17 07:08

2 楼

one more needed. you had better work on human sensing, intelligent sensing
system, and image processing fields. thx.

m*n2013-08-17 07:08

3 楼

A couple issues that come to mind:
1. With your def of wordLineMap you can only handle one file correctly.
2. you should presort occurrences of each word by filename:lineno, then
query could be done similar to merge-sort.
3. for each occurrence of a word, record file name, line number, and offset
of the line-start in file. Use file seek to retrieve line, do not read line
by line.
4. Your code records duplicate occurrences if a word appears on a line more
than once. Think 'a' and 'the' etc.
5. Since each file can have a million line, memory requirement may be part
of the test. You should make ball park estimates on size of input your
solution can handle. Or you can just shard the world line map and store them
in a few files.

sentence

【在 l******e 的大作中提到】

: 刚刚在中offline coding test中挂了，比较郁闷。我把问题和我的code 贴出来，希望
: 大家给些指导意见。
: 题目是这样的：
: 有几个非常大的file，每个file里包括最多一个million的行，每一行是一个sentence
: ，比如:
: The sky is blue.
: She said "she is a good girl".
: Don't put it into firewood!
: 要求是：
: 1. 读取和处理这些file，当处理完毕后，给出提示符 "query>",等待输入

q*32013-08-17 07:08

4 楼

b*52013-08-17 07:08

5 楼

a million line, and u need to shard???!! what kind of machine are u using???

offset
line
more

【在 m*****n 的大作中提到】

: A couple issues that come to mind:
: 1. With your def of wordLineMap you can only handle one file correctly.
: 2. you should presort occurrences of each word by filename:lineno, then
: query could be done similar to merge-sort.
: 3. for each occurrence of a word, record file name, line number, and offset
: of the line-start in file. Use file seek to retrieve line, do not read line
: by line.
: 4. Your code records duplicate occurrences if a word appears on a line more
: than once. Think 'a' and 'the' etc.
: 5. Since each file can have a million line, memory requirement may be part

q*32013-08-17 07:08

6 楼

d*e2013-08-17 07:08

7 楼

CPP？ good luck.
程序其实蛮简单的。python为例。
1.一个函数open file, getlines, yield
2. 一个函数match, yield
3. 一个函数打印
最后main,loop files, 穿起来三个寒暑做pipeline,结束等待输入
熟手半个小时写完，半个小时调试。
cpp就啧啧了。
楼下的大作业做过没？1M行慢的机器2分钟也搞定了。

sentence

【在 l******e 的大作中提到】

m*n2013-08-17 07:08

8 楼

Not sharding the data file. I mean breaking up the processed wordLineMap
into several files.
Assume 20 words per line on average, a million-line files gives you
20 million word occurrences, and each occurrence probably needs 16 bytes (
for file id, line no, and line offset], assuming a naive representation).
That's 3G memory for one file. Using
file to store the wordLineMap on disk we can handle maybe 10s of thousands
of files. Besides, records for unlike query words like 'a' and 'the' etc can
be stored separately from other words. These records may not even be loaded
into memory.

??

【在 b**********5 的大作中提到】

: a million line, and u need to shard???!! what kind of machine are u using???
:
: offset
: line
: more

m*n2013-08-17 07:08

9 楼

Do you mean to repeat this sequence for every query?
I think the test wants you to preprocess the input to support
multiple queries over time, just like web search. OP seems to
think so too.

【在 d******e 的大作中提到】

: CPP？ good luck.
: 程序其实蛮简单的。python为例。
: 1.一个函数open file, getlines, yield
: 2. 一个函数match, yield
: 3. 一个函数打印
: 最后main,loop files, 穿起来三个寒暑做pipeline,结束等待输入
: 熟手半个小时写完，半个小时调试。
: cpp就啧啧了。
: 楼下的大作业做过没？1M行慢的机器2分钟也搞定了。
:

d*e2013-08-17 07:08

10 楼

那里有every query?只有几个文件。
而且就算有多个quey,就算是多个observers把。程序小改两行就可以了。

【在 m*****n 的大作中提到】

: Do you mean to repeat this sequence for every query?
: I think the test wants you to preprocess the input to support
: multiple queries over time, just like web search. OP seems to
: think so too.

d*e2013-08-17 07:08

11 楼

你不需要吧文件都load到内存。
readlines函数返回一个iterator。
你next几步之后它回读新的到内存，否则就停在那里。

thousands
can
loaded

【在 m*****n 的大作中提到】

: Not sharding the data file. I mean breaking up the processed wordLineMap
: into several files.
: Assume 20 words per line on average, a million-line files gives you
: 20 million word occurrences, and each occurrence probably needs 16 bytes (
: for file id, line no, and line offset], assuming a naive representation).
: That's 3G memory for one file. Using
: file to store the wordLineMap on disk we can handle maybe 10s of thousands
: of files. Besides, records for unlike query words like 'a' and 'the' etc can
: be stored separately from other words. These records may not even be loaded
: into memory.

l*s2013-08-17 07:08

12 楼

百万级的查询恐怕需要Hash存储查询

d*e2013-08-17 07:08

13 楼

i see what you are saying.
要做inverted index话，需要map reduce.
这个2个小时搞不定。
直接用hash table话。内存又不够。
所以最简单的，每个keyword 写一个file.
query时候打开n个file,内容都dump出来好了。

【在 d******e 的大作中提到】

: 你不需要吧文件都load到内存。
: readlines函数返回一个iterator。
: 你next几步之后它回读新的到内存，否则就停在那里。
:
: thousands
: can
: loaded

b*52013-08-17 07:08

14 楼

直接用hash table话。内存又不够。..
based on what?

【在 d******e 的大作中提到】

: i see what you are saying.
: 要做inverted index话，需要map reduce.
: 这个2个小时搞不定。
: 直接用hash table话。内存又不够。
: 所以最简单的，每个keyword 写一个file.
: query时候打开n个file,内容都dump出来好了。

z*o2013-08-17 07:08

15 楼

TreeMap轻松搞定

Z*02013-08-17 07:08

16 楼

如果用python来整invert table，2个小时有可能。用map／reduce太高级了，直接强攻。

【在 d******e 的大作中提到】

S*n2013-08-17 07:08

17 楼

这是EA的，我也面了。其实这个题和文件搜索一点关系都没有，我当时把它当成文件搜
索来做，加了stop word removal，weighting，inverted indexing之类的，但是两个
小时实在来不及写，就全写在一个程序里，后来他们又给了我4个小时让我重写。但是
测试的那个人居然不懂stop word removal最后没让我过。我就去找他们argu了，然后
给我加面了一轮coding后拿到了onsite。不过onsite各种悲惨，最后还是挂了。
至于你的code，一开始的设定应该是所有的文件路径写在一个file里面，然后输入的是
这个包含路径的文件的路径。然后tokenize的时候，你只考虑了小写的字母，但实际上
你事先需要把所有的换成小写的，还有就是有的单词中间有-，你不能把这个去掉，还
有就是有时候两个单词中间只有标点没有空格。然后因为是大数据，你不能直接用
string，你得把每个单词换成数字，至于ls说的单词重复次数，在这个project里面不
需要考虑。
我觉得你的主要问题应该是处理的时候没有把string换成word id，导致来不及。不过
他们不会看你的代码的，只会去跑你的程序，只有一定数量的testing case通过了才算
过。我当时就是他们没设好stop word list，导致缺失了一部分。

S*n2013-08-17 07:08

18 楼

如果有需要的话，我可以把我的代码给你看。我的最后他们换了stop word list后是通
过了

k*r2013-08-17 07:08

19 楼

想知道这道题tokenize的时候，找符号之前的单词，除了!,.?";还有其他符号吗？大家
都用什么办法tokenize呢？

【在 S**********n 的大作中提到】

: 这是EA的，我也面了。其实这个题和文件搜索一点关系都没有，我当时把它当成文件搜
: 索来做，加了stop word removal，weighting，inverted indexing之类的，但是两个
: 小时实在来不及写，就全写在一个程序里，后来他们又给了我4个小时让我重写。但是
: 测试的那个人居然不懂stop word removal最后没让我过。我就去找他们argu了，然后
: 给我加面了一轮coding后拿到了onsite。不过onsite各种悲惨，最后还是挂了。
: 至于你的code，一开始的设定应该是所有的文件路径写在一个file里面，然后输入的是
: 这个包含路径的文件的路径。然后tokenize的时候，你只考虑了小写的字母，但实际上
: 你事先需要把所有的换成小写的，还有就是有的单词中间有-，你不能把这个去掉，还
: 有就是有时候两个单词中间只有标点没有空格。然后因为是大数据，你不能直接用
: string，你得把每个单词换成数字，至于ls说的单词重复次数，在这个project里面不