one algorithm question - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>CS - 计算机科学

one algorithm question

one algorithm question# CS - 计算机科学

c*d2005-07-19 07:07

1 楼

a log file contains n diferent kinds of items
how to count the occurence for each items if n is too big(out of memory)?
thx

f*p2005-07-19 07:07

2 楼

Is it a algorithm question? Are you counting how many stars in the universe?

【在 c**d 的大作中提到】

: a log file contains n diferent kinds of items
: how to count the occurence for each items if n is too big(out of memory)?
: thx

b*n2005-07-19 07:07

3 楼

how big is too big? greater than 2^(4G) ?

【在 c**d 的大作中提到】

: a log file contains n diferent kinds of items
: how to count the occurence for each items if n is too big(out of memory)?
: thx

c*d2005-07-19 07:07

4 楼

assume it does not fit into memory

【在 b***n 的大作中提到】

: how big is too big? greater than 2^(4G) ?

s*e2005-07-19 07:07

5 楼

允许写文件吗？说个简单的。
读item,遇到新type，用一个新空间记录count，遇到已有type，增加count
如果memory已满，输出没处理的新type的item到新文件。
下次再扫描这个新文件。

【在 c**d 的大作中提到】

: a log file contains n diferent kinds of items
: how to count the occurence for each items if n is too big(out of memory)?
: thx

f*p2005-07-19 07:07

6 楼

我怀疑他根本就没听懂问题。

【在 s***e 的大作中提到】

: 允许写文件吗？说个简单的。
: 读item,遇到新type，用一个新空间记录count，遇到已有type，增加count
: 如果memory已满，输出没处理的新type的item到新文件。
: 下次再扫描这个新文件。

m*s2005-07-19 07:07

7 楼

Right on. It is too vague.
What is contained in the log? Strings? Objects?
If strings, use tries/hash.
If objects, convert to binary, then use hash.
Either way, split and store chunks of temp data on disk. cache them if needed.

【在 f*****p 的大作中提到】

: 我怀疑他根本就没听懂问题。

c*c2005-07-19 07:07

8 楼

这个好像是search engine的interview里头比较容易碰到的问题?
如果是有多台机器可以一起跑, 指定一台为front end, 其余机器
为back end, 每台back end机器负责一部分item的计数(比如第一
台机器负责A打头的字符串, 第二台机器负责B打头的字符串, 诸
如此类); 由front end机器顺序读那个文件, 把每个item发给相
应的backend机器让它计数. 最后处理完后每台back end机器按
顺序把计好的数写到一个文件里就行了.
如果只有一台机器, 没有多台机器一起跑, 但允许随时写文件,
俺突然想到的, 似乎有一种比较通用的一种做法, 在memory满之
前跟你说的一样, 但memory满之后, 要把当前memory的内容排个
序然后dump到一个新文件里, 然后接着读原来的文件, 一样的做
法, 到memory满了再dump到一个新的文件. 最后全部处理完了,
把新生成的那一堆文件打开, 因为都已经排好序了, 很容易就可
以把结果combine起来, 写入一个最终的结果文件.

【在 s***e 的大作中提到】

w*e2005-07-19 07:07

9 楼

Assume n is too big, and existed items are sparse compared to n, you may
want to look at the multiple-dimension aggregation algorithm in data mining
. Although it is multi-dimension, the algorithm have good performance on how
to iterate through the whole dimension and acount for each item in cell.

【在 c**c 的大作中提到】

: 这个好像是search engine的interview里头比较容易碰到的问题?
: 如果是有多台机器可以一起跑, 指定一台为front end, 其余机器
: 为back end, 每台back end机器负责一部分item的计数(比如第一
: 台机器负责A打头的字符串, 第二台机器负责B打头的字符串, 诸
: 如此类); 由front end机器顺序读那个文件, 把每个item发给相
: 应的backend机器让它计数. 最后处理完后每台back end机器按
: 顺序把计好的数写到一个文件里就行了.
: 如果只有一台机器, 没有多台机器一起跑, 但允许随时写文件,
: 俺突然想到的, 似乎有一种比较通用的一种做法, 在memory满之
: 前跟你说的一样, 但memory满之后, 要把当前memory的内容排个