1G内存读10G文件 - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>JobHunting - 待字闺中

1G内存读10G文件

1G内存读10G文件# JobHunting - 待字闺中

m*12012-11-02 07:11

1 楼

在看cracking的时候想到了这个东西，有一下几个不明白的地方：
1 假设有一个10G的文件里面都是string，怎么将所有的string全部排序，书上说
external sort，完了不停的merge，但是怎么split文件呢。
2 单纯的来说读大文件的问题，如果只有1G内存但是要读10G的文件，怎么搞。
求论坛上的大神们解释。

g*e2012-11-02 07:11

2 楼

os用virtual mem给你搞定了。当然你也可以自己分block读入

【在 m*******1 的大作中提到】

: 在看cracking的时候想到了这个东西，有一下几个不明白的地方：
: 1 假设有一个10G的文件里面都是string，怎么将所有的string全部排序，书上说
: external sort，完了不停的merge，但是怎么split文件呢。
: 2 单纯的来说读大文件的问题，如果只有1G内存但是要读10G的文件，怎么搞。
: 求论坛上的大神们解释。

m*12012-11-02 07:11

3 楼

程序里面怎么分block实现？

os用virtual mem给你搞定了。当然你也可以自己分block读入

【在 g*****e 的大作中提到】

: os用virtual mem给你搞定了。当然你也可以自己分block读入

c*a2012-11-02 07:11

4 楼

read 1g data in to main memory and sort with some algorithm
write the sorted data on disk, repeat the step. then u have 10 files on the
disk, 1 g each
read first 90 mb of each file,total 900mb, sort them, the last 100mb for
output buffer , repeat......
这样行不行

m*12012-11-02 07:11

5 楼

如何控制只读1g? 有什么stream控制器可以控制停在1G位置处，下一次接着读呢？还是
说有别的控制的方法？

the
for

【在 c*****a 的大作中提到】

: read 1g data in to main memory and sort with some algorithm
: write the sorted data on disk, repeat the step. then u have 10 files on the
: disk, 1 g each
: read first 90 mb of each file,total 900mb, sort them, the last 100mb for
: output buffer , repeat......
: 这样行不行

c*a2012-11-02 07:11

6 楼

c好像有个lseek什么的

f*n2012-11-02 07:11

7 楼

"read first 90 mb of each file,total 900mb, sort them"
But this is not necessarily the first 900mb of the sorted output.

the
for

【在 c*****a 的大作中提到】

m*12012-11-02 07:11

8 楼

g*e2012-11-02 07:11

9 楼

os用virtual mem给你搞定了。当然你也可以自己分block读入

【在 m*******1 的大作中提到】

m*12012-11-02 07:11

10 楼

程序里面怎么分block实现？

os用virtual mem给你搞定了。当然你也可以自己分block读入

【在 g*****e 的大作中提到】

: os用virtual mem给你搞定了。当然你也可以自己分block读入

c*a2012-11-02 07:11

11 楼

m*12012-11-02 07:11

12 楼

如何控制只读1g? 有什么stream控制器可以控制停在1G位置处，下一次接着读呢？还是
说有别的控制的方法？

the
for

【在 c*****a 的大作中提到】

c*a2012-11-02 07:11

13 楼

c好像有个lseek什么的

f*n2012-11-02 07:11

14 楼

"read first 90 mb of each file,total 900mb, sort them"
But this is not necessarily the first 900mb of the sorted output.

the
for

【在 c*****a 的大作中提到】

s*22012-11-02 07:11

15 楼

不知道string排序是什么意思
假设就是10G LW 整数，给一个4G内存空间对10G数排序，也假设有个subroutine能把1G
整数在4G内存中排序
1. read 1st 1G numbers from disk
2. sort using subroutine
3. read numbers from disk， make sure the smallest 1G number in ram and
write the larger ones back in disk
4. write the smallest 1G numbers back in disk
repeat 1 ~4 for the second 1G number till the end
Notes:
You may rescale the 4G memory to 1G or any other size
string 排序要是指有index之类的结构（比方每个string做hash），又可以基于此方
法对index操作

【在 m*******1 的大作中提到】

f*n2012-11-02 07:11

16 楼

你这个“make sure the smallest 1G number in ram and write the larger ones
back in disk”打算具体怎么操作？什么复杂度？
还有，如果有n G，你要重复这个过程n次，每次要起码O(n)时间，那整个过程就起码要
O(n^2)。不是很好效率

1G

【在 s******2 的大作中提到】

: 不知道string排序是什么意思
: 假设就是10G LW 整数，给一个4G内存空间对10G数排序，也假设有个subroutine能把1G
: 整数在4G内存中排序
: 1. read 1st 1G numbers from disk
: 2. sort using subroutine
: 3. read numbers from disk， make sure the smallest 1G number in ram and
: write the larger ones back in disk
: 4. write the smallest 1G numbers back in disk
: repeat 1 ~4 for the second 1G number till the end
: Notes:

s*22012-11-02 07:11

17 楼

我认为这道题的主要问题是对用户程序来说ram和disk都不是transparent，系统调用背
后实际上都是块操作，disk更是通过dma进行的，所以实际解决这个问题要用几个数据
结构rescaling和fine tuning每次传输的数据块的大小，才能提高效率/速度
单纯的强调操作次数是纯理论的，实际上根本不可能只对一个数在磁盘上改写，而这些
数又是随机的，所以必须首先对数据结构化，最基本的就是，划分1G内存，比如
- 读512M数进行排序
- 再每次从disk上读256M，
- 跟前面的512M数据比较，小的留下，大的存在一个临时的buffer，size 4K～256K，
这样可以一边比较一边把大的数回写到磁盘上
- 像你说的，这个可能O(n^2)，但这是cpu比较的次数，磁盘的读写要少的多，cpu比起
disk要便宜的多
我前面说的是一个大概的思路，实际实现要看具体数据内容，你有好的方法不妨说一下

【在 f*******n 的大作中提到】

: 你这个“make sure the smallest 1G number in ram and write the larger ones
: back in disk”打算具体怎么操作？什么复杂度？
: 还有，如果有n G，你要重复这个过程n次，每次要起码O(n)时间，那整个过程就起码要
: O(n^2)。不是很好效率
:
: 1G

S*12012-11-02 07:11

18 楼

Unix has file descriptor, and use lseek such kind of thing so you can
control the size you want to read.
Not sure if it is in this case.

f*n2012-11-02 07:11

19 楼

你这个“跟前面的512M数据比较，小的留下”很容易需要O(n * 512mb)

【在 s******2 的大作中提到】

: 我认为这道题的主要问题是对用户程序来说ram和disk都不是transparent，系统调用背
: 后实际上都是块操作，disk更是通过dma进行的，所以实际解决这个问题要用几个数据
: 结构rescaling和fine tuning每次传输的数据块的大小，才能提高效率/速度
: 单纯的强调操作次数是纯理论的，实际上根本不可能只对一个数在磁盘上改写，而这些
: 数又是随机的，所以必须首先对数据结构化，最基本的就是，划分1G内存，比如
: - 读512M数进行排序
: - 再每次从disk上读256M，
: - 跟前面的512M数据比较，小的留下，大的存在一个临时的buffer，size 4K～256K，
: 这样可以一边比较一边把大的数回写到磁盘上
: - 像你说的，这个可能O(n^2)，但这是cpu比较的次数，磁盘的读写要少的多，cpu比起

s*22012-11-02 07:11

20 楼

512M是sorted，是拿新的block 256MB 或 whatever 跟这个512M 里的数比，等512M排
好序，就被写回disk，下次就10G-512MB，以此类推 9G，8.5G, 。。。
当然就比较次数上算还是太复杂，但强调这个没有太多意义
还要看，这10G里的内容是什么，比方要是纯32bits整数肯定有重复的，就要看它们的
发布等等，而要是100个100MB的video就简单多了

【在 f*******n 的大作中提到】

: 你这个“跟前面的512M数据比较，小的留下”很容易需要O(n * 512mb)