Random split of a file - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>Java - 爪哇娇娃

Random split of a file

Random split of a file# Java - 爪哇娇娃

c*e2013-11-16 08:11

1 楼

一个文件包含多行文本，如果要按一定比例，比如 3:7 随机分割，分割后输出到另两
个文件，什么方法最好？
这样做，好像不是最好的？
1. 第一次，逐行读取整个文件，统计行数 n
2. 按 3:7 的比例，计算随机分割后两部分的行数，a, b
3. 第二次，逐行读取整个文件，使用一个 random sampling 的算法，得到随机分割的
行，并存入一个 listA. 行数为 a;
4. 第三次逐行读取整个文件，将不存在 listA 中的每行存入 listB, 行数为 b.
真的需要扫描文件三次吗？

b*y2013-11-16 08:11

2 楼

Init a random number generator;
Parse the file once;
For each line, run the random number generator and get a number between 0.0
~ 1.0;
If the number is below 0.3, put the line into file A, else put the line into
file B;
done.
你只需要一个pass就成了。

b*y2013-11-16 08:11

3 楼

另外，简洁的算法，才是最好的算法；不要把事情搞得太复杂了。实际工作当中，keep
it simple是王道.

c*e2013-11-16 08:11

4 楼

cool. 这么简单。但是有个问题，这样能保证每一行得到的概率相等吗？比如说，产
生 0-9 之间的随机整数，
Random r = new Random();
for(int i=0; i<10; i++){
System.out.println(r.nextInt(10) + " ");
}
结果，有些重复几次，4,6,3；有些根本没出现，2,7,9,0。假如按照这次随机数来分割
文件，有两行（0,1）分到一个文件，而另一个文件包含 8 行；
4
8
4
6
1
5
3
6
3
3

0
into

【在 b******y 的大作中提到】

: Init a random number generator;
: Parse the file once;
: For each line, run the random number generator and get a number between 0.0
: ~ 1.0;
: If the number is below 0.3, put the line into file A, else put the line into
: file B;
: done.
: 你只需要一个pass就成了。

w*z2013-11-16 08:11

5 楼

跑一千次再看。

【在 c*******e 的大作中提到】

: cool. 这么简单。但是有个问题，这样能保证每一行得到的概率相等吗？比如说，产
: 生 0-9 之间的随机整数，
: Random r = new Random();
: for(int i=0; i<10; i++){
: System.out.println(r.nextInt(10) + " ");
: }
: 结果，有些重复几次，4,6,3；有些根本没出现，2,7,9,0。假如按照这次随机数来分割
: 文件，有两行（0,1）分到一个文件，而另一个文件包含 8 行；
: 4
: 8

c*e2013-11-16 08:11

6 楼

测试了一下，跑 1000 次，按 3:7 分割，得到 299:701,大约是随机的，概率大约相等
。如果要求绝对相等，就要想其他办法了。不过该简洁办法符合我的需求，我上面说的
做到绝对随机，无重复。
Random r = new Random();
int a = 0;
int b = 0;
for(int i=0; i<1000; i++){
int random = r.nextInt(10);
System.out.println(random);

if( random < 3){
a++;
}else{
b++;
}
}

System.out.println("a=" + a + " b=" + b);

a=299 b=701

【在 w**z 的大作中提到】

: 跑一千次再看。

c*e2013-11-16 08:11

7 楼

增加到 1 百万次，差距就越来越大了
a=299373 b=700627

【在 c*******e 的大作中提到】

: 测试了一下，跑 1000 次，按 3:7 分割，得到 299:701,大约是随机的，概率大约相等
: 。如果要求绝对相等，就要想其他办法了。不过该简洁办法符合我的需求，我上面说的
: 做到绝对随机，无重复。
: Random r = new Random();
: int a = 0;
: int b = 0;
: for(int i=0; i<1000; i++){
: int random = r.nextInt(10);
: System.out.println(random);
:

d*i2013-11-16 08:11

8 楼

这么多次的话，最好每一百次重新seed一下。

【在 c*******e 的大作中提到】

: 增加到 1 百万次，差距就越来越大了
: a=299373 b=700627

g*g2013-11-16 08:11

9 楼

没有变大，你要看比例

【在 c*******e 的大作中提到】

: 增加到 1 百万次，差距就越来越大了
: a=299373 b=700627

w*z2013-11-16 08:11

10 楼

你要绝对3：7？如果不是10的倍数，怎么办？

【在 c*******e 的大作中提到】

m*t2013-11-16 08:11

11 楼

似乎读两次就够了：
1. 第一次也是读一遍文件，得到总行数n
2. 对于数组A = [1..n]，random shuffle一下，得到新数组B(用C++的话，STL里就有
random_shuffle，Java估计有类似的method)，这样B数组的前30% B1，和后70%数组B2
就是要分的行
3. 第二次顺序文件，对每一行a，如果a在B1中，就存入listA, 否则存入listB
这个是没优化过的，否则A用hashset更好

【在 c*******e 的大作中提到】

: 一个文件包含多行文本，如果要按一定比例，比如 3:7 随机分割，分割后输出到另两
: 个文件，什么方法最好？
: 这样做，好像不是最好的？
: 1. 第一次，逐行读取整个文件，统计行数 n
: 2. 按 3:7 的比例，计算随机分割后两部分的行数，a, b
: 3. 第二次，逐行读取整个文件，使用一个 random sampling 的算法，得到随机分割的
: 行，并存入一个 listA. 行数为 a;
: 4. 第三次逐行读取整个文件，将不存在 listA 中的每行存入 listB, 行数为 b.
: 真的需要扫描文件三次吗？