random sampling with replacement, how? - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>Database - 数据库

random sampling with replacement, how?

random sampling with replacement, how?# Database - 数据库

b*m2008-07-09 07:07

1 楼

拥有希望，是多么美好的事情！

c*t2008-07-09 07:07

2 楼

俺有一 table (id bigint, char* data). 其中 id 是 unique 但不一定
连续的数字。
问，如何从该 table 里面挑 N 个 row （random order，有可能重复）？
比如 table
1 a
9 b
7 c
8 d
如果里面挑 3 个，可以得到 abc, aac 等。postgresql 里面有
select * from table order by random()，不过那个好像是
random w/o replacement ，不是俺要的。
thx

l*h2008-07-09 07:07

3 楼

Nice music.
车到山前必有路，凡墙必有门，什么事都不用钻牛角尖。
多看看美好的事物，多听听美好的音乐，多储存正能量：）

★ 发自iPhone App: ChineseWeb 8.7

【在 b***m 的大作中提到】

: 拥有希望，是多么美好的事情！

I*e2008-07-09 07:07

4 楼

If you know number of rows in the table, it should be easy, right?
otherwise, it is a classical problem: pool sampling.

b*m2008-07-09 07:07

5 楼

Thanks, always~

【在 l*****h 的大作中提到】

: Nice music.
: 车到山前必有路，凡墙必有门，什么事都不用钻牛角尖。
: 多看看美好的事物，多听听美好的音乐，多储存正能量：）
:
: ★ 发自iPhone App: ChineseWeb 8.7

c*t2008-07-09 07:07

6 楼

I need to do it on the server end (i.e. in UDF). How to do it?
For the # of rows, I can do a query to find the # of rows in the table.
thanks.

【在 I******e 的大作中提到】

: If you know number of rows in the table, it should be easy, right?
: otherwise, it is a classical problem: pool sampling.

m*n2008-07-09 07:07

7 楼

好听

【在 b***m 的大作中提到】

: 拥有希望，是多么美好的事情！

B*g2008-07-09 07:07

8 楼

can you create temp table on the server?

【在 c*****t 的大作中提到】

: I need to do it on the server end (i.e. in UDF). How to do it?
: For the # of rows, I can do a query to find the # of rows in the table.
: thanks.

A*e2008-07-09 07:07

9 楼

嗯，好听，多谢makejian把新贴顶回来~

c*t2008-07-09 07:07

10 楼

ya.

【在 B*****g 的大作中提到】

: can you create temp table on the server?

b*m2008-07-09 07:07

11 楼

同谢~
太利索了~

【在 A***e 的大作中提到】

: 嗯，好听，多谢makejian把新贴顶回来~

B*g2008-07-09 07:07

12 楼

1. write a procedure store random N pk values of original table to temp
table,
2.select join temp table and original table.

【在 c*****t 的大作中提到】

: ya.

Y*t2008-07-09 07:07

13 楼

顶新帖

c*t2008-07-09 07:07

14 楼

我的问题就是第一步。。。
俺现在的办法是 aggregate function 弄出 id array，从中挑出需要的
N sample, 弄出 id array。然后 enumerate 该 id array 。很麻烦，不
知道有什么简单的。

【在 B*****g 的大作中提到】

: 1. write a procedure store random N pk values of original table to temp
: table,
: 2.select join temp table and original table.

B*g2008-07-09 07:07

15 楼

not too 麻烦. I test following procedure in oracle. :N = 1000,lnCOUNT =
4427396.
Take 10 secs, not too bad, hehe
DECLARE
lnNum NUMBER := :N;
lnCOUNT NUMBER;
lnRandomSeq NUMBER;

TYPE ltypPkID IS TABLE OF tab1.pk_col%TYPE;
lrecPkID ltypPkID;
BEGIN
SELECT pk_col
BULK COLLECT INTO lrecPkID
FROM tab1;

lnCOUNT := lrecPkID.COUNT;
FOR I IN 1..lnNum
LOOP
SELECT dbms_random.value(1,lnCOUNT)
INTO lnRandomSeq
FROM DUAL;

INSERT INTO tab_temp(COL

【在 c*****t 的大作中提到】

: 我的问题就是第一步。。。
: 俺现在的办法是 aggregate function 弄出 id array，从中挑出需要的
: N sample, 弄出 id array。然后 enumerate 该 id array 。很麻烦，不
: 知道有什么简单的。

c*t2008-07-09 07:07

16 楼

你这个不行。如果俺有 1 million row 的话，取 90% 的 sample w/ replacement，
这个就完蛋了。而且俺要 repeat N times 。。。

【在 B*****g 的大作中提到】

: not too 麻烦. I test following procedure in oracle. :N = 1000,lnCOUNT =
: 4427396.
: Take 10 secs, not too bad, hehe
: DECLARE
: lnNum NUMBER := :N;
: lnCOUNT NUMBER;
: lnRandomSeq NUMBER;
:
: TYPE ltypPkID IS TABLE OF tab1.pk_col%TYPE;
: lrecPkID ltypPkID;

B*g2008-07-09 07:07

17 楼

nod.
Try 730k with 650k sample, insert takes 4.5 mins.
Also tried below, even slower than my procedure, hehe. No idea le.
SELECT *
FROM (SELECT *
FROM tab1
ORDER BY DBMS_RANDOM.VALUE)
WHERE ROWNUM <= 1000

【在 c*****t 的大作中提到】

: 你这个不行。如果俺有 1 million row 的话，取 90% 的 sample w/ replacement，
: 这个就完蛋了。而且俺要 repeat N times 。。。

c*t2008-07-09 07:07

18 楼

Finally, finished code for this approach.
This approach for 1,000,000 row with 900,00 sample
Total runtime: 3141.381 ms
not bad at all, but takes 4 UDFs to do the job.

【在 c*****t 的大作中提到】

B*g2008-07-09 07:07

19 楼

SELECT a.*
FROM (SELECT ROWNUM rn,
c1.*
FROM tab1 c1) a,
(SELECT CEIL (DBMS_RANDOM.VALUE (0, 4427396)) rn
FROM tab1
WHERE ROWNUM <= 4000000) b
WHERE a.rn = b.rn
4427396 record in tab1
sample 4000000
total 6 mins in develop environment, as for the production db, usually 10X
faster, so time should be around 1 min.

【在 c*****t 的大作中提到】

: Finally, finished code for this approach.
: This approach for 1,000,000 row with 900,00 sample
: Total runtime: 3141.381 ms
: not bad at all, but takes 4 UDFs to do the job.

c*t2008-07-09 07:07

20 楼

你这个是靠 row id 是连续。当然，想办法把 rowid 改一下也是个好办法。

【在 B*****g 的大作中提到】

: SELECT a.*
: FROM (SELECT ROWNUM rn,
: c1.*
: FROM tab1 c1) a,
: (SELECT CEIL (DBMS_RANDOM.VALUE (0, 4427396)) rn
: FROM tab1
: WHERE ROWNUM <= 4000000) b
: WHERE a.rn = b.rn
: 4427396 record in tab1
: sample 4000000