怎样快速得到两个表的交集 - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>Database - 数据库

怎样快速得到两个表的交集

怎样快速得到两个表的交集# Database - 数据库

x*e2006-11-03 08:11

1 楼

我有两个表，都只有一个属性 “ID”，每张表差不多都有5万条记录
比如
ID
10
13
300
100
12
1001
。。。
另外一个表也一样，只是记录的数据可能不同
现在要求2个表的交集，可以这么做
select * from table1, table2 where table1.ID = table2.ID;
但是因为表比较大，上面这个命令费时间太多(因为对所有记录两两比较， O(n^2))
按理说如果两个表先进行排序，后在求交集的话，时间复杂度只是线性的(O(2n))
但是我不知道怎么用SQL语言实现，请各位大牛帮忙
多谢

c*d2006-11-03 08:11

2 楼

select * from table1
INTERSECT
select * from table2

w*r2006-11-03 08:11

3 楼

if both table have indexex enabled on ID column, you can simply do
SELECT * FROM
(SELECT ID FROM TB1) A INNER JOIN (SELECT ID FROM TB2) B
ON (A.ID = B.ID)
most of databases should handle this situation by direct index hit/index
join
automatically... BTW, 50k records are tiny according to today's hardware/
software platform. this type of join should never cause any problem if you
have proper index schema enabled on the tables.

【在 x****e 的大作中提到】

: 我有两个表，都只有一个属性 “ID”，每张表差不多都有5万条记录
: 比如
: ID
: 10
: 13
: 300
: 100
: 12
: 1001
: 。。。

x*e2006-11-03 08:11

4 楼

我在自己的PC上用MYSQL做了一下测试，
对于都是5万条记录的两张表 table1, table2
第一个方法：
select * from table1, table2 where table1.ID = table2.ID;
大约需要600秒
第二个方法：
create index table1_index on table1 (id);
create index table2_index on table2 (id);
SELECT * FROM (SELECT ID FROM TB1) A INNER JOIN (SELECT ID FROM TB2) B ON (A
.ID = B.ID)
大约需要350秒
的确快了
第三个方法：
用JAVA编程，随机生成两个size都是5万的array
先分别对两个array排序（从小到大），然后从小比到大，时间复杂度是2×50000
最终运算只需要不到1秒！！！
为什么数据库的效率会比JAVA的程序慢那么多？
哪位兄台能否解释一下？
多谢

I*e2006-11-03 08:11

5 楼

You basically need sort merge join or hash join to do this fast. But, MySQL
does not support any of these. Do you have other databases?
After you create index, can you do:
explain select * from table1, table2 where table1.ID = table2.ID;
and tell us the resule.
BTW, if you output a lot of records, to measure performance, you should do:
select count(*) from table1, table2 where table1.ID = table2.ID;

(A

【在 x****e 的大作中提到】

: 我在自己的PC上用MYSQL做了一下测试，
: 对于都是5万条记录的两张表 table1, table2
: 第一个方法：
: select * from table1, table2 where table1.ID = table2.ID;
: 大约需要600秒
: 第二个方法：
: create index table1_index on table1 (id);
: create index table2_index on table2 (id);
: SELECT * FROM (SELECT ID FROM TB1) A INNER JOIN (SELECT ID FROM TB2) B ON (A
: .ID = B.ID)

x*e2006-11-03 08:11

6 楼

MySQL
sort merge join正是我想要的，是否只有ORACLE才有这个命令？
结果是：
id select_type table type possible_keys key key_len ref
rows Extra
1 SIMPLE table1 index table1_index table1_index 9
47506 Using index
1 SIMPLE table2 ref table2_index table2_index 9 codis.
table1.id 1 Using where; Using index
似乎采用这个命令，速度非常快，也是不到1秒
这个跟 create table jointable select table1.id from table1, table2 where
table1.ID = table2.ID; 好像一样的快
但select table1.i

【在 I******e 的大作中提到】

: You basically need sort merge join or hash join to do this fast. But, MySQL
: does not support any of these. Do you have other databases?
: After you create index, can you do:
: explain select * from table1, table2 where table1.ID = table2.ID;
: and tell us the resule.
: BTW, if you output a lot of records, to measure performance, you should do:
: select count(*) from table1, table2 where table1.ID = table2.ID;
:
: (A

I*e2006-11-03 08:11

7 楼

All major databases all have it. But hash join is much faster in most cases.
From the explain, it looks like that MySQL is still using nested index join.
The slowness should be in the output. If you use Python or Java API to
retrieve rows, it will be much faster.

x*e2006-11-03 08:11

8 楼

非常感谢！

w*r2006-11-03 08:11

9 楼

mysql...well, free software usually need lot tuning effort. I believe mysql
need some tuning to make the db engine choose hash join on index. Other
commercial db software should use merge or hash join for the index
automatically in your scenario. get away from mysql if you do not want to
waste your time to fine tune the RDBMS behavior.

(A

【在 x****e 的大作中提到】

I*e2006-11-03 08:11

10 楼

This is a hard lesson to me: Mysql does not support hash join or sort merge
join. In most system, Mysql is always I/O bounded.