【在 l**b 的大作中提到】 : e.g., a 20000x2000 matrix, which is very common for data mining.
l*b
5 楼
We know that the common web server log file is just a flat text file. suppose we have 2000 web pages in a web site. If a row records all pages visited by a certain user (he may visit some of 2000 pages), and if 20000 users have been identified, how to load such records (in text file it would be a 20000x2000 matrix, cell value is how many seconds the user spent on the page.) into database?
稍
【在 aw 的大作中提到】 : 恕俺无知,你SURE你的一个RECORD有1000个ATTRIBUTES?ORACLE也就是1000的LIMIT,稍 : 微具体点讲讲你的DESIGN?
s*e
6 楼
table 1: web page (page_id, ... table 2: user ( user_id, ... table 3: access record ( user_id, page_id, access_time )
【在 l**b 的大作中提到】 : We know that the common web server log file is just a flat text file. : suppose we have 2000 web pages in a web site. If a row records all pages : visited by a certain user (he may visit some of 2000 pages), and if 20000 : users have been identified, how to load such records (in text file it would be : a 20000x2000 matrix, cell value is how many seconds the user spent on the : page.) into database? : : 稍
aw
7 楼
lieb,你该回去复习DATABASE基本概念了,没有冒犯的意思。
would be 恕俺无知,你SURE你的一个RECORD有1000个ATTRIBUTES?ORACLE也就是1000的LIMIT,
【在 s***e 的大作中提到】 : table 1: web page (page_id, ... : table 2: user ( user_id, ... : table 3: access record ( user_id, page_id, access_time )
l*b
8 楼
Thanks, shuke. you are definitely right from the database design perspective. Then, for table 3 we would have 40,000,000(much fewer if the original matrix is a sparse matrix.) rows while having only 3 columns. The only concern is that for separating into more tables, more aggregation code (join query) need to be written when geting data back to do some matrix-oriented computation. Not sure which way is faster, the database tables or a single 20000x2000 matrix-looking text file, in terms of gett
【在 s***e 的大作中提到】 : table 1: web page (page_id, ... : table 2: user ( user_id, ... : table 3: access record ( user_id, page_id, access_time )
l*b
9 楼
flush) good reminder, not using database for such a long time... too much text mining...
20000 the
【在 aw 的大作中提到】 : lieb,你该回去复习DATABASE基本概念了,没有冒犯的意思。 : : would be : 恕俺无知,你SURE你的一个RECORD有1000个ATTRIBUTES?ORACLE也就是1000的LIMIT,
b*e
10 楼
exactly.
【在 aw 的大作中提到】 : lieb,你该回去复习DATABASE基本概念了,没有冒犯的意思。 : : would be : 恕俺无知,你SURE你的一个RECORD有1000个ATTRIBUTES?ORACLE也就是1000的LIMIT,
s*e
11 楼
If you import data into a database, you'd better stick on database for further computation as much as possible. For example, creating indexes on user_id or page_id in table 3 could improve your join query performance a lot. If you can't utilize the advantage of database, just forget it.
【在 l**b 的大作中提到】 : Thanks, shuke. you are definitely right from the database design perspective. : Then, for table 3 we would have 40,000,000(much fewer if the original matrix : is a sparse matrix.) rows while having only 3 columns. : The only concern is that for separating into more tables, more aggregation : code (join query) need to be written when geting data back to do some : matrix-oriented computation. : Not sure which way is faster, the database tables or a single 20000x2000 : matrix-looking text file, in terms of gett