avatar
文件分割的问题# Java - 爪哇娇娃
j*s
1
请教一个问题,有一个大文件,是个txt表格,按照第一列的关键字分割成若干文件。
比如
avatar
g*g
2
Do something in between, let's say you keep a "file pool",
you can open a maximum of 5000, and you keep the most recent 5000
open. Put it in a queue, pop the head out and append the new one
at the tail when it's over 5000. When you write a file and the file
is already in the queue, remove it and append it to the tail.
To speed up search, you can use a hashmap to track if the files are
open.

【在 j*******s 的大作中提到】
: 请教一个问题,有一个大文件,是个txt表格,按照第一列的关键字分割成若干文件。
: 比如

avatar
j*s
3
好方法,多谢多谢,堆栈这个方法好极了。

【在 g*****g 的大作中提到】
: Do something in between, let's say you keep a "file pool",
: you can open a maximum of 5000, and you keep the most recent 5000
: open. Put it in a queue, pop the head out and append the new one
: at the tail when it's over 5000. When you write a file and the file
: is already in the queue, remove it and append it to the tail.
: To speed up search, you can use a hashmap to track if the files are
: open.

avatar
j*s
4
用队列还是堆栈好?第一列的关键字是随机的,FIFO还是LIFO没区别吧?

【在 g*****g 的大作中提到】
: Do something in between, let's say you keep a "file pool",
: you can open a maximum of 5000, and you keep the most recent 5000
: open. Put it in a queue, pop the head out and append the new one
: at the tail when it's over 5000. When you write a file and the file
: is already in the queue, remove it and append it to the tail.
: To speed up search, you can use a hashmap to track if the files are
: open.

avatar
g*g
5
随机的话怎么都行,大部分实际问题应该先进先出,叫做least recently used.

【在 j*******s 的大作中提到】
: 用队列还是堆栈好?第一列的关键字是随机的,FIFO还是LIFO没区别吧?
avatar
s*e
6
既然是在mac os下操作,把这个大文件先sort一下不就简单多了吗?我甚至于会推荐更
进一步,直接用shell/python/perl任何一种脚本语言来写,肯定更容易一些。

【在 j*******s 的大作中提到】
: 请教一个问题,有一个大文件,是个txt表格,按照第一列的关键字分割成若干文件。
: 比如

avatar
b*y
7

嗯,比较喜欢先sort的方法。好像比较有条不紊.

【在 s***e 的大作中提到】
: 既然是在mac os下操作,把这个大文件先sort一下不就简单多了吗?我甚至于会推荐更
: 进一步,直接用shell/python/perl任何一种脚本语言来写,肯定更容易一些。

avatar
F*n
8
如果MEMORY能承受的话肯定是先SORT好,SORT的速度也就是NLOGN而已,比反复I/O要快
多了。

【在 b******y 的大作中提到】
:
: 嗯,比较喜欢先sort的方法。好像比较有条不紊.

avatar
A*o
9
or keep all file names in memory,
and only write to 10k files each iteration reading through the raw file.

【在 g*****g 的大作中提到】
: Do something in between, let's say you keep a "file pool",
: you can open a maximum of 5000, and you keep the most recent 5000
: open. Put it in a queue, pop the head out and append the new one
: at the tail when it's over 5000. When you write a file and the file
: is already in the queue, remove it and append it to the tail.
: To speed up search, you can use a hashmap to track if the files are
: open.

相关阅读
logo
联系我们隐私协议©2024 redian.news
Redian新闻
Redian.news刊载任何文章,不代表同意其说法或描述,仅为提供更多信息,也不构成任何建议。文章信息的合法性及真实性由其作者负责,与Redian.news及其运营公司无关。欢迎投稿,如发现稿件侵权,或作者不愿在本网发表文章,请版权拥有者通知本网处理。