avatar
u盘里的虚拟光驱能否删除?# Windows - 看得见风景的窗口
B*t
1
Major 第四季 第四話 這就是大聯盟
吾郎被Salmons直接招進大聯盟的訓練營,還給了一個56號的隊服(正好和吾郎名字讀音接近)。吾郎還是一如既往的非常的bad ass,一進球團就和隊中的ace大打出手,不過看起來這個ace還是很欣賞他的,畢竟
これからメージャーで生き残ってルーキが,気が見ず知らずのやつにマーンドを譲るようじゃいけねー。
要想在大聯盟中立足,必須要有絕不輕易將投手丘讓與他人的執念。。。于是建議和吾郎用控球力決勝負,結果吾郎完敗。。。畢竟和大聯盟的ace戰力上還是有決定性的差距。沮喪的吾郎,覺得自己根本沒有大聯盟的實力的時候,卻接到了球團要他準備在表演賽中先發的要求。吾郎希望自己的捕手,Fox(隊中唯一懂得日語的人)向教練轉達希望能夠延遲先發的要求,結果被Fox鄙視了,因為大家都是拼死要留在大聯盟登錄名單裡,即使為此在小聯盟裡祗能天天做長途汽車,吃漢堡和熱狗也在所不惜。
話說美國職棒小聯盟的待遇確實是非常差。一般一個月祗有一千多美元,客場比賽可能還有一些meal money。確實如果祗靠打小聯盟是養活不了自己的。相反,只要能上大聯盟登錄名單,祗少可以拿到一年30萬
avatar
s*x
2
Hi, guys,
I need your expert opinions on this problem. I need to do a random sampling
of a input file. Suppose this
file is consisted of 1000 lines (1000 data). I want to randomly choose 60%,
70% and 85% of them and do
some analysis. Does anyone have any suggestions? Thanks!
avatar
l*e
3
现在好象,有的优盘插进去,系统会出来个光驱。还有自动执行。
可否,删除该光驱,或其中的autorun.inf.
thanx,
avatar
h*e
4
才1000行,怎么搞都行啊。生成1000*60%, *70%, *85%个[1,1000]的随机数,然后一行
行读,记个行号,把对应的挑出来。

sampling
,

【在 s*********x 的大作中提到】
: Hi, guys,
: I need your expert opinions on this problem. I need to do a random sampling
: of a input file. Suppose this
: file is consisted of 1000 lines (1000 data). I want to randomly choose 60%,
: 70% and 85% of them and do
: some analysis. Does anyone have any suggestions? Thanks!

avatar
y*r
5
Google U3-Uninstaller

【在 l***e 的大作中提到】
: 现在好象,有的优盘插进去,系统会出来个光驱。还有自动执行。
: 可否,删除该光驱,或其中的autorun.inf.
: thanx,

avatar
s*x
6
Sorry I have to type in English. My question is that if you generate (60% of
1000) random numbers, some
of them will definitely be the same. Then you are not covering 60% of the
population, maybe only 55%.
What I need is 600 different random numbers between 1 to 1000.
Is my understanding correct?

【在 h*******e 的大作中提到】
: 才1000行,怎么搞都行啊。生成1000*60%, *70%, *85%个[1,1000]的随机数,然后一行
: 行读,记个行号,把对应的挑出来。
:
: sampling
: ,

avatar
x*g
7
all the U3 stuff can be uninstalled with a utility program from
their web site.

【在 l***e 的大作中提到】
: 现在好象,有的优盘插进去,系统会出来个光驱。还有自动执行。
: 可否,删除该光驱,或其中的autorun.inf.
: thanx,

avatar
g*g
8
You first read all the lines into memory, 1000 is nothing.
Then you generate random numbers, you can keep all generated
numbers in a hashtable, and make sure you get enough distinct.

of

【在 s*********x 的大作中提到】
: Sorry I have to type in English. My question is that if you generate (60% of
: 1000) random numbers, some
: of them will definitely be the same. Then you are not covering 60% of the
: population, maybe only 55%.
: What I need is 600 different random numbers between 1 to 1000.
: Is my understanding correct?

avatar
l*e
9
not U3.
avatar
h*e
10
that's a different problem. there's a well-known technique to generate say
600 numbers from 1 to 1000 uniformly. for simplicity, you can just use stl's
random_shuffle to get a random permutation of 1 to 1000, and use the first
600 numbers.

of

【在 s*********x 的大作中提到】
: Sorry I have to type in English. My question is that if you generate (60% of
: 1000) random numbers, some
: of them will definitely be the same. Then you are not covering 60% of the
: population, maybe only 55%.
: What I need is 600 different random numbers between 1 to 1000.
: Is my understanding correct?

avatar
x*g
11
Then what is it?

【在 l***e 的大作中提到】
: not U3.
avatar
s*x
12
Oh, man! That's exactly what I need! Thank you!

's
first

【在 h*******e 的大作中提到】
: that's a different problem. there's a well-known technique to generate say
: 600 numbers from 1 to 1000 uniformly. for simplicity, you can just use stl's
: random_shuffle to get a random permutation of 1 to 1000, and use the first
: 600 numbers.
:
: of

avatar
m*t
13

Surely you meant _map_ the file into memory? Plainly
reading all the lines could be rather, uhm, hazardous, if
we don't know how long a line can be.

【在 g*****g 的大作中提到】
: You first read all the lines into memory, 1000 is nothing.
: Then you generate random numbers, you can keep all generated
: numbers in a hashtable, and make sure you get enough distinct.
:
: of

avatar
g*g
14
Something like
while((str=bufferedReader.readLine())!=null) {
arrList.add(str);
}
I don't think long line matters.

【在 m******t 的大作中提到】
:
: Surely you meant _map_ the file into memory? Plainly
: reading all the lines could be rather, uhm, hazardous, if
: we don't know how long a line can be.

avatar
h*n
15
把1000个数字随机打乱, 然后取前600个
avatar
r*t
16
U use python for such problems?
lines = open("file","r").read()
mylines = random.sample(lines, int(len(lines)*0.6))
...(do whatever u want to mylines)

sampling
,

【在 s*********x 的大作中提到】
: Hi, guys,
: I need your expert opinions on this problem. I need to do a random sampling
: of a input file. Suppose this
: file is consisted of 1000 lines (1000 data). I want to randomly choose 60%,
: 70% and 85% of them and do
: some analysis. Does anyone have any suggestions? Thanks!

avatar
r*t
17
如果不读完,怎么能知道一共有几行?即使是 map memory 了也需要读完才能开始
ramdom sample. 需不需要 worry file size 还不知道, one problem at a time.

【在 m******t 的大作中提到】
:
: Surely you meant _map_ the file into memory? Plainly
: reading all the lines could be rather, uhm, hazardous, if
: we don't know how long a line can be.

avatar
m*t
18

Of course they matter. I have seen lines as long as
a couple of hundreds MB each.
We can _assume_ we won't run into any of them, and that's
probably usually a good assumption, but that's different
from they don't matter.

【在 g*****g 的大作中提到】
: Something like
: while((str=bufferedReader.readLine())!=null) {
: arrList.add(str);
: }
: I don't think long line matters.

avatar
m*t
19

The idea is that memory mapping won't keep the whole thing in
the memory - the OS can feel free to swap out all the rest
of the pages exception for the one currently being accessed.

【在 r****t 的大作中提到】
: 如果不读完,怎么能知道一共有几行?即使是 map memory 了也需要读完才能开始
: ramdom sample. 需不需要 worry file size 还不知道, one problem at a time.

avatar
g*g
20
I think you know beforehand the likehood of running into
such trouble by looking at the file size.
At worst you run into out of memory exception, and you
correct it. Well, we all run into them in some case no matter
how carefully you design the code.

【在 m******t 的大作中提到】
:
: The idea is that memory mapping won't keep the whole thing in
: the memory - the OS can feel free to swap out all the rest
: of the pages exception for the one currently being accessed.

avatar
r*t
21
I understand your point about using mmap. 但是就像前段时间这里讨论过的一样,
mmap 也不能彻底解决大文件问题,你没法 map 8G 文件,或者你要是可以的话,你没
法 map 8T 文件。
My question is: mmap might not help till u starts to deal with 很大的文件,
而且不一定 help performance, 因为一样需要先知道文件有多少行。

【在 m******t 的大作中提到】
:
: The idea is that memory mapping won't keep the whole thing in
: the memory - the OS can feel free to swap out all the rest
: of the pages exception for the one currently being accessed.

avatar
r*t
22
另外一个办法 (不用知道一共有几行,magicfat 的启发) 是,
如果行数比较多,根据大树定理
你从头读到尾,对每一行以 60% 的机会选出应该也是可以的,只是最终你得到的行数
接近 60%, 可能不能完全等于某个指定的行数,这时候你甩掉多余的行,或者回头找
几行也行(这个最终处理比较麻烦,但是你可能更本不需要拿出精确的行数?
def select(line):
.... if randome()<= 0.6:
........ return line
....lese:
........ return False
mylines = ifilter(select, open("myfile"))
这样 mylines 是个 iterator, 这个code 不会把整个文件读入内存,所以可能 memory 上
面节省一些。缺点是不能得到精确的行数,要求 60% 基本上得到 60%, 但是多几行少几行都可能。

sampling
,

【在 s*********x 的大作中提到】
: Hi, guys,
: I need your expert opinions on this problem. I need to do a random sampling
: of a input file. Suppose this
: file is consisted of 1000 lines (1000 data). I want to randomly choose 60%,
: 70% and 85% of them and do
: some analysis. Does anyone have any suggestions? Thanks!

avatar
r*t
23
stupid me. 这么写比较傻, 不喜欢搞 one-liner。但是下面足够了(python >=2.5):
mylines = ifilter(lambda x: x if random()<=.6 else False,
....................open("myfile"))

【在 r****t 的大作中提到】
: 另外一个办法 (不用知道一共有几行,magicfat 的启发) 是,
: 如果行数比较多,根据大树定理
: 你从头读到尾,对每一行以 60% 的机会选出应该也是可以的,只是最终你得到的行数
: 接近 60%, 可能不能完全等于某个指定的行数,这时候你甩掉多余的行,或者回头找
: 几行也行(这个最终处理比较麻烦,但是你可能更本不需要拿出精确的行数?
: def select(line):
: .... if randome()<= 0.6:
: ........ return line
: ....lese:
: ........ return False

avatar
c*t
24
你这个算法有可能拿到多于或者少于需要的 sample 。
其实比较简单的办法就是 random_shuffle line[N],N 是总共 data 的行数。
就算 N 是几百万,也是小意思。然后 sort 一下 line。然后就读一行,如果该
行不在 line[] 里,扔掉。在 line[] 里留下。这办法可以处理很大的文件。。。
上面是 sampling w/o replacement 。sampling w/ replacement 可以同理。
唯一麻烦的是,如果某 algorithm 对 order 比较 sensitive 。。。

【在 r****t 的大作中提到】
: stupid me. 这么写比较傻, 不喜欢搞 one-liner。但是下面足够了(python >=2.5):
: mylines = ifilter(lambda x: x if random()<=.6 else False,
: ....................open("myfile"))

avatar
r*t
25
已经说清楚了行数可能不准这个问题,行数准的解法在更前面的帖子里。实际上 LZ
不一定需要行数准确,而是需要 sample probability 为常数。
你这个方法的问题是,任给一个文件怎么拿到这个 N? 直接 assume N is known 不太合理。

【在 c*****t 的大作中提到】
: 你这个算法有可能拿到多于或者少于需要的 sample 。
: 其实比较简单的办法就是 random_shuffle line[N],N 是总共 data 的行数。
: 就算 N 是几百万,也是小意思。然后 sort 一下 line。然后就读一行,如果该
: 行不在 line[] 里,扔掉。在 line[] 里留下。这办法可以处理很大的文件。。。
: 上面是 sampling w/o replacement 。sampling w/ replacement 可以同理。
: 唯一麻烦的是,如果某 algorithm 对 order 比较 sensitive 。。。

avatar
c*t
26
可以先读一遍,或者 wc -l,这都很快。
这东西拿 C 写都很容易。。。

太合理。

【在 r****t 的大作中提到】
: 已经说清楚了行数可能不准这个问题,行数准的解法在更前面的帖子里。实际上 LZ
: 不一定需要行数准确,而是需要 sample probability 为常数。
: 你这个方法的问题是,任给一个文件怎么拿到这个 N? 直接 assume N is known 不太合理。

avatar
r*t
27
所以需要先读一边才行么。
wc 其实很慢(我也没想到)这儿根本不能采用,上回我的 python code 干活都干完(sanitize some csv file)了, wc 算个行数花 3, 4 倍的时间还没完。

【在 c*****t 的大作中提到】
: 可以先读一遍,或者 wc -l,这都很快。
: 这东西拿 C 写都很容易。。。
:
: 太合理。

avatar
c*t
28
flex 的 test suite 里的 wc 超快(专门就是显示 flex 的速度的)。你用
那个就好了。可以达到 10x 以上的速度。主要是平常用的 wc 是手写的,没
block read,还 N 多 jump 。
你的办法在正规的 sampling 里不能用。多了少了都不行。

(sanitize some csv file)了, wc 算个行数花 3, 4 倍的时间还没完。

【在 r****t 的大作中提到】
: 所以需要先读一边才行么。
: wc 其实很慢(我也没想到)这儿根本不能采用,上回我的 python code 干活都干完(sanitize some csv file)了, wc 算个行数花 3, 4 倍的时间还没完。

avatar
r*t
29
需要准确的 sample 个数,用我的第一个解就行了。
flex test suite 是什么东东?从来没听说过。。。

【在 c*****t 的大作中提到】
: flex 的 test suite 里的 wc 超快(专门就是显示 flex 的速度的)。你用
: 那个就好了。可以达到 10x 以上的速度。主要是平常用的 wc 是手写的,没
: block read,还 N 多 jump 。
: 你的办法在正规的 sampling 里不能用。多了少了都不行。
:
: (sanitize some csv file)了, wc 算个行数花 3, 4 倍的时间还没完。

avatar
h*e
30
flex带的test samples

【在 r****t 的大作中提到】
: 需要准确的 sample 个数,用我的第一个解就行了。
: flex test suite 是什么东东?从来没听说过。。。

avatar
m*t
31

Not when you don't keep all the lines in memory
without any apparent benefit.

【在 g*****g 的大作中提到】
: I think you know beforehand the likehood of running into
: such trouble by looking at the file size.
: At worst you run into out of memory exception, and you
: correct it. Well, we all run into them in some case no matter
: how carefully you design the code.

avatar
m*t
32

That's some funny logic though. Why do we switch to 64-bit?
It can't handle 31498 trillion gazillion google bungle number
of bytes anyway. 8-)
It doesn't have to be real huge files. And yes, it does help
performance.
Trust me, I would know. I wrote a log parsing program (in java)
a little while ago that does exactly this - build an index of
fseek positions for every 100 lines. On a 200MB files,
the difference between readLine() and a memory mapped buffer
was beyond visible.

【在 r****t 的大作中提到】
: I understand your point about using mmap. 但是就像前段时间这里讨论过的一样,
: mmap 也不能彻底解决大文件问题,你没法 map 8G 文件,或者你要是可以的话,你没
: 法 map 8T 文件。
: My question is: mmap might not help till u starts to deal with 很大的文件,
: 而且不一定 help performance, 因为一样需要先知道文件有多少行。

avatar
s*x
33
will the random.sample get the real random number of lines or pseduo-random
number? I tried STL
random_shuffle and found it's not a real random shuffle... meaning if you
run it twice, the random
"shuffled" numbers are in the same order. then the sampling procedure is not
right.

【在 r****t 的大作中提到】
: U use python for such problems?
: lines = open("file","r").read()
: mylines = random.sample(lines, int(len(lines)*0.6))
: ...(do whatever u want to mylines)
:
: sampling
: ,

avatar
r*t
34
random.sample in python uses random.random pseduo-RNG, 但是在 linux 上面
random.random 自动使用 /dev/urandom seed,所以 run it a second time, the
order should probably (not /dev/random yet) be different.
random.random() uses Mersenne Twister as the core generator,应该是很多地方
都在使用的。

random
not

【在 s*********x 的大作中提到】
: will the random.sample get the real random number of lines or pseduo-random
: number? I tried STL
: random_shuffle and found it's not a real random shuffle... meaning if you
: run it twice, the random
: "shuffled" numbers are in the same order. then the sampling procedure is not
: right.

avatar
t*t
35
STL random_shuffle calls system std::rand() if you don't supply a rand
object yourself.
so the caller is actually responsible for guaranteeing the randomness.

random
not

【在 s*********x 的大作中提到】
: will the random.sample get the real random number of lines or pseduo-random
: number? I tried STL
: random_shuffle and found it's not a real random shuffle... meaning if you
: run it twice, the random
: "shuffled" numbers are in the same order. then the sampling procedure is not
: right.

avatar
r*t
36

这个是 coconut 前一段时间提出来的标准,我只是转述了一下。
那好吧,>200 M help performance, still big files.

【在 m******t 的大作中提到】
:
: That's some funny logic though. Why do we switch to 64-bit?
: It can't handle 31498 trillion gazillion google bungle number
: of bytes anyway. 8-)
: It doesn't have to be real huge files. And yes, it does help
: performance.
: Trust me, I would know. I wrote a log parsing program (in java)
: a little while ago that does exactly this - build an index of
: fseek positions for every 100 lines. On a 200MB files,
: the difference between readLine() and a memory mapped buffer

avatar
r*t
37
so you mean one needs to take care of seeding in his code?

【在 t****t 的大作中提到】
: STL random_shuffle calls system std::rand() if you don't supply a rand
: object yourself.
: so the caller is actually responsible for guaranteeing the randomness.
:
: random
: not

avatar
t*t
38
STL algorithm is ... algorithm only. randomness is not a part of the shuffle
algorithm.

【在 r****t 的大作中提到】
: so you mean one needs to take care of seeding in his code?
相关阅读
logo
联系我们隐私协议©2024 redian.news
Redian新闻
Redian.news刊载任何文章,不代表同意其说法或描述,仅为提供更多信息,也不构成任何建议。文章信息的合法性及真实性由其作者负责,与Redian.news及其运营公司无关。欢迎投稿,如发现稿件侵权,或作者不愿在本网发表文章,请版权拥有者通知本网处理。