Java Map 存 1 million 记录 - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>Java - 爪哇娇娃

Java Map 存 1 million 记录

Java Map 存 1 million 记录# Java - 爪哇娇娃

l*02013-08-10 07:08

1 楼

如果物理内存够，比如 8G，这么多 entry 的 map, 会有性能问题吗？ Key 就是
String.还是有分到几个小的 map 比较好？程序中，存到一个 map 里面更方便。
Map 比较合适的最大 size 是多少？

t*e2013-08-10 07:08

2 楼

key partition一下，用两层Maps.

c*e2013-08-10 07:08

3 楼

存到hashtable里？能排序吗?

【在 l******0 的大作中提到】

: 如果物理内存够，比如 8G，这么多 entry 的 map, 会有性能问题吗？ Key 就是
: String.还是有分到几个小的 map 比较好？程序中，存到一个 map 里面更方便。
: Map 比较合适的最大 size 是多少？

w*z2013-08-10 07:08

4 楼

http://stackoverflow.com/questions/4333265/how-to-sort-a-java-h

【在 c*********e 的大作中提到】

: 存到hashtable里？能排序吗?

w*z2013-08-10 07:08

5 楼

1m doesn't seem to be large. do benchmark and tune it if it becomes a
problem.

【在 l******0 的大作中提到】

s*e2013-08-10 07:08

6 楼

if possible, using distributed cache will provide you better QoS of
availability, scalability and performance.
If you have to store them in one single jvm, you'd better write your own
hash function by extending hashmap. The default implementation of hashmap is
not designed for the big data like that.

g*g2013-08-10 07:08

7 楼

1M is nothing, just do it, don't even bother with benchmark.

c*e2013-08-10 07:08

8 楼

linkedhashmap,排序，那时间复杂度就不是o(1)了吧？

【在 w**z 的大作中提到】

: http://stackoverflow.com/questions/4333265/how-to-sort-a-java-h

l*02013-08-10 07:08

9 楼

你是说分作两个 map?

【在 t*******e 的大作中提到】

: key partition一下，用两层Maps.

l*02013-08-10 07:08

10 楼

"The default implementation of hashmap is not designed for the big data like
that."
Any documentation or experiments on this point?

is

【在 s******e 的大作中提到】

: if possible, using distributed cache will provide you better QoS of
: availability, scalability and performance.
: If you have to store them in one single jvm, you'd better write your own
: hash function by extending hashmap. The default implementation of hashmap is
: not designed for the big data like that.

l*02013-08-10 07:08

11 楼

我记得好像有次一个 map 里面的数目超过 20万，速度就变慢了，但我不肯定是不是还
有别的原因造成速度慢。
我的理解是，就像一个房子里面，房子越大，每次要从中找出一个人的时间就要更多？
hash bucket 越多，是不是 search 的时间就越长？但不知100 万会不会成为瓶颈。

【在 g*****g 的大作中提到】

: 1M is nothing, just do it, don't even bother with benchmark.

t*e2013-08-10 07:08

12 楼

When key space grows big，the probability of hash collision enlarges。Java
handles hash collision in a less-efficient manner, which could lead to
slowness.

。

【在 l******0 的大作中提到】

: 我记得好像有次一个 map 里面的数目超过 20万，速度就变慢了，但我不肯定是不是还
: 有别的原因造成速度慢。
: 我的理解是，就像一个房子里面，房子越大，每次要从中找出一个人的时间就要更多？
: hash bucket 越多，是不是 search 的时间就越长？但不知100 万会不会成为瓶颈。

t*e2013-08-10 07:08

13 楼

Figure out an algorithm to evenly distribute all the keys into 100 slots (
Maps). Each slot contains 10K records. Consistent Hash in BigTable might
give you some inspiration.

【在 l******0 的大作中提到】

: 你是说分作两个 map?

w*z2013-08-10 07:08

14 楼

I don't think you are smarter than the people who implemented Java string
hash function. don't over engineer things. if you think you have performance
issue, run some benchmark and see where is the bottle neck and go from
there. as I said, 1m is nothing nowadays .

【在 t*******e 的大作中提到】

: When key space grows big，the probability of hash collision enlarges。Java
: handles hash collision in a less-efficient manner, which could lead to
: slowness.
:
: 。

t*e2013-08-10 07:08

15 楼

http://openjdk.java.net/jeps/180
Obviously they are smarter than you.

performance

【在 w**z 的大作中提到】

: I don't think you are smarter than the people who implemented Java string
: hash function. don't over engineer things. if you think you have performance
: issue, run some benchmark and see where is the bottle neck and go from
: there. as I said, 1m is nothing nowadays .

w*z2013-08-10 07:08

16 楼

I am talking about String hash function, not handling of hash map collision
. by the way, if you run openjdk in production, then good luck. Of course
, they are smarter than me.

【在 t*******e 的大作中提到】

: http://openjdk.java.net/jeps/180
: Obviously they are smarter than you.
:
: performance

t*e2013-08-10 07:08

17 楼

Same issue with Oracle JDK. BTW, you sound pretty naive and emotional.
Please be professional not personal when discussing technical subjects.

collision
course

【在 w**z 的大作中提到】

: I am talking about String hash function, not handling of hash map collision
: . by the way, if you run openjdk in production, then good luck. Of course
: , they are smarter than me.

w*z2013-08-10 07:08

18 楼

which part is emotional and not professional?

【在 t*******e 的大作中提到】

: Same issue with Oracle JDK. BTW, you sound pretty naive and emotional.
: Please be professional not personal when discussing technical subjects.
:
: collision
: course

s*e2013-08-10 07:08

19 楼

rehash and collision are two major issues.

like

【在 l******0 的大作中提到】

: "The default implementation of hashmap is not designed for the big data like
: that."
: Any documentation or experiments on this point?
:
: is

c*e2013-08-10 07:08

20 楼

hashtable发生collision是常事，换个算法。

【在 s******e 的大作中提到】

: rehash and collision are two major issues.
:
: like

s*e2013-08-10 07:08

21 楼

read the context please. Do not jump into the topic too fast without knowing
the situation... we were talking about big number of string keys.

【在 c*********e 的大作中提到】

: hashtable发生collision是常事，换个算法。

t*e2013-08-10 07:08

22 楼

Judge yourself before you judge others.

【在 w**z 的大作中提到】

: which part is emotional and not professional?

w*z2013-08-10 07:08

23 楼

搞得那么悬呼？人用一个1m hashmap , 还big table, consistent hashing. 你phd
搞研究的吧？

【在 t*******e 的大作中提到】

: Judge yourself before you judge others.

g*g2013-08-10 07:08

24 楼

1M hashmap，屁大的事情。我这几天把几个Oracle的表转到MySQL上，有一些合并和分
解的操作。
几个1-2M记录的表，都是直接放到内存里，用来做id->object lookup，完全没有压力
。这是在我8G的laptop上跑的，不是什么大服务器。

phd

【在 w**z 的大作中提到】

: 搞得那么悬呼？人用一个1m hashmap , 还big table, consistent hashing. 你phd
: 搞研究的吧？

s*e2013-08-10 07:08

25 楼

well. seems that people have the different assumptions.
If the initial loading time is not counted and your string key is db id type
of string, then hashmap might be fine since both rehashing and collision
will not be issue any more.
But according to the original post, it seems that we are talking about
building the hashmap over the time on the fly with random and big strings as
the key. if so, rehashing and collision will be issues.

g*g2013-08-10 07:08

26 楼

1M is such a small dataset it doesn't matter how it's built.
A 32 bits integer gives your 4 billion buckets, 1M is just 1/4000 of them,
you'll see a few collisions here and there, but not enough to cause
performance issue.

type
as

【在 s******e 的大作中提到】

: well. seems that people have the different assumptions.
: If the initial loading time is not counted and your string key is db id type
: of string, then hashmap might be fine since both rehashing and collision
: will not be issue any more.
: But according to the original post, it seems that we are talking about
: building the hashmap over the time on the fly with random and big strings as
: the key. if so, rehashing and collision will be issues.

s*e2013-08-10 07:08

27 楼

you misunderstand the relationship between #buckets and collision. Having
enough buckets does not necessarily transfer into less collisions. As said,
db id type string (especially surrogate id), due to the constraint applied,
is more likely to be uniformly distributed than random big string keys.
rehashing is another issue because of the way how the default implementation
of hashmap works. if you build the map on the fly, I remember that somebody
reported after 200 thousand string key entries, rehashing could cause some
users to suffer quite a bit. Even not all the users will suffer, but it
should be avoided for any robust systems.

【在 g*****g 的大作中提到】

: 1M is such a small dataset it doesn't matter how it's built.
: A 32 bits integer gives your 4 billion buckets, 1M is just 1/4000 of them,
: you'll see a few collisions here and there, but not enough to cause
: performance issue.
:
: type
: as

g*g2013-08-10 07:08

28 楼

No, you misunderstand. The number of available buckets is constant, the less
data, the less chance for collision, simple as that.
It is possible 1M data causes heavy collision? Possible, but unlikely for
most use cases. And when it's case, it's easy to tune the hashing function.
OP didn't mention any random big string to begin with, try a simple approach
and don't overengineer before you have to.

,
,
implementation
somebody
some

【在 s******e 的大作中提到】

: you misunderstand the relationship between #buckets and collision. Having
: enough buckets does not necessarily transfer into less collisions. As said,
: db id type string (especially surrogate id), due to the constraint applied,
: is more likely to be uniformly distributed than random big string keys.
: rehashing is another issue because of the way how the default implementation
: of hashmap works. if you build the map on the fly, I remember that somebody
: reported after 200 thousand string key entries, rehashing could cause some
: users to suffer quite a bit. Even not all the users will suffer, but it
: should be avoided for any robust systems.

w*z2013-08-10 07:08

29 楼

Java Hashmap implementation does a supplement hash on the hasCode caller
provides.
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/
It's not likely you will have a lot of hash collisions for random String Key
.
Another way to avoid rehashing is to set a bigger initial size if you are
sure you will have 1M records.
I agree with goodbug, at least do some benchmarking before trying to tune
anything.

less
approach

【在 g*****g 的大作中提到】

: No, you misunderstand. The number of available buckets is constant, the less
: data, the less chance for collision, simple as that.
: It is possible 1M data causes heavy collision? Possible, but unlikely for
: most use cases. And when it's case, it's easy to tune the hashing function.
: OP didn't mention any random big string to begin with, try a simple approach
: and don't overengineer before you have to.
:
: ,
: ,
: implementation

l*m2013-08-10 07:08

30 楼

hashmap will increase the memory allocation when the current allocation is
not large enough

Key

【在 w**z 的大作中提到】

: Java Hashmap implementation does a supplement hash on the hasCode caller
: provides.
: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/
: It's not likely you will have a lot of hash collisions for random String Key
: .
: Another way to avoid rehashing is to set a bigger initial size if you are
: sure you will have 1M records.
: I agree with goodbug, at least do some benchmarking before trying to tune
: anything.
:

w*z2013-08-10 07:08

31 楼

At that time, rehashing happens... If you know your size in advance, just
set to a bigger number to avoid rehashing.

【在 l*******m 的大作中提到】

: hashmap will increase the memory allocation when the current allocation is
: not large enough
:
: Key

s*e2013-08-10 07:08

32 楼

First, the number of buckets is never constant for the default
implementation of HashMap. That is why you have rehashing.
Second, the collision will depend on all three factors of the input, hash
function and the number of buckets. Even you can use all 64 bits of memory
for your bucket, you might still have heavy collision for a biased input.
That is why I said db id type string key might work fine for the default
hashmap. I believe in such a case, the default hash function will distribute
them uniformly.

less
approach

【在 g*****g 的大作中提到】

s*e2013-08-10 07:08

33 楼

When you build your map dynamically, that is a big if. If you can preload, I
think even several million records should be fine if the input is not that
biased.

【在 w**z 的大作中提到】

: At that time, rehashing happens... If you know your size in advance, just
: set to a bigger number to avoid rehashing.

w*z2013-08-10 07:08

34 楼

That is true, it's the tradeoff between time and space. Only lz knows his
use case.

I
that

【在 s******e 的大作中提到】

: When you build your map dynamically, that is a big if. If you can preload, I
: think even several million records should be fine if the input is not that
: biased.

l*02013-08-10 07:08

35 楼

In my case,
1) The "keys" are English words or phrases, no repetitions.
2) I have to build this map in a static initialization function.
BTW, what do you mean by 'preload', is it an initialization?

I
that

【在 s******e 的大作中提到】

: When you build your map dynamically, that is a big if. If you can preload, I
: think even several million records should be fine if the input is not that
: biased.

l*02013-08-10 07:08

36 楼

Sounds like good experiments. I will use one map to try.

【在 g*****g 的大作中提到】

: 1M hashmap，屁大的事情。我这几天把几个Oracle的表转到MySQL上，有一些合并和分
: 解的操作。
: 几个1-2M记录的表，都是直接放到内存里，用来做id->object lookup，完全没有压力
: 。这是在我8G的laptop上跑的，不是什么大服务器。
:
: phd