if possible, using distributed cache will provide you better QoS of availability, scalability and performance. If you have to store them in one single jvm, you'd better write your own hash function by extending hashmap. The default implementation of hashmap is not designed for the big data like that.
g*g
7 楼
1M is nothing, just do it, don't even bother with benchmark.
"The default implementation of hashmap is not designed for the big data like that." Any documentation or experiments on this point?
is
【在 s******e 的大作中提到】 : if possible, using distributed cache will provide you better QoS of : availability, scalability and performance. : If you have to store them in one single jvm, you'd better write your own : hash function by extending hashmap. The default implementation of hashmap is : not designed for the big data like that.
【在 g*****g 的大作中提到】 : 1M is nothing, just do it, don't even bother with benchmark.
t*e
12 楼
When key space grows big,the probability of hash collision enlarges。Java handles hash collision in a less-efficient manner, which could lead to slowness.
Figure out an algorithm to evenly distribute all the keys into 100 slots ( Maps). Each slot contains 10K records. Consistent Hash in BigTable might give you some inspiration.
【在 l******0 的大作中提到】 : 你是说分作两个 map?
w*z
14 楼
I don't think you are smarter than the people who implemented Java string hash function. don't over engineer things. if you think you have performance issue, run some benchmark and see where is the bottle neck and go from there. as I said, 1m is nothing nowadays .
【在 t*******e 的大作中提到】 : When key space grows big,the probability of hash collision enlarges。Java : handles hash collision in a less-efficient manner, which could lead to : slowness. : : 。
【在 w**z 的大作中提到】 : I don't think you are smarter than the people who implemented Java string : hash function. don't over engineer things. if you think you have performance : issue, run some benchmark and see where is the bottle neck and go from : there. as I said, 1m is nothing nowadays .
w*z
16 楼
I am talking about String hash function, not handling of hash map collision . by the way, if you run openjdk in production, then good luck. Of course , they are smarter than me.
Same issue with Oracle JDK. BTW, you sound pretty naive and emotional. Please be professional not personal when discussing technical subjects.
collision course
【在 w**z 的大作中提到】 : I am talking about String hash function, not handling of hash map collision : . by the way, if you run openjdk in production, then good luck. Of course : , they are smarter than me.
w*z
18 楼
which part is emotional and not professional?
【在 t*******e 的大作中提到】 : Same issue with Oracle JDK. BTW, you sound pretty naive and emotional. : Please be professional not personal when discussing technical subjects. : : collision : course
s*e
19 楼
rehash and collision are two major issues.
like
【在 l******0 的大作中提到】 : "The default implementation of hashmap is not designed for the big data like : that." : Any documentation or experiments on this point? : : is
c*e
20 楼
hashtable发生collision是常事,换个算法。
【在 s******e 的大作中提到】 : rehash and collision are two major issues. : : like
s*e
21 楼
read the context please. Do not jump into the topic too fast without knowing the situation... we were talking about big number of string keys.
well. seems that people have the different assumptions. If the initial loading time is not counted and your string key is db id type of string, then hashmap might be fine since both rehashing and collision will not be issue any more. But according to the original post, it seems that we are talking about building the hashmap over the time on the fly with random and big strings as the key. if so, rehashing and collision will be issues.
g*g
26 楼
1M is such a small dataset it doesn't matter how it's built. A 32 bits integer gives your 4 billion buckets, 1M is just 1/4000 of them, you'll see a few collisions here and there, but not enough to cause performance issue.
type as
【在 s******e 的大作中提到】 : well. seems that people have the different assumptions. : If the initial loading time is not counted and your string key is db id type : of string, then hashmap might be fine since both rehashing and collision : will not be issue any more. : But according to the original post, it seems that we are talking about : building the hashmap over the time on the fly with random and big strings as : the key. if so, rehashing and collision will be issues.
s*e
27 楼
you misunderstand the relationship between #buckets and collision. Having enough buckets does not necessarily transfer into less collisions. As said, db id type string (especially surrogate id), due to the constraint applied, is more likely to be uniformly distributed than random big string keys. rehashing is another issue because of the way how the default implementation of hashmap works. if you build the map on the fly, I remember that somebody reported after 200 thousand string key entries, rehashing could cause some users to suffer quite a bit. Even not all the users will suffer, but it should be avoided for any robust systems.
【在 g*****g 的大作中提到】 : 1M is such a small dataset it doesn't matter how it's built. : A 32 bits integer gives your 4 billion buckets, 1M is just 1/4000 of them, : you'll see a few collisions here and there, but not enough to cause : performance issue. : : type : as
g*g
28 楼
No, you misunderstand. The number of available buckets is constant, the less data, the less chance for collision, simple as that. It is possible 1M data causes heavy collision? Possible, but unlikely for most use cases. And when it's case, it's easy to tune the hashing function. OP didn't mention any random big string to begin with, try a simple approach and don't overengineer before you have to.
, , implementation somebody some
【在 s******e 的大作中提到】 : you misunderstand the relationship between #buckets and collision. Having : enough buckets does not necessarily transfer into less collisions. As said, : db id type string (especially surrogate id), due to the constraint applied, : is more likely to be uniformly distributed than random big string keys. : rehashing is another issue because of the way how the default implementation : of hashmap works. if you build the map on the fly, I remember that somebody : reported after 200 thousand string key entries, rehashing could cause some : users to suffer quite a bit. Even not all the users will suffer, but it : should be avoided for any robust systems.
w*z
29 楼
Java Hashmap implementation does a supplement hash on the hasCode caller provides. http://grepcode.com/file/repository.grepcode.com/java/root/jdk/ It's not likely you will have a lot of hash collisions for random String Key . Another way to avoid rehashing is to set a bigger initial size if you are sure you will have 1M records. I agree with goodbug, at least do some benchmarking before trying to tune anything.
less approach
【在 g*****g 的大作中提到】 : No, you misunderstand. The number of available buckets is constant, the less : data, the less chance for collision, simple as that. : It is possible 1M data causes heavy collision? Possible, but unlikely for : most use cases. And when it's case, it's easy to tune the hashing function. : OP didn't mention any random big string to begin with, try a simple approach : and don't overengineer before you have to. : : , : , : implementation
l*m
30 楼
hashmap will increase the memory allocation when the current allocation is not large enough
Key
【在 w**z 的大作中提到】 : Java Hashmap implementation does a supplement hash on the hasCode caller : provides. : http://grepcode.com/file/repository.grepcode.com/java/root/jdk/ : It's not likely you will have a lot of hash collisions for random String Key : . : Another way to avoid rehashing is to set a bigger initial size if you are : sure you will have 1M records. : I agree with goodbug, at least do some benchmarking before trying to tune : anything. :
w*z
31 楼
At that time, rehashing happens... If you know your size in advance, just set to a bigger number to avoid rehashing.
【在 l*******m 的大作中提到】 : hashmap will increase the memory allocation when the current allocation is : not large enough : : Key
s*e
32 楼
First, the number of buckets is never constant for the default implementation of HashMap. That is why you have rehashing. Second, the collision will depend on all three factors of the input, hash function and the number of buckets. Even you can use all 64 bits of memory for your bucket, you might still have heavy collision for a biased input. That is why I said db id type string key might work fine for the default hashmap. I believe in such a case, the default hash function will distribute them uniformly.
less approach
【在 g*****g 的大作中提到】 : No, you misunderstand. The number of available buckets is constant, the less : data, the less chance for collision, simple as that. : It is possible 1M data causes heavy collision? Possible, but unlikely for : most use cases. And when it's case, it's easy to tune the hashing function. : OP didn't mention any random big string to begin with, try a simple approach : and don't overengineer before you have to. : : , : , : implementation
s*e
33 楼
When you build your map dynamically, that is a big if. If you can preload, I think even several million records should be fine if the input is not that biased.
【在 w**z 的大作中提到】 : At that time, rehashing happens... If you know your size in advance, just : set to a bigger number to avoid rehashing.
w*z
34 楼
That is true, it's the tradeoff between time and space. Only lz knows his use case.
I that
【在 s******e 的大作中提到】 : When you build your map dynamically, that is a big if. If you can preload, I : think even several million records should be fine if the input is not that : biased.
l*0
35 楼
In my case, 1) The "keys" are English words or phrases, no repetitions. 2) I have to build this map in a static initialization function. BTW, what do you mean by 'preload', is it an initialization?
I that
【在 s******e 的大作中提到】 : When you build your map dynamically, that is a big if. If you can preload, I : think even several million records should be fine if the input is not that : biased.
l*0
36 楼
Sounds like good experiments. I will use one map to try.