问一道google题 - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>JobHunting - 待字闺中

问一道google题

问一道google题# JobHunting - 待字闺中

s*t2011-01-31 08:01

1 楼

一个单词的列表，要找出两个单词，它们没有相同的字母出现，而且长度乘积最大
原帖在这里：
http://mitbbs.com/article_t/JobHunting/31744763.html
感觉没有哪个解法比较好。

f*42011-01-31 08:01

2 楼

有比O(n^2)更快的办法咩？

c*t2011-01-31 08:01

3 楼

我觉得lolhaha的算法挺好，就是不知道怎么算复杂度。

【在 s*******t 的大作中提到】

: 一个单词的列表，要找出两个单词，它们没有相同的字母出现，而且长度乘积最大
: 原帖在这里：
: http://mitbbs.com/article_t/JobHunting/31744763.html
: 感觉没有哪个解法比较好。

i*e2011-01-31 08:01

4 楼

lolhaha 的算法的复杂度是 O(N^2).
应该没有比 O(N^2) 更好的算法了，
但是有一个优化可以把算法运行速度大大提升。
根据 lolhaha 的算法，
1.create hash for each word
2.sort the dictionary by word length.
assume the dictionary become
a[0],a[1]......
find the largest k so that hash(a[0])&hash(a[k])==0,
get max=len(a[0])*len(a[k])
find max i in [2,k-1] so that hash(a[1])&hash(a[i])==0,
compare max and len(a[1])*len(a[i])
next step find max j in [3,i-1]....
Using this online word list:
http://www.sil.org/linguistics/wordlists/english/wordlist/words
(containing total 109562 valid words with characters a-z (deleting 20
invalid words with apostrophe ')),
the run time is 20 seconds using lolhaha's algorithm on my machine.
However, by just adding this condition before the inner loop:
if (a[j].length * a[j].length < max) break; // stop comparing
the run time is reduced to less than 1 second!!!
If you're interested, the answer using the above word list is:
microphotographic defenselessness 255
photomicrographic defenselessness 255
Isn't this interesting?
It is able to reduce the run time by such a large factor because the
dictionary only contains a handful of long words, compared to the majority
words which are shorter. Since the word list is sorted by descending order
in length, the max is likely to be found during the beginning of loop and
could be terminated earlier.
一些常见面试题的答案与总结 -
http://www.ihas1337code.com

【在 c********t 的大作中提到】

: 我觉得lolhaha的算法挺好，就是不知道怎么算复杂度。

s*t2011-01-31 08:01

5 楼

then I guess how to design the hash function will be key issue here.
someone suggest use bitset to calculate the hash, however in some word,
there are duplicated characters which maybe an issue.

【在 i**********e 的大作中提到】

: lolhaha 的算法的复杂度是 O(N^2).
: 应该没有比 O(N^2) 更好的算法了，
: 但是有一个优化可以把算法运行速度大大提升。
: 根据 lolhaha 的算法，
: 1.create hash for each word
: 2.sort the dictionary by word length.
: assume the dictionary become
: a[0],a[1]......
: find the largest k so that hash(a[0])&hash(a[k])==0,
: get max=len(a[0])*len(a[k])

i*e2011-01-31 08:01

6 楼

Duplicated characters is not an issue here.
It is only required that characters that appeared in word A do not appear in
word B.
Therefore, assuming a word can contain only characters from 'a' to 'z' all
lowercase, then each character can be mapped to a bit in an integer using
only 26 bits.
If 'a' appeared in a word, bit 1 is set. If 'z' appeared, bit 26 is set.
This integer will form the hash for a word.
To find out if all characters that are in word A are not in word B, just
simply do bit AND operation of their hash. Duplicate characters will not
cause an issue here.
一些常见面试题的答案与总结 -
http://www.ihas1337code.com

s*t2011-01-31 08:01

7 楼

Thanks. I understand now. Even two words may have the same hash value, it
doesn't affect the final result.

in

【在 i**********e 的大作中提到】

: Duplicated characters is not an issue here.
: It is only required that characters that appeared in word A do not appear in
: word B.
: Therefore, assuming a word can contain only characters from 'a' to 'z' all
: lowercase, then each character can be mapped to a bit in an integer using
: only 26 bits.
: If 'a' appeared in a word, bit 1 is set. If 'z' appeared, bit 26 is set.
: This integer will form the hash for a word.
: To find out if all characters that are in word A are not in word B, just
: simply do bit AND operation of their hash. Duplicate characters will not

g*s2011-01-31 08:01

8 楼

good observation.
however, how can you ensure this hash is evenly distributed?

in

【在 i**********e 的大作中提到】

i*e2011-01-31 08:01

9 楼

Do you mean hash table?
This 'hash' is not a hash table. It is like a signature for a word which
contains just enough information of the set of characters that appeared in
the word.
一些常见面试题的答案与总结 -
http://www.ihas1337code.com

【在 g*********s 的大作中提到】

: good observation.
: however, how can you ensure this hash is evenly distributed?
:
: in

c*t2011-01-31 08:01

10 楼

太赞了，我觉得还可以优化的就是，不需要一开始就把所有单词的hashcode算出来储存
。不如用到的时候才算出来存。这样因为a[j].length*a[j].length用算了。

【在 i**********e 的大作中提到】

z*o2011-01-31 08:01

11 楼

1. use an integer Mark[i] to represent the character set for word[i] in the
dict; use lower 26 bits, each represent whether a certain character appears
in the word.
(Mark[i] & Mark[j]) == 0 means word[i] and
word[j] are disjoint.
2. at the mean time of calculating the Mark[i], we update the maximum-length
word index in
Longest[(~Mark[i]) & 0x03ffffff];
this means eg.
if word[i] = "abcdefg........uvwabcdefg.....uvw", Longest[(~
Mark[i]) & 0x03ffffff] = i :
the longest word contains and only contains {a,b,c,....,u,v,w} is
word[i] with length 46.
3. we modify the definition of Longest[mark] from
"longest word contains and only contains all unmarked
characters" to
"longest word contains at most all unmarked characters".
To modify the values for Longest[mark], we update Longest[ mark] using
all the Longest[ mark' ] values, where:
mark' = turn one 1 in mark into 0;
if we update the Longest[ mark] using "number of 1s in mark" as order,
increasingly, every Longest[mark] will only be recalculated once.
4. using queue to recalculate every "updated" Longest[mark], instead of
recalculate every Longest[mark] can sometimes dramatically reduce the
calculation. But complexity keeps the same.
5. For complexity: 2^26*26+O(n*m), which is not a good bit-O representation
but more useful.
6. I wrote it just practicing to explain something in English. It sucks to
me.