问一下Repeated DNA Sequences - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>JobHunting - 待字闺中

问一下Repeated DNA Sequences

问一下Repeated DNA Sequences# JobHunting - 待字闺中

a*e2015-02-09 08:02

1 楼

自己做的总是memory exceeded，
看了讨论，
https://oj.leetcode.com/discuss/24478/i-did-it-in-10-lines-of-c
还是不懂为啥m[t = t << 3 & 0x3FFFFFFF | s[i] & 7]++ == 1就能达到rolling hash
的效果呢?
i只是一个char啊，这个是哪里推导出来的？多谢！
Neat idea. The additional 1 bit per letter still encode each substring in
10x3 = 30 bits, just under 4 bytes for a 32-bit integer.
Your code could be further simplified. By observing that s[i] & 7 is never 0
, each of the first nine substrings with length < 10 will have unique hash
key and will never collide with other 10-letter long sequences. Therefore
the first loop could be removed and be compacted into a single loop.
vector findRepeatedDnaSequences(string s) {
unordered_map m;
vector r;
for (int t = 0, i = 0; i < s.size(); i++)
if (m[t = t << 3 & 0x3FFFFFFF | s[i] & 7]++ == 1)
r.push_back(s.substr(i - 9, 10));
return r;
}

s*a2015-02-09 08:02

2 楼

因为他说ACTG在ASC里面后三位不一样吧
其实你就说A001 C010 T011 G100啥都没有是000 一样的
无非赶得巧省几行代码

p*e2015-02-09 08:02

3 楼

你不一定非要用这种方法，一个字母用2个bit表示，就不会memory exceed了
m[t = t << 3 & 0x3FFFFFFF | s[i] & 7]++ == 1
这里的s[i] & 7 是取下一个字母N，
t << 3 & 0x3FFFFFFF 是保留当前10-letter string的后9个letter S,
| 是把 N 合并到S中，这样就更新了key