问个L家设计题分布式 inverted index设计 - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>JobHunting - 待字闺中

问个L家设计题分布式 inverted index设计

问个L家设计题分布式 inverted index设计# JobHunting - 待字闺中

s*m2015-03-19 07:03

1 楼

出了一个inverted index的题，就是有一大堆doc，对doc里出现的word建inverted
index，doc很多所以是distribute在很多machine上的，问怎么实现这个inverted
index

g*g2015-03-19 07:03

2 楼

Cassandra is a perfect DB for illustration. You have each word mapping to a
list of doc ids in each row. The doc id can be UUID or URL as long as it's
unique. For each index row, the row key (word) is also hashed and the row is
replicated so you can have N copy in the cluster and the keys will evenly
distribute. You may also use
timestamp etc. to arrange your index row so you can optionally use a time
range query which is very common in such design.

【在 s*******m 的大作中提到】

: 出了一个inverted index的题，就是有一大堆doc，对doc里出现的word建inverted
: index，doc很多所以是distribute在很多machine上的，问怎么实现这个inverted
: index

s*m2015-03-19 07:03

3 楼

谢谢。
想请教个初级的问题，想cassandra这样的key-value数据库，
内部有index吗？比如，我检索一个key，会不会很快的完成？

a
is

【在 g*****g 的大作中提到】

: Cassandra is a perfect DB for illustration. You have each word mapping to a
: list of doc ids in each row. The doc id can be UUID or URL as long as it's
: unique. For each index row, the row key (word) is also hashed and the row is
: replicated so you can have N copy in the cluster and the keys will evenly
: distribute. You may also use
: timestamp etc. to arrange your index row so you can optionally use a time
: range query which is very common in such design.

p*22015-03-19 07:03

4 楼

检索key很快
然后基本没有index
不过inverted index是不是一般 in memory的？我可能会用redis搞搞

【在 s*******m 的大作中提到】

: 谢谢。
: 想请教个初级的问题，想cassandra这样的key-value数据库，
: 内部有index吗？比如，我检索一个key，会不会很快的完成？
:
: a
: is

s*m2015-03-19 07:03

5 楼

cassandra 生成的key, app 层可以知道吗？
如果数据库是分布式，需要用这个key做consistent hashing，找到这个数据在哪个节
点。
我理解的对吗？
如果检索很快，那是不是说NoSQL数据库就不需要memchache 这样的cache层了

【在 p*****2 的大作中提到】

: 检索key很快
: 然后基本没有index
: 不过inverted index是不是一般 in memory的？我可能会用redis搞搞

s*m2015-03-19 07:03

6 楼

还有个问题
Key-value 数据库。有对象的概念吗？
比如，一个人 key = 1, value=......
一个动物 key 也是 1， value=.......

【在 p*****2 的大作中提到】

: 检索key很快
: 然后基本没有index
: 不过inverted index是不是一般 in memory的？我可能会用redis搞搞

h*02015-03-19 07:03

7 楼

CREATE TABLE invertedIndex (
word text,
positions list,
PRIMARY KEY word;
}
分布式数据库不需要你自己去找在那个node上，不然用起来也太麻烦了把。。。

【在 s*******m 的大作中提到】

: cassandra 生成的key, app 层可以知道吗？
: 如果数据库是分布式，需要用这个key做consistent hashing，找到这个数据在哪个节
: 点。
: 我理解的对吗？
: 如果检索很快，那是不是说NoSQL数据库就不需要memchache 这样的cache层了

b*52015-03-19 07:03

8 楼

MLGB de, 再抱怨一下，像这种概念不清的人（no offense），好多都能被FLG录取，
我他妈的这种人，反而倒是到处被reject。。。

【在 s*******m 的大作中提到】

: 还有个问题
: Key-value 数据库。有对象的概念吗？
: 比如，一个人 key = 1, value=......
: 一个动物 key 也是 1， value=.......

g*v2015-03-19 07:03

9 楼

这个题用mapreduce不行么

c*z2015-03-19 07:03

10 楼

马

g*g2015-03-19 07:03

11 楼

App doesn't need to know. It knows the keyword which is a unique word, it
doesn't need to know the hash value. Cassandra can cache rows in memory, for
access, you don't need memcache. But Memcache can be convenient for
different things, like caching a rich object in memory which you don't do in
NoSQL.

【在 s*******m 的大作中提到】

h*02015-03-19 07:03

12 楼

好好刷题。同时你系统设计比别人表现的好，到时候录取的时候level会高一点

【在 b**********5 的大作中提到】

: MLGB de, 再抱怨一下，像这种概念不清的人（no offense），好多都能被FLG录取，
: 我他妈的这种人，反而倒是到处被reject。。。

h*02015-03-19 07:03

13 楼

好虫大神 rich object是什么？能举个例子吗？

for
in

【在 g*****g 的大作中提到】

: App doesn't need to know. It knows the keyword which is a unique word, it
: doesn't need to know the hash value. Cassandra can cache rows in memory, for
: access, you don't need memcache. But Memcache can be convenient for
: different things, like caching a rich object in memory which you don't do in
: NoSQL.

b*52015-03-19 07:03

14 楼

我刷啊，刷得黑天白夜的，然后面试时，问到一个怎么产生一个random bejewel的
题，你叫我怎么办？给它基本解出来，我觉得，但没写全，你叫我怎么办？
然后去面个二流公司，题目都解出来啊，然后领走时，面试官说， we will get
back to u very soon。。。然后二个礼拜过去了，发信去问，人家屁都不回

【在 h*******0 的大作中提到】

: 好好刷题。同时你系统设计比别人表现的好，到时候录取的时候level会高一点

h*02015-03-19 07:03

15 楼

面试有时候运气占挺大成分的加油吧实在不行就去个非flg过度下

【在 b**********5 的大作中提到】

: 我刷啊，刷得黑天白夜的，然后面试时，问到一个怎么产生一个random bejewel的
: 题，你叫我怎么办？给它基本解出来，我觉得，但没写全，你叫我怎么办？
: 然后去面个二流公司，题目都解出来啊，然后领走时，面试官说， we will get
: back to u very soon。。。然后二个礼拜过去了，发信去问，人家屁都不回

b*52015-03-19 07:03

16 楼

那还不如在家里自己自由职业卖逼

【在 h*******0 的大作中提到】

: 面试有时候运气占挺大成分的加油吧实在不行就去个非flg过度下

h*02015-03-19 07:03

17 楼

感觉mapreduce用在这不好

【在 g****v 的大作中提到】

: 这个题用mapreduce不行么

b*52015-03-19 07:03

18 楼

原题是问怎么实现这个inverted index，大家都在讨论怎么存这个inverted index。
。。
wordID，
不正好是basic map reduce么？

【在 h*******0 的大作中提到】

: 感觉mapreduce用在这不好

h*02015-03-19 07:03

19 楼

看错题了。。这哥们提了好几个问题。不过mapreduce的overhead蛮大的，如果是每
次新加入一个doc，都run一遍hadoop还挺蛋疼的。

【在 b**********5 的大作中提到】

: 原题是问怎么实现这个inverted index，大家都在讨论怎么存这个inverted index。
: 。。
: wordID，
: 不正好是basic map reduce么？

g*g2015-03-19 07:03

20 楼

Think of it as a Json object, a doc. Anything that's a value and too big to
fit into C* row cache.

【在 h*******0 的大作中提到】

: 好虫大神 rich object是什么？能举个例子吗？
:
: for
: in

g*g2015-03-19 07:03

21 楼

How is this a mapreduce? It's just an index. Everybody knows what an
inverted index is, the question is how to implemented it in a distributed
system so that it can scale.

【在 b**********5 的大作中提到】

: 原题是问怎么实现这个inverted index，大家都在讨论怎么存这个inverted index。
: 。。
: wordID，
: 不正好是basic map reduce么？

g*v2015-03-19 07:03

22 楼

map:
(word, docID)
reduce
(word, docID1, docID2....)
这难道不是个经典的mapreduce application么，请大神指教。

【在 g*****g 的大作中提到】

: How is this a mapreduce? It's just an index. Everybody knows what an
: inverted index is, the question is how to implemented it in a distributed
: system so that it can scale.

g*g2015-03-19 07:03

23 楼

If you are taking counts, it can be MapReduce, otherwise what are you
reducing in an inverted index?

【在 g****v 的大作中提到】

: map:
: (word, docID)
: reduce
: (word, docID1, docID2....)
: 这难道不是个经典的mapreduce application么，请大神指教。

d*a2015-03-19 07:03

24 楼

http://grids.ucs.indiana.edu/ptliupages/publications/Scalable%2

【在 g*****g 的大作中提到】

: How is this a mapreduce? It's just an index. Everybody knows what an
: inverted index is, the question is how to implemented it in a distributed
: system so that it can scale.

b*52015-03-19 07:03

25 楼

我只是说，本来的问题是，你只有一些hdfs file，你要建立这个inverted index。
你store 这个inverted index in the hbase或者cassandra都可以

【在 g*****g 的大作中提到】

: If you are taking counts, it can be MapReduce, otherwise what are you
: reducing in an inverted index?

x*u2015-03-19 07:03

26 楼

这仅仅是第一步，然后呢？
怎么存？怎么partition？怎么scale？怎么更新？怎么保证可用性？
还可能扩展问，如果有按条件搜索的需求怎么处理？怎么做实时更新？
设计题只能顺着面试官思路走，看他想问啥，不过要是你特别牛能从头到尾滴水不漏面
面俱到更好了。

【在 g****v 的大作中提到】

: map:
: (word, docID)
: reduce
: (word, docID1, docID2....)
: 这难道不是个经典的mapreduce application么，请大神指教。

b*52015-03-19 07:03

27 楼

怎么存，就是存在cassandra或者hbase里啊。 hbase、cassandra都是帮你partition
好了， scale好了。你可以谈谈hbase， cassandra的architecture。 real time
更新就是lookup， overwrite， insert到你这个nosql table里。。。

【在 x****u 的大作中提到】

:
: 这仅仅是第一步，然后呢？
: 怎么存？怎么partition？怎么scale？怎么更新？怎么保证可用性？
: 还可能扩展问，如果有按条件搜索的需求怎么处理？怎么做实时更新？
: 设计题只能顺着面试官思路走，看他想问啥，不过要是你特别牛能从头到尾滴水不漏面
: 面俱到更好了。