BIgData对job market的影响? - 未名空间MITBBS历史存档

BIgData对job market的影响?# Java - 爪哇娇娃

w*n2013-07-07 07:07

1 楼

不懂big data,大牛说说, 以后会不会招聘广告上动不动就要会
hadoop,hbase什么的,就像现在的spring/hibernate是基本必备一样?

z*e2013-07-07 07:07

2 楼

会
以下两个产品你应该尽早找机会接触熟悉最后掌握
hadoop
cassandra

g*e2013-07-07 07:07

3 楼

现在有一线大公司用cassandra吗，FB自己都基本放弃了，现在都是hbase

【在 z****e 的大作中提到】

: 会
: 以下两个产品你应该尽早找机会接触熟悉最后掌握
: hadoop
: cassandra

w*z2013-07-07 07:07

4 楼

netflix, eBay.

【在 g**e 的大作中提到】

: 现在有一线大公司用cassandra吗，FB自己都基本放弃了，现在都是hbase

g*e2013-07-07 07:07

5 楼

上次看的一个ppt，ebay的hbase node规模比cassandra大多了（几千vs几百）。
netflix不太了解，等好虫解释。

【在 w**z 的大作中提到】

: netflix, eBay.

z*e2013-07-07 07:07

6 楼

hbase还不够成熟，版本号连1都没上，也就是还没有正式版
用起来风险太大
fb用hbase是因为以前cassandra跟hadoop不怎么兼容
或者说不象hbase那样原生态，所以整合起来比较折腾
现在apache已经开始整合cassandra跟hadoop了
所以用cassandra并不是很坏的结果，而且cassandra单独用也瞒好的
hbase单独用就折腾

【在 g**e 的大作中提到】

: 现在有一线大公司用cassandra吗，FB自己都基本放弃了，现在都是hbase

g*g2013-07-07 07:07

7 楼

hbase基本上是用来做hadoop的，Cassandra是一个通用的数据库，用途不太一样。

【在 g**e 的大作中提到】

: 上次看的一个ppt，ebay的hbase node规模比cassandra大多了（几千vs几百）。
: netflix不太了解，等好虫解释。

t*e2013-07-07 07:07

8 楼

和我感觉的相反。Hbase有SPOF, scalability也不如cassandra。HBase每个data
region只有一个region server负责读写。cassandra replica set的所有node都可以负
责读写。Hbase是master-slave topology, cassandra是peer to peer。不过如果
mongodb够用，可能还是mongodb容易使用，毕竟支持ad hoc queries。Cassandra0.7支
持native indexing了，基本相当于支持ad hoc query了。
另外cassandra is modeled after Amazon Dynamo, 不是FB的技术。

【在 g**e 的大作中提到】

: 现在有一线大公司用cassandra吗，FB自己都基本放弃了，现在都是hbase

g*e2013-07-07 07:07

9 楼

dynamo我倒是天天用的，问题不少。最近最严重的是stale cache node，直接导致某些
table put latency p50>50ms, p90>5000ms。还不给解决

【在 t*******e 的大作中提到】

: 和我感觉的相反。Hbase有SPOF, scalability也不如cassandra。HBase每个data
: region只有一个region server负责读写。cassandra replica set的所有node都可以负
: 责读写。Hbase是master-slave topology, cassandra是peer to peer。不过如果
: mongodb够用，可能还是mongodb容易使用，毕竟支持ad hoc queries。Cassandra0.7支
: 持native indexing了，基本相当于支持ad hoc query了。
: 另外cassandra is modeled after Amazon Dynamo, 不是FB的技术。

t*e2013-07-07 07:07

10 楼

回的真快啊。

【在 g**e 的大作中提到】

: dynamo我倒是天天用的，问题不少。最近最严重的是stale cache node，直接导致某些
: table put latency p50>50ms, p90>5000ms。还不给解决

g*e2013-07-07 07:07

11 楼

刚好看见了……

【在 t*******e 的大作中提到】

: 回的真快啊。

w*z2013-07-07 07:07

12 楼

Cassandra is the combination of Dynamo and Google bigtable , easier config
than hbase, no single point of failure. but integration with Hadoop is no
good.

【在 t*******e 的大作中提到】

: 和我感觉的相反。Hbase有SPOF, scalability也不如cassandra。HBase每个data
: region只有一个region server负责读写。cassandra replica set的所有node都可以负
: 责读写。Hbase是master-slave topology, cassandra是peer to peer。不过如果
: mongodb够用，可能还是mongodb容易使用，毕竟支持ad hoc queries。Cassandra0.7支
: 持native indexing了，基本相当于支持ad hoc query了。
: 另外cassandra is modeled after Amazon Dynamo, 不是FB的技术。

p*22013-07-07 07:07

13 楼

刚才随便看了一下怎么发现cassandra好像不怎么乐观呀。说Facebook自己用的也不多
，Twitter也停用了。

【在 z****e 的大作中提到】

: 会
: 以下两个产品你应该尽早找机会接触熟悉最后掌握
: hadoop
: cassandra

r*k2013-07-07 07:07

14 楼

hbase 没有spof, for sure.
从那里听说的？你是说facebook 的系统不scalable?
this is by design, to guarantee strong consistency. HBase chooses CP of CAP
and cassandra chooses AP of CAP. It's their design choices.
cassandra's gossip protocol sounds ideal, but the real throughput would be a
big issue. That's why FB dropped cassandra and adopted hbase in their
production systems.
不明白，有不支持adhoc query 的nosql ? 你说的是secondary indexing?
distributed indexing for large scale distributed DB is not that easy.
cassandra was started from facebook.

【在 t*******e 的大作中提到】

: 和我感觉的相反。Hbase有SPOF, scalability也不如cassandra。HBase每个data
: region只有一个region server负责读写。cassandra replica set的所有node都可以负
: 责读写。Hbase是master-slave topology, cassandra是peer to peer。不过如果
: mongodb够用，可能还是mongodb容易使用，毕竟支持ad hoc queries。Cassandra0.7支
: 持native indexing了，基本相当于支持ad hoc query了。
: 另外cassandra is modeled after Amazon Dynamo, 不是FB的技术。

r*k2013-07-07 07:07

15 楼

I heard from a conference keynote speak given by a fb engineer that facebook
is not using cassandra for ANY of their product. all about mysql, memcache
and hbase now.
twitter used to try out cassandra in 2010, but failed (just like digg). now
they're using redis + mysql for their tweets, and investigating hbase now (
told by their engineering director)

【在 p*****2 的大作中提到】

:
: 刚才随便看了一下怎么发现cassandra好像不怎么乐观呀。说Facebook自己用的也不多
: ，Twitter也停用了。

r*k2013-07-07 07:07

16 楼

版本号和成不成熟没大关系。2010年 digg vp of engineering got fired because
cassandra failed their whole system. 那时cassandra 应该早>1.0了
http://www.neowin.net/news/digg-vp-of-engineering-fired-after-v
这不是主要原因。https://www.facebook.com/UsingHbase 里面有，懒得找了，主要是
write throughput。
没有的事儿。怎么整合？cassandra on hdfs? mapred optimization on cassandra?
凑合粘乎一下还行，整合就不可能了。

【在 z****e 的大作中提到】

: hbase还不够成熟，版本号连1都没上，也就是还没有正式版
: 用起来风险太大
: fb用hbase是因为以前cassandra跟hadoop不怎么兼容
: 或者说不象hbase那样原生态，所以整合起来比较折腾
: 现在apache已经开始整合cassandra跟hadoop了
: 所以用cassandra并不是很坏的结果，而且cassandra单独用也瞒好的
: hbase单独用就折腾

b*y2013-07-07 07:07

17 楼

好帖子。看来MySQL还是很靠谱的。

p*22013-07-07 07:07

18 楼

facebook
memcache
now
多谢大牛。

【在 r*******k 的大作中提到】

: I heard from a conference keynote speak given by a fb engineer that facebook
: is not using cassandra for ANY of their product. all about mysql, memcache
: and hbase now.
: twitter used to try out cassandra in 2010, but failed (just like digg). now
: they're using redis + mysql for their tweets, and investigating hbase now (
: told by their engineering director)

t*e2013-07-07 07:07

19 楼

Brisk， Hadoop on CFS，不过不是apache搞的。Besides, Cassandra can work
natively as a hadoop data source or sink.

：现在apache已经开始整合cassandra跟hadoop了
没有的事儿。怎么整合？cassandra on hdfs? mapred optimization on cassandra?
凑合粘乎一下还行，整合就不可能了。

【在 r*******k 的大作中提到】

:
: 版本号和成不成熟没大关系。2010年 digg vp of engineering got fired because
: cassandra failed their whole system. 那时cassandra 应该早>1.0了
: http://www.neowin.net/news/digg-vp-of-engineering-fired-after-v
: 这不是主要原因。https://www.facebook.com/UsingHbase 里面有，懒得找了，主要是
: write throughput。
: 没有的事儿。怎么整合？cassandra on hdfs? mapred optimization on cassandra?
: 凑合粘乎一下还行，整合就不可能了。

t*e2013-07-07 07:07

20 楼

NameNode is SPOF.
Cassandra is more efficiency with respect to scalability。
CAP
Cassandra allows you to trade between consistency and availability.
Consistency level can be tuned per each read/write operation.
a
Key value store, column family-based NoSQL falls short on ad hoc query
capability.

【在 r*******k 的大作中提到】

:
: 版本号和成不成熟没大关系。2010年 digg vp of engineering got fired because
: cassandra failed their whole system. 那时cassandra 应该早>1.0了
: http://www.neowin.net/news/digg-vp-of-engineering-fired-after-v
: 这不是主要原因。https://www.facebook.com/UsingHbase 里面有，懒得找了，主要是
: write throughput。
: 没有的事儿。怎么整合？cassandra on hdfs? mapred optimization on cassandra?
: 凑合粘乎一下还行，整合就不可能了。

t*e2013-07-07 07:07

21 楼

To developers, Mongodb acts much like a relational db. Column family dbs are
different animals. Coding with column family is obviously more involving.

t*e2013-07-07 07:07

22 楼

我自己没实际比较过hbase cassandra。不过google一下最近的评价，很多从hbase转
cassandra的。
goodbug 在用cassandra，能不能给讲讲最不满意的地方。

t*e2013-07-07 07:07

23 楼

淘宝的马工对my sql cluster不满意，还抓了几个bugs。

【在 b******y 的大作中提到】

: 好帖子。看来MySQL还是很靠谱的。

p*22013-07-07 07:07

24 楼

这个有意思了。怎么总是转来转去的呢。

【在 t*******e 的大作中提到】

: 我自己没实际比较过hbase cassandra。不过google一下最近的评价，很多从hbase转
: cassandra的。
: goodbug 在用cassandra，能不能给讲讲最不满意的地方。

w*z2013-07-07 07:07

25 楼

we use Cassandra in production. No big complain, except for some operational
stuff, like repair and compaction which slows down the system a bit. there
are some big installations such as netflix, ebay, spotify, ooyala. To
compare nosql solution, do some homework, and understand your use case, HA
, latency requirement, do benchmark. ...
hard to say which one is better than another.

【在 p*****2 的大作中提到】

:
: 这个有意思了。怎么总是转来转去的呢。

t*e2013-07-07 07:07

26 楼

NOSQL不是relational有个SQL93，没有规范，每个都完全不同。都试一遍，几年时间就
过去了。而且简单的use case很多时候不能说明问题。不可能都亲自尝试的情况下，先
看看别人的经验就能避免走弯路。简单的而言key value stores一般不支持ad hoc
query，document dbs基本都支持，如果use case必须要支持，那就从document db中选
。又如你要有multi-row transaction, 大部分NOSQL都不支持。

【在 p*****2 的大作中提到】

:
: 这个有意思了。怎么总是转来转去的呢。

p*22013-07-07 07:07

27 楼

operational
there
HA
大牛说说ooyala这个公司到底咋样呀？

【在 w**z 的大作中提到】

: we use Cassandra in production. No big complain, except for some operational
: stuff, like repair and compaction which slows down the system a bit. there
: are some big installations such as netflix, ebay, spotify, ooyala. To
: compare nosql solution, do some homework, and understand your use case, HA
: , latency requirement, do benchmark. ...
: hard to say which one is better than another.

p*22013-07-07 07:07

28 楼

多谢大牛。下一步得好好学习一下这些东西了。

【在 t*******e 的大作中提到】

: NOSQL不是relational有个SQL93，没有规范，每个都完全不同。都试一遍，几年时间就
: 过去了。而且简单的use case很多时候不能说明问题。不可能都亲自尝试的情况下，先
: 看看别人的经验就能避免走弯路。简单的而言key value stores一般不支持ad hoc
: query，document dbs基本都支持，如果use case必须要支持，那就从document db中选
: 。又如你要有multi-row transaction, 大部分NOSQL都不支持。

r*k2013-07-07 07:07

29 楼

我估计您只读过一些过时blog, 网文之类，没有hands on experience. （sorry,
really not nice）
NameNode HA 在2012年中已经比较成熟，我知道的多数公司2012年底已经upgrade
their production systems to use Namenode HA. 如果你听说过spof of NN，那是
2013年前的事情了，以后千万别提了。
这个领域的特点，所有的事情的都在moving around。如果你不肯定，请不要乱说。
这里不适合讨论cassandra hbase 实现细节一类，理论上的东西和实际差远了，看似美
好的东西实际实现两码事，你需要做很多的妥协来实现这些美好的目标，而cassandra
需要妥协的地方的太多了。不要以为fb, twitter ，还有其他一些公司drop cassandra
是没有strong reason的。再踢一次，cassandra 是fb 最早开发的，现在他们已经退
出很久了。
关于mapred on CFS，that's my point, 只是粘合: cassandra implement HDFS
client API on CFS 而已。如果连这个都不做，datastax 也不用混了，到底性能如何
，稳不稳定，who knows。apache hadoop 肯定不会干这事儿。

【在 t*******e 的大作中提到】

: NOSQL不是relational有个SQL93，没有规范，每个都完全不同。都试一遍，几年时间就
: 过去了。而且简单的use case很多时候不能说明问题。不可能都亲自尝试的情况下，先
: 看看别人的经验就能避免走弯路。简单的而言key value stores一般不支持ad hoc
: query，document dbs基本都支持，如果use case必须要支持，那就从document db中选
: 。又如你要有multi-row transaction, 大部分NOSQL都不支持。

t*e2013-07-07 07:07

30 楼

你正在做hbase, 自然是知道多。NN是SPOF，直到2012年间还是事实，变成是我乱说的
，不是update-to-date knowledge是另外一回事。理论上cassandra的确比hbase美好，
同意实际上可能不是一回事。
论坛本来就是让大家讨论的，没有人能保证自己什么都正确，包括你在内。建议你也把
基础打打好，先搞懂什么是ad hoc query。

cassandra
cassandra

【在 r*******k 的大作中提到】

: 我估计您只读过一些过时blog, 网文之类，没有hands on experience. （sorry,
: really not nice）
: NameNode HA 在2012年中已经比较成熟，我知道的多数公司2012年底已经upgrade
: their production systems to use Namenode HA. 如果你听说过spof of NN，那是
: 2013年前的事情了，以后千万别提了。
: 这个领域的特点，所有的事情的都在moving around。如果你不肯定，请不要乱说。
: 这里不适合讨论cassandra hbase 实现细节一类，理论上的东西和实际差远了，看似美
: 好的东西实际实现两码事，你需要做很多的妥协来实现这些美好的目标，而cassandra
: 需要妥协的地方的太多了。不要以为fb, twitter ，还有其他一些公司drop cassandra
: 是没有strong reason的。再踢一次，cassandra 是fb 最早开发的，现在他们已经退

z*e2013-07-07 07:07

31 楼

经验论一般说不超过2.0都不算稳定
write throughput的原因给zkss？
最后一个我看cassandra已经被列为hadoop的关联项目
排在hbase之前，现在还不算很快，但是逐步会被优化
只是凑合黏糊一下就好了，就怕整合太彻底，以后想拆开都难
hbase其实就很难搞，如果不用hadoop的话

【在 r*******k 的大作中提到】

: 我估计您只读过一些过时blog, 网文之类，没有hands on experience. （sorry,
: really not nice）
: NameNode HA 在2012年中已经比较成熟，我知道的多数公司2012年底已经upgrade
: their production systems to use Namenode HA. 如果你听说过spof of NN，那是
: 2013年前的事情了，以后千万别提了。
: 这个领域的特点，所有的事情的都在moving around。如果你不肯定，请不要乱说。
: 这里不适合讨论cassandra hbase 实现细节一类，理论上的东西和实际差远了，看似美
: 好的东西实际实现两码事，你需要做很多的妥协来实现这些美好的目标，而cassandra
: 需要妥协的地方的太多了。不要以为fb, twitter ，还有其他一些公司drop cassandra
: 是没有strong reason的。再踢一次，cassandra 是fb 最早开发的，现在他们已经退

z*e2013-07-07 07:07

32 楼

说明不稳定
这块新发展出来的，各种产品都有其不如意的地方

【在 p*****2 的大作中提到】

:
: 多谢大牛。下一步得好好学习一下这些东西了。

z*e2013-07-07 07:07

33 楼

hbase貌似是waterloo那群人在牵头搞？

cassandra
cassandra

【在 r*******k 的大作中提到】

: 我估计您只读过一些过时blog, 网文之类，没有hands on experience. （sorry,
: really not nice）
: NameNode HA 在2012年中已经比较成熟，我知道的多数公司2012年底已经upgrade
: their production systems to use Namenode HA. 如果你听说过spof of NN，那是
: 2013年前的事情了，以后千万别提了。
: 这个领域的特点，所有的事情的都在moving around。如果你不肯定，请不要乱说。
: 这里不适合讨论cassandra hbase 实现细节一类，理论上的东西和实际差远了，看似美
: 好的东西实际实现两码事，你需要做很多的妥协来实现这些美好的目标，而cassandra
: 需要妥协的地方的太多了。不要以为fb, twitter ，还有其他一些公司drop cassandra
: 是没有strong reason的。再踢一次，cassandra 是fb 最早开发的，现在他们已经退

p*22013-07-07 07:07

34 楼

ad hoc query到底啥意思呀？刚才看到自己想了一下没想清楚

【在 t*******e 的大作中提到】

: 你正在做hbase, 自然是知道多。NN是SPOF，直到2012年间还是事实，变成是我乱说的
: ，不是update-to-date knowledge是另外一回事。理论上cassandra的确比hbase美好，
: 同意实际上可能不是一回事。
: 论坛本来就是让大家讨论的，没有人能保证自己什么都正确，包括你在内。建议你也把
: 基础打打好，先搞懂什么是ad hoc query。
:
: cassandra
: cassandra

t*e2013-07-07 07:07

35 楼

简单地说就是支持随意的general purpose query。现实OLTP一般都要求。column
family NOSQL则是query driven schema design。先把query想好了，再设计column
families, 特别是rowkey。本质上column family db只能search by rowkey，full
table scan太慢。cassandra 0.7开始支持native secondary index, 尽管还有局限，
添加ad hoc query based on column values就比较容易了。Hbase最新版不知道，（有
人说我乱说了，放个disclaimer），老板的一般通过coprocessor做，估计是在client
side生成。总之两者都不如mongodb，relational的方便。

【在 p*****2 的大作中提到】

:
: ad hoc query到底啥意思呀？刚才看到自己想了一下没想清楚

r*k2013-07-07 07:07

36 楼

我现在不做hbase，只是恰巧在big data 和open source 领域，因为我们的项目有这些
需求，所以任何项目都会涉及一点。
兄弟，我真要吐血了。(我对你没有个人意见，我也不知你是谁，如果之前说话不太
nice 请见谅。)
big data 领域，我们一般提 ad-hoc query 时，与知相对的是所谓的 batch
processing。batch processing 一般指run mapreduce job 或其他job比如scan 一个
巨大的文件，无论一个小时还是一晚上得到结果, doesn't matter. ad-hoc 相对的，
指随机的查询，很快的得到结果(<1sec)，比如query user data by user id. 对于单
机DB小数据来说，这根本不是问题，但对于distributed file system, 我们需要相应
的技术来实现这种需求，所以才有bigtable 以及后来者。举个例子，在100TB数据中找
到一个user id，极相应记录，只有nosql 可以在1秒内办到。
given a key, to get the value(s), 本身就是最典型的ad-hoc query. 是每个nosql
最基本的功能，否则要他做什么？distributed file system 就可以了。你说nosql
not support ad-hoc query, 任何相关领域的人都得和你急。
NoSQL 多数所不具备的是 non-key indexing support, or secondary indexing, 其中
有其技术限制, 说来话长了。号称可以的，都有其自身限制。
说实话，我不认为我们有相同的tech background and context，这种最基本的term 还
有争议，深入的技术讨论也没有意义。
不好意思, hijack 了原帖，一会儿另回关于 big data and job 的想法。

r*k2013-07-07 07:07

37 楼

不会。即便现在spring/hibernate也不是必备的。 :)
说说我的观察：
现在 hadoop/big data 的确很热：我前两天随便看了一下一些hot公司招聘信息，比如
box.net, dropbox, square, twitter, pinterest, 都有hadoop positions, 方向也很
多。
但从另一个角度，这方面有经验的人很少，我们去年面试了n多hadoop engineer
candidates, (还有一个twitter的)没一个合适的，最后还是花大价钱挖熟人添坑，现
在还有2个没fill。
原因之一是之前大家采取观望的居多，导致有 hands on experience的人少，比如
dropbox 要找他们 first hadoop engineer，你早干嘛去了？
这个技术应用的范围来看，主力当然是web company，传统enterprise 厂商也不少，比
如emc, intel, 还有一大票专业公司。

g*g2013-07-07 07:07

38 楼

The main beef is that you need to create your own index and maintain it for
everything. Secondary index has performance issue and it's not recommended.
And you really really have to plan your query, while you can change your
schema without downtime, it also means every time you change your mind, you
have to migrate your data.
On the pro side, a peer to peer structure is really made for cloud. The
built-in multi-DC capability is very useful.

【在 t*******e 的大作中提到】

: 我自己没实际比较过hbase cassandra。不过google一下最近的评价，很多从hbase转
: cassandra的。
: goodbug 在用cassandra，能不能给讲讲最不满意的地方。

t*e2013-07-07 07:07

39 楼

"given a key, to get the value(s), 本身就是最典型的ad-hoc query. 是每个nosql
最基本的功能，否则要他做什么？distributed file system 就可以了。你说nosql
not support ad-hoc query, 任何相关领域的人都得和你急。"
如果你这么认识ad hoc query，我也无语了。另外你举的Digg的例子，没见人说是
cassandra failed the project，都说是inadequate test导致的。我没用过HBase, 但
google hbase vs cassandra 出来的结果和你说差别很大。不相信可以自己试试，不要
说这些人也都不如你hands on。
我现在在evalute NoSQL databases，决定前就想多看点，避免一面之词。

【在 r*******k 的大作中提到】

: 不会。即便现在spring/hibernate也不是必备的。 :)
: 说说我的观察：
: 现在 hadoop/big data 的确很热：我前两天随便看了一下一些hot公司招聘信息，比如
: box.net, dropbox, square, twitter, pinterest, 都有hadoop positions, 方向也很
: 多。
: 但从另一个角度，这方面有经验的人很少，我们去年面试了n多hadoop engineer
: candidates, (还有一个twitter的)没一个合适的，最后还是花大价钱挖熟人添坑，现
: 在还有2个没fill。
: 原因之一是之前大家采取观望的居多，导致有 hands on experience的人少，比如
: dropbox 要找他们 first hadoop engineer，你早干嘛去了？

z*e2013-07-07 07:07

40 楼

现在这块还没怎么稳定下来，还有大量生手的机会
时间一久，各种问题都搞定了，产品也就成熟了，到时候就不会有太多生手的机会了
任何一个领域都是这样
你现在去做传统的挨踢，至少spring是必备的，不会spring可不行
web公司因为在拓展领域，所以生手和熟手差距不大，但是差距会逐步拉大

【在 r*******k 的大作中提到】

: 不会。即便现在spring/hibernate也不是必备的。 :)
: 说说我的观察：
: 现在 hadoop/big data 的确很热：我前两天随便看了一下一些hot公司招聘信息，比如
: box.net, dropbox, square, twitter, pinterest, 都有hadoop positions, 方向也很
: 多。
: 但从另一个角度，这方面有经验的人很少，我们去年面试了n多hadoop engineer
: candidates, (还有一个twitter的)没一个合适的，最后还是花大价钱挖熟人添坑，现
: 在还有2个没fill。
: 原因之一是之前大家采取观望的居多，导致有 hands on experience的人少，比如
: dropbox 要找他们 first hadoop engineer，你早干嘛去了？

t*e2013-07-07 07:07

41 楼

Is there a dedicated DBA team to manage the production environment, or
developers play dual roles?

for
you

【在 g*****g 的大作中提到】

: The main beef is that you need to create your own index and maintain it for
: everything. Secondary index has performance issue and it's not recommended.
: And you really really have to plan your query, while you can change your
: schema without downtime, it also means every time you change your mind, you
: have to migrate your data.
: On the pro side, a peer to peer structure is really made for cloud. The
: built-in multi-DC capability is very useful.

w*z2013-07-07 07:07

42 楼

depends on the company. we have Tech Op team, but I did all the
installation, implementation, monitoring , maintenance for our first
cluster. It was pretty fun. If your Tech op team doesn't have much
experience on Cassandra , you better know how to do it yourself. Eventually
, they will ask you to fix the problems.

【在 t*******e 的大作中提到】

: Is there a dedicated DBA team to manage the production environment, or
: developers play dual roles?
:
: for
: you

g*g2013-07-07 07:07

43 楼

We have a small Cassandra ops team doing backup, version upgrade etc. But we
have several hundred clusters. We also have a small cloud DB team, with a
couple of DBA giving consultation on all kinds of cloud db option. But it's
mostly on your own.

【在 t*******e 的大作中提到】

: Is there a dedicated DBA team to manage the production environment, or
: developers play dual roles?
:
: for
: you

t*e2013-07-07 07:07

44 楼

两位如果不介意的话，能介绍一下use cases, cassandra当OLTP用，还是OLAP，或
batch data processing?

w*z2013-07-07 07:07

45 楼

We use Cassandra to store friends, persistent notifications and newsfeed.
You can't really call it OLTP since it doesn't have transaction.
We don't do analytical processing (yet), you can set up a cluster just for
data analyze. The integration with Hadoop is not great, but it works for
someone.
You can subscribe to the cassandra user group, u**[email protected]
and you can also join IRC channel: #cassandra channel on irc.freenode.net.
People there are really helpful.
And Datastax is a great resource.
Good luck.

【在 t*******e 的大作中提到】

: 两位如果不介意的话，能介绍一下use cases, cassandra当OLTP用，还是OLAP，或
: batch data processing?

t*e2013-07-07 07:07

46 楼

Thanks for sharing. "cassandra integration with hadoop is not great." Is it
due to the lack of data locality?

【在 w**z 的大作中提到】

: We use Cassandra to store friends, persistent notifications and newsfeed.
: You can't really call it OLTP since it doesn't have transaction.
: We don't do analytical processing (yet), you can set up a cluster just for
: data analyze. The integration with Hadoop is not great, but it works for
: someone.
: You can subscribe to the cassandra user group, u**[email protected]
: and you can also join IRC channel: #cassandra channel on irc.freenode.net.
: People there are really helpful.
: And Datastax is a great resource.
: Good luck.

w*z2013-07-07 07:07

47 楼

You can buy Datastax enterprise version and it comes with Hadoop and solr
integration. We haven't tried yet since we run hadoop off scribe log data.
Cassandra doesn't use HDFS as its file system, so you will have to transfer
data in/our of Cassandra. I am not expert on Hadoop, so don't want to give
you wrong information. But as I know, the biggest advantage of HBase is that
Hbase runs on on HDFS, so Hadoop integration is much easier.

it

【在 t*******e 的大作中提到】

: Thanks for sharing. "cassandra integration with hadoop is not great." Is it
: due to the lack of data locality?