[Road map] From ClickStream to ConsumerInsight - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>DataSciences - 数据科学

[Road map] From ClickStream to ConsumerInsight

[Road map] From ClickStream to ConsumerInsight# DataSciences - 数据科学

N*f2014-04-18 07:04

1 楼

再完成一个主题，基本上还是傻瓜机的干活。

c*z2014-04-18 07:04

2 楼

In case you are curious about what data scientists do, this is one case that
has multiple projects and involves multiple teams.
It is a big thing and not completely in my scope, but I will try my best to
describe it.
Stage 1. We need the clickstream data. It is the crawler/parser team's job
to get the urls (optimally, the whole pages as well) from websites and
classify them, and the hadoop admin team's job to store them in place. It is
a monster in its own sake, and I am no expert on this at all.
How do you know who did what? Well, that is a trade secret. You can also buy
those data from app developers (e.g. from Chrome app store). Many free
software collects data that is more than necessary - nothing is really free.
Stage 2. We need to clean the data. It is the data team's job to remove
garbage, inject structure and compensate for bad data. The job can be very
different based on the final product.
There are many difficulties, besides the data being huge. No traditional SDE
or statistician alone can do this. Just some examples:
Issue 1: data format. The parser might mess up and give garbage, we should
be able to detect and remove those.
Issue 2: date and time. The clickstream time is from host computer's system
clock and it might be wrong; there is also time zone difference.
Issue 3: item names. Same item can have different item names from different
pages. One way to deal with it is to build a product database, with SKU or
ASIN as key. But not all product page urls have these. Another way is to use
some kind of string distance measure, such as Jaccard index. But just like
any unsupervised learning, testing of this method is difficult.
Issue 4: sample bias. The clickstream collected from apps are naturally
biased towards app users, as well as towards less geeky users since the
geeky ones can disable the data collection functionality. This is a big
thing since clients want unbiased data. One way to deal with this is RIM
weighting, using some third party data as truth. Another way is
bootstrapping. There is only one thing for sure: there will be bias.
Issue 5: incomplete data. We have only data from part of the population, and
that data is also incomplete. For example, we may have only 2% of the
shopping cart information. One way to deal with it is statistical inference.
Stage 3. We need to build and test models to answer questions, such as
1. popular items
2. paths to purchase
3. market share
4. sales prediction
5. recommender system
But before that, we need to extract features for modeling when simple
aggregations won’t work (e.g. 2, 4, 5), i.e. transform lists of pages
visited to predictors such as number of time an item is viewed, ordinal
position, whether it is from a search, etc.
When we finally build and test model, things come back to ordinary for
statisticians, except that now the training data can be huge. One can pilot
with a sample using R, Matlab or Python, then push to large scale. I have
been using Scalding, with a mixed feeling of love and hatred.
Thanks for reading! Please share your comments and/or workflow!

a*72014-04-18 07:04

3 楼

第一张是狼兄自拍？这个算不算动物主题，我深表怀疑。。。

【在 N****f 的大作中提到】

: 再完成一个主题，基本上还是傻瓜机的干活。

d*n2014-04-18 07:04

4 楼

ad targeting这块也很激烈啊，除了google以外我已经看过很多家大的小的在做了，上
游下游都有。
不过始终觉得没啥意思。赚钱是真的，但是没成就感。

N*f2014-04-18 07:04

5 楼

Moving on the the undomesticated types...
那位鹈鹕是在San Clemente海滨邂逅；本想再靠近些，不过一看人家
那副眼神，只得作罢。松鼠是Huntington Library住户，至今也没弄明
白他小人家究竟在仙人球上掏出了什么美味，吃得忘乎所以，连仙人球
扎屁股都不管不顾。

D*u2014-04-18 07:04

6 楼

Thanks for sharing the insights! 很有帮助。问个问题，是用R prototype 好了，
然后用scalding去calculate score么？还有，有用到real time的analytic么？

N*f2014-04-18 07:04

7 楼

下期预告：人物类——气龙玉照五张，none of them very flattering. :-D

c*z2014-04-18 07:04

8 楼

we don't do real time analytics, I am interested in learning that if people
are willing to share :)

【在 D**u 的大作中提到】

: Thanks for sharing the insights! 很有帮助。问个问题，是用R prototype 好了，
: 然后用scalding去calculate score么？还有，有用到real time的analytic么？

a*72014-04-18 07:04

9 楼

这两张都不象是傻瓜机作品了，Ｓ２很强大嘛
第一张如果能把凶狠的眼神拍得更清楚些就更好了，哈哈。估计对镜头的要求比较高？

【在 N****f 的大作中提到】

: Moving on the the undomesticated types...
: 那位鹈鹕是在San Clemente海滨邂逅；本想再靠近些，不过一看人家
: 那副眼神，只得作罢。松鼠是Huntington Library住户，至今也没弄明
: 白他小人家究竟在仙人球上掏出了什么美味，吃得忘乎所以，连仙人球
: 扎屁股都不管不顾。

d*n2014-04-18 07:04

10 楼

analyze和real time本来就很矛盾啊，现在无非是搞些mouse heatmap, A/B test,
location-aware ad targeting
搞real time有三大要素
1. monetizable
2. actionable
3. profitable
很多号称realtime的东西就只有1，最后就黄了。

people

【在 c***z 的大作中提到】

: we don't do real time analytics, I am interested in learning that if people
: are willing to share :)

a*72014-04-18 07:04

11 楼

热切期待！

【在 N****f 的大作中提到】

: 下期预告：人物类——气龙玉照五张，none of them very flattering. :-D

g*l2014-04-18 07:04

12 楼

it is so hard to get real-time data, for tech or org reasons.
of course, real-time product recommendation for ecommerce is always the
biggest use case.

N*f2014-04-18 07:04

13 楼

松鼠应该是S2照的。鹈鹕是上班时遇上的，手里肯定是公司发的99美元傻瓜机，
zoom功能有限，否则也用不着冒鹈鹕之大不韪一味往前凑合，呵呵。

【在 a*********7 的大作中提到】

: 这两张都不象是傻瓜机作品了，Ｓ２很强大嘛
: 第一张如果能把凶狠的眼神拍得更清楚些就更好了，哈哈。估计对镜头的要求比较高？

D*u2014-04-18 07:04

14 楼

got you, 能否讲一下，你们的analysis cycle一般多长，2 weeks or 2 days?

people

【在 c***z 的大作中提到】

: we don't do real time analytics, I am interested in learning that if people
: are willing to share :)

a*72014-04-18 07:04

15 楼

嗯，松鼠那张背景虚化的效果，傻瓜机估计拍不出来

【在 N****f 的大作中提到】

: 松鼠应该是S2照的。鹈鹕是上班时遇上的，手里肯定是公司发的99美元傻瓜机，
: zoom功能有限，否则也用不着冒鹈鹕之大不韪一味往前凑合，呵呵。

c*z2014-04-18 07:04

16 楼

we re-run model daily,
new models take one week to deploy,
new products take much longer, usually in quaters

【在 D**u 的大作中提到】

: got you, 能否讲一下，你们的analysis cycle一般多长，2 weeks or 2 days?
:
: people

A*a2014-04-18 07:04

17 楼

狼兄把动物的神采都拍出来了，
第二只猫看上去像在思考，lol

l*m2014-04-18 07:04

18 楼

real time 还是很有搞头。因为要求高，cs, probability, statistics 都有领会

【在 d****n 的大作中提到】

: analyze和real time本来就很矛盾啊，现在无非是搞些mouse heatmap, A/B test,
: location-aware ad targeting
: 搞real time有三大要素
: 1. monetizable
: 2. actionable
: 3. profitable
: 很多号称realtime的东西就只有1，最后就黄了。
:
: people

b*k2014-04-18 07:04

19 楼

这三个都是你家的宝贝嘛？

【在 N****f 的大作中提到】

: 再完成一个主题，基本上还是傻瓜机的干活。

r*d2014-04-18 07:04

20 楼

这个real time是指real time recommendation吗？

【在 d****n 的大作中提到】

b*k2014-04-18 07:04

21 楼

喔，这两张我很喜欢！

【在 N****f 的大作中提到】

r*d2014-04-18 07:04

22 楼

赞一个。

that
to
is
buy

【在 c***z 的大作中提到】

: In case you are curious about what data scientists do, this is one case that
: has multiple projects and involves multiple teams.
: It is a big thing and not completely in my scope, but I will try my best to
: describe it.
: Stage 1. We need the clickstream data. It is the crawler/parser team's job
: to get the urls (optimally, the whole pages as well) from websites and
: classify them, and the hadoop admin team's job to store them in place. It is
: a monster in its own sake, and I am no expert on this at all.
: How do you know who did what? Well, that is a trade secret. You can also buy
: those data from app developers (e.g. from Chrome app store). Many free

z*a2014-04-18 07:04

23 楼

very nice!!

【在 N****f 的大作中提到】

: 再完成一个主题，基本上还是傻瓜机的干活。

N*f2014-04-18 07:04

24 楼

那位睿智长者是俺家第一任猫主席，大名Milo，事迹俺在文情版、中学版
都作过专题报告，呵呵。

【在 A***a 的大作中提到】

: 狼兄把动物的神采都拍出来了，
: 第二只猫看上去像在思考，lol

N*f2014-04-18 07:04

25 楼

Yup...不过第二张里的Milo五年前已经仙逝。

【在 b*********k 的大作中提到】

: 这三个都是你家的宝贝嘛？

N*f2014-04-18 07:04

26 楼

I see...you like them on the wilder side. :-)

【在 b*********k 的大作中提到】

: 喔，这两张我很喜欢！

N*f2014-04-18 07:04

27 楼

Thanks!

【在 z*****a 的大作中提到】

: very nice!!

b*k2014-04-18 07:04

28 楼

patpat

【在 N****f 的大作中提到】

: Yup...不过第二张里的Milo五年前已经仙逝。

b*k2014-04-18 07:04

29 楼

哈哈，是啊：P

【在 N****f 的大作中提到】

: I see...you like them on the wilder side. :-)

p*a2014-04-18 07:04

30 楼

赞各种眼神～

【在 N****f 的大作中提到】

: 再完成一个主题，基本上还是傻瓜机的干活。