c*z
2 楼
In case you are curious about what data scientists do, this is one case that
has multiple projects and involves multiple teams.
It is a big thing and not completely in my scope, but I will try my best to
describe it.
Stage 1. We need the clickstream data. It is the crawler/parser team's job
to get the urls (optimally, the whole pages as well) from websites and
classify them, and the hadoop admin team's job to store them in place. It is
a monster in its own sake, and I am no expert on this at all.
How do you know who did what? Well, that is a trade secret. You can also buy
those data from app developers (e.g. from Chrome app store). Many free
software collects data that is more than necessary - nothing is really free.
Stage 2. We need to clean the data. It is the data team's job to remove
garbage, inject structure and compensate for bad data. The job can be very
different based on the final product.
There are many difficulties, besides the data being huge. No traditional SDE
or statistician alone can do this. Just some examples:
Issue 1: data format. The parser might mess up and give garbage, we should
be able to detect and remove those.
Issue 2: date and time. The clickstream time is from host computer's system
clock and it might be wrong; there is also time zone difference.
Issue 3: item names. Same item can have different item names from different
pages. One way to deal with it is to build a product database, with SKU or
ASIN as key. But not all product page urls have these. Another way is to use
some kind of string distance measure, such as Jaccard index. But just like
any unsupervised learning, testing of this method is difficult.
Issue 4: sample bias. The clickstream collected from apps are naturally
biased towards app users, as well as towards less geeky users since the
geeky ones can disable the data collection functionality. This is a big
thing since clients want unbiased data. One way to deal with this is RIM
weighting, using some third party data as truth. Another way is
bootstrapping. There is only one thing for sure: there will be bias.
Issue 5: incomplete data. We have only data from part of the population, and
that data is also incomplete. For example, we may have only 2% of the
shopping cart information. One way to deal with it is statistical inference.
Stage 3. We need to build and test models to answer questions, such as
1. popular items
2. paths to purchase
3. market share
4. sales prediction
5. recommender system
But before that, we need to extract features for modeling when simple
aggregations won’t work (e.g. 2, 4, 5), i.e. transform lists of pages
visited to predictors such as number of time an item is viewed, ordinal
position, whether it is from a search, etc.
When we finally build and test model, things come back to ordinary for
statisticians, except that now the training data can be huge. One can pilot
with a sample using R, Matlab or Python, then push to large scale. I have
been using Scalding, with a mixed feeling of love and hatred.
Thanks for reading! Please share your comments and/or workflow!
has multiple projects and involves multiple teams.
It is a big thing and not completely in my scope, but I will try my best to
describe it.
Stage 1. We need the clickstream data. It is the crawler/parser team's job
to get the urls (optimally, the whole pages as well) from websites and
classify them, and the hadoop admin team's job to store them in place. It is
a monster in its own sake, and I am no expert on this at all.
How do you know who did what? Well, that is a trade secret. You can also buy
those data from app developers (e.g. from Chrome app store). Many free
software collects data that is more than necessary - nothing is really free.
Stage 2. We need to clean the data. It is the data team's job to remove
garbage, inject structure and compensate for bad data. The job can be very
different based on the final product.
There are many difficulties, besides the data being huge. No traditional SDE
or statistician alone can do this. Just some examples:
Issue 1: data format. The parser might mess up and give garbage, we should
be able to detect and remove those.
Issue 2: date and time. The clickstream time is from host computer's system
clock and it might be wrong; there is also time zone difference.
Issue 3: item names. Same item can have different item names from different
pages. One way to deal with it is to build a product database, with SKU or
ASIN as key. But not all product page urls have these. Another way is to use
some kind of string distance measure, such as Jaccard index. But just like
any unsupervised learning, testing of this method is difficult.
Issue 4: sample bias. The clickstream collected from apps are naturally
biased towards app users, as well as towards less geeky users since the
geeky ones can disable the data collection functionality. This is a big
thing since clients want unbiased data. One way to deal with this is RIM
weighting, using some third party data as truth. Another way is
bootstrapping. There is only one thing for sure: there will be bias.
Issue 5: incomplete data. We have only data from part of the population, and
that data is also incomplete. For example, we may have only 2% of the
shopping cart information. One way to deal with it is statistical inference.
Stage 3. We need to build and test models to answer questions, such as
1. popular items
2. paths to purchase
3. market share
4. sales prediction
5. recommender system
But before that, we need to extract features for modeling when simple
aggregations won’t work (e.g. 2, 4, 5), i.e. transform lists of pages
visited to predictors such as number of time an item is viewed, ordinal
position, whether it is from a search, etc.
When we finally build and test model, things come back to ordinary for
statisticians, except that now the training data can be huge. One can pilot
with a sample using R, Matlab or Python, then push to large scale. I have
been using Scalding, with a mixed feeling of love and hatred.
Thanks for reading! Please share your comments and/or workflow!
d*n
4 楼
ad targeting这块也很激烈啊,除了google以外我已经看过很多家大的小的在做了,上
游下游都有。
不过始终觉得没啥意思。赚钱是真的,但是没成就感。
游下游都有。
不过始终觉得没啥意思。赚钱是真的,但是没成就感。
N*f
5 楼
Moving on the the undomesticated types...
那位鹈鹕是在San Clemente海滨邂逅;本想再靠近些,不过一看人家
那副眼神,只得作罢。松鼠是Huntington Library住户,至今也没弄明
白他小人家究竟在仙人球上掏出了什么美味,吃得忘乎所以,连仙人球
扎屁股都不管不顾。
那位鹈鹕是在San Clemente海滨邂逅;本想再靠近些,不过一看人家
那副眼神,只得作罢。松鼠是Huntington Library住户,至今也没弄明
白他小人家究竟在仙人球上掏出了什么美味,吃得忘乎所以,连仙人球
扎屁股都不管不顾。
D*u
6 楼
Thanks for sharing the insights! 很有帮助。问个问题,是用R prototype 好了,
然后用scalding去calculate score么?还有,有用到real time的analytic么?
然后用scalding去calculate score么?还有,有用到real time的analytic么?
N*f
7 楼
下期预告:人物类——气龙玉照五张,none of them very flattering. :-D
g*l
12 楼
it is so hard to get real-time data, for tech or org reasons.
of course, real-time product recommendation for ecommerce is always the
biggest use case.
of course, real-time product recommendation for ecommerce is always the
biggest use case.
A*a
17 楼
狼兄把动物的神采都拍出来了,
第二只猫看上去像在思考,lol
第二只猫看上去像在思考,lol
r*d
22 楼
赞一个。
that
to
is
buy
【在 c***z 的大作中提到】![](/moin_static193/solenoid/img/up.png)
: In case you are curious about what data scientists do, this is one case that
: has multiple projects and involves multiple teams.
: It is a big thing and not completely in my scope, but I will try my best to
: describe it.
: Stage 1. We need the clickstream data. It is the crawler/parser team's job
: to get the urls (optimally, the whole pages as well) from websites and
: classify them, and the hadoop admin team's job to store them in place. It is
: a monster in its own sake, and I am no expert on this at all.
: How do you know who did what? Well, that is a trade secret. You can also buy
: those data from app developers (e.g. from Chrome app store). Many free
that
to
is
buy
【在 c***z 的大作中提到】
![](/moin_static193/solenoid/img/up.png)
: In case you are curious about what data scientists do, this is one case that
: has multiple projects and involves multiple teams.
: It is a big thing and not completely in my scope, but I will try my best to
: describe it.
: Stage 1. We need the clickstream data. It is the crawler/parser team's job
: to get the urls (optimally, the whole pages as well) from websites and
: classify them, and the hadoop admin team's job to store them in place. It is
: a monster in its own sake, and I am no expert on this at all.
: How do you know who did what? Well, that is a trade secret. You can also buy
: those data from app developers (e.g. from Chrome app store). Many free
相关阅读
是不是我看错了,Kaggle上可做的题一共11题? (转载)有人被Georgia Tech 的OMSA 录取了吗?分享一个转data scientist失败的经历期刊征稿(SCI-Index, 2017-05-31) "Remote Sensing Big Data: Theory, Methods and Applications"请教insight health data多久出结果啊紧急求救: SMOTE-NC 处理categorical data for unbalanced class!!!请问NYC Data Science Academy值得去吗有Insight Data Science Fellow Alumni可以聊聊么?免费提供数据分析方面的职业咨询和简历review (转载)做data analytics SQL语句需要多少功力?紧急求助高人---window附件里的 notepad被覆盖的问题R Shiny 如何发布客户端?请教Data incubator的challenge急需healthcare方面的数据分析人员讲讲:这个行业的outcomes主要有哪几方面?谢谢!!需要知道些啥才能给出建大数据库的报价?决定转行已经一年了,请大家帮忙解惑!现在ds做哪个方向比较好?找个一块参加DA培训的朋友香港能不能成为矽谷? 转载random forest/xgbclassifier的feature importance