c*z
2 楼
In case you are curious about what data scientists do, this is one case that
has multiple projects and involves multiple teams.
It is a big thing and not completely in my scope, but I will try my best to
describe it.
Stage 1. We need the clickstream data. It is the crawler/parser team's job
to get the urls (optimally, the whole pages as well) from websites and
classify them, and the hadoop admin team's job to store them in place. It is
a monster in its own sake, and I am no expert on this at all.
How do you know who did what? Well, that is a trade secret. You can also buy
those data from app developers (e.g. from Chrome app store). Many free
software collects data that is more than necessary - nothing is really free.
Stage 2. We need to clean the data. It is the data team's job to remove
garbage, inject structure and compensate for bad data. The job can be very
different based on the final product.
There are many difficulties, besides the data being huge. No traditional SDE
or statistician alone can do this. Just some examples:
Issue 1: data format. The parser might mess up and give garbage, we should
be able to detect and remove those.
Issue 2: date and time. The clickstream time is from host computer's system
clock and it might be wrong; there is also time zone difference.
Issue 3: item names. Same item can have different item names from different
pages. One way to deal with it is to build a product database, with SKU or
ASIN as key. But not all product page urls have these. Another way is to use
some kind of string distance measure, such as Jaccard index. But just like
any unsupervised learning, testing of this method is difficult.
Issue 4: sample bias. The clickstream collected from apps are naturally
biased towards app users, as well as towards less geeky users since the
geeky ones can disable the data collection functionality. This is a big
thing since clients want unbiased data. One way to deal with this is RIM
weighting, using some third party data as truth. Another way is
bootstrapping. There is only one thing for sure: there will be bias.
Issue 5: incomplete data. We have only data from part of the population, and
that data is also incomplete. For example, we may have only 2% of the
shopping cart information. One way to deal with it is statistical inference.
Stage 3. We need to build and test models to answer questions, such as
1. popular items
2. paths to purchase
3. market share
4. sales prediction
5. recommender system
But before that, we need to extract features for modeling when simple
aggregations won’t work (e.g. 2, 4, 5), i.e. transform lists of pages
visited to predictors such as number of time an item is viewed, ordinal
position, whether it is from a search, etc.
When we finally build and test model, things come back to ordinary for
statisticians, except that now the training data can be huge. One can pilot
with a sample using R, Matlab or Python, then push to large scale. I have
been using Scalding, with a mixed feeling of love and hatred.
Thanks for reading! Please share your comments and/or workflow!
has multiple projects and involves multiple teams.
It is a big thing and not completely in my scope, but I will try my best to
describe it.
Stage 1. We need the clickstream data. It is the crawler/parser team's job
to get the urls (optimally, the whole pages as well) from websites and
classify them, and the hadoop admin team's job to store them in place. It is
a monster in its own sake, and I am no expert on this at all.
How do you know who did what? Well, that is a trade secret. You can also buy
those data from app developers (e.g. from Chrome app store). Many free
software collects data that is more than necessary - nothing is really free.
Stage 2. We need to clean the data. It is the data team's job to remove
garbage, inject structure and compensate for bad data. The job can be very
different based on the final product.
There are many difficulties, besides the data being huge. No traditional SDE
or statistician alone can do this. Just some examples:
Issue 1: data format. The parser might mess up and give garbage, we should
be able to detect and remove those.
Issue 2: date and time. The clickstream time is from host computer's system
clock and it might be wrong; there is also time zone difference.
Issue 3: item names. Same item can have different item names from different
pages. One way to deal with it is to build a product database, with SKU or
ASIN as key. But not all product page urls have these. Another way is to use
some kind of string distance measure, such as Jaccard index. But just like
any unsupervised learning, testing of this method is difficult.
Issue 4: sample bias. The clickstream collected from apps are naturally
biased towards app users, as well as towards less geeky users since the
geeky ones can disable the data collection functionality. This is a big
thing since clients want unbiased data. One way to deal with this is RIM
weighting, using some third party data as truth. Another way is
bootstrapping. There is only one thing for sure: there will be bias.
Issue 5: incomplete data. We have only data from part of the population, and
that data is also incomplete. For example, we may have only 2% of the
shopping cart information. One way to deal with it is statistical inference.
Stage 3. We need to build and test models to answer questions, such as
1. popular items
2. paths to purchase
3. market share
4. sales prediction
5. recommender system
But before that, we need to extract features for modeling when simple
aggregations won’t work (e.g. 2, 4, 5), i.e. transform lists of pages
visited to predictors such as number of time an item is viewed, ordinal
position, whether it is from a search, etc.
When we finally build and test model, things come back to ordinary for
statisticians, except that now the training data can be huge. One can pilot
with a sample using R, Matlab or Python, then push to large scale. I have
been using Scalding, with a mixed feeling of love and hatred.
Thanks for reading! Please share your comments and/or workflow!
d*n
4 楼
ad targeting这块也很激烈啊,除了google以外我已经看过很多家大的小的在做了,上
游下游都有。
不过始终觉得没啥意思。赚钱是真的,但是没成就感。
游下游都有。
不过始终觉得没啥意思。赚钱是真的,但是没成就感。
N*f
5 楼
Moving on the the undomesticated types...
那位鹈鹕是在San Clemente海滨邂逅;本想再靠近些,不过一看人家
那副眼神,只得作罢。松鼠是Huntington Library住户,至今也没弄明
白他小人家究竟在仙人球上掏出了什么美味,吃得忘乎所以,连仙人球
扎屁股都不管不顾。
那位鹈鹕是在San Clemente海滨邂逅;本想再靠近些,不过一看人家
那副眼神,只得作罢。松鼠是Huntington Library住户,至今也没弄明
白他小人家究竟在仙人球上掏出了什么美味,吃得忘乎所以,连仙人球
扎屁股都不管不顾。
D*u
6 楼
Thanks for sharing the insights! 很有帮助。问个问题,是用R prototype 好了,
然后用scalding去calculate score么?还有,有用到real time的analytic么?
然后用scalding去calculate score么?还有,有用到real time的analytic么?
N*f
7 楼
下期预告:人物类——气龙玉照五张,none of them very flattering. :-D
g*l
12 楼
it is so hard to get real-time data, for tech or org reasons.
of course, real-time product recommendation for ecommerce is always the
biggest use case.
of course, real-time product recommendation for ecommerce is always the
biggest use case.
A*a
17 楼
狼兄把动物的神采都拍出来了,
第二只猫看上去像在思考,lol
第二只猫看上去像在思考,lol
r*d
22 楼
赞一个。
that
to
is
buy
【在 c***z 的大作中提到】
: In case you are curious about what data scientists do, this is one case that
: has multiple projects and involves multiple teams.
: It is a big thing and not completely in my scope, but I will try my best to
: describe it.
: Stage 1. We need the clickstream data. It is the crawler/parser team's job
: to get the urls (optimally, the whole pages as well) from websites and
: classify them, and the hadoop admin team's job to store them in place. It is
: a monster in its own sake, and I am no expert on this at all.
: How do you know who did what? Well, that is a trade secret. You can also buy
: those data from app developers (e.g. from Chrome app store). Many free
that
to
is
buy
【在 c***z 的大作中提到】
: In case you are curious about what data scientists do, this is one case that
: has multiple projects and involves multiple teams.
: It is a big thing and not completely in my scope, but I will try my best to
: describe it.
: Stage 1. We need the clickstream data. It is the crawler/parser team's job
: to get the urls (optimally, the whole pages as well) from websites and
: classify them, and the hadoop admin team's job to store them in place. It is
: a monster in its own sake, and I am no expert on this at all.
: How do you know who did what? Well, that is a trade secret. You can also buy
: those data from app developers (e.g. from Chrome app store). Many free
相关阅读
转行DS必需?课程,证书,还是学位?data analysis part time job有同学注册了斯坦福的在线CS246--数据挖掘课程的吗?人工智能下围棋超过人类, 是一个虚假结论, 纯属误导!Optimal tier segementation怎么做?求amazon的quantitative analyst内推内推Google 3D Hill 函数fitting求拍砖和指导:英语不好怎么找DA工作?fresh PhD 求职求内推,非常感谢!julio培训 主要是(adword campaign), 有感兴趣的么 ?软件工程,科学计算方面的博士后职位内推 Twilio Data Scientist想跳火坑,31岁转码农可行吗?SAS modeling面试问题求解 答题好必有重谢有人一起学Data Engineer吗?请教一个安装postgreSQL的问题用普通工作站搭一个卷积神经网能达到什么样的规模,够干点什么Data Scientist Full-time Position in Allstate搞了个实时twitter文本分析来研究闯王和吸奶的行情分析 (转载)