Redian新闻
>
Some thoughts on data science and data scientists
avatar
Some thoughts on data science and data scientists# DataSciences - 数据科学
c*z
1
Below are just some of my personal opinions, please don't take them
personally :)
1. Data science is a very broad term. If I dare to put down a definition,
the fundamental question for data science should be:
Are we really doing what we thinking we are doing?
In formal words, data science is the science of measuring inference from
data. Not only inference, but also the confidence of such inference.
Data scientists are most concerned about what we don't know (e.g. data
quality, panel bias, model validity, etc), and this is exactly why we are
called scientist.
An analogy is that software engineers are most concerned about what hasn't
happened yet (e.g. site reliability, scalability, etc).
2. My definition is closer to that of statistics, although statisticians
seldom need to worry about too much (dirty, unstructured, unlabeled) data.
Under this definition, many data scientist positions are actually for
analysts and engineers, because they only care about inference or
reliability, rather than confidence and validity.
Specifically, by the nature of input data:
Statisticians work on small volumes of clean data, likely with lots of
assumptions, likely from academic literature;
Data analysts work on small volumes of dirty data, not knowing how to clean
data and making assumptions mostly from business knowledge;
Data engineers work on large volumes of clean data, likely structured for
query and display;
Data scientists work on large volumes of dirty data, likely unstructured and
unlabeled.
3. The key questions a data scientist working in business settings should
ask:
Do we have well defined questions?
Do we have truthfully labeled data?
Do we have unbiased panel?
Features and models are secondary to questions and data. Specifically, the
first steps of research should be to ask the right questions and decide the
level and unit of analysis.
Essentially, a data scientist need skills from business, science and
engineering, which basically cover three functional roles:
A data architect,
A solution architect,
A software architect,
This is exactly why many data scientists are under unreasonable expectation
and enormous stress.
avatar
d*n
2
版主厉害。
好吧,我能吐槽data science里面有一半时间是在data mangling吗?
avatar
c*z
3
多谢前辈捧场
一半时间还好啦,我是80% :(
剩下20%是fit curve,挺没意思的

【在 d****n 的大作中提到】
: 版主厉害。
: 好吧,我能吐槽data science里面有一半时间是在data mangling吗?

相关阅读
logo
联系我们隐私协议©2024 redian.news
Redian新闻
Redian.news刊载任何文章,不代表同意其说法或描述,仅为提供更多信息,也不构成任何建议。文章信息的合法性及真实性由其作者负责,与Redian.news及其运营公司无关。欢迎投稿,如发现稿件侵权,或作者不愿在本网发表文章,请版权拥有者通知本网处理。