一个面试题（predictive model） (转载) - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>DataSciences - 数据科学

一个面试题（predictive model） (转载)

一个面试题（predictive model） (转载)# DataSciences - 数据科学

z*u2015-03-19 07:03

1 楼

如题

b*r2015-03-19 07:03

2 楼

【以下文字转载自 JobHunting 讨论区】
发信人: badweather (bad weather), 信区: JobHunting
标题: 一个面试题（predictive model）
发信站: BBS 未名空间站 (Thu Mar 19 00:18:31 2015, 美东)
一家公司的面试题目：一个数据表格，只有两列，一列是表示实际天气情况（下雨-0，
不下雨-1），另一列是表示预测天气情况。表格一共365行，每行代表一天。从表格里
面我们可以得到error rate(预测错误的天数除以365)。现在问，有几种不同的方法预
测天气，我们能得到不同的error rate,问是不是error rate最低的方法最好？
我的回答是不一定：
1。error rate最低的有可能overfitting
2.这只是trainning data的fitting,如果用于实际预测，所选择的方法不一定最好，我
们要看看在test data上的performance.
面试官好像不是特别满意我的答案，而且告诉我没有另外的data,这怎么回答？

p*92015-03-19 07:03

3 楼

We plan to go to Hawaii or Europe. :)

【在 z*u 的大作中提到】

: 如题

H*E2015-03-19 07:03

4 楼

就这么几个数据，而且是binary, 问题是在问你怎么判断classification error.你该
扯一些type i &II error, 或者confusion matrix之类的比较靠铺，和model fit没毛
关系。

b*r2015-03-19 07:03

5 楼

谢谢!有理!

【在 H****E 的大作中提到】

: 就这么几个数据，而且是binary, 问题是在问你怎么判断classification error.你该
: 扯一些type i &II error, 或者confusion matrix之类的比较靠铺，和model fit没毛
: 关系。

z*92015-03-19 07:03

6 楼

type i&II error 怎么说？求明示！谢谢！

【在 H****E 的大作中提到】

H*E2015-03-19 07:03

7 楼

谷歌是你的好朋友。
http://www.damienfrancois.be/blog/files/modelperfcheatsheet.pdf

b*r2015-03-19 07:03

8 楼

谢谢提供!很详细!

【在 H****E 的大作中提到】

: 谷歌是你的好朋友。
: http://www.damienfrancois.be/blog/files/modelperfcheatsheet.pdf

f*y2015-03-19 07:03

9 楼

面试官想问的估计是：precision，recall, ROC and gains chart.
这些都是吧error rate更好的参数或者图。

【在 b********r 的大作中提到】

: 谢谢提供!很详细!

t*e2015-03-19 07:03

10 楼

What matters ultimately is the cost/benefit. The biggest problem with error
rate is that it's prevalance dependent. If there are many more sunny days
than raining days (or vice versa), it's not appropriate. AUC is a better
choice, but it has problems too. It probably makes sense to ask for
clarifying questions like what's the cost of misclassifying a raining day to
sunny day and vice versa, what's the benefit of correctly classifying sunny
days or rainy days? 最后就是你想最大化benefit - cost.

【在 b********r 的大作中提到】

: 谢谢提供!很详细!

a*92015-03-19 07:03

11 楼

很基础的问题啊比如有可能是非常unbalancd的数据比如下雨天只占非常小的比例那
这个用总体的error rate做指标就很不好了 auc之类的可以避免一下

【在 b********r 的大作中提到】

: 谢谢提供!很详细!

w*22015-03-19 07:03

12 楼

这题考的是evaluation metrics，像precision， recall， f1之类的。给的数据是
target variable是predicted target variable。
从confusion matrix入手。分析一下根据business model，是optimize precison还是
recall。然后具体怎么做。
为什么不是accuracy，你可以说accuracy可以很高，error很低的情况可能classifier
总是predict majority。
如果又多个model，你可以比较他们的auc score, indicating how well a classifier
separate the two classes
不要担心，多面几次就好了。