用10-fold cross-validation 之后怎么挑Model？ - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>DataSciences - 数据科学

用10-fold cross-validation 之后怎么挑Model？

用10-fold cross-validation 之后怎么挑Model？# DataSciences - 数据科学

t*i2014-05-21 07:05

1 楼

一个不大的数据，十几万个record，一百个变量，用random forest作 binary
classification
因为有over-fitting, 决定用 10-fold cross-validation
做完之后，有十个 random forest Models
下一步怎么做？
之后是挑 validation error (on its set-aside 10th hold-out set) 最小的那个
Model吗？（需要一个final model 放进 production system）
Thanks!

T*u2014-05-21 07:05

2 楼

kfold不是这么用的吧。。。如果你这10个model parameters有很大不同，就不是挑哪
个cherry的问题，而是你对这个model有多少信心的问题。大概可以理解成posterior
distribution。

h*32014-05-21 07:05

3 楼

Model Selection上来说，可以就这样用。挑个总共error最小的就行了。
楼上说的是一个confidence的估计。大概的意思是，你跑10个fold，可以得到10个
testing error。那么这10个error值，构成一个distribution。如果这个distribution
的varaince很大，那么说明你这个model不够consistent，跟随机的差不多，那么就没
意义了。

【在 T*****u 的大作中提到】

: kfold不是这么用的吧。。。如果你这10个model parameters有很大不同，就不是挑哪
: 个cherry的问题，而是你对这个model有多少信心的问题。大概可以理解成posterior
: distribution。

T*u2014-05-21 07:05

4 楼

想想也不矛盾，多谢分享。

distribution

【在 h********3 的大作中提到】

: Model Selection上来说，可以就这样用。挑个总共error最小的就行了。
: 楼上说的是一个confidence的估计。大概的意思是，你跑10个fold，可以得到10个
: testing error。那么这10个error值，构成一个distribution。如果这个distribution
: 的varaince很大，那么说明你这个model不够consistent，跟随机的差不多，那么就没
: 意义了。

c*z2014-05-21 07:05

5 楼

I could do a feature selection first

d*y2014-05-21 07:05

6 楼

用cross validation 选 Random Forest感觉怪怪的
In random forests, there is no need for cross-validation or a separate test
set to get an unbiased estimate of the test set error. It is estimated
internally, during the run
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

【在 t*****i 的大作中提到】

: 一个不大的数据，十几万个record，一百个变量，用random forest作 binary
: classification
: 因为有over-fitting, 决定用 10-fold cross-validation
: 做完之后，有十个 random forest Models
: 下一步怎么做？
: 之后是挑 validation error (on its set-aside 10th hold-out set) 最小的那个
: Model吗？（需要一个final model 放进 production system）
: Thanks!

l*s2014-05-21 07:05

7 楼

这是用来挑feature的而不是model的

b*o2014-05-21 07:05

8 楼

You don't need cross-validation for random forest. OOB is somehow similar to
CV in spirit.
I suspect you confuse training error with OOB error when you say the model
over fits. Try to compare OOB error with test error, and see whether they
are similar.

【在 t*****i 的大作中提到】