请教一道面试题 - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>DataSciences - 数据科学

请教一道面试题

请教一道面试题# DataSciences - 数据科学

T*t2017-03-27 07:03

1 楼

对于一个单身了好久的男人来说如果发现身边有一个可以交往的女孩子那肯定会是一件
特别让人兴奋的事情，而我单身六年后终于迎来了自己的春天，但这个春天可真冷啊。
一开始接触她的时候她并不是一个强势的女孩子，我清楚记得我第一次见她是在一家餐
馆，当时是我们好几个朋友在一起吃饭，看到她的第一看我就深深的被她给吸引了，她
确实很漂亮。那时候我虽然喜欢她可处于一种自卑的心理也并没有对她表白，自己还想
着把她介绍给我一个比较有钱的哥们，但我哥们嫌她有点小，就这样我也不知道怎么回
事就对她表白了，被拒绝是肯定的。就算是一直被她拒绝，可她在我边上所表现出来的
确实是一个淑女型的姑娘，我也特别喜欢，后来追到手之后我算是明白了，她不过是在
不是特别熟悉的人面前表现出来自己淑女的一面，我俩在一起之后她就是活生生的一个
女孩子，一个强势的姑娘，那点淑女范全都消失不见了。
现在可以说在家她完全当家，让我做什么我就做什么，她一个眼神就能把我给吓得不行
，我现在每天都胸口痛，就好像是被她给吓得似的，整天的给我发脾气我也不敢说什么
，如果真能重新选择一次的话我肯定不会要那么强势的姑娘。

a*l2017-03-27 07:03

2 楼

这个题的point是什么？谢谢
Given 4,000,000 samples with 1000 features, y is 2.5% positive and 97.5%
negative, how do you take a sample from this datasets to build a reasonable
model.

z*12017-03-27 07:03

3 楼

Combat Imbalanced Classes
"You can change the dataset that you use to build your predictive model to
have more balanced data.
This change is called sampling your dataset and there are two main methods
that you can use to even-up the classes:
You can add copies of instances from the under-represented class called over
-sampling (or more formally sampling with replacement), or
You can delete instances from the over-represented class, called under-
sampling.
These approaches are often very easy to implement and fast to run. They are
an excellent starting point.
In fact, I would advise you to always try both approaches on all of your
imbalanced datasets, just to see if it gives you a boost in your preferred
accuracy measures.
You can learn a little more in the the Wikipedia article titled “
Oversampling and undersampling in data analysis“."

m*r2017-03-27 07:03

4 楼

我来抛个砖。
看见这个2.5% vs 97.5% 是不是可以imbalanced sampling?
另外，怎么会有这么多feature ? 有的feature一眼看过去就没用直接garbage
collection.

y*g2017-03-27 07:03

5 楼

1. class imbalance决定了你选两类sample的比例
2. feature size决定了你至少应该选多少数据出来才能获得有意义的model
3. n/p 在这个问题里面挺大的，所以regularization不是什么大问题。只要注意不要
overfit就可以了。

reasonable

【在 a****l 的大作中提到】

: 这个题的point是什么？谢谢
: Given 4,000,000 samples with 1000 features, y is 2.5% positive and 97.5%
: negative, how do you take a sample from this datasets to build a reasonable
: model.

s*n2017-03-27 07:03

6 楼

http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

W*e2017-03-27 07:03

7 楼

Over sampling under sampling techniques. From the link u provided, this only
applies to cases that sampling is biased from population and u know it
beforehand. Confusion mertrics and classification report may be one tool
with purposely adjusting the class probability and use f score as a measure.
The features are big, probably need do Sth on it first. Feeling need reduce
the dimensions first instead of only shrinking it.
Rookie一个， please feel free to comment .

: Combat Imbalanced Classes

: "You can change the dataset that you use to build your predictive
model to

: have more balanced data.

: This change is called sampling your dataset and there are two main
methods

: that you can use to even-up the classes:

: You can add copies of instances from the under-represented class
called over

: -sampling (or more formally sampling with replacement), or

: You can delete instances from the over-represented class, called under-

: sampling.

: These approaches are often very easy to implement and fast to run.
They are

【在 z*******1 的大作中提到】

: Combat Imbalanced Classes
: "You can change the dataset that you use to build your predictive model to
: have more balanced data.
: This change is called sampling your dataset and there are two main methods
: that you can use to even-up the classes:
: You can add copies of instances from the under-represented class called over
: -sampling (or more formally sampling with replacement), or
: You can delete instances from the over-represented class, called under-
: sampling.
: These approaches are often very easy to implement and fast to run. They are

b*s2017-03-27 07:03

8 楼

4000000*2.5% 这个postive size对我来说已经很奢侈了
干嘛需要up sampling或者down sampling，虽然我老板就是搞sampling的，但是我个人
觉得up sampling或者down sampling之后就没法provide unbiased estimation了

reasonable

【在 a****l 的大作中提到】

: 这个题的point是什么？谢谢
: Given 4,000,000 samples with 1000 features, y is 2.5% positive and 97.5%
: negative, how do you take a sample from this datasets to build a reasonable
: model.

d*n2017-03-27 07:03

9 楼

是不是sampling最后都要搞到1:1?

a*z2017-03-27 07:03

10 楼

second this one:
"4000000*2.5% 这个postive size对我来说已经很奢侈了"
would like to do dimension reduction first.

t*g2017-03-27 07:03

11 楼

这个是问怎么处理imbalanced samples，然后如何在这种情况下build model

x*t2017-03-27 07:03

12 楼

尽管4M*2.5% 绝对数量很大，但是还是2.5% vs 97.5% 的imbalanced class problem。
一般策略是：
1）over sampling on minority class (缺点：overfitting，只是把decision
boundary 做细，没有genralize）
2) under sampling on majority class
3) synthesize data points
第三个参考SMOTE和ADASYN 两种方法。python有现成package：imbalanced-learn
SMOTE和ADASYN的papers：
https://www.jair.org/media/953/live-953-2037-jair.pdf
http://sci2s.ugr.es/keel/pdf/algorithm/congreso/2008-He-ieee.pdf

reasonable

【在 a****l 的大作中提到】

: 这个题的point是什么？谢谢
: Given 4,000,000 samples with 1000 features, y is 2.5% positive and 97.5%
: negative, how do you take a sample from this datasets to build a reasonable
: model.

S*o2017-03-27 07:03

13 楼

Oversampling is v. bad for decision tree based pipelines, as the decision
policy usually based on gini index, info gain or whatever, affected by
distribution of classes. But it could work v. well in some cases, penalty
based balancing is often upsampling in disguise.

【在 x***t 的大作中提到】

: 尽管4M*2.5% 绝对数量很大，但是还是2.5% vs 97.5% 的imbalanced class problem。
: 一般策略是：
: 1）over sampling on minority class (缺点：overfitting，只是把decision
: boundary 做细，没有genralize）
: 2) under sampling on majority class
: 3) synthesize data points
: 第三个参考SMOTE和ADASYN 两种方法。python有现成package：imbalanced-learn
: SMOTE和ADASYN的papers：
: https://www.jair.org/media/953/live-953-2037-jair.pdf
: http://sci2s.ugr.es/keel/pdf/algorithm/congreso/2008-He-ieee.pdf

x*t2017-03-27 07:03

14 楼

所以我推荐SMOTE或ADASYN，详细参见原paper

【在 S*****o 的大作中提到】

: Oversampling is v. bad for decision tree based pipelines, as the decision
: policy usually based on gini index, info gain or whatever, affected by
: distribution of classes. But it could work v. well in some cases, penalty
: based balancing is often upsampling in disguise.

a*s2017-03-27 07:03

15 楼

Can someone explain more about these steps and where to learn all these? I
am learning DS from scratch, and.. mainly self-teaching as well by looking
for online resources.
Thanks.

【在 x***t 的大作中提到】