一道药厂computational biology的面试题 - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>Biology - 生物学

一道药厂computational biology的面试题

一道药厂computational biology的面试题# Biology - 生物学

Q*F2017-09-22 07:09

1 楼

快爆发吧

m*c2017-09-22 07:09

2 楼

最近电面了一个大药厂的职位。这道题不知怎么答好。
有两组病人，用同一种药治疗，其中一组病人的效果好，而另外一组的疗效不好。现有
每个病人的RNAseq数据，也就是两万多个基因的表达值数据(normalized)，这个值得范
围可能是０－１００，非随机，非线性分布，但总体的均值为１. 问用什么样的机器
学习的方法或统计方法来找出一组基因，也就是两万多个基因中的一小部分，其表达值
可以用于病人对于该治疗的预测？
Two patient cohorts, all treated with the same drug. One cohort are the
responders, who has response to the treatment and the other one are non-
responders who does not respond to the treatment. RNAseq was performed and
we have the normalized gene expression values of the 20,000 genes for each
of the patients. The expression value ranges from 0-100 with total average
of 1.
The question is how to find out a gene set (a small portion of the 20,000
gene) and use their combined (maybe weighted) gene expression value to
predict if a patient is either a responder or non-responder to the drug
treatment. It's a binary prediction.
Hope this clear.
多谢指教。

j*z2017-09-22 07:09

3 楼

我在NSC,但是也希望TSC的兄弟姐妹早绿！

z*e2017-09-22 07:09

4 楼

先问每组多少人;
然后PLS-DA，OPLS-DA或者Random forest应该都可以。

f*e2017-09-22 07:09

5 楼

同绿，多绿，肉烂烂锅里，不要给了烙印。

v*e2017-09-22 07:09

6 楼

中值为1还差不，均值为1，一个1000的需要2000个0.5的来平衡？你看他题目是不是写
错了。

p*j2017-09-22 07:09

7 楼

爆发个头。比NSC差了几条街。

m*c2017-09-22 07:09

8 楼

大牛，能不能稍微详细指点一下？假设每组都有100个病人。

【在 z*****e 的大作中提到】

: 先问每组多少人;
: 然后PLS-DA，OPLS-DA或者Random forest应该都可以。

F*u2017-09-22 07:09

9 楼

人家说的是将来时

【在 p****j 的大作中提到】

: 爆发个头。比NSC差了几条街。

m*c2017-09-22 07:09

10 楼

可能。表达值太高也没有太大意义。那就把最大值限制在100吧。

【在 v*******e 的大作中提到】

: 中值为1还差不，均值为1，一个1000的需要2000个0.5的来平衡？你看他题目是不是写
: 错了。

a*82017-09-22 07:09

11 楼

现在只是小放水，等爆

d*m2017-09-22 07:09

12 楼

你能不能贴下英文？看得我不太明白

H*i2017-09-22 07:09

13 楼

TSC双飞是王道！

d*m2017-09-22 07:09

14 楼

统计上来讲，一切都是density estimation，你自己想想有哪几个variables，弄几个
assumptions，构建joint density，然后再想想哪些方法可以estimate
conditional density，哪些就能做预测了。这个看起来就是个classification的问题
，方法取决于你对于表达量density function assumption。

F*u2017-09-22 07:09

15 楼

完了，老赫成复读机了

【在 H******i 的大作中提到】

: TSC双飞是王道！

m*c2017-09-22 07:09

16 楼

just updated with English description. please check out again.

【在 d********m 的大作中提到】

: 你能不能贴下英文？看得我不太明白

x*02017-09-22 07:09

17 楼

你表情怎么那么re-re 逗死了我也想去搞一个

【在 H******i 的大作中提到】

: TSC双飞是王道！

s*s2017-09-22 07:09

18 楼

不太明白这个均值为1有啥用途，可能是让你提到有些algorithm
要把predictor normalize吧
没做过biomarker, 不过这题目不是让你建模，而是让你找subset.
找subset要么自动找，要么手动stepwise找。前者可以用用lasso
啥的，后者就是把p-value大的，或者information gain多的predictor
一个一个加回去，直到predictor power(比如AUC)不增加为止。当
然最后要一个binary classifier, 那就在外面wrap一个logistic regression
就行了，比如logistic lasso.
机器学习玩的不多，有了feature selection，后面搞点naive bayes
或者decision tree就好了。random forest, neuron networks啥的不说
不管feature selection, 另外这些花样的结果比较难解释，药厂一般
不会搞这些吧。一般偏clinical side的都要容易解释.

现有
得范
达值

【在 m******c 的大作中提到】

: 最近电面了一个大药厂的职位。这道题不知怎么答好。
: 有两组病人，用同一种药治疗，其中一组病人的效果好，而另外一组的疗效不好。现有
: 每个病人的RNAseq数据，也就是两万多个基因的表达值数据(normalized)，这个值得范
: 围可能是０－１００，非随机，非线性分布，但总体的均值为１. 问用什么样的机器
: 学习的方法或统计方法来找出一组基因，也就是两万多个基因中的一小部分，其表达值
: 可以用于病人对于该治疗的预测？
: Two patient cohorts, all treated with the same drug. One cohort are the
: responders, who has response to the treatment and the other one are non-
: responders who does not respond to the treatment. RNAseq was performed and
: we have the normalized gene expression values of the 20,000 genes for each

C*n2017-09-22 07:09

19 楼

TSC可能之前在干别的事情吧
USCIS announced today that data entry of all FY15 H-1B cap petitions has
been completed. USCIS will now begin returning all H-1B cap petitions that
were not selected in the lottery. U.S. businesses use the H-1B program to
employ foreign workers in occupations that require highly specialized
knowledge in fields such as science, engineering, and computer programming.

d*m2017-09-22 07:09

20 楼

赞经验。我昨天下午看到这么大个feature space，一直在想怎么做regularization，
想到似乎有个啥方法可以用，就是
一直想不起来。对，就是这个Lasso。

【在 s******s 的大作中提到】

: 不太明白这个均值为1有啥用途，可能是让你提到有些algorithm
: 要把predictor normalize吧
: 没做过biomarker, 不过这题目不是让你建模，而是让你找subset.
: 找subset要么自动找，要么手动stepwise找。前者可以用用lasso
: 啥的，后者就是把p-value大的，或者information gain多的predictor
: 一个一个加回去，直到predictor power(比如AUC)不增加为止。当
: 然最后要一个binary classifier, 那就在外面wrap一个logistic regression
: 就行了，比如logistic lasso.
: 机器学习玩的不多，有了feature selection，后面搞点naive bayes
: 或者decision tree就好了。random forest, neuron networks啥的不说