请问一下AS transcripts可靠性的问题 - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>Biology - 生物学

请问一下AS transcripts可靠性的问题

请问一下AS transcripts可靠性的问题# Biology - 生物学

j*e2012-02-02 08:02

1 楼

打得很精彩。还想去参加游行的，这下没戏了。

n*72012-02-02 08:02

2 楼

情况大概是这样：我们做了很多cDNA clone，测序之后选取了一些进行了下游的实验，
主要是in vitro 的protein实验。我们鉴定了很多新的splicing isoform，其中有不少
有premature stop codon(可以是splicing 造成的，也可以使indel造成的）
现在我们想做一些quality control，去掉一些不靠谱的transcripts，于是出现了分歧
组里的大姐想法是要尽量跟已知的protein一致。把每条isoform对应的protein align
到已有的protein sequence上。如果整条protein基本align上去，即便使truncated
protein, 不管什么原因，都可以作为partial protein 保留。但是如果shifted frame
的amino acid sequence长到一定程度，那就认为这些protein sequence跟已有的
sequence太不一样，要除去
我完全明白大姐为什么那么想，因为我们实际test的是一些protein sequences
但是就我这部分工作，我需要关注的是AS 对下游生物过程的影响。所以我打算首先去
除所有有indel的transcripts, 因为indel可能是sample specific的，可能使
sequencing/assembly/PCR error，总之不能算普遍存在的东西。然后我想只保留有
canonical splicing sites，并且不会成为nonsense mediated decay target的
transcripts。我的考虑使，虽然会错杀一些，绝大大部分真实的sequences应该是只有
canonical splicing sites的，这样可以过滤掉一些assemble的错误。然后NMD虽然不
是100%的效率（这也是我们为什么可以clone到一些NMD targets), 但是也可以作为是
不是生物体中实际存在的protein的一个标准。剩下的transcript，不管有多么奇怪，
跟已知protein多么不一样，我认为都是从某个特点gene locus出来的东西，应该都作
为gene的一个产物来处理
不知道我这么处理合适不，麻烦给点意见。谢谢！

y*u2012-02-02 08:02

3 楼

我也郁闷的今天没法睡了

l*12012-02-02 08:02

4 楼

If not only bioinformatics approach to
>现在我们想做一些quality control，去掉一些不靠谱的transcripts
you team should try do bleow step by step:
1)
Chromatin immunoprecipitation (ChIP) and DNA
microarrays (chip)
2)
ChIP-PCR
3)
Expression RT-PCR
4)
Reporter plasmid construction.
5)
Cell culture, transfection, and reporter assay
6)
Western blots
cited from
//www.ncbi.nlm.nih.gov/pmc/articles/PMC2643481/
during above ChIP assay of course you can try below Bioinformatics tools package.
//info.gersteinlab.org/Tools#ChIP
likes [email protected]:
A tool for calling peaks corresponding to transcription factor binding sites from ChIP-Seq data scored against
a matched control such as Input DNA. PeakSeq employs a two-pass strategy in which putative binding sites
are first identified in order to compensate for genomic variation in the 'mappability' of sequences, before a
second pass filters out sites not significantly enriched compared to the normalized control, computing
precise enrichments and significances. Our scoring procedure enables us to optimize experimental design by
estimating the depth of sequencing required for a desired level of coverage and demonstrating that more
than two replicates provides only a marginal gain in information.
cited from Yale Gerstein Lab Tools link:
//info.gersteinlab.org/Tools
NB: if had my misunderstood to your enquiry true meaning, please let me know it.

align
frame

【在 n******7 的大作中提到】

: 情况大概是这样：我们做了很多cDNA clone，测序之后选取了一些进行了下游的实验，
: 主要是in vitro 的protein实验。我们鉴定了很多新的splicing isoform，其中有不少
: 有premature stop codon(可以是splicing 造成的，也可以使indel造成的）
: 现在我们想做一些quality control，去掉一些不靠谱的transcripts，于是出现了分歧
: 组里的大姐想法是要尽量跟已知的protein一致。把每条isoform对应的protein align
: 到已有的protein sequence上。如果整条protein基本align上去，即便使truncated
: protein, 不管什么原因，都可以作为partial protein 保留。但是如果shifted frame
: 的amino acid sequence长到一定程度，那就认为这些protein sequence跟已有的
: sequence太不一样，要除去
: 我完全明白大姐为什么那么想，因为我们实际test的是一些protein sequences

M*02012-02-02 08:02

5 楼

不顺的时候，什么事都go against your wish

t*d2012-02-02 08:02

6 楼

AS 这个领域还没有什么头绪吧。
你们俩的想法都不错啊，取决于你们的后续试验想干什么吧。
大姐的想法更有可操作性。你的想法更难接下去做，不过如果做成了，回报更多。

align
frame

【在 n******7 的大作中提到】

n*72012-02-02 08:02

7 楼

非常感谢如此详细的回复，不过目前只想做一些理论上的分析

【在 l**********1 的大作中提到】

: If not only bioinformatics approach to
: >现在我们想做一些quality control，去掉一些不靠谱的transcripts
: you team should try do bleow step by step:
: 1)
: Chromatin immunoprecipitation (ChIP) and DNA
: microarrays (chip)
: 2)
: ChIP-PCR
: 3)
: Expression RT-PCR

n*72012-02-02 08:02

8 楼

实验已经做完了，我们在清理垃圾。。。

【在 t*d 的大作中提到】

: AS 这个领域还没有什么头绪吧。
: 你们俩的想法都不错啊，取决于你们的后续试验想干什么吧。
: 大姐的想法更有可操作性。你的想法更难接下去做，不过如果做成了，回报更多。
:
: align
: frame

t*d2012-02-02 08:02

9 楼

我是说接下来的功能试验。
你找出一堆 AS 的意义何在？只是列出可能的 AS 吗？如果这样的话，你的或她的故事
都讲得过去吧。揉吧揉吧，放在一起将不是更好么？

【在 n******7 的大作中提到】

: 实验已经做完了，我们在清理垃圾。。。

n*k2012-02-02 08:02

10 楼

What are ur general thoughts about AS field? ever talked to Fu about it...I
was hoping he is gonna do something with AS-NGS thing but it has been a long
wait:))
As for the LZ's problems, I were thinking:
1. the proteomic domain is presumably very much incompleted and perhaps not
as reliable either...
2. why Indel has to be eliminated, I saw potentially they could be very
interested unless generated by the procedures...
3. An idiot on the bioinformatic side, if it is pretty manageable, I was
wondering why LZ wouldn't want to do all different analyses and see what
happen...In the end it could(could I mean) be something jumping out from a
biological point of view...well, I guess it depends on whether to work for
publication or something meaningful:)))

【在 t*d 的大作中提到】

n*72012-02-02 08:02

11 楼

不是，有下游实验，所以才要清理，不想被别人说我们data不可靠。。。

【在 t*d 的大作中提到】

: 我是说接下来的功能试验。
: 你找出一堆 AS 的意义何在？只是列出可能的 AS 吗？如果这样的话，你的或她的故事
: 都讲得过去吧。揉吧揉吧，放在一起将不是更好么？

n*72012-02-02 08:02

12 楼

I
long
你说Xiangdong Fu? 我跟他lab很熟，不过他们似乎现在关注的是RNA-Protein的
interaction
not
不错，这样发挥空间就大一些
根据我的估计大概1/3的indel在5‘ end primer region，1/3在3‘ end primer
region. 应该不是biological meaningful的东西。并且，我们的sample没有相关的表
型，默认是wild-type的
因为我们是做mid-throughput analysis,而且想尽快写文章。boss没有任何经验和
confidence, 一天主意变几次。。。

【在 n********k 的大作中提到】

: What are ur general thoughts about AS field? ever talked to Fu about it...I
: was hoping he is gonna do something with AS-NGS thing but it has been a long
: wait:))
: As for the LZ's problems, I were thinking:
: 1. the proteomic domain is presumably very much incompleted and perhaps not
: as reliable either...
: 2. why Indel has to be eliminated, I saw potentially they could be very
: interested unless generated by the procedures...
: 3. An idiot on the bioinformatic side, if it is pretty manageable, I was
: wondering why LZ wouldn't want to do all different analyses and see what

t*d2012-02-02 08:02

13 楼

哦，这样啊。
这个时候不是怎么和下游实验结果吻合，怎么说么？

【在 n******7 的大作中提到】

: 不是，有下游实验，所以才要清理，不想被别人说我们data不可靠。。。

t*d2012-02-02 08:02

14 楼

i am not familiar with AS field. There were too few reports showing that
different isoforms of a mRNA could have different functions. And I can not
understand why cells need AS. NGS may provide an excellent means to help
us understand AS more. There were a few papers at Nature Genetics and NEJM
showing a component in RNA editing complex is mutated in CLL and the
mutation can be used for diagnosis/prognosis. That indicates AS may play a
very important role in the cancer progress, there were no details on AS in
the paper though.

I
long
not

【在 n********k 的大作中提到】

n*k2012-02-02 08:02

15 楼

in that case, Fu would be the ideal person to consult on your question...
RNA-Protein has been their focus from the beginning but they were also among
the first to examine isoforms using genomic approaches...They have been
publishing quite a bit on the TF/epigenetic side lately and I was wondering
where they are going with the AS thing...BTW, whose lab are you at?

【在 n******7 的大作中提到】

:
: I
: long
: 你说Xiangdong Fu? 我跟他lab很熟，不过他们似乎现在关注的是RNA-Protein的
: interaction
: not
: 不错，这样发挥空间就大一些
: 根据我的估计大概1/3的indel在5‘ end primer region，1/3在3‘ end primer
: region. 应该不是biological meaningful的东西。并且，我们的sample没有相关的表
: 型，默认是wild-type的

n*k2012-02-02 08:02

16 楼

AS is very much an under studied field and I personally have a strong
interest in it and believe there are lots of interesting stuff to be
revealed. As for biological significance, there are tons of examples in
either basic biology or clinical pathology that involve AS.

【在 t*d 的大作中提到】

: i am not familiar with AS field. There were too few reports showing that
: different isoforms of a mRNA could have different functions. And I can not
: understand why cells need AS. NGS may provide an excellent means to help
: us understand AS more. There were a few papers at Nature Genetics and NEJM
: showing a component in RNA editing complex is mutated in CLL and the
: mutation can be used for diagnosis/prognosis. That indicates AS may play a
: very important role in the cancer progress, there were no details on AS in
: the paper though.
:
: I

n*72012-02-02 08:02

17 楼

其实是跟他lab的熟一些，我们lab是一个不知名的小破lab

among
wondering

【在 n********k 的大作中提到】

: in that case, Fu would be the ideal person to consult on your question...
: RNA-Protein has been their focus from the beginning but they were also among
: the first to examine isoforms using genomic approaches...They have been
: publishing quite a bit on the TF/epigenetic side lately and I was wondering
: where they are going with the AS thing...BTW, whose lab are you at?

n*72012-02-02 08:02

18 楼

是的，AS的很多东西还没有定论
我们和合作者现在有一些比较有意思的数据，就是对数据可靠性有些担心
有太多看上去不太确定的东西，不知道是AS noise 还是真的会有biological output

【在 n********k 的大作中提到】

: AS is very much an under studied field and I personally have a strong
: interest in it and believe there are lots of interesting stuff to be
: revealed. As for biological significance, there are tons of examples in
: either basic biology or clinical pathology that involve AS.

c*g2012-02-02 08:02

19 楼

我的一些不太成熟的看法，要是不对请见谅。
1）如果你们的下游工作是validation，当然是越stringent越有可能被你validate到，
你可以把你的filter和你大姐的combine了。实际上你已经有下游的validation了，你
是不是有可能知道错误都发生在哪一步？根据你的validation的结果来inform你应该进
行什么样的filter。
2）如果你们做的是human或者model organism的数据，而你们想要看看你们的序列是不
是真实正确，可以和现有的cDNA数据库，甚至是最近的rna-seq的数据比较，如果在别
的数据里也出现过，应该可以认为是真实的序列。
3）大部分的人的和模式生物的alternative splicing应该都已经被annotate到了，并
且你们研究的对象是wild type的，那么和已有的annotation比较应该是一个很好"
cross-validation"的方法。

align
frame

【在 n******7 的大作中提到】

M*n2012-02-02 08:02

20 楼

如果是cDNA clone出来的，PCR/assembly error可能很小。而且可以通过reference
sequence 查出来的，应该可剔出
你大姐太保守了，而且忘记了还有overlapping gene 和non-coding RNA 这个概念。
你的想法稍微靠谱一点，不过splice site也不是100%保守的，所以你的想法还是保守
了一点。
BTW，你们目的到底是啥啊？是把所有transcriptome都搞明白还是别的目的？到底做
了多少clone? 这个效率太低了吧。
应该搞RNAseq

align
frame

【在 n******7 的大作中提到】

n*72012-02-02 08:02

21 楼

我的一些不太成熟的看法，要是不对请见谅。
1）如果你们的下游工作是validation，当然是越stringent越有可能被你validate到，
你可以把你的filter和你大姐的combine了。实际上你已经有下游的validation了，你
是不是有可能知道错误都发生在哪一步？根据你的validation的结果来inform你应该进
行什么样的filter。
下游的工作不是validation，是一些in vitro的性质实验，所以不太清楚到底跟生物体
差别多大
2）如果你们做的是human或者model organism的数据，而你们想要看看你们的序列是不
是真实正确，可以和现有的cDNA数据库，甚至是最近的rna-seq的数据比较，如果在别
的数据里也出现过，应该可以认为是真实的序列。
有不少是跟已知的overlap的，但是我们更关注新发现的isoform是不是可靠
3）大部分的人的和模式生物的alternative splicing应该都已经被annotate到了，并
且你们研究的对象是wild type的，那么和已有的annotation比较应该是一个很好"
cross-validation"的方法。
就我们的结果来看，大约有一半我们发现的isoform不是已知的。但是我们也觉得这里
面有个bias，就是已知的dataset倾向于报告“可靠”的mrna，比如从splicing site，
nmd来衡量。所以我们也想控制一下。

【在 c*****g 的大作中提到】

: 我的一些不太成熟的看法，要是不对请见谅。
: 1）如果你们的下游工作是validation，当然是越stringent越有可能被你validate到，
: 你可以把你的filter和你大姐的combine了。实际上你已经有下游的validation了，你
: 是不是有可能知道错误都发生在哪一步？根据你的validation的结果来inform你应该进
: 行什么样的filter。
: 2）如果你们做的是human或者model organism的数据，而你们想要看看你们的序列是不
: 是真实正确，可以和现有的cDNA数据库，甚至是最近的rna-seq的数据比较，如果在别
: 的数据里也出现过，应该可以认为是真实的序列。
: 3）大部分的人的和模式生物的alternative splicing应该都已经被annotate到了，并
: 且你们研究的对象是wild type的，那么和已有的annotation比较应该是一个很好"

n*72012-02-02 08:02

22 楼

PCR的error有一些，有些clone很奇怪，做了SANGER出来一些不知道哪里的序列。还有
rRNA的序列都弄出来了
我做的assembly，为了保证data size，有些cutoff设定的不是很严格，所以有些区域
的分值不高。还有一些alignment的问题，我发现过几个错误。我们的目的是发现一些
新的iso，所以ref sequence并不是总有用。
保守没办法，我们的实验pipeline有些assumption，其实是不完全对的。加上我们最后
研究的使protein，这个对mrna序列太敏感了，只好保守一点。除非老板或者合作者愿
意做一些protein level的高通量严重，比如MS之类。。。
我们想发现全长isoform
RNA-seq目前只能研究splicing sites吧？ Cufflinks，scripture什么的出来的
isoform还是根据splicing site预测的，不一定真实存在。做clone的目的也是为了
克服RNA-seq的这个缺点。我们选了一组gene set做的clone，手上几组数据大概一共测
序了36K的clone

【在 M*****n 的大作中提到】

: 如果是cDNA clone出来的，PCR/assembly error可能很小。而且可以通过reference
: sequence 查出来的，应该可剔出
: 你大姐太保守了，而且忘记了还有overlapping gene 和non-coding RNA 这个概念。
: 你的想法稍微靠谱一点，不过splice site也不是100%保守的，所以你的想法还是保守
: 了一点。
: BTW，你们目的到底是啥啊？是把所有transcriptome都搞明白还是别的目的？到底做
: 了多少clone? 这个效率太低了吧。
: 应该搞RNAseq
:
: align

n*72012-02-02 08:02

23 楼

另外，我确实在我们的reference sequence里面就观察到了 dual frame 的exon
non-coding RNA也有可能，有些gene会同时产生mRNA和mRNA-like non-coding RNA （
现在叫lincRNA似乎更多一些）
不过我们现在研究的使protein，ncRNA得过滤掉。。

【在 M*****n 的大作中提到】