RNA-seq data assembler vs genomic shot gun data assembler - 未名空间MITBBS历史存档

RNA-seq data assembler vs genomic shot gun data assembler# Biology - 生物学

c*y2018-01-06 08:01

1 楼

版上有人能介绍下它们之间的区别嘛？
想用cufflinks to assemble shot-gun genome data, 不知可行否
多谢！

f*o2018-01-06 08:01

2 楼

"genomic shot gun data assembler" assembles genomic DNA sequences while "RNA
-Seq assembler", such as Cufflinks, assembles transcriptome. Both classes of
assemblers mostly uses de bruijn graph based methods for Illumina data. But
there are many differences between.
For example, RNA-Seq reads can be stranded while DNA sequences are
strandless. So DNA assembler has to treat a read and its reverse complement
sequence the same.
Another major difference is the assumption on the sequence coverage.
Disregarding fragmentation and sequencing biases, the coverage of the genome
should be more or less even. Thus regions with more reads implies multiple
copies/repeats. On the other hand, due to different expression levels of
different mRNAs, the coverage of different transcripts can vary greatly.
Additionally, RNA assembler has to be aware of different isoforms of the
same gene, which corresponds to different traversal in the graph.
There are many other differences but the bottomline line is that it's not a
good idea to use Cufflinks to assemble genome. SPAdes is my goto for such
purpose.

c*y2018-01-06 08:01

3 楼

But SPAdes is a de novo tool. How about if I would like to include the
reference to the assembly process?

RNA
of
But
complement
genome
multiple

【在 f*******o 的大作中提到】

: "genomic shot gun data assembler" assembles genomic DNA sequences while "RNA
: -Seq assembler", such as Cufflinks, assembles transcriptome. Both classes of
: assemblers mostly uses de bruijn graph based methods for Illumina data. But
: there are many differences between.
: For example, RNA-Seq reads can be stranded while DNA sequences are
: strandless. So DNA assembler has to treat a read and its reverse complement
: sequence the same.
: Another major difference is the assumption on the sequence coverage.
: Disregarding fragmentation and sequencing biases, the coverage of the genome
: should be more or less even. Thus regions with more reads implies multiple

f*o2018-01-06 08:01

4 楼

That's rare, not sure if I understand why but I can think of two ways:
1. Generate all possible K-mers (K = your sequencing length) from the
genomic sequence. Include them in your fastq read files and do de novo
assembly. The intrinsic information from the reference sequence should be
useful to resolve ambiguity in the graph during assembly. Similar idea has
been used in Cufflinks.
2. Align your reads to the genome, then separate aligned and unaligned reads
. For the aligned reads, just call variations. For the unaligned reads, do
de novo assembly and compare the assembled contigs with the reference genome
to identify novel sequence or variations. Similar idea was used in Tophat
to identify new splicing sites.
Hope that this helps.

c*y2018-01-06 08:01

5 楼

版上有人能介绍下它们之间的区别嘛？
想用cufflinks to assemble shot-gun genome data, 不知可行否
多谢！

f*o2018-01-06 08:01

6 楼

"genomic shot gun data assembler" assembles genomic DNA sequences while "RNA
-Seq assembler", such as Cufflinks, assembles transcriptome. Both classes of
assemblers mostly uses de bruijn graph based methods for Illumina data. But
there are many differences between.
For example, RNA-Seq reads can be stranded while DNA sequences are
strandless. So DNA assembler has to treat a read and its reverse complement
sequence the same.
Another major difference is the assumption on the sequence coverage.
Disregarding fragmentation and sequencing biases, the coverage of the genome
should be more or less even. Thus regions with more reads implies multiple
copies/repeats. On the other hand, due to different expression levels of
different mRNAs, the coverage of different transcripts can vary greatly.
Additionally, RNA assembler has to be aware of different isoforms of the
same gene, which corresponds to different traversal in the graph.
There are many other differences but the bottomline line is that it's not a
good idea to use Cufflinks to assemble genome. SPAdes is my goto for such
purpose.

c*y2018-01-06 08:01

7 楼

But SPAdes is a de novo tool. How about if I would like to include the
reference to the assembly process?

RNA
of
But
complement
genome
multiple

【在 f*******o 的大作中提到】

: "genomic shot gun data assembler" assembles genomic DNA sequences while "RNA
: -Seq assembler", such as Cufflinks, assembles transcriptome. Both classes of
: assemblers mostly uses de bruijn graph based methods for Illumina data. But
: there are many differences between.
: For example, RNA-Seq reads can be stranded while DNA sequences are
: strandless. So DNA assembler has to treat a read and its reverse complement
: sequence the same.
: Another major difference is the assumption on the sequence coverage.
: Disregarding fragmentation and sequencing biases, the coverage of the genome
: should be more or less even. Thus regions with more reads implies multiple

f*o2018-01-06 08:01

8 楼

That's rare, not sure if I understand why but I can think of two ways:
1. Generate all possible K-mers (K = your sequencing length) from the
genomic sequence. Include them in your fastq read files and do de novo
assembly. The intrinsic information from the reference sequence should be
useful to resolve ambiguity in the graph during assembly. Similar idea has
been used in Cufflinks.
2. Align your reads to the genome, then separate aligned and unaligned reads
. For the aligned reads, just call variations. For the unaligned reads, do
de novo assembly and compare the assembled contigs with the reference genome
to identify novel sequence or variations. Similar idea was used in Tophat
to identify new splicing sites.
Hope that this helps.

C*X2018-01-06 08:01

9 楼

你是学生还是博士后？

【在 c***y 的大作中提到】

: 版上有人能介绍下它们之间的区别嘛？
: 想用cufflinks to assemble shot-gun genome data, 不知可行否
: 多谢！