RNA-seq data assembler vs genomic shot gun data assembler
RNA-seq data assembler vs genomic shot gun data assembler# Biology - 生物学
c*y
1 楼
版上有人能介绍下它们之间的区别嘛? 想用cufflinks to assemble shot-gun genome data, 不知可行否 多谢!
f*o
2 楼
"genomic shot gun data assembler" assembles genomic DNA sequences while "RNA -Seq assembler", such as Cufflinks, assembles transcriptome. Both classes of assemblers mostly uses de bruijn graph based methods for Illumina data. But there are many differences between. For example, RNA-Seq reads can be stranded while DNA sequences are strandless. So DNA assembler has to treat a read and its reverse complement sequence the same. Another major difference is the assumption on the sequence coverage. Disregarding fragmentation and sequencing biases, the coverage of the genome should be more or less even. Thus regions with more reads implies multiple copies/repeats. On the other hand, due to different expression levels of different mRNAs, the coverage of different transcripts can vary greatly. Additionally, RNA assembler has to be aware of different isoforms of the same gene, which corresponds to different traversal in the graph. There are many other differences but the bottomline line is that it's not a good idea to use Cufflinks to assemble genome. SPAdes is my goto for such purpose.
c*y
3 楼
But SPAdes is a de novo tool. How about if I would like to include the reference to the assembly process?
RNA of But complement genome multiple
【在 f*******o 的大作中提到】 : "genomic shot gun data assembler" assembles genomic DNA sequences while "RNA : -Seq assembler", such as Cufflinks, assembles transcriptome. Both classes of : assemblers mostly uses de bruijn graph based methods for Illumina data. But : there are many differences between. : For example, RNA-Seq reads can be stranded while DNA sequences are : strandless. So DNA assembler has to treat a read and its reverse complement : sequence the same. : Another major difference is the assumption on the sequence coverage. : Disregarding fragmentation and sequencing biases, the coverage of the genome : should be more or less even. Thus regions with more reads implies multiple
f*o
4 楼
That's rare, not sure if I understand why but I can think of two ways: 1. Generate all possible K-mers (K = your sequencing length) from the genomic sequence. Include them in your fastq read files and do de novo assembly. The intrinsic information from the reference sequence should be useful to resolve ambiguity in the graph during assembly. Similar idea has been used in Cufflinks. 2. Align your reads to the genome, then separate aligned and unaligned reads . For the aligned reads, just call variations. For the unaligned reads, do de novo assembly and compare the assembled contigs with the reference genome to identify novel sequence or variations. Similar idea was used in Tophat to identify new splicing sites. Hope that this helps.
c*y
5 楼
版上有人能介绍下它们之间的区别嘛? 想用cufflinks to assemble shot-gun genome data, 不知可行否 多谢!
f*o
6 楼
"genomic shot gun data assembler" assembles genomic DNA sequences while "RNA -Seq assembler", such as Cufflinks, assembles transcriptome. Both classes of assemblers mostly uses de bruijn graph based methods for Illumina data. But there are many differences between. For example, RNA-Seq reads can be stranded while DNA sequences are strandless. So DNA assembler has to treat a read and its reverse complement sequence the same. Another major difference is the assumption on the sequence coverage. Disregarding fragmentation and sequencing biases, the coverage of the genome should be more or less even. Thus regions with more reads implies multiple copies/repeats. On the other hand, due to different expression levels of different mRNAs, the coverage of different transcripts can vary greatly. Additionally, RNA assembler has to be aware of different isoforms of the same gene, which corresponds to different traversal in the graph. There are many other differences but the bottomline line is that it's not a good idea to use Cufflinks to assemble genome. SPAdes is my goto for such purpose.
c*y
7 楼
But SPAdes is a de novo tool. How about if I would like to include the reference to the assembly process?
RNA of But complement genome multiple
【在 f*******o 的大作中提到】 : "genomic shot gun data assembler" assembles genomic DNA sequences while "RNA : -Seq assembler", such as Cufflinks, assembles transcriptome. Both classes of : assemblers mostly uses de bruijn graph based methods for Illumina data. But : there are many differences between. : For example, RNA-Seq reads can be stranded while DNA sequences are : strandless. So DNA assembler has to treat a read and its reverse complement : sequence the same. : Another major difference is the assumption on the sequence coverage. : Disregarding fragmentation and sequencing biases, the coverage of the genome : should be more or less even. Thus regions with more reads implies multiple
f*o
8 楼
That's rare, not sure if I understand why but I can think of two ways: 1. Generate all possible K-mers (K = your sequencing length) from the genomic sequence. Include them in your fastq read files and do de novo assembly. The intrinsic information from the reference sequence should be useful to resolve ambiguity in the graph during assembly. Similar idea has been used in Cufflinks. 2. Align your reads to the genome, then separate aligned and unaligned reads . For the aligned reads, just call variations. For the unaligned reads, do de novo assembly and compare the assembled contigs with the reference genome to identify novel sequence or variations. Similar idea was used in Tophat to identify new splicing sites. Hope that this helps.