求教：Deep sequencing的data convert - 未名空间MITBBS历史存档

国际科技财经博客移民网络热点娱乐民生时事公众号

Redian新闻

>未名空间

>Biology - 生物学

求教：Deep sequencing的data convert

求教：Deep sequencing的data convert# Biology - 生物学

x*y2013-02-08 08:02

1 楼

捐给版上了, 版主done会发给大家的。。。

c*b2013-02-08 08:02

2 楼

最近处理一批发表过的数据，完全没有经验，求大家给科普一下。
拿到的数据format如下（恳请告知是什么format）：
ACAAACGACTCTCGGCAACGGTTGT 2
ATATGAAGACAAGTAGTGCAGCTCGGAGACGGG 1
ATAATAGAGGTTTTGCAAAACAAT 1
后面的数字代表read number。
我想把这个data换成fasta格式的，不知道什么软件合适，几个million，也不知道自己
的机子能不能搞定。最好是能够有简单的UI，不然搞不定啊，555.

l*a2013-02-08 08:02

3 楼

cong

【在 x***y 的大作中提到】

: 捐给版上了, 版主done会发给大家的。。。

u*12013-02-08 08:02

4 楼

我觉得你要解释更清楚点这到底是什么数据
一般的sequencing reads的数据，首先最短的reads也有36bp，其次所有的base都是有
想对应的phred value的；所以你这肯定不是sequencing数据
印象中一般都是reference genome是fasta格式，所以我猜测你这是什么species的ref
sequence？但不懂后面的1，2是什么意思。
总之不知道是什么。还求高人指点。

【在 c********b 的大作中提到】

: 最近处理一批发表过的数据，完全没有经验，求大家给科普一下。
: 拿到的数据format如下（恳请告知是什么format）：
: ACAAACGACTCTCGGCAACGGTTGT 2
: ATATGAAGACAAGTAGTGCAGCTCGGAGACGGG 1
: ATAATAGAGGTTTTGCAAAACAAT 1
: 后面的数字代表read number。
: 我想把这个data换成fasta格式的，不知道什么软件合适，几个million，也不知道自己
: 的机子能不能搞定。最好是能够有简单的UI，不然搞不定啊，555.

g*n2013-02-08 08:02

5 楼

re

【在 x***y 的大作中提到】

: 捐给版上了, 版主done会发给大家的。。。

c*b2013-02-08 08:02

6 楼

不好意思，是小RNA的deep sequencing （不是genome sequence），后面的number是表
示read number（abundance）。
我就想把它变成fasta或者其他可以被下游软件识别的格式。

ref

【在 u*********1 的大作中提到】

: 我觉得你要解释更清楚点这到底是什么数据
: 一般的sequencing reads的数据，首先最短的reads也有36bp，其次所有的base都是有
: 想对应的phred value的；所以你这肯定不是sequencing数据
: 印象中一般都是reference genome是fasta格式，所以我猜测你这是什么species的ref
: sequence？但不懂后面的1，2是什么意思。
: 总之不知道是什么。还求高人指点。

g*e2013-02-08 08:02

7 楼

a*h2013-02-08 08:02

8 楼

我觉得你这个像是tag count file (参考 GBS pipeline)。
自己用 linux shell commands will do that (假设你的文件是space delimit):
cat input.txt | nl | gawk -F' ' '{print ">"$1"_"$3"\n"$2}'
>1_2
ACAAACGACTCTCGGCAACGGTTGT
>2_1
ATATGAAGACAAGTAGTGCAGCTCGGAGACGGG
>3_1
ATAATAGAGGTTTTGCAAAACAAT
sequence ID will be unique if you only have one file and also include read
number information. For multiple files, just do: cat file1 file2 file2 | nl
| gawk -F' ' '{print ">"$1"_"$3"\n"$2}'

P*e2013-02-08 08:02

9 楼

re

【在 x***y 的大作中提到】

: 捐给版上了, 版主done会发给大家的。。。

c*l2013-02-08 08:02

10 楼

Looks like microRNA or other siRNA. A script can do this job.

【在 c********b 的大作中提到】

g*z2013-02-08 08:02

11 楼

chi

【在 x***y 的大作中提到】

: 捐给版上了, 版主done会发给大家的。。。

a*h2013-02-08 08:02

12 楼

补充一个：如果是tab delimit:
cat input.txt | nl | sed 's/ //g' | gawk -F'\t' '{print ">"$1"_"$3"\n"$2}'

q*d2013-02-08 08:02

13 楼

j*p2013-02-08 08:02

14 楼

学写script吧～

S*b2013-02-08 08:02

15 楼

re

【在 x***y 的大作中提到】

: 捐给版上了, 版主done会发给大家的。。。

c*b2013-02-08 08:02

16 楼

看来只好如此了。非常感谢各位啊！

【在 j*p 的大作中提到】

: 学写script吧～

l*t2013-02-08 08:02

17 楼

re

【在 x***y 的大作中提到】

: 捐给版上了, 版主done会发给大家的。。。

a*h2013-02-08 08:02

18 楼

试了我给你的script了么？

【在 c********b 的大作中提到】

: 看来只好如此了。非常感谢各位啊！

h*e2013-02-08 08:02

19 楼

cong

t*a2013-02-08 08:02

20 楼

这解法多漂亮简洁，顶一个
楼主就别发愣了，这个好使

【在 a********h 的大作中提到】

: 我觉得你这个像是tag count file (参考 GBS pipeline)。
: 自己用 linux shell commands will do that (假设你的文件是space delimit):
: cat input.txt | nl | gawk -F' ' '{print ">"$1"_"$3"\n"$2}'
: >1_2
: ACAAACGACTCTCGGCAACGGTTGT
: >2_1
: ATATGAAGACAAGTAGTGCAGCTCGGAGACGGG
: >3_1
: ATAATAGAGGTTTTGCAAAACAAT
: sequence ID will be unique if you only have one file and also include read

P*f2013-02-08 08:02

21 楼

re

【在 x***y 的大作中提到】

: 捐给版上了, 版主done会发给大家的。。。

c*b2013-02-08 08:02

22 楼

我太笨了，不会perl，连在mac terminal下指定文件都不会，刚刚找隔壁实验室的朋友
帮忙跑了一下，很好用。真心感谢！！

【在 a********h 的大作中提到】

:
: 试了我给你的script了么？

i*e2013-02-08 08:02

23 楼

chi
一些常见面试题的答案与总结 -
http://www.ihas1337code.com

a*h2013-02-08 08:02

24 楼

不客气。简单的工作，Linux/Unix shell command 就够了。

【在 c********b 的大作中提到】

: 我太笨了，不会perl，连在mac terminal下指定文件都不会，刚刚找隔壁实验室的朋友
: 帮忙跑了一下，很好用。真心感谢！！

d*e2013-02-08 08:02

25 楼

恭喜！
牛人给个原因好吗？原则我删贴的啦 :)

【在 x***y 的大作中提到】

: 捐给版上了, 版主done会发给大家的。。。

c*e2013-02-08 08:02

26 楼

BTW, what do you think about python vs Perl?
I exclusively use python and found many traditional labs prefer
Perl.

【在 a********h 的大作中提到】

:
: 不客气。简单的工作，Linux/Unix shell command 就够了。

h*i2013-02-08 08:02

27 楼

排队

【在 x***y 的大作中提到】

: 捐给版上了, 版主done会发给大家的。。。

a*h2013-02-08 08:02

28 楼

don't know Python. My feeling is: perl is much easier to learn than Python
and a bit flexible/powerful than shell script. But it is getting old (no
major updates for a few years!) Python is somewhat like java, your code is
based on how you familiar with the "funcitons" others already written.but
more powerful if you are into some complex problem and for large software (
package) work (where the codes are developed by a team or even multiple
teams). For small informatics jobs, perl or shell is enough.

【在 c********e 的大作中提到】

:
: BTW, what do you think about python vs Perl?
: I exclusively use python and found many traditional labs prefer
: Perl.

s*92013-02-08 08:02

29 楼

恭喜，恭喜！有什么喜事说出来听听～

l*12013-02-08 08:02

30 楼

plus 各取所需用C++ or Perl or python or R etc 取决于生信分析的对象
样品数量和目的项目
比如楼主的问题如是 NGS high.through raw data 也可 try python based
Bcbio-nextgen
cited,
Summary: Python scripts and modules for automated next gen sequencing
analysis. These provide a fully automated pipeline for taking sequencing
results from an Illumina sequencer, converting them to standard Fastq format
, aligning to a reference genome, doing SNP calling, and producing a summary
PDF of results
web link:
HTTP: //seqanswers.com/wiki/Bcbio-nextgen
or alternatively,
HTTPS: //bcbio-nextgen.readthedocs.org/en/latest/
for more RNA-seq softs based on different programing platforms,
pls refer,
HTTP: //seqanswers.com/wiki/Software/list

【在 c********e 的大作中提到】

:
: BTW, what do you think about python vs Perl?
: I exclusively use python and found many traditional labs prefer
: Perl.