"for my TF, the data make sense to me but the core said it is trash/useless,
9-20% mappable reads (out of 9-11M, meant to get 20M) and peaks calling with
a FDR of 100%. "
Mouse sample with 20%x11M = 2.2M is useless for publication. But it is still
potentially useable for trouble shooting.
Possible reasons(most likely -- least likely):
1. Anti-body doesn't work, did not pull down anything, therefore, no signal
enrichment on sites that are supposed TF-binding. The whole signal should
look no difference between your input, you are supposed to see flat line (
except centermere) across chromosomes.
2. Library overload. this will give you less output reads, because 11M is
kind of low, usually GXII generates 20-30M reads, with raw .fastq file size
of 3-6G. If this is the only reason, you should still be able to see chip-
enriched sites, but fold enrichment should be low. you should upload your
mapped data to genome browser, or IGV and go to some of you positive
controls and take a look whether their promoters/enhancers have signal?
3. sequencing mapping, if raw reads is long, trim reads will give you
slightly better mapping, but won't change the mappable reads distribution.
which means if you did not see enrichment with 2.2M reads, you probably won'
t see enrichment in 4.4M reads.
"Luckily my TF has been chipped many many time and has very conserved
binding sites. "
If the same TF had been chipseqed many times, and if this TF has a conserved
motif, you can use it's motif to reverse search binding sites use fimo (
part of meme suite).
"I randomly picked the mapped peaks, most of them with at least 1high
confident binding site...so biology tells me the data is very good..."
This doesn't make sense to me, I never heard of something called "mapped
peaks", it should either be "mapped reads" or "called peaks". If it is "
mapped reads", of course if you blat them against genome reference, they'll
go to somewhere, but whether many of the reads will pile up together is
another question. I don't think this could be used as an evidence of showing
your data is very good.
"With that, I am thinking I would need to get a bit bioformatical myself...
So any guru on the field could make more comments on the data-mining/
analysis part...
Also, any recommendations on de nova motif finders from ChIP data? How
about duplicate reads, and repetitive regions?...Pretty frustrated as of now
,...either I haven't found the right places/softwares, or the data-mining/
analysis for NGS is still pretty much at a primitive stage..."
I wrote a brief introduction few days ago, only if you are interested.
Suggestion: tell your core to show you the mapped reads on browser, seeing
is believing, and take close look into those called peaks (I guess most of
them with FDR 100%), you'll then understand whether you should trust them.
BTW. the number of peaks means nothing, because one call always call more or
fewer peaks by manipulating thresholds.