ChIP- seq analysis. Morgane Thomas- Chollier Samuel Collombet. Computa)onal systems biology - IBENS

Size: px
Start display at page:

Download "ChIP- seq analysis. Morgane Thomas- Chollier Samuel Collombet. Computa)onal systems biology - IBENS"

Transcription

1 ChIP- seq analysis Morgane Thomas- Chollier Samuel Collombet Computa)onal systems biology - IBENS mthomas@biologie.ens.fr M2 Computa8onal analysis of cis- regulatory sequences 2014/2015 Denis Thieffry, Jacques van Helden and Carl Herrmann kindly shared some of their slides.

2 The ChIP- seq era Pubmed hits per year for "ChiP-Seq"

3 Aim of the course 1 - From reads to peaks (= primary analysis) 2 - Secondary analysis - func8onal annota8on of peaks - mo8f discovery in peaks

4 in vivo experimental methods to iden8fy binding sites ChIP (=Chroma8n Immuno- Precipita8on) differences in methods to detect the bound DNA - small- scale: PCR / qpcr - large- scale: - microarray = ChIP- on- chip - sequencing = ChIP- seq h9p:// an)bodies.com/

5 ChIP- seq aim: find all regions bound by a specific transcripion factor by histones bearing a specific modificaion in a given experimental condi)on (cell type, developmental stage,...) Mardis. Nat Methods (2007) and then what????

6 ChIP- seq Experimental approach and then what???? BioinformaIc approach

7 Different ChIP profiles Park, Nature reviews 2009

8 Modelling noise levels ChIP-seq dataset (=treatment) = signal background noise + How do we estimate the noise?

9 Modelling noise levels noise is not uniform (chromain conformaion, local biases, mappability) input dataset is mandatory for reliable local esimaion! (although some algorithms do not require it :- ( ) treatment input?

10 From sequence reads to peaks experiment Input sequences (reads length 36 bp) from Illumina

11 Oct4:5:1:871:340 GGCGCACTTACACCCTACATCCATTG + Oct4:5:1:804:348 GTCTGCATTATCTACCAGCACTTCCC + Oct4:5:1:767:334 GCTGTCTTCCCGCTGTTTTATCCCCC + Oct4:5:1:805:329 GTAGTTTACCTGTTCATATGTTTCTG + IIIIIII9IIIIII?IIIIIIII7II >SRR Oct4:5:1:871:340 GGCGCACTTACACCCTACATCCATTG >SRR Oct4:5:1:804:348 GTCTGCATTATCTACCAGCACTTCCC >SRR Oct4:5:1:767:334 GCTGTCTTCCCGCTGTTTTATCCCCC >SRR Oct4:5:1:805:329 GTAGTTTACCTGTTCATATGTTTCTG SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS......XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ...!"#$%&'()*+,-./ :;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{ }~ S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) adapted from Wikipedia

12 From sequence reads to peaks experiment Input sequences (reads length 36 bp) from Illumina C quality check

13 h]p:// h]p://bioinfo- core.org/index.php/9th_discussion- 28_October_2010 h]p://bioinfo.cipf.es/courses/mda11/lib/exe/fetch.php?media=ngs_qc_tutorial_mda_val_2011.pdf

14 modencode Kni Drosophila

15 From sequence reads to peaks experiment Input sequences (reads length 30/34 bp) from Illumina C quality check if necessary only!!! cutadapt C remove adapter sequences h]p://code.google.com/p/cutadapt/ quality check

16 From sequence reads to peaks experiment Input C if necessary only!!! cutadapt C mapping BowIe BED BAM SAM BED BAM SAM Langmead, Genome Biol 10:R25 (2009)

17 Mapping h]p://bifx- core.bio.ed.ac.uk:8080/galaxy/u/shaun%20webb/p/ngs- workshop BowIe and Colourspace BowIe BWA LastZ Tophat

18 From sequence reads to peaks experiment Input C if necessary only!!! cutadapt C mapping BowIe BED BAM SAM BED BAM SAM Langmead, Genome Biol 10:R25 (2009) quality check Samstat Lassmann et al. Bioinforma)cs (2010)

19 From sequence reads to peaks experiment Input GR C if necessary only!!! cutadapt Input C mapping BowIe BED BAM SAM Langmead, Genome Biol 10:R25 (2009) quality check visualiza8on Samstat Lassmann et al. Bioinforma)cs (2010)

20 mapping peak- calling Valouev Nat Methods (2008), Jothi, NAR (2008)

21 From sequence reads to peaks experiment Input GR C if necessary only!!! cutadapt Input C mapping BowIe BED BAM SAM Langmead, Genome Biol 10:R25 (2009) quality check visualiza8on Samstat Lassmann et al. Bioinforma)cs (2010)

22 From sequence reads to peaks experiment Input MACS treatment vs control peaks C if necessary only!!! cutadapt peak calling MACS Cut- off FDR (2%) Zhang, Genome Biol (2008) C BowIe BED BAM SAM Samstat visualiza8on

23 From sequence reads to peaks experiment Input MACS treatment vs control peaks C if necessary only!!! cutadapt peak calling MACS Cut- off FDR (2%) Zhang, Genome Biol (2008) C BowIe BED BAM SAM Samstat visualiza8on

24 mapping peak- calling Valouev Nat Methods (2008), Jothi, NAR (2008)

25 From sequence reads to peaks bimodal enrichment pa]ern peak model Two steps strategy : 1 modelling the read shij size 2 peak calling Percentage forward tags reverse tags shifted tags d =119 1 : search high-quality paired peaks : separates their forward and reverse reads, and aligns them by the midpoint. The distance between the modes of the forward and reverse peaks in the alignment is defined as d, and MACS shifts all reads by d/2 toward the 3 ends to better locate the precise binding sites. 2: uses the shift size to search for peaks, Poisson distribution to measure the p-value of each peak, and False Discovery Rate (FDR) calculation using the input data Distance to the middle Feng, J., Liu, T., & Zhang, Y. (2011). Using MACS to Iden)fy Peaks from ChIP- Seq Data, Current Protocols in Bioinforma)cs

26 ChIP- seq signal for transcrip8on factors We expect to see a typical strand asymetry in read densiies ChIP peak recogniion pa]ern

27 Tag shising Each tag is shijed by d/2 (i.e. towards the middle of the IP fragment) where d represent the fragment length

28 From sequence reads to peaks experiment Input peak list BED C if necessary only!!! cutadapt peak calling MACS Zhang, Genome Biol (2008) C BowIe BED BAM SAM Samstat visualiza8on

29 Peak list (BED file) chr chr chr chr chr chr chr chr chr chr chr chr

30 Read mapping programs BowIe (BowIe2) BWA Generally not having a strong influence on the results» Parameters: retain uniquely mapped reads

31 Peak- calling programs Strong influence on the called peaks» Many different programs» They do not share the same «default» threshold to retain peaks» The top highest peaks are usually common, but the less obvious peaks are ojen not shared between different peak callers Mali Salmon- Divon et al, BMC Bioinforma)cs, 2010

32 Peak calling programs To be chosen according to type of expected peaks» TranscripIon factors and «sharp» peaks» ChromaIn marks and «broad peaks» Many new programs sill developped!» MACS is currently commonly usedfor sharp peaks» SICER is good with broad signal

33 Aim of the course 1 - From reads to peaks (= primary analysis) 2 - Secondary analysis

34 Visualizing in a genome browser Local tools (IGV)» Fast» Ideal for sensiive datasets web- based tools (UCSC browser) with custom tracks» Integrated with many other informaion (conservaion, )» Easy to share between collaborators File formats» BED => simply defines a region (start- end)» WIG, bedgraph => value assigned to each posiion

35 Vizualizing in IGV You will manipulate these data.

36 Aim of the course 1 - From reads to peaks (= primary analysis) 2 - Secondary analysis - func8onal annota8on of peaks - mo8f discovery in peaks

37 Distance to closest TSS Distance of the peaks to the closest TSS Number of peaks Distance to the closest TSS

38 Localisa8on of the peaks in the genome

39 Genes Regions Peaks Idea : assign funcional annotaion to genomic regions use staisics to avoid biases assign to each gene a regulatory domain basal (- 5kb/+1kb from TSS) extended (up to nearest basal region ; max 1Mb) each domain is annotated to the funcional terms of the corresponding gene "Func8onal domains" "GREAT improves functional interpretation of cis-regulatory regions" McLean et al. Nat. Biotech. (2010)

40 Genes Regions Peaks term A term B Given that 60% of the genome is annotated to A, would I randomly expect 3 or more peaks to fall into region A? p > 0.5 Given that 15% of the genome is annotated to B, would I randomly expect 3 or more peaks to fall into region B? p = 0.07 "GREAT improves functional interpretation of cis-regulatory regions" McLean et al. Nat. Biotech. (2010)

41 Aim of the course 1 - From reads to peaks (= primary analysis) 2 - Secondary analysis - func8onal annota8on of peaks - mo8f discovery in peaks

42 de novo mo8f discovery transcripion factor target gene target gene Problem : How can we model/describe the binding specificity of a given TF? target gene cis- regulatory elements binding moif

43 de novo mo8f discovery Find excepional moifs based on the sequence only (A priori no knowledge of the moif to look for) Criteria of excepionality: - higher/lower frequency than expected by chance (over- /under- representa8on) - concentraion at specific posiions relaive to some reference coordinate (posi8onal bias) K& -<</%%$3<$;&.-;82-3&

44 de novo mo8f discovery Tools already exist for a long Ime! - MEME (1994) - RSAT oligo- analysis (1998) - AlignACE (2000) - Weeder (2001) - MoIfSampler (2001) Why do we need new approaches for genome- wide datasets?

45 New approaches for ChIP- seq datasets Size, size, size - limited numbers of promoters and enhancers - dozens of thousands of peaks!!!!!! h]p:// pages/odi- webinar- web.html

46 New approaches for ChIP- seq datasets Size, size, size - limited numbers of promoters and enhancers - dozens of thousands of peaks!!!!!! the problem is slightly different - promoters: bp from co- regulated genes - peaks: 300bp, posiional bias -<</%%$3<$;& K&.-;82-3& h]p:// pages/odi- webinar- web.html

47 New approaches for ChIP- seq datasets Size, size, size - limited numbers of promoters and enhancers - dozens of thousands of peaks!!!!!! the problem is slightly different - promoters: bp from co- regulated genes - peaks: 300bp, posiional bias mo8f analysis: not just for specialists anymore! - complete user- friendly workflows -<</%%$3<$;& K&.-;82-3& h]p:// pages/odi- webinar- web.html

48 Comparison of tools for ChIP- seq Program peak-motifs ChipMunk Web interface yes yes yes yes no no Size limitation unrestricted (Web site tested with 22 Mb) CompleteMotifs 100kb (web site) 500kb (web site) unrestricted, but motif discovery restricted to 600 peaks clipped to 100bp MEME-ChIP MICSA motif discovery restricted to a few hundred base pairs Stand-alone version yes yes no yes yes yes Tasks GimmeMotifs - peak finding no no no no yes no annotation of peak-flanking genes no no yes no no sequence composition yes no no no no (mono- and di-nucleotides) motif discovery yes yes yes yes yes yes enrichment in motifs from databases no no yes yes no enrichment in discovered motifs yes no no no no peak scoring no no no yes yes no motif clustering no no no no yes comparison discovered motifs / motif DB yes no no yes yes sequence scanning for site prediction yes no no yes no positional distribution of sites inside peaks yes no yes no yes visualization in genome browsers yes no yes no no Motif discovery algorithms RSAT oligo-analysis RSAT dyad-analysis RSAT position-analysis RSAT local-word-analysis + in stand-alone version: MEME ChIPMunk ChipMunk ChipMunk MEME Weeder MEME DREME MEME MEME Weeder MotifSampler BioProspector Gadem Improbizer MDmodule Trawler MoAn Pattern matching algorithms RSAT matrix-scan-quick no patser MAST + AME no (enrichment) Motif comparison algorithm RSAT compare-motifs no STAMP TOMTOM STAMP Motif clustering algorithm STAMP Comparison between discovered motifs yes no yes no yes Motif database comparisons JASPAR UNIPROBE DMMPMM RegulonDB upload your own database no JASPAR TRANSFAC JASPAR TRANSFAC UNIPROBE FLYREG DPINTERACT SCPD DMMPMM and many others no Motif sizes variable (multiple word assembly) user-specified <=25 for MEME <=12 for Weeder <=13 for ChipMunk predefined ranges (small, medium, large, extra-large) Multiple motifs yes no yes yes yes Ref (PMID) This article

49 Comparison of tools for ChIP- seq Program peak-motifs ChipMunk Web interface yes yes yes yes no no CompleteMotifs Size limitation unrestricted 100kb (web site) 500kb (web site) unrestricted, but motif discovery - Web interface yes yes yes yes no no (Web site tested with 22 motif discovery restricted to a few Mb) restricted to 600 hundred base pairs Size limitation unrestricted 100kb (web site) peaks clipped 500kb to (web site) unrestricted, but 100bp Stand-alone version yes(web site tested yes with 22 no yes yes motif discovery yes Tasks Mb) restricted to 600 peaks clipped to 100bp peak finding no no no no yes no annotation of peak-flanking genes no no yes no no sequence composition yes no no no no (mono- and di-nucleotides) motif discovery yes yes yes yes yes yes enrichment in motifs from databases no no yes yes no enrichment in discovered motifs yes no no no no peak scoring no no no yes yes no motif clustering no no no no yes comparison discovered motifs / motif DB yes no no yes yes sequence scanning for site prediction yes no no yes no positional distribution of sites inside peaks yes no yes no yes visualization in genome browsers yes no yes no no Motif discovery algorithms RSAT oligo-analysis RSAT dyad-analysis RSAT position-analysis RSAT local-word-analysis + in stand-alone version: MEME ChIPMunk ChipMunk ChipMunk MEME Weeder MEME-ChIP MEME DREME MICSA MEME MEME-ChIP GimmeMotifs MEME Weeder MotifSampler BioProspector Gadem Improbizer MDmodule Trawler MoAn Pattern matching algorithms RSAT matrix-scan-quick no patser MAST + AME no (enrichment) Motif comparison algorithm RSAT compare-motifs no STAMP TOMTOM STAMP Motif clustering algorithm STAMP Comparison between discovered motifs yes no yes no yes Motif database comparisons JASPAR UNIPROBE DMMPMM RegulonDB upload your own database no JASPAR TRANSFAC JASPAR TRANSFAC UNIPROBE FLYREG DPINTERACT SCPD DMMPMM and many others no Motif sizes variable (multiple word assembly) peak-motifs user-specified <=25 for MEME <=12 for Weeder <=13 for ChipMunk predefined ranges (small, medium, large, extra-large) Multiple motifs yes no yes yes yes Ref (PMID) This article ChipMunk CompleteMotifs MICSA motif discovery restricted to a few hundred base pairs Thomas- Chollier, Herrmann, Defrance, Sand, Thieffry, van Helden Nucleic Acids Research, 2012 GimmeMotifs -

50 Comparison of tools for ChIP- seq Program peak-motifs ChipMunk Web interface yes yes yes yes no no CompleteMotifs Size limitation unrestricted 100kb (web site) 500kb (web site) unrestricted, but motif discovery - Web interface yes yes yes yes no no (Web site tested with 22 motif discovery restricted to a few Mb) restricted to 600 hundred base pairs Size limitation unrestricted 100kb (web site) peaks clipped 500kb to (web site) unrestricted, but 100bp Stand-alone version yes(web site tested yes with 22 no yes yes motif discovery yes Tasks Mb) restricted to 600 peaks clipped to 100bp peak finding no no no no yes no annotation of peak-flanking genes no no yes no no sequence composition yes no no no no (mono- and di-nucleotides) motif discovery yes yes yes yes yes yes enrichment in motifs from databases no no yes yes no enrichment in discovered motifs yes no no no no peak scoring no no no yes yes no motif clustering no no no no yes comparison discovered motifs / motif DB yes no no yes yes sequence scanning for site prediction yes no no yes no positional distribution of sites inside peaks yes no yes no yes visualization in genome browsers yes no yes no no Motif discovery algorithms RSAT oligo-analysis RSAT dyad-analysis RSAT position-analysis RSAT local-word-analysis + in stand-alone version: MEME ChIPMunk ChipMunk ChipMunk MEME Weeder MEME-ChIP MEME DREME MICSA MEME MEME-ChIP GimmeMotifs MEME Weeder MotifSampler BioProspector Gadem Improbizer MDmodule Trawler MoAn Pattern matching algorithms RSAT matrix-scan-quick no patser MAST + AME no (enrichment) Motif comparison algorithm RSAT compare-motifs no STAMP TOMTOM STAMP Motif clustering algorithm STAMP Comparison between discovered motifs yes no yes no yes Motif database comparisons JASPAR UNIPROBE DMMPMM RegulonDB upload your own database no JASPAR TRANSFAC JASPAR TRANSFAC UNIPROBE FLYREG DPINTERACT SCPD DMMPMM and many others no Motif sizes variable (multiple word assembly) peak-motifs user-specified <=25 for MEME <=12 for Weeder <=13 for ChipMunk predefined ranges (small, medium, large, extra-large) Multiple motifs yes no yes yes yes Ref (PMID) This article ChipMunk CompleteMotifs MICSA motif discovery restricted to a few hundred base pairs Thomas- Chollier, Herrmann, Defrance, Sand, Thieffry, van Helden Nucleic Acids Research, 2012 GimmeMotifs -

51 RSAT peak- mo8fs fast and scalable treat full- size datasets 1h peak-motifs: oligo-analysis-7nt peak-motifs: dyad-analysis peak-motifs: position-analysis-7nt peak-motifs: local-words-7nt dreme chipmunk meme Time (seconds) sequence size (Mb) size limit of other websites typical ChIP-seq dataset Thomas- Chollier, Herrmann, Defrance, Sand, Thieffry, van Helden Nucleic Acids Research, 2012

52 RSAT peak- mo8fs fast and scalable treat full- size datasets complete pipeline Thomas- Chollier, Herrmann, Defrance, Sand, Thieffry, van Helden Nucleic Acids Research, 2012

53 RSAT peak- mo8fs fast and scalable treat full- size datasets complete pipeline web interface within RSAT Jacques van Helden Thomas- Chollier, Defrance, Medina- Rivera, Sand, Herrmann, Thieffry, van Helden Nucleic Acids Research, 2011 Medina- Rivera, Abreu- Goodger, Thomas- Chollier, Salgado, Collado- Vides, van Helden Nucleic Acids Research, 2011 Sand, Thomas- Chollier, van Helden Bioinforma4cs, 2009 Thomas- Chollier*, Sand*, Turatsinze, Janky, Defrance, Vervisch, van Helden Nucleic Acids Research, 2008 Sand, Thomas- Chollier, Vervisch, van Helden Nature Protocols, 2008 Thomas- Chollier*, Turatsinze*, Defrance, van Helden Nature Protocols, 2008 van Helden, Nucleic Acids Research, 2003

54

55 RSAT peak- mo8fs fast and scalable treat full- size datasets complete pipeline web interface accessible to non- specialists using 4 complementary algorithms peak-motifs: oligo-analysis-7nt peak-motifs: dyad-analysis peak-motifs: position-analysis-7nt peak-motifs: local-words-7nt dreme chipmunk meme - Global over- representa8on - oligo- analysis - dyad- analysis (spaced mo8fs) - Posi8onal bias - posi8on- analysis - local- words Time (seconds) K& sequence size (Mb) -<</%%$3<$;&.-;82-3&

56 Mo8f discovery methods: frequency?% 97#".4"0&!"#$%& -<</%%$3<$;&<-#./0$9& -<</%%$3<$;&<-#./0$9& 4%-#D&& 97#".4"0%4#%":;")$"0%<=>".% /))'.."()"#%!"#$%&'$()(**+***&,&!"#$%#"&'"()"#% *+),-./'(0%#"&'"()"#% 123"(% "8% BE&!3"/."A)+6%,=>".#% B."&'"()5"#%B./>%$"#$% #"&'"()"#% B7;$%5$9&'0$;0&;$C/$3<$;,&!&+-#./012-3&-4&."516/$&'783-#816,&139&:"516/$& '#/62"0$;23)&<-%%$<2-3,& oligo- analysis dyad- analysis (spaced mo8fs) Thomas- Chollier, Darbo, Herrmann, Defrance, Thieffry, van Helden Nature Protocols,2012

57 Mo8f discovery methods: posi8onal bias *% 97#".4"0%/))'.."()"#%;".%25(0/2% 3/>/-"("/'#%>/0"6% F"#$%&'$()(**+***G&,& H839-I;&>J&30& K& K& -<</%%$3<$;& posi8on- analysis C%.-;82-3& 97#".4"0%/))'."()"#%;".%25(0/2%!"#$%&'$()(**+***&,&.-;82-3& 97#".4"0%4#%":;")$"0%<=>".%/))'.."()"#% local- words B7;$%5$9&83&I839-I& :=.$<0$9&-<</%%$3<$;&83&I?-6$&;$C/$3<$;& Thomas- Chollier, Darbo, Herrmann, Defrance, Thieffry, van Helden Nature Protocols,2012

58 Direct versus indirect binding ChIP- seq does not necessarily reveal direct binding Direct binding Indirect binding The moif of the targeted TF is not always found in peaks!

59 To read further ChIP seq and beyond: new and improved methodologies to detect and characterize protein DNA interac8ons» Terrence S. Furey - Nature Reviews GeneIcs 13, (December 2012) ChIP- Seq: advantages and challenges of a maturing technology» Peter J. Park - Nat Rev Genet October; 10(10): Computa8on for ChIP- seq and RNA- seq studies» Shirley Pepke et al - Nature Methods 6, S22 - S32 (2009)