ChIP-seq guidlines used by ENCODE AND modencode consortia. 2012/11/15 DJEKIDEL Mohamed Nadhir

Size: px
Start display at page:

Download "ChIP-seq guidlines used by ENCODE AND modencode consortia. 2012/11/15 DJEKIDEL Mohamed Nadhir"

Transcription

1 ChIP-seq guidlines used by ENCODE AND modencode consortia 2012/11/15 DJEKIDEL Mohamed Nadhir

2 Paper & authors

3 ENCODE and modencode performed: >1000 ChIP-seq experiment. 140 different factors. 100 cell lines 4 different organisms

4 Goals of the paper Provide guideline to: Validate anti-body. Assess experiment quality Scoring and evaluation of data Data reporting

5 Where to download?

6 Brief overview of ChiP-seq

7 ChiP-Seq workflow

8 ChiP-Seq workflow ChIP-seq performed on Human and Mouse ChIP-chip + ChIP-seq on small genome organisms

9 Antibody and IP specificity 2 types of antibody deficiency: Poor reactivity against the target. Cross-reactivity with other proteins

10 Antibody and IP specificity Two kinds of tests: Primary : easy with less effort to execute n n Immunoblot (western blot). Immunostaining Secondary : more effort needed n Factor knockdown. By sirna. n ChIP-seq against >1 epitope or diff. complex member. n IP using epitope-tagged. n Enrichment à Mass Spec. n Binding-site motif analysis.

11 Antibody characterization workflow

12 Primary mode- Immunoblot >50% of the signal

13 Primary mode- Immunoblot Some factors can deviate from the expected size: modifications. isoform differences. Intrinsic properties of the factor. If diff. >20% of the expected size, needs to check: other studies that confirm it. Signal is reduced by sirna knockdown If multiple band, the factor can identified by mass spectrometry.

14 Primary mode- Immunoflorescence Used when immunoblot is not successful. Should be combined with a method that reduce the protein

15 Secondary mode- knockdown Performed using extracts from sirna or shrna. The original signal in all bands need to decrease by >30%. Knockdown can also be measured by ChIP experiment. Data can be submited to ENCODE if reduction >50%

16 Secondary mode- IP +MA

17 Secondary mode- Motif enrichment For a data to be submited it should be: >4 fold enriched compared to all accessible regions Present in >10% of analyzed peeks Preferably applied if: The factor has a well-charactized motif. No paralogs are expressed Motif enrichment Motif representation

18 Replication ChIP-seq experiment should be performed on 2 biological replicates. Replicate agreement is done using the IDR (Irreproducible Discovery Rate) metric. >10million UMR /replicate (mamalian), with >80% NRF (nondundancy fraction). Library complexity: The fraction of DNA fragments that are non-redundant.

19 Sequencing depth The number of positive site increase with the number of sequenced reads.

20 Sequencing depth

21 Sequencing depth

22 Control Sample Two basic methods to produce control DNA DNA isolated from cross-linked cells that have been fragmented and under the same condition. A mock ChIP reaction using a ctrl antibody that interacts with an irrelevant, non-nuclear antigene (IgG control)

23 Peak calling After mapping to the genome peak calling is performed. Many algorithms can be used : SPP, PeakSeq, MACs. Different algorithm have different statistics criteria. Generally default setting in successful experiments give ten of thousands of peaks.

24 Evaluating data- Browser inspection Gives a first impression about the quality. Can examine known binding locations can be examined Check read distribution asymmetric or not in both strands.

25 Evaluating data- Global enrichment Calculate the fraction of reads in peaks (PFiR) Typically, a minority of reads occur in peaks. FPiR correlate positively and linearly with the # of called regions.

26 Evaluating data- Global enrichment Most ENCODE data (787 of 1052) have and FRiP enrichment of 1% or more. Failing the 1% doesn t mean failure

27 Evaluating data- Cross-correlation quality metric that is independent of peak calling. Based on the idea: Sequence tag density accumulates on forward and reverse strands are centered around the binding site Computed as follow: Shift the Watson strand by k bp Peason linear correlation between 2 strands.

28 Evaluating data- Cross-correlation

29 Evaluating data- Cross-correlation NSC : Normalized Strand coefficient RSC (relative strand correlation) It is recommended to repeat experiment with: NSC < 1.05 RSC < 0.8

30 Evaluating data- Cross-correlation

31 Evaluating data- IDR Analysis IDR (irreproducible discovery rate) A statistic developed for ChIP-seq. Ranks peaks according to their significance (P-value, q-value, etc). Detects which peaks have high consistency between the 2 replicates. An IDR score is provided for each peak. A transition between significant and insignificant peaks is expected

32 Evaluating data- IDR Analysis

33 Evaluating data- IDR Analysis

34 Evaluating data- IDR Analysis

35 Evaluating data- IDR Analysis IDR is dominated by the weakest replicate. The number of significant binding regions identified using IDR on each individual replicate is recommended to be >=2

36 Evaluating data- IDR Analysis

37 How good a ChIP-seq should be? NSC >1.05 RSC > 0.8 NSC/RSC 5-12 NRF > 0.8 N1/N2 >= 2 Np/Nt >= 2 FRiP

38 Conclusions Developed ChIP-seq best practices Make the user aware of the quality of ENCODE data. The intension is not to force people to use these metrics.

39 THANKS