Introduction to ChIP Seq data analyses. Acknowledgement: slides taken from Dr. H

Similar documents
Introduction to ChIP- Seq data analyses

Introduc)on to ChIP- Seq data analyses

PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015

The ChIP-Seq project. Giovanna Ambrosini, Philipp Bucher. April 19, 2010 Lausanne. EPFL-SV Bucher Group

Mentor: Dr. Bino John

ChIP-seq data analysis

Chapter 1 Analysis of ChIP-Seq Data with Partek Genomics Suite 6.6

Computation for ChIP-seq and RNA-seq studies

Decoding Chromatin States with Epigenome Data Advanced Topics in Computa8onal Genomics

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to genome biology

Leveraging biological replicates to improve analysis in ChIP-seq experiments

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Introduction to gene expression microarray data analysis

PIP-seq. Cells. Permanganate ChIP-Seq

Nature Biotechnology: doi: /nbt Supplementary Figure 1

Perm-seq: Mapping Protein-DNA Interactions in Segmental Duplication and Highly Repetitive Regions of Genomes with Prior- Enhanced Read Mapping

Gene Expression Data Analysis (I)

Supplementary Data for DNA sequence+shape kernel enables alignment-free modeling of transcription factor binding.

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

Analysis of Microarray Data

: Genomic Regions Enrichment of Annotations Tool

EECS730: Introduction to Bioinformatics

Learning a Weighted Sequence Model of the Nucleosome Core and Linker Yields More Accurate Predictions in Saccharomyces cerevisiae and Homo sapiens

The Next Generation of Transcription Factor Binding Site Prediction

Introduction to Bioinformatics. Fabian Hoti 6.10.

Gene Expression Technology

Bioinformatics and Genomics: A New SP Frontier?

Le proteine regolative variano nei vari tipi cellulari e in funzione degli stimoli ambientali

Intro to Microarray Analysis. Courtesy of Professor Dan Nettleton Iowa State University (with some edits)

Regulation of eukaryotic transcription:

ChIPnorm: A Statistical Method for Normalizing and Identifying Differential Regions in Histone Modification ChIP-seq Libraries

Agilent Genomic Workbench 7.0

Methods of Biomaterials Testing Lesson 3-5. Biochemical Methods - Molecular Biology -

Intracellular receptors specify complex patterns of gene expression that are cell and gene

Ensembl Funcgen: A Database and API for Epigenomics and Gene Regulation Data.

Green Center Computational Core ChIP- Seq Pipeline, Just a Click Away

ChIP-seq analysis. adapted from J. van Helden, M. Defrance, C. Herrmann, D. Puthier, N. Servant

RNA-Sequencing analysis

Functional Genomics Overview RORY STARK PRINCIPAL BIOINFORMATICS ANALYST CRUK CAMBRIDGE INSTITUTE 18 SEPTEMBER 2017

Microarrays: since we use probes we obviously must know the sequences we are looking at!

Computational Methods for Analyzing and Modeling Gene Regulation Dynamics

measuring gene expression December 5, 2017

Introduction to Next Generation Sequencing (NGS) Data Analysis and Pathway Analysis. Jenny Wu

Analysis of a Tiling Regulation Study in Partek Genomics Suite 6.6

Genome-Wide Localization of Protein-DNA Binding and Histone Modification by a Bayesian Change-Point Method with ChIP-seq Data

Data and Metadata Models Recommendations Version 1.2 Developed by the IHEC Metadata Standards Workgroup

Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.

Introduction to Bioinformatics and Gene Expression Technology

Predicting Microarray Signals by Physical Modeling. Josh Deutsch. University of California. Santa Cruz

Identifying and mitigating bias in next-generation sequencing methods for chromatin biology

Bayesian Variable Selection and Data Integration for Biological Regulatory Networks

Humboldt Universität zu Berlin. Grundlagen der Bioinformatik SS Microarrays. Lecture

Lecture 7: April 7, 2005

Figure S1: NUN preparation yields nascent, unadenylated RNA with a different profile from Total RNA.

SUPPLEMENTAL MATERIALS AND METHODS

Bioinformatics of Transcriptional Regulation

Multiple Testing in RNA-Seq experiments

Exploration, Normalization, Summaries, and Software for Affymetrix Probe Level Data

Sanger Sequencing: Troubleshooting Guide

Genomics xxx (2011) xxx xxx. Contents lists available at SciVerse ScienceDirect. Genomics. journal homepage:

A pipeline for ChIP-seq data analysis (Prot 56)

The ENCODE Encyclopedia. & Variant Annotation Using RegulomeDB and HaploReg

Mapping strategies for sequence reads

Measuring transcriptomes with RNA-Seq

Performance characteristics of the High Sensitivity DNA kit for the Agilent 2100 Bioanalyzer

Multi-omics in biology: integration of omics techniques

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

Quality Control Assessment in Genotyping Console

Figure S4 A-H : Initiation site properties and evolutionary changes

Microarray Analysis of Gene Expression in Huntington's Disease Peripheral Blood - a Platform Comparison. CodeLink compatible

Lecture #1. Introduction to microarray technology

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Microarray Gene Expression Analysis at CNIO

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Identifying the binding sites and regulatory targets of a transcription

A Distribution Free Summarization Method for Affymetrix GeneChip Arrays

DNA Microarray Data Oligonucleotide Arrays

qpcr Quantitative PCR or Real-time PCR Gives a measurement of PCR product at end of each cycle real time

Measuring transcriptomes with RNA-Seq. BMI/CS 776 Spring 2016 Anthony Gitter

ab ChIP Kit Magnetic One-Step

On the detection and refinement of transcription factor binding sites using ChIP-Seq data

Background Correction and Normalization. Lecture 3 Computational and Statistical Aspects of Microarray Analysis June 21, 2005 Bressanone, Italy

Myers Lab ChIP-seq Protocol v Modified January 10, 2014

Next-Generation Sequencing. Technologies

ChromaSig: A Probabilistic Approach to Finding Common Chromatin Signatures in the Human Genome

Many transcription factors! recognize DNA shape

Exploration and Analysis of DNA Microarray Data

RNA-Seq analysis using R: Differential expression and transcriptome assembly

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

Modern Epigenomics. Histone Code

Supporting Information

ENCODE RBP Antibody Characterization Guidelines

Introduction to BioMEMS & Medical Microdevices DNA Microarrays and Lab-on-a-Chip Methods

Atelier Chip-Seq. Stéphanie Le Gras, IGBMC Strasbourg Violaine Saint-André, Institut Curie Paris Morgane Thomas-Chollier, ENS Paris

Serial Analysis of Gene Expression

Nima Hejazi. Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi. nimahejazi.org github/nhejazi

Validation Study of FUJIFILM QuickGene System for Affymetrix GeneChip

MulCom: a Multiple Comparison statistical test for microarray data in Bioconductor.

Transcription:

Introduction to ChIP Seq data analyses Acknowledgement: slides taken from Dr. H Wu @Emory

ChIP seq: Chromatin ImmunoPrecipitation it ti + sequencing Same biological motivation as ChIP chip: measure specific biological modifications along the genome: Detect binding sites of DNA binding proteins (transcription factors, pol2, etc.). quantify strengths of chromatin modifications q y g (e.g., histone modifications).

Experimental procedures SameasChIP chip as except the laststep: step: sequencing is used to replace microarray. Crosslink: fix proteins on Isolate genomic DNA. Sonication: cut DNA in small pieces of ~200bp. IP: use antibody to capture DNA segments with specific proteins. Reverse crosslink: remove protein from DNA. Sequence the DNA segments.

Hongkai Ji, Hopkins Biostat

Advantages of ChIP seq over ChIP chip Not limited by arraydesign design, especially useful for species without commercially available arrays. Higher spatialresolution. Better signal to noise ratio and dynamic ranges. Less starting materials.

ChIP seq data analyses Alignment Peak Detection Annotation Visualization Sequence Analysis Motif Analysis Hongkai Ji, Hopkins Biostat

ChIP seq peak detection Peaks: protein binding or histone modification sites. Data from ChIP seq: raw data: sequence reads. After alignments: genome coordinates (chromosome/position) of the start of each read. Usually, aligned reads are summarized into counts of equal sized bins genome wide: 1. segment genome into small bins of equal sizes (50bps). 2. Count number of reads started at each bin. The bin counts are inputs for many peak calling algorithms. Similar to probe signals in ChIP chip. Difference: discrete (seq) vs. continuous (microarray) data.

Normalization issues The most common normalization needed is to adjust for total counts. Divided by total counts is conservative, because ChIP sample contains reads mapped to background and peaks, but control sample have reads mapped to background only. Should normalize using the number of total reads in backgrounds. Two pass algorithm: Roughly find peaks, and exclude those regions. Computetotal total reads in theleftover regions andnormalizenormalize based on that. Other normalizations (GC contents, MA plot based) available, but don t seems to help much.

Before peak detection: what do we know about ChIP seq? SonicatedDNA sequences are selected so that the resulting sequence fragments are approximately 200bp. In the immunoprecipitated (IP) sample, more sequence reads should be presented around the TF binding sites. However, other artifacts need to be considered. DNA sequence: can affect amplification process or sequencing process Chromatin structure (e.g., open chromatin region or not): may affect the DNA sonication process. A control sample is necessary to correct artifacts.

Control sample is important

Reads aligned to different strands Number of Reads aligned to different strands form two distinct i peaks around the true binding sites. This information can be used to help peak detection. Valouev et al. (2008) Nature Method

Mappability For each basepair position in the genome, whether a 35 bp (or other length now) sequence tag starting from this position can be uniquely mapped to a genome location. Regions withlowmappability (highly repetitive) cannot have high counts, thus affect the ability to detect peaks.

Peak detection software MACS Cisgenome QuEST Hpeak PICS PeakSeq MOSAiCS

Peak detection methods Counts fromneighboring windows need to be combined to make inference. To combine counts: Smoothing based: moving average (MACS, Cisgeenome), HMM (Hpeak). Model clustering of reads starting position (PICS). Other consideration: mappability of the genome (PeakSeq, MOSAiCS).

MACS (Model based Analysis of ChIP Seq) Zhang et al. 2008, GB Estimate shift size of reads d from the distance of two modes from + and strands. Shift all reads toward 3 end by d/2. Use a dynamic Possion model to scan genome and score peaks. Counts in a window are assumed to following Poisson distribution with rate: local = max( BG, [ 1k,] 5k, 10k) The dynamic rate capture the local fluctuation of counts. FDR estimates from sample swapping: flip the IP and control samples and call peaks. Number of peaks detected under each p value cutoff will be used as null and used to compute FDR.

Cisgenome (Ji et al. 2008, NBT) ) Implemented with Windows GUI. Use a Binomial model to score peaks. k 1i k 2i n i =k 1i + k 2i k 1i n i ~ Binom(n i, p 0 )

FLJ20699 Consider mappability: PeakSeq Rozowsky et al. (2009) NBT First round analysis: detect possible peak regions by identifying threshold considering mappability: Cut genome into segment (L=1Mb). Within each segment, the same number of reads are permuted in a region of f Length, where f is the proportion of mappable bases in the segment. 1. Constructing signal maps Tags Extend mapped tags to DNA fr agment Map of number of DNA fragments at each nucleotide position Signal map 2. First pass: determining potential binding regions b y comparison to simulation f Length Length Simulate each segment Determine a threshold satisfying the desired initial false discovery rate Use the threshold to identify potential target sites Simulation Simulated target sites Threshold 100 80 60 40 20 0 1 0.6 0.4 0.2 0 Potential target sites Threshold Mappability map GTSE1 TRMU GRAMD4 TBC1D22A f CELSR1 CERK

Second round analysis: Normalize data by counts in background regions. Test significance of the peaks identified in first round by comparing the total count in peak region with control data, using binomial p value, with Benjamini Hochberg correction. ChIP-seq sam mple 3. Normalizing control to ChIP-seq sample 400 400 P f = 1 350 350 Slope = 0. 96 300 300 Correlation = 0.77 250 250 200 200 150 150 100 P f = 0 100 50 Slope = 1.24 50 Correlation = 0.71 0 0 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 Input DNA Input DNA 4. Second pass: scoring enriched target regions relativ e to control For potential binding sites calculate the f old enrichment Compute a P-value from the binomial distribution Correct for multiple hypothesis testing and deter mine enriched target sites ChIP-seq sam mple Select fraction of potential peaks to e xclude (parameter P f ) Count tags in bins along chromosome f or ChIP-seq sample and control Determine slope of least squares linear reg ression 100 80 60 40 20 0 100 80 60 40 20 0 Potential target sites ChIP-seq sample Normalized input DNA Enriched target sites

Bioconductor packages for ChIP seq There are some packages: chipseq, ChIPseqR, BayesPeak, PICS, etc., but not very popular. Most people p use command line driven software like MACS or CisGenome GUI.

Comparing ChIP seq and ChIP chip: Ji et al. (2008) Nature Biotechnology NRSF ChIP chip: hi 2 ChIP + 2 Mock IP in Jurkat cells, profiled using Affymetrix Human Tiling 2.0R arrays. NRSF ChIP seq: ChIP + Negative Control in Jurkat cells, sequenced with the next generation sequencer made by Illumina/Solexa.

Overlaps in peaks Before post processing After post processing

Correlation in signals

ChIP seq provide better signals

Review ChIP seq is used to detect interesting regions in the genome, same as tiling arrays. The enriched DNA segments is sequenced directly, instead of relying li on hbidi hybridization in tiling arrays. Number of aligned reads are input data. More reads in a region indicate stronger signals (for TF binding, etc.). Single dataset peak detection is similar to that in ChIP chip. Data in neighboring regions need to be combined. Joint analysis from multiple datasets if of great interests.