ChIP-seq and RNA-seq. Farhat Habib

Size: px

Start display at page:

Download "ChIP-seq and RNA-seq. Farhat Habib"

Franklin Gervase Long
5 years ago
Views:

1 ChIP-seq and RNA-seq Farhat Habib

2 Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions (ChIPchromatin immunoprecipitation) Transcriptomes (RNA)

3 Background Large scale ChIP-seq and transcriptome studies previous used microarrays Deep sequencing versions offer advantages in increased specificity, sensitivity and comprehensiveness

4 Pepke, S. et al., Nat. Methods 6, S22 S32 (2009)

5 ChIP basics

6 ChIP-chip vs ChIP-seq Resolution Noise source Coverage ChIP-chip Array-specific, generally bp cross hybridization between probes and nonspecific targets Limited by array ChIP-seq single base pair Some GC bias Limited by alignability of reads to genome DNA required High (few micrograms) Low (10-50 ng) Amplification more required less required Multiplexing not possible possible

7 Alignment Many options for alignment we have considered these previously hash based algorithms BWT based algorithms

8 Peak Finding Peak finding can be conceptually divided into following components: (i) a signal profile definition along each chromosome, (ii) a background model, (iii) peak call criteria, (iv) post-call filtering of artifactual peaks and (v) significance ranking of called peaks

9 Signal from TFBSs RPM(reads per million)=number of reads aligning to a position per million aligned reads

10 Signal from RNA Pol

11 Signal from Histone marks

12 Utilizing strand shift

13 Utilizing strand shift

Consecutive windows exceeding a threshold value are merged. Some methods count tags within a window in a strandspecific fashion.

14 Building a signal profile The signal profile is a smoothing of the tag counts to allow reliable region identification and better summit resolution. The simplest way to define a signal profile is to slide a window of fixed width across the genome, replacing the tag count at each site with the summed value within the window centered at the site. Consecutive windows exceeding a threshold value are merged. Some methods count tags within a window in a strandspecific fashion. An alternate approach is to extend the ChIP-seq tags along their strand direction (called an XSET ) and to count overlaps above a threshold as peak regions. Tag extension before signal calculation serves the dual purpose of correct- ing for the assumed fragment length and also smoothing over gaps that were not tagged because of low sampling or read mappability Others perform a window scan but only after shifting the tag data in a strand-specific fashion to account for the fragment length.

15 background model the background model is an assumed statistical noise distribution that guides the use of control data to filter out false positives in the treatment data. In the absence of control data, the background tag distribution is typically modeled with a Poisson or negative binomial distribution. When available, control data may be used to determine parameters for these distributions or the control data may be subtracted from the signal along the genome or the signal may be thresholded by its enrichment ratio relative to the control. Using experimental control data is thought important because it substantially reduces false positive regions that come from DNA shearing biases or sequencing artifacts.

16 Peak identification locations where the signal satisfies certain quality criteria are considered candidate peaks. The main quality criterion is either an absolute signal threshold or a minimum enrichment relative to the background or both.

17 Peak identification locations where the signal satisfies certain quality criteria are considered candidate peaks. The main quality criterion is either an absolute signal threshold or a minimum enrichment relative to the background or both.

18 Peak identification locations where the signal satisfies certain quality criteria are considered candidate peaks. The main quality criterion is either an absolute signal threshold or a minimum enrichment relative to the background or both.

19 Peak identification locations where the signal satisfies certain quality criteria are considered candidate peaks. The main quality criterion is either an absolute signal threshold or a minimum enrichment relative to the background or both.

20 Filtering artifacts Two popular filtering criteria are based on distribution of tags between DNA strands (directionality) and single-site duplicates. Directionality criteria include: fraction of plus and minus tags, fraction of plus (minus) tags occurring to the left (right) of the putative peak, and the presence of a partnered plus (minus) peak for each minus (plus) peak. Duplicate filters are straightforward and eliminate tags at single sites that exhibit counts much greater than that expected by chance.

Ranking peaks Called peak regions encompass a wide range of quantitative enrichments; thus an assessment of the relative confidence one should place in a given set of peaks is informative.

21 Ranking peaks Called peak regions encompass a wide range of quantitative enrichments; thus an assessment of the relative confidence one should place in a given set of peaks is informative. A few callers do not provide P values, in which case the use of the peak height or fold enrichment may be used to provide a peak ranking, though not statistical significance. From an end user perspective, the false discovery rate (FDR) is more informative

22 Challenges Some ChIP-seq peak regions are spatial convolutions of multiple sources. In such cases, the highest density of reads does not always correspond to a source point This is worse in gene rich regions or smaller genomes with potentially higher densities of binding sites compressed in complicated modules. RNA polymerase and histone modifications are not as convenient to model as TFBSs.

23 Overview of ChIP-Seq P. J. Park, Nature reviews. Genetics 10, 669 (2009).

24 RNA-Seq RNA-seq refers to experimental procedures that generate DNA sequence reads derived from the entire RNA molecule. In theory, RNA-seq can be used to build a complete map of the transcriptome across all cell types, perturbations and states. computational challenges for analysis of RNA-seq data fall into three main categories: (i) read mapping, (ii) transcriptome reconstruction and (iii) expression quantification

25 Mapping RNA reads RNA-seq reads pose particular challenges because they are short (~ bases), error rates are considerable and many reads span exon-exon junctions. Additionally, the number of reads per experiment is increasingly large, currently as many as hundreds of millions.

26 Unspliced aligners There are two major algorithmic approaches to map RNA-seq reads to a reference transcriptome. The first, unspliced read aligners, align reads to a reference without allowing any large gaps. ideal for mapping reads against a reference cdna databases for quantification purposes limited to identifying known exons and junctions, and do not allow for the identification of splicing events involving new exons

27 Spliced aligners Using these reads can be aligned to the entire genome, including intron-spanning reads that require large gaps for proper placement. Several methods exist that fall into two main categories: exon first and seed and extend.

28 Mapping RNA reads MapSplice,SpliceMap, TopHat GSNAP, QPALMA M. Garber, M. G. Grabherr, M. Guttman, C. Trapnell, Nature Methods 8, 469 (2011).

29 Comparison Exon-first approaches are faster and require fewer computational resources compared to seed-extend methods. Exon-first approaches can miss spliced alignments for reads that also map to the genome contiguously, as can occur for genes that have retrotransposed pseudogenes In contrast, seed- extend methods evaluate spliced and unspliced alignments in the same step, which reduces this bias toward unspliced alignments, yielding the best placement of each read.

30 Transcriptome reconstruction a map of all transcripts and isoforms that are expressed in a particular sample is referred to as transcriptome reconstruction Computationally challenging task gene expression spans several orders of magnitude, with some genes represented by only a few reads. reads originate from the mature mrna (exons only) as well as from the incompletely spliced precursor RNA (containing intronic sequences), making it difficult to identify the mature transcripts. reads are short, and genes can have many isoforms, making it challenging to determine which isoform produced each read.

31 Reconstruction methods These fall into two main classes: Genome-guided methods rely on a reference genome to map all the reads to the genome and assemble overlapping reads into transcripts genome-independent methods assemble the reads directly into transcripts without using a reference genome

32 M. Garber, M. G. Grabherr, M. Guttman, C. Trapnell, Nature Methods 8, 469 (2011).

33 Genome guided reconstruction genome-guided assembly methods such as Cufflinks and Scripture use spliced reads to reconstruct the transcriptome. Scripture initially transforms the genome into a graph topology, which represents all possible connections of bases in the transcriptome either when they occur consecutively or when they are connected by a spliced read. this graph topology is then used to reduce the transcript reconstruction problem to a statistical segmentation problem of identifying significant transcript paths across the graph.

34 Cufflinks connects aligned reads into a graph based on the location of their spliced alignments. Scripture and Cufflinks build similar assembly graphs but differ in how they parse the graph into transcripts. Scripture reports all isoforms that are compatible with the read data (maximum sensitivity), whereas Cufflinks reports the minimal number of compatible isoforms (maximum precision). Specifically, Scripture enumerates all possible paths through the assembly graph that are consistent with the spliced reads and the fragment size distribution of the paired end reads. Cufflinks chooses a minimal set of paths through the graph such that all reads are included in at least one path.

35 Cufflinks uses read coverage across each path to decide which combination of paths is most likely to originate from the same RNA M. Garber, M. G. Grabherr, M. Guttman, C. Trapnell, Nature Methods 8, 469 (2011).

36 Genome independent reconstruction genome-independent transcriptome reconstruction algorithms use the reads to directly build consensus transcripts Consensus transcripts can then be mapped to a genome or aligned to a gene or protein database for annotation purposes. A commonly used strategy is to use a de Bruijn graph based assembler such as Velvet or transabyss

37 Estimating transcript expression When using RNA-seq to estimate gene expression, read counts need to be normalized to extract meaningful expression estimates There are two main sources of systematic variability that require normalization. First, RNA fragmentation during library construction causes longer transcripts to generate more reads compared to shorter transcripts present at the same abundance in the sample. Second, the variability in the number of reads produced for each run causes fluctuations in the number of fragments mapped across samples

38 the reads per kilobase of transcript per million mapped reads (RPKM) metric normalizes a transcript s read count by both its length and the total number of mapped reads in the sample

39 Challenges As many genes have multiple isoforms, many of which share exons, and many genes families have close paralogs, some reads cannot be assigned unequivocally to a transcript. This read assignment uncertainty affects expression quantification accuracy the number of potential isoforms greatly impacts the results, with incorrect or misassembled isoforms introducing uncertainty.

40 To summarize ChIP-seq and RNA-seq offer many possibilities for studying genetic regulation Increasing read lengths allow better quantification of RNA expression and better reconstruction of transcripts Growth in ChIP-seq and RNA-seq datasets will drive integrated computational analysis that address questions about how the chemical code of in vivo DNA binding for multiple factors relates to transcription output