Modern Epigenomics. Histone Code

Modern Epigenomics Histone Code Ting Wang Department of Genetics Center for Genome Sciences and Systems Biology Washington University Dragon Star 2012 Changchun, China July 2, 2012

DNA methylation + Histone modification Chromatin

Chromatin DNA plus Protein in cells with nuclei Nucleosome 146 bp of DNA - 2 each of histones: H2A,H2B, H3 and H4

The Nucleosome core particle Nucleosome H3 H4

Post-translational Histone Modifications h"p://www.nature.com/nsmb/journal/v14/n11/images/nsmb1337- F1.gif

Post-translational Histone Modifications H3 tail Modifications: =Acetylation =Methylation Active HATs HDACs KMTases Repressive

Li e. al. (2007) Cell 128, 707

Histone Modifications in Relation to Gene Transcription Li e. al. (2007) Cell 128, 707

DNA methylation mediated repression

Repression independent of DNA methylation H3K9 methylation condensed chromatin

H3K27 methylation mediated repression 1. H3K27 methylation 2. DNA methylation

Mechanisms of Epigenetic Crosstalk

Epigenetic cancer therapy

DNA-methylation and HDAC inhibitors in clinical trials

Summary Dnmt1, Dnmt3A, Dnmt3b - the mammalian DNMTs Chromatin structure is influenced by covalent modification of histone tails Multiple chromatin modification pathways involved in silencing of genes which may show crosstalk with DNA methylation

Technologies for Interrogating Chromatin States Histone Modifications ChIP-chip Antibody specific to one type of histone modification ChIP-seq Deep sequencing

Chromatin-IP Sequencing K4me1 K4me2 K4me3 K27me3 ackve repressive

Histone methylation and transcriptional state Transcribed gene Silent developmental gene K4me3 K27me3 FoxP1 Olig1 Constitutive heterochromatin K9me3 K20me3 Poised developmental gene K4me3 K27me3 Olig1

Predicting non-coding RNA? From sequence? Not clear which properties can be exploited Sequence features such as promoters are too weak Histone modifications + conservation worked

Nucleosome Positioning from Histone ChIP-seq Barski et al, Cell 2007 Nucleosome resolution ChIP-seq of 21 histone marks in CD4 + T-cells Total 185.7 M 25 nt tags sequenced Analysis not at nucleosome resolution to map nucleosomes at specific regions MNase digest Antibody for

Combine Tags From All ChIP-Seq

Extend Tags 3 to 150 nt Check Tag Count Across Genome

Take the middle 75 nt

Digital DNaseI profiling Precise delineation of the accessible regulatory DNA compartment Accessible Inaccessible Inaccessible

Digital DNaseI profiling: direct access to regulatory sequences

ChromHMM Enhancer Transcriptio n Start Site Transcribed Region DNA Observed chromatin marks. Called based on a Poisson distribution K4me1 K4me3 K4me3 K4me1 K36me3 K36me3 K36me3 K36me3 K27ac K4me1 Most likely Hidden State 1 2 3 4 6 6 6 6 6 5 5 5 200bp interval s High Probability Chromatin Marks in State 0.8 0.8 0.7 1: 2: 3: K4me1 0.9 0.8 0.9 K27ac K4me3 K4me1 4: 5: 6: K4me1 0.9 K4me3 K36me3 All probabilities are learned from the data

ChromHMM

ApplicaKon of ChromHMM to 41 chromakn marks in CD4+ T- cells (Barski 07, Wang 08) Repe11ve Repressed Ac1ve intergenic Transcribed Promoter ChromaKn Marks from (Barski et al, Cell 2007; Wang et al Nature GeneKcs, 2008); DNAseI hypersensikvity from (Boyle et al, Cell 2008); Expression Data from 29 (Su et al, PNAS 2004); Lamina data from (Guelen et al; Naature 2008)

Next-gen Sequencing Technology

Forward Genetics Genotype Phenotype Hypothesis Test Hypothesis By Genetic Manipulation

Forward Genetics Two groups: 1. Develop Colorectal cancer At Young Age 2. Do not Phenotype Mutation in APC Gene Genotype Hypothesis APC is a Tumor Supressor Gene Test Hypothesis By Genetic Manipulation Delete APC in Mouse Control: Isogenic APC+

The Cycle of Forward Genetics Observation Phenotype?Sequencing? Genotype In 2005 $9 million/genome Not feasible Thinking Hypothesis Test Hypothesis By Genetic Manipulation Gene Deletion/Replacement Recombinant Technology

The Problem with Forward Genetics Sequencing Phenotype Sequencing Genotype Currently $40,000* /genome Cost is rapidly dropping Thinking Hypothesis Test Hypothesis By Genetic Manipulation Gene Deletion/Replacement Recombinant Technology

0 and 1 st generation sequencing Pre-1992 old fashioned way 1992-1999 1999 2003 ABI 373/377 ABI 3700 ABI 3730XL S35 ddntps Gels Manual loading Manual base calling Fluorescent ddntps* Gels Manual loading Automated base calling* Fluorescent ddntps Capillaries* Robotic loading* Automated base calling Breaks down frequently Fluorescent ddntps Capillaries Robotic loading Automated base calling Reliable*

Next or 2 nd -generation sequencing 454/Roche GS-20/FLX (Oct 2005) ABI SOLiD (Oct 2007) Illumina/Solexa 1G Genetic Analyser (Feb 2007)

A simple comparison of seq. tech. Technology Reads/run Ave read length 3730XL (ABI) 96 900-1200 bp 454 (Roche) 400,000 250-310 bp bp per Run ~100,000 70 million Data output 1-2MB 20GB Illumina 1G (Solexa) 40 million 36 bp 1 billion 1.5TB SoLID (ABI) 88-132 million 35 bp 1 billion 1.5-3.0TB (44-66 per slide)

They can be applied to different areas ABI 3730XL Next Gen short read instrument (Solexa) Next Gen long read instrument (454) Routine sequencing Verify SNPs from next gen 1X scaffold for novel genomes When quantity matters but length doesn t Expression tags Chip Seq Re-sequencing When length matters Novel genomes Metagenomics

Illumina Genome Analyzer

IGA Sequencing Pipeline 1. Sample Prep (1-5 days) 2. Cluster generation on flow cell (1.5 day) Ligate adapters Clonal Single molecular Array 4. Data Analysis (days-months) 3. Sequencing and imaging (2-3 days)

8 channels (lanes) Cluster generation

Attach DNA to flow cell Attach DNA to flow cell

Attach Bridge DNA amplification to flow cell Can we amplify epigenetic mark??

Cluster generation Clonal Single molecular Array

Clonal single molecule array 100um Random array of clusters ~1000 molecules per ~ 1 um cluster ~20-30,000 clusters per tile ~40 M clusters per flowcell

Sequencing by synthesis 3 5 Cycle 1: Add sequencing reagents First base incorporated Remove unincorporated bases A T C A G T C T G C T A C G A Detect signal Cycle 2-n: Add sequencing reagents and repeat G T C A G T A C C C G A T C G A T 5

Base calling from images T G C T A C G A T 1 2 3 7 8 9 4 5 6 T T T T T T T G T The identity of each base of a cluster is read off from sequential images Reversible terminator chemistry solves homopolymer problem

IGA without cover

Flow cell imaging

A flow cell A flow cell contains eight lanes Lane 1 Lane 2... Lane 8 Each lane/channel contains three columns of tiles Column 1 Column 2 Column 3 Tile 20K-30K Clusters Each column contains 100 tiles Each tile is imaged four times per cycle one image per base. 345,600 images for a 36-cycle run 350 X 350 µm

Data analysis pipeline Firecrest Bustard tiff image files (345,600) intensity files Sequence files Additional Data Analysis Alignment to Genome Eland

Applications Whole Genome Re-sequencing Gene Expression Targeted Re-sequencing ChIP Sequencing Other Applications MicroRNA discovery

Read Length is Not As Important For Resequencing

Applications Genomes Re-sequencing Human Exons (Microarray capture/ amplification) small (including mi-rna) and long RNA profiling (including splicing) ChIP-Seq: Transcription Factors Histone Modifications Effector Proteins DNA Methylation Polysomal RNA Origins of Replication/Replicating DNA Whole Genome Association (rare, high impact SNPs) Copy Number/Structural Variation in DNA ChIA-PET: Transcription Factor Looping Interactions???

Functional Genomics Data Analysis Map reads to the genome Available Tools MAQ SOAP MOSAIK BWA BOWTIE Determine the target genome sequence (i.e., repeat classes) Mapping options Number of allowed mis-matches (as function of position) Number of mapped loci (e.g., 1 = unique read sequence) Generate Consensus Sequence and identify SNPs Generate Read Enrichment Profile (e.g., Wald Lab tool) Develop Null Model and Calculate Significantly Enriched Sites High level analysis: compare to annotations, other data sets, etc

Limitations of short read technology Need a genome De-novo assembly difficult Can t sequence through repeats 80% of the human genome is sequenceable Need high coverage 15-20X to detect polymorphisms Missed SNPs are likely due to low coverage 300X for 1 in 20 event (1 heterozygous in 10 samples) Error rate increases past the first 30~50 bases

Paired End Reads are Important! Known Distance Read 1 Read 2 Repetitive DNA Unique DNA Paired read maps uniquely Single read maps to multiple positions

Paired Ends are Important Part 2 Deletion Insertion Inversion Shendure et al 2005

Paired end mapping reveal structural variations a Basic insertion b Basic deletion c Basic inversion Donor Ref d Linking e Linked insertion f Everted duplication Donor A B A B C Ref A B A C B g Anchored split mapping (deletion) h Anchored split mapping (insertion) i Hanging insertion Donor Ref 0

We need more genomes! Complete genomics ($5000) ABI ($10,000) Illumina ($10,000) Intelligent Biosystems (<$1000)

Ion torrent 3 rd generation sequencing Pac Bio Nanopore

Ion Torrent Sensor, well and chip architecture. Wafer, die and chip packaging. JM Rothberg et al. Nature 475, 348-352 (2011) doi:10.1038/nature10242

Pros and Cons Fast (4 hour sequencing) Cheap per run, but not per base* Homopolymers? * Yet

Single-molecule, real-time (SMRT) sequencing PacBio

Nanopore sequencing