Introduction to genome biology

Introduction to genome biology Lisa Stubbs We ve found most genes; but what about the rest of the genome? Genome size* 12 Mb 95 Mb 170 Mb 1500 Mb 2700 Mb 3200 Mb #coding genes ~7000 ~20000 ~14000 ~26000 ~23000 ~21000 # transcripts ~7000 ~50000 ~29000 ~53000 ~93000 ~200000 Kb/gene 1714 bp 4750 bp 12143 bp 57,692 bp 117381 bp 152381 bp *data taken from ENSEMBL genome browser www.ensembl.org Most notably: Coding gene number is relatively constant in metazoans, BUT Number of alternative transcripts per gene and Gene density are not Each gene gives rise to many more isoforms: protein sequence diversity Much more non-coding DNA, including gene regulatory DNA

Most traditional studies have focused on promoters and nearby (proximal) enhancers Promoter regions are most likely to be involved in recruiting RNA polymerase and related proteins TATA binding proteins (TAFs) General transcription factors (GTFs) Mediator complexes Some transcription factors (TF) are also more likely to be found at promoter sites SP1, E2F family are classical examples BUT, most other metazoan TFs are found preferentially at distant sites Introns, intergenic regions Some may be 100s or 1000s of bp from the target promoter, or even embedded within neighboring genes Transcription factors and their binding sites Most known TFs have short, and variable binding sites, e.g. YY1 SP1 Mzf1 BUT The probability of finding a string such as the Yy1 core (even as a simple string, rather than a matrix) is (1/4) 4 = 1/256 bp! Most TFBS are not much more specific than this! So, how to raise the probability that the site you find is functional? 1. Interspecies conservation: sites that are found in similar locations in diverse species are more likely to be functional 2. Site clustering: most TFBS form homo- or heterodimers that significantly stabilize binding and influence function 3. Location within regions that are known to be in an open state in the cell type and conditions of interest

How to find the regulatory needles in the haystack? Vertebrate genomes are mostly non-coding ~2% coding; ~5% noncoding and evolutionarily conserved (at the DNA sequence alignment level) Websites to view pre-aligned sequence conservation levels abound; e.g. the ECR browser http://ecrbrowser.dcode.org/ zpicture and Mulan provide do it yourself tools for pairwise or multisequence alignments of up to 1Mb; http://zpicture.dcode.org/, http://mulan.dcode.org/ All three tools allow detection of conserved TFBS from Transfac, Jaspar, and other databases Conserved motifs are more likely to be functional As long as the biology you are interested in is also conserved Important to consider the appropriate species for comparisons

ECR details: Step 2 Summary of conserved TFBS

SpaWal display Of conserved TFBS Focusing on accessible chromatin Even well conserved motifs cannot be accessed in closed regions of chromatin Not accessible e.g. H3K9Me3, H3K27Me3 accessible e.g. H3K27Ac

How to find active elements? Chromatin immunoprecipitation with TF and histone-modification antibodies Chromatin and attendant proteins are chemically crosslinked (lightly) using formaldehyde Crosslinking will also attach proteins to each other, so that detection of secondary chromatin interactions is inevitable Cross-linked chromatin is randomly sheared by sonication (average fragment size 200-500bp) + Sonicated fragments in solution are exposed to a protein-specific antibody Antibody is retrieved with DNA still attached DNA is released with salt and heat (reverses the crosslinks) Library is created for sequencing : ligation of tags and light PCR amplification ATGGCCTTAACGA.. Sequenced directly e.g. illumina sequencing Sequence-based ChIP approaches Harness ChIP, DNAse sensitivity, and other assays, to Illumina sequencing ChIP enriched DNA is ligated to Illumina linkers and sequenced directly If you experiment works, you ve enriched a very small fraction of the genome: Requires a lot of input chromatin! Traditional methods need ~10^7 cells per experiment!! Critical step is an efficient, selective antibody (and very few exist)

ChIP computational issues Sequence is read from randomly position ends of multiple, overlapping randomly sheared fragments Reads will be scattered around a distance ~2X shear fragment length; ChIP seq reads surround but may not contain the DNA binding site Computational tools (like MACS) need to join adjacent sets of read peaks and define a shift distance between read peaks to determine a summit Seq reads ChIP fragments Binding site Analytical considerations Genomic neighborhoods Shear efficiency is not really random Some genomic regions are fragile and sensitive; some are protected Chromatin-matched, co-sheared controls are essential Most peak-finders are strongly biased to compare controls and experimental with similar numbers of reads Repeatability is key Biological, or at least technical, replicates are also essential Artifactual peaks are very easy to generate! Other ways to validate: Known targets Known motifs Similar targets in different cell types or tissues Peak width Transcription factors typically yield sharp peaks; chromatin marks are sometimes broader and more diffuse

User-friendly tools MACS: Model based peak detection, is sensitive to peak enrichment and background Zhang et al, Genome Biology 2008, Feng et al. 2012, Nat Procols PMID: 22936215 (Xiaole Liu lab); MACS1 is best for sharp peaks (TFs); will break diffuse peaks into smaller regions MACS2 is designed to allow broad- or sharp-peak detection HOMER (http://homer.salk.edu/homer) Can be easily tweaked for more sensitive peak detection Comes packaged wiith a rich set of peak annotation tools Tools for DNAse-seq, High-C, differential ChIP analysis and many more Both tools permit generation of wiggle files or similar that can be viewed in the UCSC browser Looking at your data is a very important step! Peak finders can miss peaks that you can easily see by eye! Differential ChIP and connection to differential expression Just like differential sequence analysis comparison requires rigorous normalization Normalization is complicated for ChIP peak height? Peak shape? Summit position? Read density? Local neighborhoods? Not as simple as an intensity score or a yes/no count Chromatin dynamics and expression dynamics *might* or *might not* be temporally coordinated 200 _ 94-95 FCX120 CK1+2 1M H3K4me3 ChIP 200 _ 99-100 FCX120 EX1+2 1M H3K4me3 ChIP 70 _ 42-46 FCX30 CK1+2 5M h3k27ac ChIP 70 _ 41-45 FCX30 EX1+2 5M h3k27ac ChIP 40 _ 69-70 FCX120 CK1+2 4M h3k4me1 ChIP 40 _ 72-73 FCX120 EX1+2 4M h3k4me1 ChIP 30 _ 108+109 FCX120 EX1+2 5M H3K27me3 ChIP 30 _ 108+109 FCX120 CK1+2 5M H3K27me3 ChIP 5 kb mm9 76,304,000 76,305,000 76,306,000 76,307,000 76,308,000 76,309,000 76,310,000 76,311,000 76,312,000 76,313,000 UCSC Genes (RefSeq, GenBank, trnas & Comparative Genomics) Hsf1 Hsf1 Hsf1 Hsf1 Hsf1 94-95 Frontal Cortex 120 min control samples 1+2 1M cells H3K4me3 ChIP 99-100 Frontal Cortex 120 min exp samples 1+2 1M cells H3K4me3 ChIP 42-46 Frontal Cortex 30 min control sample 1+2 5M h3k27ac 41-45 Frontal Cortex 30 min experimental sample 1+2 5M h3k27ac 69-70 Frontal Cortex 120 min control sample 1+2 4M cells h3k4me1 72-73 Frontal Cortex 120 min experimental sample 1+2 4M cells h3k4me1 108+109 Frontal Cortex 120 min exp samples 1+2 5M cells H3K27me3 ChIP?

Data from ChIP with TFs, modified Histones, and other proteins are available for human (and to some degree, mouse and flies) as Tables in the UCSC genome browser (www.genome.ucsc.edu) From Hoffman et al, Nucl Acid Res 41:827, 2013 Yet another example of why you should look at your data Scale chr17: Mouse mrnas 200-94-95 FCX120 CK1+2 1M H3K4me3 ChIP 200-99-100 FCX120 EX1+2 1M H3K4me3 ChIP 70-42-46 FCX30 CK1+2 5M h3k27ac ChIP 70-41-45 FCX30 EX1+2 5M h3k27ac ChIP 30-69-70 FCX120 CK1+2 4M h3k4me1 ChIP 20-66-67 FCX120 EX1+2 1M h3k4me1 ChIP 30-108+109 FCX120 CK1+2 5M H3K27me3 ChIP 30-108+109 FCX120 EX1+2 5M H3K27me3 ChIP Hspa1b 5 kb mm9 35,095,000 35,100,000 35,105,000 Hspa1a Spliced ESTs

Transposon-based alternatives These tools address an important issue: Library preps fail unless you start with significant ChIP input How to work with samples for which millions of cells are not available? Solution Library prep without linker ligation A transposon brings in the essential Illumina (or other) primers Library prep is completed simply with PCR The need for substantial input DNA is removed TN5 (e.g. Illumina library oligos) transposase tagmentawon inserwon ConWnued reacwon PCR Ready to sequence

Regular ChIP prep ChIP tagmentation Treat with transposase and tag oligos while chromatin is still on the beads Release after tagmentation, PCR, sizeselect and sequence (no library prep!) Issues related to tagmentation Illumina-owned kit is expensive but Ratio of DNA: transposase Has to be adjusted for each cell type and chromatin prep Need even fragmentation to avoid bias, and small enough fragments, in general, for illumina Need to avoid making fragments too small Bias observed in DNA: controls are complicated Solution in ChiPmentation Tagmentation while DNA is still protected by the antibody and cross-linked chromatin, still on the bead Protects from over-tagmentation, this allowing a full digestion without fear of losing the DNA Allows the protocol to work over a 25X range of DNA: transposon and lessens worries about time Genome Res 24:2033 2040

Genome Biology Topic overview Lectures Ross Hardison Basics of gene regulation, epigenetics and ENCODE results David Hawkins Chromatin states, biological applications James Taylor Higher dimension chromatin structure Lisa Stubbs Integrating data for biological inference: Basics of Expression correlation methods Workshops Bowtie and MACS on Galaxy Peaks to features in Galaxy Bowtie and MACs / Tophat->Cuffdiff on the command line Monday: student s choice How to for ECR browser and Z-picture (sequence alignments and conserved motifs) Simple methods for expression correlation: Cluster and Cytoscape ChIP peaks to Meme-ChIP (online connection to the meme suite for large peak sets) DAVID functional clustering analysis (GO and pathway analysis tools online