Introduction to genome biology

Similar documents
Introduction to genome biology

DNA:CHROMATIN INTERACTIONS

ChIP-seq analysis 2/28/2018

APPLICATION NOTE. Abstract. Introduction

Applied Bioinformatics - Lecture 16: Transcriptomics

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis

ChIP-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

The ENCODE Encyclopedia. & Variant Annotation Using RegulomeDB and HaploReg

Applications of short-read

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq

2/10/17. Contents. Applications of HMMs in Epigenomics

ChIP. November 21, 2017

Galaxy Platform For NGS Data Analyses

Novel methods for RNA and DNA- Seq analysis using SMART Technology. Andrew Farmer, D. Phil. Vice President, R&D Clontech Laboratories, Inc.

Deep Sequencing technologies

Green Center Computational Core ChIP- Seq Pipeline, Just a Click Away

Introduction to NGS analyses

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015

ChIP-Seq Tools. J Fass UCD Genome Center Bioinformatics Core Wednesday September 16, 2015

TECH NOTE Ligation-Free ChIP-Seq Library Preparation

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Gene Expression Microarrays. For microarrays, purity of the RNA was further assessed by

ChIP-seq/Functional Genomics/Epigenomics. CBSU/3CPG/CVG Next-Gen Sequencing Workshop. Josh Waterfall. March 31, 2010

ChIP-seq and RNA-seq. Farhat Habib

SO YOU WANT TO DO A: RNA-SEQ EXPERIMENT MATT SETTLES, PHD UNIVERSITY OF CALIFORNIA, DAVIS

ChIP-seq and RNA-seq

Discovering gene regulatory control using ChIP-chip and ChIP-seq. An introduction to gene regulatory control, concepts and methodologies

Nature Methods: doi: /nmeth.4396

NGS Approaches to Epigenomics

Charles Girardot, Furlong Lab. MACS, CisGenome, SISSRs and other peak calling algorithms: differences and practical use

Applications of ChIP. November 05, David Grotsky, PhD Scientific Support Specialist - Epigenetics

Genome 373: High- Throughput DNA Sequencing. Doug Fowler

A more efficient, sensitive and robust method of chromatin immunoprecipitation (ChIP)

Next- genera*on Sequencing. Lecture 13

Discovering gene regulatory control using ChIP-chip and ChIP-seq. Part 1. An introduction to gene regulatory control, concepts and methodologies

Data and Metadata Models Recommendations Version 1.2 Developed by the IHEC Metadata Standards Workgroup

Figure S1. nuclear extracts. HeLa cell nuclear extract. Input IgG IP:ORC2 ORC2 ORC2. MCM4 origin. ORC2 occupancy

Supplemental Figure 1 A

Integrated NGS Sample Preparation Solutions for Limiting Amounts of RNA and DNA. March 2, Steven R. Kain, Ph.D. ABRF 2013

ChampionChIP Quick, High Throughput Chromatin Immunoprecipitation Assay System

Targeted RNA sequencing reveals the deep complexity of the human transcriptome.

nature methods A paired-end sequencing strategy to map the complex landscape of transcription initiation

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Assay Standards Working Group Nov 2012 Assay Standards Working Group Recommendations, November 2012

WORKSHOP. Transcriptional circuitry and the regulatory conformation of the genome. Ofir Hakim Faculty of Life Sciences

Introductory Next Gen Workshop

TECH NOTE Pushing the Limit: A Complete Solution for Generating Stranded RNA Seq Libraries from Picogram Inputs of Total Mammalian RNA

Genome 541 Gene regulation and epigenomics Lecture 3 Integrative analysis of genomics assays

Supplemental Figure 1.

Introduction to ChIP Seq data analyses. Acknowledgement: slides taken from Dr. H

PrimePCR Assay Validation Report

Nature Biotechnology: doi: /nbt Supplementary Figure 1. sndrop-seq overview.

Go to Bottom Left click WashU Epigenome Browser. Click

Reading Lecture 8: Lecture 9: Lecture 8. DNA Libraries. Definition Types Construction

Caroline Townsend December 2012 Biochem 218 A critical review of ChIP-seq enrichment analysis tools

DIAMANTINA INSTITUTE for Cancer, Immunology and Metabolic Medicine

02 Agenda Item 03 Agenda Item

Lieberman-Aiden et al. (2009) Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome. Science 326:

A Brief History. Bootstrapping. Bagging. Boosting (Schapire 1989) Adaboost (Schapire 1995)

Complete Sample to Analysis Solutions for DNA Methylation Discovery using Next Generation Sequencing

Bioinformatics of Transcriptional Regulation

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

Computational Analysis of Ultra-high-throughput sequencing data: ChIP-Seq

Experimental Design. Dr. Matthew L. Settles. Genome Center University of California, Davis

Library construction for nextgeneration sequencing: Overviews. and challenges

Transcriptome analysis

Next-generation sequencing technologies

Astrocyte GCRB/BICF Workflow for ChIP-Seq Analysis. Venkat Beibei

Supplementary Table 1: Oligo designs. A list of ATAC-seq oligos used for PCR.

Genome 541! Unit 4, lecture 3! Genomics assays

Next-generation sequencing technologies

Nature Genetics: doi: /ng Supplementary Figure 1. ChIP-seq genome browser views of BRM occupancy at previously identified BRM targets.

2/19/13. Contents. Applications of HMMs in Epigenomics

Figure 7.1: PWM evolution: The sequence affinity of TFBSs has evolved from single sequences, to PWMs, to larger and larger databases of PWMs.

Genomics and Gene Recognition Genes and Blue Genes

How to deal with your RNA-seq data?

Applications of HMMs in Epigenomics

Lecture 5: Regulation

Next Workshop on Epigenetic Profiling ChIP & MCIp

Measuring Protein-DNA interactions

CollecTF Documentation

TECH NOTE Stranded NGS libraries from FFPE samples

Non-coding Function & Variation, MPRAs II. Mike White Bio /5/18

Non-coding Function & Variation, MPRAs. Mike White Bio5488 3/5/18

Gene Regulation 10/19/05

Nature Genetics: doi: /ng.3556 INTEGRATED SUPPLEMENTARY FIGURE TEMPLATE. Supplementary Figure 1

PrimePCR Assay Validation Report

Like use other ChIP kits, before handle ChIP assay please choose a good antibody suitable for precipitation the crosslinked protein / DNA complexes.

Computational Investigation of Gene Regulatory Elements. Ryan Weddle Computational Biosciences Internship Presentation 12/15/2004

Figure S1: NUN preparation yields nascent, unadenylated RNA with a different profile from Total RNA.

Diagenode ideal ChIP-seq kit for Histones for 100 reactions (C )

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Multi-omics in biology: integration of omics techniques

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

High-throughput Transcriptome analysis

NGS Data Analysis and Galaxy

Sequence Analysis. II: Sequence Patterns and Matrices. George Bell, Ph.D. WIBR Bioinformatics and Research Computing

Supplementary Figure 2

Finding Genes with Genomics Technologies

Transcription:

Introduction to genome biology Lisa Stubbs We ve found most genes; but what about the rest of the genome? Genome size* 12 Mb 95 Mb 170 Mb 1500 Mb 2700 Mb 3200 Mb #coding genes ~7000 ~20000 ~14000 ~26000 ~23000 ~21000 # transcripts ~7000 ~50000 ~29000 ~53000 ~93000 ~200000 Kb/gene 1714 bp 4750 bp 12143 bp 57,692 bp 117381 bp 152381 bp *data taken from ENSEMBL genome browser www.ensembl.org Most notably: Coding gene number is relatively constant in metazoans, BUT Number of alternative transcripts per gene and Gene density are not Each gene gives rise to many more isoforms: protein sequence diversity Much more non-coding DNA, including gene regulatory DNA

Most traditional studies have focused on promoters and nearby (proximal) enhancers Promoter regions are most likely to be involved in recruiting RNA polymerase and related proteins TATA binding proteins (TAFs) General transcription factors (GTFs) Mediator complexes Some transcription factors (TF) are also more likely to be found at promoter sites SP1, E2F family are classical examples BUT, most other metazoan TFs are found preferentially at distant sites Introns, intergenic regions Some may be 100s or 1000s of bp from the target promoter, or even embedded within neighboring genes Transcription factors and their binding sites Most known TFs have short, and variable binding sites, e.g. YY1 SP1 Mzf1 BUT The probability of finding a string such as the Yy1 core (even as a simple string, rather than a matrix) is (1/4) 4 = 1/256 bp! Most TFBS are not much more specific than this! So, how to raise the probability that the site you find is functional? 1. Interspecies conservation: sites that are found in similar locations in diverse species are more likely to be functional 2. Site clustering: most TFBS form homo- or heterodimers that significantly stabilize binding and influence function 3. Location within regions that are known to be in an open state in the cell type and conditions of interest

How to find the regulatory needles in the haystack? Vertebrate genomes are mostly non-coding ~2% coding; ~5% noncoding and evolutionarily conserved (at the DNA sequence alignment level) Websites to view pre-aligned sequence conservation levels abound; e.g. the ECR browser http://ecrbrowser.dcode.org/ zpicture and Mulan provide do it yourself tools for pairwise or multisequence alignments of up to 1Mb; http://zpicture.dcode.org/, http://mulan.dcode.org/ All three tools allow detection of conserved TFBS from Transfac, Jaspar, and other databases Conserved motifs are more likely to be functional As long as the biology you are interested in is also conserved Important to consider the appropriate species for comparisons

ECR details: Step 2 Summary of conserved TFBS

SpaWal display Of conserved TFBS Focusing on accessible chromatin Even well conserved motifs cannot be accessed in closed regions of chromatin Not accessible e.g. H3K9Me3, H3K27Me3 accessible e.g. H3K27Ac

How to find active elements? Chromatin immunoprecipitation with TF and histone-modification antibodies Chromatin and attendant proteins are chemically crosslinked (lightly) using formaldehyde Crosslinking will also attach proteins to each other, so that detection of secondary chromatin interactions is inevitable Cross-linked chromatin is randomly sheared by sonication (average fragment size 200-500bp) + Sonicated fragments in solution are exposed to a protein-specific antibody Antibody is retrieved with DNA still attached DNA is released with salt and heat (reverses the crosslinks) Library is created for sequencing : ligation of tags and light PCR amplification ATGGCCTTAACGA.. Sequenced directly e.g. illumina sequencing Sequence-based ChIP approaches Harness ChIP, DNAse sensitivity, and other assays, to Illumina sequencing ChIP enriched DNA is ligated to Illumina linkers and sequenced directly If you experiment works, you ve enriched a very small fraction of the genome: Requires a lot of input chromatin! Traditional methods need ~10^7 cells per experiment!! Critical step is an efficient, selective antibody (and very few exist)

ChIP computational issues Sequence is read from randomly position ends of multiple, overlapping randomly sheared fragments Reads will be scattered around a distance ~2X shear fragment length; ChIP seq reads surround but may not contain the DNA binding site Computational tools (like MACS) need to join adjacent sets of read peaks and define a shift distance between read peaks to determine a summit Seq reads ChIP fragments Binding site Analytical considerations Genomic neighborhoods Shear efficiency is not really random Some genomic regions are fragile and sensitive; some are protected Chromatin-matched, co-sheared controls are essential Most peak-finders are strongly biased to compare controls and experimental with similar numbers of reads Repeatability is key Biological, or at least technical, replicates are also essential Artifactual peaks are very easy to generate! Other ways to validate: Known targets Known motifs Similar targets in different cell types or tissues Peak width Transcription factors typically yield sharp peaks; chromatin marks are sometimes broader and more diffuse

User-friendly tools MACS: Model based peak detection, is sensitive to peak enrichment and background Zhang et al, Genome Biology 2008, Feng et al. 2012, Nat Procols PMID: 22936215 (Xiaole Liu lab); MACS1 is best for sharp peaks (TFs); will break diffuse peaks into smaller regions MACS2 is designed to allow broad- or sharp-peak detection HOMER (http://homer.salk.edu/homer) Can be easily tweaked for more sensitive peak detection Comes packaged wiith a rich set of peak annotation tools Tools for DNAse-seq, High-C, differential ChIP analysis and many more Both tools permit generation of wiggle files or similar that can be viewed in the UCSC browser Looking at your data is a very important step! Peak finders can miss peaks that you can easily see by eye! Differential ChIP and connection to differential expression Just like differential sequence analysis comparison requires rigorous normalization Normalization is complicated for ChIP peak height? Peak shape? Summit position? Read density? Local neighborhoods? Not as simple as an intensity score or a yes/no count Chromatin dynamics and expression dynamics *might* or *might not* be temporally coordinated 200 _ 94-95 FCX120 CK1+2 1M H3K4me3 ChIP 200 _ 99-100 FCX120 EX1+2 1M H3K4me3 ChIP 70 _ 42-46 FCX30 CK1+2 5M h3k27ac ChIP 70 _ 41-45 FCX30 EX1+2 5M h3k27ac ChIP 40 _ 69-70 FCX120 CK1+2 4M h3k4me1 ChIP 40 _ 72-73 FCX120 EX1+2 4M h3k4me1 ChIP 30 _ 108+109 FCX120 EX1+2 5M H3K27me3 ChIP 30 _ 108+109 FCX120 CK1+2 5M H3K27me3 ChIP 5 kb mm9 76,304,000 76,305,000 76,306,000 76,307,000 76,308,000 76,309,000 76,310,000 76,311,000 76,312,000 76,313,000 UCSC Genes (RefSeq, GenBank, trnas & Comparative Genomics) Hsf1 Hsf1 Hsf1 Hsf1 Hsf1 94-95 Frontal Cortex 120 min control samples 1+2 1M cells H3K4me3 ChIP 99-100 Frontal Cortex 120 min exp samples 1+2 1M cells H3K4me3 ChIP 42-46 Frontal Cortex 30 min control sample 1+2 5M h3k27ac 41-45 Frontal Cortex 30 min experimental sample 1+2 5M h3k27ac 69-70 Frontal Cortex 120 min control sample 1+2 4M cells h3k4me1 72-73 Frontal Cortex 120 min experimental sample 1+2 4M cells h3k4me1 108+109 Frontal Cortex 120 min exp samples 1+2 5M cells H3K27me3 ChIP?

Data from ChIP with TFs, modified Histones, and other proteins are available for human (and to some degree, mouse and flies) as Tables in the UCSC genome browser (www.genome.ucsc.edu) From Hoffman et al, Nucl Acid Res 41:827, 2013 Yet another example of why you should look at your data Scale chr17: Mouse mrnas 200-94-95 FCX120 CK1+2 1M H3K4me3 ChIP 200-99-100 FCX120 EX1+2 1M H3K4me3 ChIP 70-42-46 FCX30 CK1+2 5M h3k27ac ChIP 70-41-45 FCX30 EX1+2 5M h3k27ac ChIP 30-69-70 FCX120 CK1+2 4M h3k4me1 ChIP 20-66-67 FCX120 EX1+2 1M h3k4me1 ChIP 30-108+109 FCX120 CK1+2 5M H3K27me3 ChIP 30-108+109 FCX120 EX1+2 5M H3K27me3 ChIP Hspa1b 5 kb mm9 35,095,000 35,100,000 35,105,000 Hspa1a Spliced ESTs

Transposon-based alternatives These tools address an important issue: Library preps fail unless you start with significant ChIP input How to work with samples for which millions of cells are not available? Solution Library prep without linker ligation A transposon brings in the essential Illumina (or other) primers Library prep is completed simply with PCR The need for substantial input DNA is removed TN5 (e.g. Illumina library oligos) transposase tagmentawon inserwon ConWnued reacwon PCR Ready to sequence

Regular ChIP prep ChIP tagmentation Treat with transposase and tag oligos while chromatin is still on the beads Release after tagmentation, PCR, sizeselect and sequence (no library prep!) Issues related to tagmentation Illumina-owned kit is expensive but Ratio of DNA: transposase Has to be adjusted for each cell type and chromatin prep Need even fragmentation to avoid bias, and small enough fragments, in general, for illumina Need to avoid making fragments too small Bias observed in DNA: controls are complicated Solution in ChiPmentation Tagmentation while DNA is still protected by the antibody and cross-linked chromatin, still on the bead Protects from over-tagmentation, this allowing a full digestion without fear of losing the DNA Allows the protocol to work over a 25X range of DNA: transposon and lessens worries about time Genome Res 24:2033 2040

Genome Biology Topic overview Lectures Ross Hardison Basics of gene regulation, epigenetics and ENCODE results David Hawkins Chromatin states, biological applications James Taylor Higher dimension chromatin structure Lisa Stubbs Integrating data for biological inference: Basics of Expression correlation methods Workshops Bowtie and MACS on Galaxy Peaks to features in Galaxy Bowtie and MACs / Tophat->Cuffdiff on the command line Monday: student s choice How to for ECR browser and Z-picture (sequence alignments and conserved motifs) Simple methods for expression correlation: Cluster and Cytoscape ChIP peaks to Meme-ChIP (online connection to the meme suite for large peak sets) DAVID functional clustering analysis (GO and pathway analysis tools online