Using GREAT.stanford.edu to interpret cis-regulatory rich datasets including ChIP-Seq, Epi markers, GWAS etc.

Similar documents
: Genomic Regions Enrichment of Annotations Tool

3/15/2016. Genome = Genes + Gene Regulation. Personal Transcription Factor. Binding Site Mutations Point to. Personal Medical Histories

GREAT improves functional interpretation of cis-regulatory regions

Introduction to genome biology

The ENCODE Encyclopedia. & Variant Annotation Using RegulomeDB and HaploReg

Relationship of Gene s Types and Introns

Integrating diverse datasets improves developmental enhancer prediction

Introduction to genome biology

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

NGS Approaches to Epigenomics

ChIP-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

Discovering gene regulatory control using ChIP-chip and ChIP-seq. An introduction to gene regulatory control, concepts and methodologies

Human Gene Regulation

A step-by-step guide to ChIP-seq data analysis

Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Introduction to BIOINFORMATICS

2/19/13. Contents. Applications of HMMs in Epigenomics

ChIP-seq and RNA-seq. Farhat Habib

Applications of HMMs in Epigenomics

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

Tales from the Dark Side of Your Genome

Discovering gene regulatory control using ChIP-chip and ChIP-seq. Part 1. An introduction to gene regulatory control, concepts and methodologies

Non-coding Function & Variation, MPRAs II. Mike White Bio /5/18

Introduction to ChIP Seq data analyses. Acknowledgement: slides taken from Dr. H

Analysis of ChIP-seq data with R / Bioconductor

Integrating Diverse Datasets Improves Developmental Enhancer Prediction

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

ChIP-seq and RNA-seq

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Non-coding Function & Variation, MPRAs. Mike White Bio5488 3/5/18

Microarray Data Analysis in GeneSpring GX 11. Month ##, 200X

Title: Genome-Wide Predictions of Transcription Factor Binding Events using Multi- Dimensional Genomic and Epigenomic Features Background

KnetMiner USER TUTORIAL

Knowledge-Guided Analysis with KnowEnG Lab

Introduction to NGS analyses

Analyzing Gene Set Enrichment

Object Groups. SRI International Bioinformatics

ChIP-seq analysis 2/28/2018

2/10/17. Contents. Applications of HMMs in Epigenomics

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Charles Girardot, Furlong Lab. MACS, CisGenome, SISSRs and other peak calling algorithms: differences and practical use

Expression Analysis Systematic Explorer (EASE)

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

user s guide Question 1

Nature Biotechnology: doi: /nbt Supplementary Figure 1. sndrop-seq overview.

The first thing you will see is the opening page. SeqMonk scans your copy and make sure everything is in order, indicated by the green check marks.

Training materials.

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis

Understanding protein lists from proteomics studies. Bing Zhang Department of Biomedical Informatics Vanderbilt University

Mapping long-range promoter contacts in human cells with high-resolution capture Hi-C

Methods and tools for exploring functional genomics data

Capabilities & Services

The ENCODE Encyclopedia. Zhiping Weng University of Massachusetts Medical School ENCODE 2016: Research Applications and Users Meeting June 8, 2016

The human gene encoding Glucose-6-phosphate dehydrogenase (G6PD) is located on chromosome X in cytogenetic band q28.

Genome 541 Gene regulation and epigenomics Lecture 3 Integrative analysis of genomics assays

Introduction to 'Omics and Bioinformatics

Understanding Transcripts in UCSC Genome Browser

Upstream/Downstream Relation Detection of Signaling Molecules using Microarray Data

Bi8 Lecture 19. Review and Practice Questions March 8th 2016

The Human Genome Project

Genomes: What we know and what we don t know

Training materials.

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger

Inferring Cellular Networks Using Probabilis6c Graphical Models. Jianlin Cheng, PhD University of Missouri 2010

Introduction to the UCSC genome browser

About Strand NGS. Strand Genomics, Inc All rights reserved.

Supporting Information

Machine Learning. HMM applications in computational biology

Section C: The Control of Gene Expression

Galaxy Platform For NGS Data Analyses

ChIP. November 21, 2017

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

PRIMEGENSw3 User Manual

LARGE DATA AND BIOMEDICAL COMPUTATIONAL PIPELINES FOR COMPLEX DISEASES

Analysis of Microarray Data

Genome 541! Unit 4, lecture 3! Genomics assays

Nature Genetics: doi: /ng Supplementary Figure 1. H3K27ac HiChIP enriches enhancer promoter-associated chromatin contacts.

Linking Genetic Variation to Important Phenotypes: SNPs, CNVs, GWAS, and eqtls

Proteomics. Manickam Sugumaran. Department of Biology University of Massachusetts Boston, MA 02125

Massive Analysis of cdna Ends for simultaneous Genotyping and Transcription Profiling in High Throughput

Crash-course in genomics

TUTORIAL. Revised in Apr 2015

ChIP-seq experimental design and analysis

The genetic code of gene regulatory elements

Computational Analysis of Ultra-high-throughput sequencing data: ChIP-Seq

GREG GIBSON SPENCER V. MUSE

THE HEALTH AND RETIREMENT STUDY: GENETIC DATA UPDATE

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome

Nature Genetics: doi: /ng Supplementary Figure 1. ChIP-seq genome browser views of BRM occupancy at previously identified BRM targets.

Linking Genetic Variation to Important Phenotypes: SNPs, CNVs, GWAS, and eqtls

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

In silico identification of transcriptional regulatory regions

Practice Exam A. Briefly describe how IL-25 treatment might be able to help this responder subgroup of liver cancer patients.

CyVerse Overview. National Academies Special Topics Summer Institute on Quantitative Biology

DNA. bioinformatics. epigenetics methylation structural variation. custom. assembly. gene. tumor-normal. mendelian. BS-seq. prediction.

Reading Between the Genes: Computational Models to Discover Function from Noncoding DNA

Complete draft sequence 2001

Transcription:

Using GREAT.stanford.edu to interpret cis-regulatory rich datasets including ChIP-Seq, Epi markers, GWAS etc. Gill Bejerano Dept. of Developmental Biology & Dept. of Computer Science Stanford University http://bejerano.stanford.edu 1

Human Gene Regulation 10 13 different cells in an adult human. All these cells have the same Genome. 20,000 Genes encode how to make proteins. 1,000,000 Genomic switches determine which and how much proteins to make. Gene Gene Gene Gene Hundreds of different cell types. 2

Most Non-Coding Elements likely work in cis IRX1 is a member of the Iroquois homeobox gene family. Members of this family appear to play multiple roles during pattern formation of vertebrate embryos. gene deserts regulatory jungles Every orange tick mark is roughly 100-1,000bp long, each evolves under purifying selection, and does not code for protein. 9Mb 3

Many non-coding elements tested are cis-regulatory 4

Bejerano Lab : Human Cis-Regulation CIS REGULATION DEVELOPMENT EVOLUTION DISEASE We build tools, predict and test in house. 5

Combinatorial Regulatory Code 2,000 different proteins can bind specific DNA sequences. DNA Proteins Protein binding site Gene DNA A regulatory region encodes 3-10 such protein binding sites. When all are bound by proteins the regulatory region turns on, and the nearby gene is activated to produce protein. 6

ChIP-Seq: first glimpses of the regulatory genome in action Peak Calling Cis-regulatory peak 7

What is the transcription factor I just assayed doing? Collect known literature of the form Function A: Gene1, Gene2, Gene3,... Function B: Gene1, Gene2, Gene3,... Function C:... Ask whether the binding sites you discovered are preferentially binding (regulating) any one or more of the functions listed above. Form hypothesis and perform further experiments. Cis-regulatory peak Gene transcription start site 8

Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile Gene transcription start site SRF binding ChIP-seq peak ChIP-seq identified 2,429 SRF binding peaks in human Jurkat cells 1 SRF is known as a master regulator of the actin cytoskeleton In the ChIP-Seq peaks, we expect to find binding sites regulating (genes involved in) actin cytoskeleton formation. [1] Valouev A. et al., Nat. Methods, 2008 9

Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile Gene transcription start site SRF binding ChIP-seq peak Ontology term (e.g. actin cytoskeleton ) Existing, gene-based method to analyze enrichment: Ignore distal binding events. Count affected genes. N = 8 genes in genome K = 3 genes annotated with n = 2 genes selected by proximal peaks k = 1 selected gene annotated with Rank by enrichment hypergeometric p-value. P = Pr(k 1 n=2, K =3, N=8) 10

We have (reduced ChIP-Seq into) a gene list! What is the gene list enriched for? Pro: A lot of tools out there for the analysis of gene lists. Cons: These tools are built for microarray analysis. Does it matter?? Microarray data Microarray data Deep sequencing data Microarray tool 11

SRF Gene-based enrichment results Original authors can only state: basic cellular processes, particularly those related to gene expression are enriched 1 SRF SRF acts on genes both in nucleus and cytoplasm, that are involved in transcription and various types of binding SRF Where s the signal? Top actin term is ranked #28 in the list. [1] Valouev A. et al., Nat. Methods, 2008 12

Associating only proximal peaks loses a lot of information Relationship of binding peaks to nearest genes for eight human (H) and mouse (M) ChIP-seq datasets SRF (H: Jurkat) NRSF (H: Jurkat) GABP (H: Jurkat) Stat3 (M: ESC) p300 (M: ESC) p300 (M: limb) 0.7 p300 (M: forebrain) p300 (M: midbrain) Fraction of all elements 0.6 0.5 0.4 0.3 0.2 0.1 Restricting to proximal peaks often leads to complete loss of key enrichments 0 0-2 2-5 5-50 50-500 > 500 Distance to nearest transcription start site (kb) 13

Bad Solution: Associating distal peaks brings in many false enrichments Why bad? 14% of human genes tagged multicellular organismal development. But 33% of base pairs have such a gene nearest upstream/downstream. SRF ChIP-seq set has >2,000 binding events. Throw a random set of 2,000 regions at the genome. What do you get from a gene list analysis? Term Bonferroni corrected p-value nervous system development 5x10-9 system development 8x10-9 anatomical structure development 7x10-8 multicellular organismal development 1x10-7 developmental process 2x10-6 Regulatory jungles are often next to key developmental genes 14

Real Solution: Do not convert to gene list. Analyze the set of genomic regions Gene transcription start site Ontology term ( actin cytoskeleton ) Gene regulatory domain Genomic region (ChIP-seq peak) p = 0.33 of genome annotated with n = 6 genomic regions k = 5 genomic regions hit annotation GREAT = Genomic Regions Enrichment of Annotations Tool P = Pr binom (k 5 n=6, p =0.33) Since 33% of base pairs are near a multicellular organismal development gene, we now expect 33% of genomic regions to hit this term by chance. => Toss 2,000 random regions at genome, get NO (false) enrichments. 15

How does GREAT know how to assign distal binding peaks to genes? Future: High-throughput assays based on chromosome conformation capture (3C) methods will elucidate complex regulation mechanisms Currently: Flexible computational definitions allow assignment of peaks to nearest gene, nearest two genes, etc. Default: each gene has a basal regulatory domain of 5 kb up- and 1kb downstream of transcription start site, extends to basal domain of nearest genes within 1 Mb Though some associations may be missed or incorrect, in general signal richness and robustness is greatly improved by associating distal peaks 16

Top gene-based enrichments of SRF GREAT infers many specific functions of SRF from its binding profile Ontology Term # Genes Binomial Experimental P-value support * Gene Ontology Pathway Commons Top GREAT enrichments of SRF actin cytoskeleton actin binding TRAIL signaling Class I PI3K signaling 30 31 32 26 TreeFam FOS gene family 1x10-8 7x10-9 Miano et al. 2007 5x10-5 Miano et al. 2007 5x10-7 Bertolotto et al. 2000 2x10-6 Poser et al. 2000 5 Chai & Tarnawski 2002 (top actin-related term 28 th in list) TF Targets Targets of SRF Targets of GABP Targets of YY1 Targets of EGR1 84 28 44 23 5x10-76 4x10-9 1x10-6 2x10-4 Positive control ChIp-Seq support Natesan & Gilman 1995 * Known from literature as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT. Similar results for GABP, NRSF, Stat3, p300 ChIP-Seq [McLean et al., Nat Biotechnol., 2010] 17

GREAT data integrated Twenty ontologies spanning broad categories of biology 44,832 total ontology terms tested in each GREAT run (2,800 terms) (5,215) (834) (6,700) (3,079) (911) (150) (1,253) (288) (706) (5,781) (427) (456) (6,857) (8,272) (238) (615) (19) (222) (9) Michael Hiller 18

GREAT implementation Can handle datasets of hundreds of thousands of genomic regions Testing a single ontology term takes ~1 ms Enables real-time calculation of enrichment results for all ontologies Cory McLean 19

GREAT web app: input page http://great.stanford.edu Pick a genome assembly Input BED regions of interest Dave Bristor 20

As of February 11: Added Zebrafish 21

GREAT web app: (Optional): alter association rules http://great.stanford.edu Three association rule choices Lnp Evx2 HoxD cluster Literature-curated domains for a small subset of genes [adapted from Spitz, Gonzalez, & Duboule, Cell, 2003] 22

GREAT web app: output summary Additional ontologies, term statistics, multiple hypothesis corrections, etc. Ontology-specific enrichments 23

GREAT web app: term details page Genes annotated as actin binding with associated genomic regions Genomic regions annotated with actin binding Drill down to explore how a particular peak regulates Plectin and its role in actin binding 24

You can also submit any track straight from the UCSC Table Browser A simple, well documented programmatic interface allows any tool to submit directly to GREAT. (See our Help / Inquiries welcome!) 25

GREAT web app: export data HTML output displays all user selected rows and columns Tab-separated values also available for additional postprocessing 26

GREAT Web Stats: 40 jobs/day x 300 days up 500 entries 27

GREAT can be used with any cis-reg set 28

Top 119 SNPs associated with diabetes from NIH GWAS Catalog

Human specific loss of regulatory seq. Human specific deletions appear significantly often next to: Steroid hormone receptors Neural tumor suppressors http://bejerano.stanford.edu [McLean et al., Nature, 2011] 30

Summary Human genome chock-full of regulatory sequences GREAT accurately assesses functional enrichments of validated or putative cis-regulatory sequences using a novel genomic region-based approach [McLean et al., Nature Biotechnol., 2010] Online tool available at http://great.stanford.edu has been embraced by the biomedical research community http://bejerano.stanford.edu 31

Bejerano Lab: Developmental Genomics & Evolutionary Developmental Genomics dry wet Funding: NIH / NICHD, NHGRI NSF / STC Packard Foundation HFSP Young Investigator Award Searle Scholar Network, Microsoft Research Mallinckrodt Foundation, A.P. Sloan Foundation Okawa Foundation, Borroughs-Wellcome Bejerano Lab Bruce Schaar, Ph.D Geetu Tuteja, Ph.D (Dean s Fellow,) Michael Hiller, Ph.D (HFSP Fellow) Andrew Doxey, Ph.D. (NSERC Fellow) Cory McLean (Bio-X Fellow) Shoa Clarke (HHMI Gilliam Fellow) Aaron Wenger (SiGF Fellow) Harendra Guturu (NSF Fellow) Jim Notwell (NSF Fellow) Tisha Chung Jose Soltren Saatvik Agarwal Jenny Chen Sushant Shankar Stanford David Kingsley & lab 32

http://bejerano.stanford.edu 33