Using GREAT.stanford.edu to interpret cis-regulatory rich datasets including ChIP-Seq, Epi markers, GWAS etc. Gill Bejerano Dept. of Developmental Biology & Dept. of Computer Science Stanford University http://bejerano.stanford.edu 1
Human Gene Regulation 10 13 different cells in an adult human. All these cells have the same Genome. 20,000 Genes encode how to make proteins. 1,000,000 Genomic switches determine which and how much proteins to make. Gene Gene Gene Gene Hundreds of different cell types. 2
Most Non-Coding Elements likely work in cis IRX1 is a member of the Iroquois homeobox gene family. Members of this family appear to play multiple roles during pattern formation of vertebrate embryos. gene deserts regulatory jungles Every orange tick mark is roughly 100-1,000bp long, each evolves under purifying selection, and does not code for protein. 9Mb 3
Many non-coding elements tested are cis-regulatory 4
Bejerano Lab : Human Cis-Regulation CIS REGULATION DEVELOPMENT EVOLUTION DISEASE We build tools, predict and test in house. 5
Combinatorial Regulatory Code 2,000 different proteins can bind specific DNA sequences. DNA Proteins Protein binding site Gene DNA A regulatory region encodes 3-10 such protein binding sites. When all are bound by proteins the regulatory region turns on, and the nearby gene is activated to produce protein. 6
ChIP-Seq: first glimpses of the regulatory genome in action Peak Calling Cis-regulatory peak 7
What is the transcription factor I just assayed doing? Collect known literature of the form Function A: Gene1, Gene2, Gene3,... Function B: Gene1, Gene2, Gene3,... Function C:... Ask whether the binding sites you discovered are preferentially binding (regulating) any one or more of the functions listed above. Form hypothesis and perform further experiments. Cis-regulatory peak Gene transcription start site 8
Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile Gene transcription start site SRF binding ChIP-seq peak ChIP-seq identified 2,429 SRF binding peaks in human Jurkat cells 1 SRF is known as a master regulator of the actin cytoskeleton In the ChIP-Seq peaks, we expect to find binding sites regulating (genes involved in) actin cytoskeleton formation. [1] Valouev A. et al., Nat. Methods, 2008 9
Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile Gene transcription start site SRF binding ChIP-seq peak Ontology term (e.g. actin cytoskeleton ) Existing, gene-based method to analyze enrichment: Ignore distal binding events. Count affected genes. N = 8 genes in genome K = 3 genes annotated with n = 2 genes selected by proximal peaks k = 1 selected gene annotated with Rank by enrichment hypergeometric p-value. P = Pr(k 1 n=2, K =3, N=8) 10
We have (reduced ChIP-Seq into) a gene list! What is the gene list enriched for? Pro: A lot of tools out there for the analysis of gene lists. Cons: These tools are built for microarray analysis. Does it matter?? Microarray data Microarray data Deep sequencing data Microarray tool 11
SRF Gene-based enrichment results Original authors can only state: basic cellular processes, particularly those related to gene expression are enriched 1 SRF SRF acts on genes both in nucleus and cytoplasm, that are involved in transcription and various types of binding SRF Where s the signal? Top actin term is ranked #28 in the list. [1] Valouev A. et al., Nat. Methods, 2008 12
Associating only proximal peaks loses a lot of information Relationship of binding peaks to nearest genes for eight human (H) and mouse (M) ChIP-seq datasets SRF (H: Jurkat) NRSF (H: Jurkat) GABP (H: Jurkat) Stat3 (M: ESC) p300 (M: ESC) p300 (M: limb) 0.7 p300 (M: forebrain) p300 (M: midbrain) Fraction of all elements 0.6 0.5 0.4 0.3 0.2 0.1 Restricting to proximal peaks often leads to complete loss of key enrichments 0 0-2 2-5 5-50 50-500 > 500 Distance to nearest transcription start site (kb) 13
Bad Solution: Associating distal peaks brings in many false enrichments Why bad? 14% of human genes tagged multicellular organismal development. But 33% of base pairs have such a gene nearest upstream/downstream. SRF ChIP-seq set has >2,000 binding events. Throw a random set of 2,000 regions at the genome. What do you get from a gene list analysis? Term Bonferroni corrected p-value nervous system development 5x10-9 system development 8x10-9 anatomical structure development 7x10-8 multicellular organismal development 1x10-7 developmental process 2x10-6 Regulatory jungles are often next to key developmental genes 14
Real Solution: Do not convert to gene list. Analyze the set of genomic regions Gene transcription start site Ontology term ( actin cytoskeleton ) Gene regulatory domain Genomic region (ChIP-seq peak) p = 0.33 of genome annotated with n = 6 genomic regions k = 5 genomic regions hit annotation GREAT = Genomic Regions Enrichment of Annotations Tool P = Pr binom (k 5 n=6, p =0.33) Since 33% of base pairs are near a multicellular organismal development gene, we now expect 33% of genomic regions to hit this term by chance. => Toss 2,000 random regions at genome, get NO (false) enrichments. 15
How does GREAT know how to assign distal binding peaks to genes? Future: High-throughput assays based on chromosome conformation capture (3C) methods will elucidate complex regulation mechanisms Currently: Flexible computational definitions allow assignment of peaks to nearest gene, nearest two genes, etc. Default: each gene has a basal regulatory domain of 5 kb up- and 1kb downstream of transcription start site, extends to basal domain of nearest genes within 1 Mb Though some associations may be missed or incorrect, in general signal richness and robustness is greatly improved by associating distal peaks 16
Top gene-based enrichments of SRF GREAT infers many specific functions of SRF from its binding profile Ontology Term # Genes Binomial Experimental P-value support * Gene Ontology Pathway Commons Top GREAT enrichments of SRF actin cytoskeleton actin binding TRAIL signaling Class I PI3K signaling 30 31 32 26 TreeFam FOS gene family 1x10-8 7x10-9 Miano et al. 2007 5x10-5 Miano et al. 2007 5x10-7 Bertolotto et al. 2000 2x10-6 Poser et al. 2000 5 Chai & Tarnawski 2002 (top actin-related term 28 th in list) TF Targets Targets of SRF Targets of GABP Targets of YY1 Targets of EGR1 84 28 44 23 5x10-76 4x10-9 1x10-6 2x10-4 Positive control ChIp-Seq support Natesan & Gilman 1995 * Known from literature as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT. Similar results for GABP, NRSF, Stat3, p300 ChIP-Seq [McLean et al., Nat Biotechnol., 2010] 17
GREAT data integrated Twenty ontologies spanning broad categories of biology 44,832 total ontology terms tested in each GREAT run (2,800 terms) (5,215) (834) (6,700) (3,079) (911) (150) (1,253) (288) (706) (5,781) (427) (456) (6,857) (8,272) (238) (615) (19) (222) (9) Michael Hiller 18
GREAT implementation Can handle datasets of hundreds of thousands of genomic regions Testing a single ontology term takes ~1 ms Enables real-time calculation of enrichment results for all ontologies Cory McLean 19
GREAT web app: input page http://great.stanford.edu Pick a genome assembly Input BED regions of interest Dave Bristor 20
As of February 11: Added Zebrafish 21
GREAT web app: (Optional): alter association rules http://great.stanford.edu Three association rule choices Lnp Evx2 HoxD cluster Literature-curated domains for a small subset of genes [adapted from Spitz, Gonzalez, & Duboule, Cell, 2003] 22
GREAT web app: output summary Additional ontologies, term statistics, multiple hypothesis corrections, etc. Ontology-specific enrichments 23
GREAT web app: term details page Genes annotated as actin binding with associated genomic regions Genomic regions annotated with actin binding Drill down to explore how a particular peak regulates Plectin and its role in actin binding 24
You can also submit any track straight from the UCSC Table Browser A simple, well documented programmatic interface allows any tool to submit directly to GREAT. (See our Help / Inquiries welcome!) 25
GREAT web app: export data HTML output displays all user selected rows and columns Tab-separated values also available for additional postprocessing 26
GREAT Web Stats: 40 jobs/day x 300 days up 500 entries 27
GREAT can be used with any cis-reg set 28
Top 119 SNPs associated with diabetes from NIH GWAS Catalog
Human specific loss of regulatory seq. Human specific deletions appear significantly often next to: Steroid hormone receptors Neural tumor suppressors http://bejerano.stanford.edu [McLean et al., Nature, 2011] 30
Summary Human genome chock-full of regulatory sequences GREAT accurately assesses functional enrichments of validated or putative cis-regulatory sequences using a novel genomic region-based approach [McLean et al., Nature Biotechnol., 2010] Online tool available at http://great.stanford.edu has been embraced by the biomedical research community http://bejerano.stanford.edu 31
Bejerano Lab: Developmental Genomics & Evolutionary Developmental Genomics dry wet Funding: NIH / NICHD, NHGRI NSF / STC Packard Foundation HFSP Young Investigator Award Searle Scholar Network, Microsoft Research Mallinckrodt Foundation, A.P. Sloan Foundation Okawa Foundation, Borroughs-Wellcome Bejerano Lab Bruce Schaar, Ph.D Geetu Tuteja, Ph.D (Dean s Fellow,) Michael Hiller, Ph.D (HFSP Fellow) Andrew Doxey, Ph.D. (NSERC Fellow) Cory McLean (Bio-X Fellow) Shoa Clarke (HHMI Gilliam Fellow) Aaron Wenger (SiGF Fellow) Harendra Guturu (NSF Fellow) Jim Notwell (NSF Fellow) Tisha Chung Jose Soltren Saatvik Agarwal Jenny Chen Sushant Shankar Stanford David Kingsley & lab 32
http://bejerano.stanford.edu 33