BTRY 7210: Topics in Quantitative Genomics and Genetics

Size: px
Start display at page:

Download "BTRY 7210: Topics in Quantitative Genomics and Genetics"

Transcription

1 BTRY 7210: Topics in Quantitative Genomics and Genetics Jason Mezey Biological Statistics and Computational Biology (BSCB) Department of Genetic Medicine Spring 2015, Thurs.,12:20-1:10

2 Why you re here (=eqtl): Spring 2015 Course Announcement BTRY 7210 Topics in Quantitative Genomics and Genetics Professor: Jason Mezey Biological Statistics and Computational Biology Time: Thurs. 12:20-1:10 PM Room: 224 Weill Hall COURSE DESCRIPTION: We will consider the problem of identifying and leveraging expression Quantitative Trait Loci (eqtl) when analyzing genome-wide data. The class will include a TBD ratio of lectures by the instructor : reading / discussion of papers. Students taking the class for a grade will be required to produce a single, minicritique report of current papers touching on topics covered in the class. General topics areas that will be considered will include: probability and statistics necessary for understanding eqtl analysis, the basics of eqtl analysis, quality control and model checking in eqtl analysis, advanced eqtl analysis techniques including hidden factor analysis, extending analyses to xqtl, biological value and interpretation of eqtl, combining eqtl and other bioinformatics data for biological discovery, structure and application of probabilistic graphical models that make use of eqtl for network analysis and discovery. GRADING: S/U or Audit. CREDITS: 1 SUGGESTED PREREQUISITES: Quantitative Genomics and Genetics (BTRY 6830 / 4830) and/or background in statistics and/or background in eqtl, GWAS or related genetic mapping analyses

3 Today Logistics (time/locations, listserv, website, format, requirements, registering, deciding on topics to cover) Intuitive introduction to expression Quantitative Trait Loci (eqtl) and why you should care Basic concepts in biology, statistics, and eqtl analysis

4 Logistics 1 This class will take place in 224 Weill Hall, from 12:20-1:10PM every Thurs., unless I announce otherwise (!!) Each week, I will be on campus or will join by video-conference Format: a combination of lectures and possibly discussion of papers (that I will select) focused on specific subjects Updated info. on the class website (bottom of classes page): Make sure you are on the listserv (!!!) ( me to join / remove): mezey-groupm-l@cornell.edu

5 Logistics II Who you are: anyone interested in eqtl + (minimally) a working understanding of genetics and statistics - ideally, you took my class last semester... May I sit in? Yes! Come to as many or as few classes as you wish Taking the class for a grade (S/U or Audit): Please register officially If you Audit the class, there are no specific requirements If you take the class for S/U, you must attend and produce a minireport (requiring ~10-20 hours of time) by the end of the semester

6 Deciding on topics to cover There are many topics that I think would be relevant and interesting I am therefore going to allow you all to vote on the possible subjects, choosing from the following: Format: a combination of lectures and possibly discussion of papers (that I will select) focused on specific subjects: Statistical and analysis methodology for identifying eqtl Bioinformatic analyses using eqtl (=integrating with different data types and using them to make inferences about complex phenotypes) Probabilistic Graphical Models (PGMs) and how these can make use of eqtl for network discovery Another topic of interest to you... When you me for the listserv (mezey-groupm-l@cornell.edu) also send me your top preference of a topic to focus on (relating to eqtl!!)

7 Questions about logistics?

8 What is an eqtl = expression Quantitative Trait Locus? (intuition) eqtl ERAP2 expression A/A A/G G/G rs27290 genotype

9 Why should I care about eqtl? (Part 1) eqtl describe a fundamental aspect of biology: inherited allelic variants that impact gene expression eqtl can be discovered from statistical analysis of the most basic types of genome-wide data: genotype and gene expression eqtl are used to characterize gene expression regulatory element, e.g. Brown et al. (ENCODE) eqtl are used to interpret GWAS hits, e.g. to narrow candidates eqtl represent a natural perturbation and can be used to infer novel regulatory (network) relationships

10 Why should I care about eqtl? (Part II) These represent sites of the genome that probably contribute to many phenotypes, i.e. they mark active sites of the genotype-phenotype map Statistical (computational) approaches for genome-wide eqtl identification - when applied correctly - really do identify eqtl, i.e. this is not just a model fitting exercise = we are inferring real biology eqtl (and more broadly xqtl) will be the fundamental analysis and starting point when applying genome-wide data to understand biology

11 Zero to eqtl in ~30 min The molecular biology behind eqtl Rigorous definition of an eqtl Detecting eqtl from the analysis of genome-wide data The importance of linkage disequilibrium (i.e. what we are really discovering in eqtl analyses) The statistical foundation Typical outcomes and interpretation

12 Central Dogma of Molecular Biology credit: wikipedia

13 The eqtl concept I If we could quantify the amount of RNA in a particular tissue or cell type, under a specific set of conditions, this might be informative (i.e. a proxy for gene expression) A case where different allelic states at a specific site (locus) in the genome alter a measured expression variable in a tissue / cell population under a given a set of conditions is an eqtl Note that an eqtl therefore describes a variable pair (genotype-expression association)

14 The eqtl concept II expression - a quantifiable and (theoretically) repeatable measurement of the number of RNA molecules, deriving from a position in the genome, under specified conditions (we will use Y to represent such a measurement) polymorphism - the presence of at least two allelic states (A1 and A2) in a population at a specific locus in the genome where the existence of one of the alleles can be traced to a mutation event expression Quantitative Trait Locus (eqtl) - a polymorphic locus where an experimental exchange of one allele for another produces a change in expression on average: A 1 A 2 Y Note: that within this definition is on average and under specified conditions so the specific allele exchange need not cause a change in expression under every manipulation The allelic states defined by the original mutation event define the causal polymorphism of the eqtl

15 Detecting eqtl from the analysis of genome-wide data I Since eqtl reflect a case where different allelic combinations (genotypes) lead to different levels of gene expression, we could in theory discover an eqtl by testing for an association between measured genotypes and gene expression levels Most eqtl are discovered using this type of approach A typical (human) eqtl experiment includes m (= ~10-30K) expression variables and N (= ~0.1-10mil) genotypes measured in n individuals sampled from a population A typical (most!) analysis of such data proceeds by performing independent statistical tests of (a subset of) genotype-expression pairs, where tests that are significant after a multiple test correct (e.g. Bonferroni), are assumed to indicate an eqtl

16 Detecting eqtl from the analysis of genome-wide data II This seems straightforward but there is a wrinkle: we often have not measured the causal polymorphism (genotypes) (!!) However, a rule of genetics is that genotypes that tend to be physically close to each other in the genome have a high correlations (=linkage disequilibrium or LD) and the further away genotypes are from one another, the lower their correlations (in general): LD equilibrium, linkage A B C Chr. 1 Chr. 2 D equilibrium, no linkage We take advantage of this and assume that significant genotypes are in LD (correlated) with causal polymorphisms and therefore indicate their genomic position (!!) This is why we consider measured genotypes to be markers or tags

17 Detecting eqtl from the analysis of genome-wide data III That is, if we test a (non-causal) marker genotype that is correlated with the causal genotype AND if the only correlated genotypes are in the same position in the genome THEN we can identify the genomic position of the casual genotype (!!) For almost all human eqtl, we know the genomic position but not the identities of the causal genotypes responsible for the eqtl Copyright: Journal of Diabetes and its Complications; Science Direct; Vendramini et al

18 Statistical foundation I We need to begin by defining our sample space for an eqtl A experiment: For each individual in our sample { space, } we are interested in pairs of sample outcomes {(a single pair at a time!): Where (Ω g is the set { of possible genotype outcomes for an individual at a locus and Ω P is the set of values of the expression variable for an individual Note that for a diploid, with { two alleles } (typical for humans!): Ω g = {A 1 A 1,A 1 A 2,A 2 A 2 } Ω = {possible individuals} Ω = {Ω g Ω P } F F

19 Statistical foundation II Next, we need to define { the probability model: Pr(F Ω )=Pr(F g,p ) We will define two (types) or random R variables (* = state does F not matter): F Y :(, Ω P ) R Y = measurable expression value X :(Ω g, ) R X(A 1 A 1 )= 1,X(A 1 A 2 )=0,X(A 2 A 2 ) = 1 Note that the probability model induces a (joint) probability distribution on the these random variables: Pr(Y,X)

20 Statistical foundation III To assess whether the marker genotype indicates an eqtl, we need to assess the following hypothesis: H 0 : Cov(Y,X) = 0 H A : Cov(Y,X) = 0 To do this, we will collect a sample of size n of expression and genotype pairs (y, x) and define a statistic T(y, x), for which we know the distribution under the null hypothesis, such that we can calculate a p-value: pval = Pr(T t H 0 : true) p-value - the probability of obtaining a value of a statistic, or more extreme, conditional on H0 being true To analyze the data from a genome-wide eqtl experiment, we calculate a p- values for each of (a subset of) the total set of expression-genotype pairs and for cases where we reject the null (at an appropriate multiple test corrected type I error), we assume that this indicates an eqtl Note that we usually consider a run of contiguous genotypes for which we reject the null for the same expression variable to indicate the position of a single causal eqtl polymorphism

21 Typical outcome I eqtl (p < ) ERAP2 expression A/A A/G G/G rs27290 genotype no eqtl (n.s.) ERAP2 expression T/T T/C C/C rs genotype

22 Typical outcome II This is a cis- eqtl because the significant genotypes are in the same location as the expressed gene (otherwise, it would be a trans- eqtl) Most eqtl are cis-, which makes biological sense

23 Typical outcome III A typical genome-wide eqtl analysis with a relatively small sample size finds many eqtl We have a strong reason to believe that many of the eqtl reported are false positives (=future lecture, stay tuned!) This is a remarkably simple approach to finding eqtl, are there better analysis approaches (=stay tuned!) The landscape of available eqtl data is changing with innovations in next-generation sequencing technologies that are providing many more genotypes and providing a variety of measurements of gene expression and other types of variables (=we will discuss!) How do we validate eqtl? How do we leverage eqtl to learn more biology? What is in the immediate future for eqtl? (=you get the picture...)

24 Leveraging eqtl I!"#$%&'( The most significant SNP in a linkage disequilibrium (LD) block tend to be enriched for ENCODE Transcription Factor motifs, suggesting a functional mechanism!)*+*,!)*+*-./

25 Leveraging eqtl II eqtl co-localize with disease loci identified in GWAS, indicating a common genetic basis and a method for identifying candidate causal polymorphisms for disease risk

26 Leveraging eqtl III eqtl are used within Probabilistic Graphical Modeling (PGM) frameworks to discover new network / pathway / regulatory relationships

27 That s it for today Reminder (!!): if you are taking the class for a grade (S/U) please register and please me to join the listserv AND let me know what topic you would be most interested in covering (!!)