BTRY 7210: Topics in Quantitative Genomics and Genetics

BTRY 7210: Topics in Quantitative Genomics and Genetics Jason Mezey Biological Statistics and Computational Biology (BSCB) Department of Genetic Medicine jgm45@cornell.edu January 29, 2015

Why you re here (=eqtl): Spring 2015 Course Announcement BTRY 7210 Topics in Quantitative Genomics and Genetics Professor: Jason Mezey Biological Statistics and Computational Biology Time: Thurs. 12:20-1:10 PM Room: 224 Weill Hall COURSE DESCRIPTION: We will consider the problem of identifying and leveraging expression Quantitative Trait Loci (eqtl) when analyzing genome-wide data. The class will include a TBD ratio of lectures by the instructor : reading / discussion of papers. Students taking the class for a grade will be required to produce a single, minicritique report of current papers touching on topics covered in the class. General topics areas that will be considered will include: probability and statistics necessary for understanding eqtl analysis, the basics of eqtl analysis, quality control and model checking in eqtl analysis, advanced eqtl analysis techniques including hidden factor analysis, extending analyses to xqtl, biological value and interpretation of eqtl, combining eqtl and other bioinformatics data for biological discovery, structure and application of probabilistic graphical models that make use of eqtl for network analysis and discovery. GRADING: S/U or Audit. CREDITS: 1 SUGGESTED PREREQUISITES: Quantitative Genomics and Genetics (BTRY 6830 / 4830) and/or background in statistics and/or background in eqtl, GWAS or related genetic mapping analyses

Today Logistics reminders Introduction to eqtl part 2

Logistics Format: a combination of lectures and possibly discussion of papers (that I will select) focused on specific subjects Updated info. on the class website (bottom of classes page): http://mezeylab.cb.bscb.cornell.edu/ Make sure you are on the listserv (!!!) (email me to join / remove): mezey-groupm-l@cornell.edu If you can register for the class, please do (=Audit) - you may also register S/U (either leading a discussion or a minor ~10-20 hour final report will be required)

The eqtl concept II expression Quantitative Trait Locus (eqtl) - a polymorphic locus where an experimental exchange of one allele for another produces a change in expression on average under specified conditions: A 1 A 2 Y The allelic states defined by the original mutation event define the causal polymorphism of the eqtl Intuitive example: if rs27290 was a causal allele, changing A -> G would change the measured expression of ERAP2 eqtl ERAP2 expression 3.5 4.0 4.5 5.0 5.5 6.0 A/A A/G G/G rs27290 genotype

Detecting eqtl from the analysis of genome-wide data I Since eqtl reflect a case where different allelic combinations (genotypes) lead to different levels of gene expression, we could in theory discover an eqtl by testing for an association between measured genotypes and gene expression levels Most eqtl are discovered using this type of approach A typical (human) eqtl experiment includes m (= ~10-30K) expression variables and N (= ~0.1-10mil) genotypes measured in n individuals sampled from a population A typical (most!) analysis of such data proceeds by performing independent statistical tests of (a subset of) genotype-expression pairs, where tests that are significant after a multiple test correct (e.g. Bonferroni), are assumed to indicate an eqtl

Detecting eqtl from the analysis of genome-wide data II eqtl We (almost) always detect eqtl by testing (non-causal) markers in LD with the causal polymorphism rs27290 ERAP2 expression 3.5 4.0 4.5 5.0 5.5 6.0 eqtl A/A A/G G/G rs27290 genotype Copyright: Journal of Diabetes and its Complications; Science Direct; Vendramini et al ERAP2 expression 3.5 4.0 4.5 5.0 5.5 6.0 T/T A/A T/G A/G G/G T45G rs27290 genotype

Genome-wide scan for eqtl: typical outcome eqtl (p < 10 30 ) ERAP2 expression 3.5 4.0 4.5 5.0 5.5 6.0 A/A A/G G/G rs27290 genotype no eqtl (n.s.) ERAP2 expression 3.5 4.0 4.5 5.0 5.5 6.0 T/T T/C C/C rs1908530 genotype

Statistical foundation I (see BTRY 6830 lectures on class site) We need to begin by defining our sample space for an eqtl A experiment: For each individual in our sample { space, } we are interested in pairs of sample outcomes {(a single pair at a time!): Where (Ω g is the set { of possible genotype outcomes for an individual at a locus and Ω P is the set of values of the expression variable for an individual Note that for a diploid, with { two alleles } (typical for humans!): Ω g = {A 1 A 1,A 1 A 2,A 2 A 2 } Ω = {possible individuals} Ω = {Ω g Ω P } F F

Statistical foundation II Next, we need to define { the probability model: Pr(F Ω )=Pr(F g,p ) We will define two (types) or random R variables (* = state does F not matter): F Y :(, Ω P ) R Y = measurable expression value X :(Ω g, ) R X(A 1 A 1 )= 1,X(A 1 A 2 )=0,X(A 2 A 2 ) = 1 Note that the probability model induces a (joint) probability distribution on the these random variables: Pr(Y,X)

Statistical foundation III To assess whether the marker genotype indicates an eqtl, we need to assess the following hypothesis: H 0 : Cov(Y,X) = 0 H A : Cov(Y,X) = 0 To do this, we will collect a sample of size n of expression and genotype pairs (y, x) and define a statistic T(y, x), for which we know the distribution under the null hypothesis, such that we can calculate a p-value: pval = Pr(T t H 0 : true) p-value - the probability of obtaining a value of a statistic, or more extreme, conditional on H0 being true To analyze the data from a genome-wide eqtl experiment, we calculate a p- values for each of (a subset of) the total set of expression-genotype pairs and for cases where we reject the null (at an appropriate multiple test corrected type I error), we assume that this indicates an eqtl Note that we usually consider a run of contiguous genotypes for which we reject the null for the same expression variable to indicate the position of a single causal eqtl polymorphism

Typical outcome: zooming in and cis- v trans- This is a cis- eqtl because the significant genotypes are in the same location as the expressed gene (otherwise, it would be a trans- eqtl) Most eqtl are cis-, which makes biological sense

What we will discuss 1: genomewide identification of eqtl one gene, one SNP one gene, multiple SNPs all genes, all SNPs. one gene, all SNPs.

What we will discuss II: reducing eqtl false positives Population structure and and hidden factors can cause false positive associations = correlations - that don t represent true true genetic genetic effects. effects These effects are visible on the p-value heatmap: population structure hidden factor Usually we can remove these artifacts by including appropriate covariates in our analysis We can sometimes remove these artifacts by including appropriate covariates in our analysis in a mixed model (we will also consider binary phenotype!) or by using a hidden factor analysis

What we will discuss III: can we identify causal genotypes (alleles)?

What we discuss IV: leveraging eqtl for annotation? eqtl SNPs tend to be in disequilibrium (LD) blocks enriched for ENCODE Transcription Factor motifs, suggesting a functional mechanism

What we will discuss V: leveraging eqtl for GWAS eqtl co-localize with disease loci identified in GWAS, indicating a common genetic basis and a method for identifying candidate causal polymorphisms for disease risk

What we will discuss VI: leveraging eqtl for network inference eqtl are used within Probabilistic Graphical Modeling (PGM) frameworks to discover new network / pathway / regulatory relationships

That s it for today Reminder: please email me to join the listserv (!!)