Course Announcements

Size: px
Start display at page:

Download "Course Announcements"

Transcription

1 Statistical Methods for Quantitative Trait Loci (QTL) Mapping II Lectures 5 Oct 2, 2 SE 527 omputational Biology, Fall 2 Instructor Su-In Lee T hristopher Miles Monday & Wednesday 2-2 Johnson Hall (JHN) 22 ourse nnouncements HW # is out Project proposal Due next Wed paragraph describing what you d like to work on for the class project. 2

2 Why are we so different? Human genetic diversity TTTTTTTTT TTTTTTTTTT ny observable characteristic or trait Different phenotype ppearance Disease susceptibility Drug responses Different genotype Individual-specific DN 3 billion-long string TTTTTTTT TTTTTTTTT TTTTTTT TTTTTTTTTTTT TTTTTTTTTTTTT TTTTTTTTTTTT TTTTTTTTTTTTTTT TTTTTTTTTTTT TTTTTTTTTTTT TTTTTTTTTTTTT 3 TTT Motivation Which sequence variation affects a trait? Better understanding disease mechanisms Personalized medicine ppearance, Personality, Disease susceptibility, Drug responses, Sequence variations Different Instruction instruction T TTTTT XX XXX DN 3 billion long! different person person cell Obese? 5% Bold? 3% Diabetes? 6.2% Parkinson s disease?.3% Heart disease? 2.% olon cancer? 6.5% 4 2

3 QTL mapping Data Phenotypes y i = trait value for mouse i enotypes x ik = / (i.e. B/) of mouse i at marker k enetic map Locations of genetic markers oals Identify the genomic regions (QTLs) contributing to variation in the phenotype , enotype data 3 markers Phenotype data mouse individuals 5 Outline Statistical methods for mapping QTL What is QTL? Experimental animals nalysis of variance (marker regression) Interval mapping (EM) QTL? , mouse individuals 6 3

4 Interval mapping [Lander and Botstein, 989] onsider any one position in the genome as the location for a putative QTL. For a particular mouse, let z = / if (unobserved) genotype at QTL is B/. alculate P(z = marker data). Need only consider nearby genotyped markers. May allow for the presence of genotypic errors. 7 Interval mapping [Lander and Botstein, 989] onsider any one position in the genome as the location for a putative QTL. For a particular mouse, let z = / if (unobserved) genotype at QTL is B/. alculate P(z = marker data). Need only consider nearby genotyped markers. May allow for the presence of genotypic errors. iven genotype at the QTL, phenotype is distributed as N(µ+ z, σ 2 ). iven marker data, phenotype follows a mixture of normal distributions. 8 4

5 IM the mixture model Nearest flanking markers M /M 2 99% B M QTL M Let s say that the mice with QTL genotype have average phenotype µ while the mice with QTL genotype B have average phenotype µ B. The QTL has effect = µ B -µ. What are unknowns? µ and µ B enotype of QTL 65% B 35% 35% B 65% 99% 9 IM estimation and LOD scores Use a version of the EM algorithm to obtain estimates of µ, µ B, σ and expectation on z (an iterative algorithm). alculate the LOD score Repeat for all other genomic positions (in practice, at.5 cm steps along genome). 5

6 simulated example LOD score curves enetic markers Interval mapping dvantages Make proper account of missing data an allow for the presence of genotypic errors Pretty pictures High power in low-density scans Improved estimate of QTL location Disadvantages reater computational effort (doing EM for each position) Requires specialized software More difficult to include covariates Only considers one QTL at a time 2 6

7 Statistical significance P( D QTL at the position) log Large LOD score evidence for QTL P( D no QTL) Question How large is large? nswer onsider distribution of LOD score if there were no QTL. nswer 2 onsider distribution of maximum LOD score. Null hypothesis assuming that there are no QTLs segregating in the population. Null distribution of the LOD scores at a particular genomic position (solid curve) and of the maximum LOD score from a genome scan (dashed curve). Only ~3% of chance that the genomic position gets LOD score. 3 LOD thresholds To account for the genome-wide search, compare the observed LOD scores to the null distribution of the maximum LOD score, genome-wide, that would be obtained if there were no QTL anywhere. LOD threshold = 95 th percentile of the distribution of genome-wide max LOD, when there are no QTL anywhere. Methods for obtaining thresholds nalytical calculations (assuming dense map of markers) (Lander & Botstein, 989) omputer simulations Permutation/ randomized test (hurchill & Doerge, 994) 4 7

8 More on LOD thresholds ppropriate threshold depends on Size of genome Number of typed markers Pattern of missing data Stringency of significance threshold Type of cross (e.g. F2 intercross vs backcross) Etc 5 n example Permutation distribution for a trait 6 8

9 Modeling multiple QTLs dvantages Reduce the residual variation and obtain greater power to detect additional QTLs. Identification of (epistatic) interactions between QTLs requires the joint modeling of multiple QTLs. Interactions between two loci Trait variation that is not explained by a detected putative QTL. The effect of QTL is the same, irrespective of the genotype of QTL 2, and vice versa The effect of QTL depends on the genotype of QTL 2, and vice versa 7 Multiple marker model Let y = phenotype, x = genotype data. Imagine a small number of QTL with genotypes x,,x p 2 p or 3 p distinct genotypes for backcross and intercross, respectively We assume that E(y x) = µ(x,,x p ), var(y x) = σ 2 (x,,x p ) 8 9

10 Multiple marker model onstant variance σ 2 (x,,x p )=σ 2 ssuming normality y x ~ N(µ g, σ 2 ) dditivity µ(x,,x p ) = µ + j j x j Epistasis µ(x,,x p ) = µ + j j x j + j,k w j,k x j x k 9 omputational problem N backcross individuals, M markers in all with at most a handful expected to be near QTL x ij = genotype (/) of mouse i at marker j y i = phenotype (trait value) of mouse i ssuming addivitity, y i =µ+ j j x ij + e which j? Variable selection in linear regression models 2

11 Mapping QTL as model selection Select the class of models dditive models dditive with pairwise interactions Regression trees x x 2 x N w w 2 w N Phenotype (y) y = w x + +w N x N +ε minimize w (w x + w N x N -y) 2? 2 Linear Regression minimize w (w x + w N x N -y) 2 +model complexity x x 2 x N w w 2 w N parameters Phenotype (y) Y = w x + +w N x N +ε Search model space Forward selection (FS) Backward deletion (BE) FS followed by BE 22

12 Lasso* (L ) Regression L term minimize w (w x + w N x N -y) 2 + w i x x 2 x N L 2 L w w 2 w N parameters Phenotype (y) Induces sparsity in the solution w (many w i s set to zero) Provably selects right features when many features are irrelevant onvex optimization problem No combinatorial search Unique global optimum Efficient optimization 23 * Tibshirani, 996 Model selection ompare models Likelihood function + model complexity (eg # QTLs) ross validation test Sequential permutation tests ssess performance Maximize the number of QTL found ontrol the false positive rate 24 2

13 Outline Basic concepts Haplotype, haplotype frequency Recombination rate Linkage disequilibrium Haplotype reconstruction Parsimony-based approach EM-based approach 25 Review genetic variation Single nucleotide polymorphism (SNP) Each variant is called an allele; each allele has a frequency Hardy Weinberg equilibrium (HWE) Relationship between allele and genotype frequencies How about the relationship between alleles of neighboring SNPs? We need to know about linkage (dis)equilibrium 26 3

14 Let s consider the history of two neighboring alleles 27 History of two neighboring alleles lleles that exist today arose through ancient mutation events Before mutation fter mutation Mutation 28 4

15 History of two neighboring alleles One allele arose first, and then the other Before mutation fter mutation Mutation Haplotype combination of alleles present in a chromosome 29 Recombination can create more haplotypes No recombination (or 2n recombination events) Recombination 3 5

16 Without recombination With recombination Recombinant haplotype 3 Haplotype combination of alleles present in a chromosome Each haplotype has a frequency, which is the proportion of chromosomes of that type in the population onsider N binary SNPs in a genomic region There are 2 N possible haplotypes But in fact, far fewer are seen in human population 32 6