Introduction to QTL mapping

Size: px
Start display at page:

Download "Introduction to QTL mapping"

Transcription

1 Introduction to QL mapping in experimental crosses Karl W Broman Department of Biostatistics he Johns Hopkins niversity kbroman Outline Experiments and data Models ANOVA at marker loci Interval mapping LOD thresholds LOD support intervals Power to detect QLs How many markers/mice? election bias Errors in the map Genotyping errors elective genotyping Covariates Non-normal traits

2 Backcross experiment P P F A A B B A B BC BC BC BC Intercross experiment P P F A A B B A B F F F F

3 Phenotype distributions Within each of the parental and F strains individuals are genetically identical. Environmental variation may or may not be constant with genotype. he backcross generation exhibits genetic as well as environmental variation. Parental strains F generation Backcross generation Phenotype Data and Goals Phenotypes: Genotypes: = / if mouse for a backcross) Genetic map: Locations of markers = phenotype for mouse is BB/AB at marker Goals: Identify the or at least one) genomic regions QLs) that contribute to variation in the phenotype. Form confidence intervals for QL locations. Estimate QL effects.

4 X Chromosome Location cm) Genetic map Markers Individuals Missing genotypes X

5 Models: ecombination We assume: Mendel s rules No crossover interference Locations of crossovers are according to a Poisson process. form a Markov chain with transition probabilities: = recombination fraction = is the genetic distance in Morgans. Markov chain: " " " Models: Genotype # $Phenotype Let = phenotype % = whole genome genotype Imagine a small number of QLs with genotypes % " " " % &. & distinct genotypes) E % ) * * * + var % ) * * * + Homoscedasticity constant variance): - Normally distributed residual variation: %. /. Additivity: ) * * * + & % % or ) Epistasis: Any deviations from additivity.

6 he simplest method: ANOVA Also known as marker regression. plit mice into groups according to genotype at a marker. Do a t-test / ANOVA. epeat for each marker. 7 5 BB AB Genotype at DM BB AB Genotype at DM99 Effect at a marker Consider the case of a single QL with effect Consider a marker linked to the QL with Of individuals with marker genotype BB mean phenotype is: recomb. frac.. Difference: Of individuals with marker genotype AB mean phenotype is: Phenotype

7 = < ; : A = A = A? =?@ =? > < ; :? > : = = = < : < ; : < A > = > = = < / ; :? =. < 7? = = A? < ; : A =? ; : =? = < = = = : > > = ; = < : ; :?; > = < ; : 9? ANOVA at marker loci Advantages imple. Easily incorporates covariates. Easily extended to more complex models. Doesn t require a genetic map. Disadvantages Must exclude individuals with missing genotype data. Imperfect information about QL location. uffers in low density scans. Only considers one QL at a time. QL genotype BB AB Let / if the unobserved) QL genotype is BB/AB. Assume 7where. Given genotypes at linked markers.mixture of normal dist ns with mixing proportion marker data : Each position in the genome one at a time is posited as the putative QL. Interval mapping IM) BB BB BB AB AB BB AB AB Assume a single QL model. Lander & Botstein 99)

8 he normal mixtures 9 B 7 cm cm wo markers separated by cm with the QL closer to the left marker. he figure at right show the distributions of the phenotype conditional on the genotypes at the two markers. he dashed curves correspond to the components of the mixtures. Phenotype BB/BB BB/AB AB/BB AB/AB µ µ µ µ µ + µ + µ + µ + Interval mapping continued) Let C marker data. / marker data C D E C D E where D E F G H I J K L Log likelihood: M N O P marker data Maximum likelihood estimates MLEs) of : values for which M is maximized.

9 EM algorithm Dempster et al. 977) E step: Let Q marker data & V W X V Y Z [ \ ] ^ Z _ \ ] ^ Z ` \ ] ^ & V W X V Y Z [ \ ] ^ Z _ \ ] ^ Z ` \ ] ^ & V W X V Y Z [ \ ] ^ Z ` \ ] ^ M step: Let Q Q Q Q a bit complicated he algorithm: tart with Q C ; iterate the E & M steps until convergence. LOD scores he LOD score is a measure of the strength of evidence for the presence of a QL at a particular location. LOD N O P a likelihood ratio comparing the hypothesis of a QL at position versus that of no QL N O P b QL at c c c no QL d c c c are the MLEs assuming a single QL at position. No QL model: he phenotypes are independent and identically distributed iid) /.

10 An example LOD curve Chromosome position cm) Map position cm) lod LOD curves LOD

11 Interval mapping Advantages akes proper account of missing data. Allows examination of positions between markers. Gives improved estimates of QL effects. Provides pretty graphs. Disadvantages Increased computation time. equires specialized software. Difficult to generalize. Only considers one QL at a time. Multiple QL methods Why consider multiple QLs at once? educe residual variation. eparate linked QLs. Investigate interations between QLs epistasis).

12 LOD thresholds Large LOD scores indicate evidence for the presence of a QL. Q: How large is large? $We consider the distribution of the LOD score under the null hypothesis of no QL. Key point: We must make some adjustment for our examination of multiple putative QL locations. $We seek the distribution of the maximum LOD score genomewide. he 95th %ile of this distribution serves as a genome-wide LOD threshold. Estimating the threshold: simulations analytical calculations permutation randomization) tests. Null distribution of the LOD score Null distribution derived by computer simulation of backcross with genome of typical size. olid curve: distribution of LOD score at any one point. Dashed curve: distribution of maximum LOD score genome-wide. LOD score

13 ? f : g j i h e? f :? f : l e j i h g l G u t m < l s? f : l s? s p l : o n Permutation tests markers phenotypes s-lod support interval Chromosomal region for which the LOD score is within of its maximum. Generally I prefer Plot of LOD or ;.5. LOD depicts evidence for QL location mice genotype data 5 Chromosome position cm) LOD maxlod) Value: conditions on observed phenotypes marker density and pattern of missing data; doesn t rely on normality assumptions or asymptotics. We can t look at all possible permutations but a random set of is feasible and provides reasonable estimates of P-values and thresholds. klod l. LOD support intervals is a genome-wide LOD threshold. to the distribution of Permute/shuffle the phenotypes; keep the genotype data intact. a set of curves) LOD is a genome-wide P-value. klod rq he 95th %ile of We wish to compare the observed Calculate LOD

14 a a a a v a a v v v Power to detect QLs he power to detect a QL is the chance that its LOD score exceeds the genome-wide threshold. Power depends on ize of the QL effect. Number of progeny. ype of cross. Density of markers. tringency of the LOD threshold. 9 LOD score QL effect = 5 Power = % QL effect = Power = % At right: Dashed curve: dist n of max LOD under null hypothesis. olid curve: dist n of LOD score at QL with a. Dotted curve: dist n of LOD score at QL with a. 9 LOD score 9 LOD score QL effect = Power = 97% How many markers/mice? At right: mice op: Bottom: olid: cm spacing Dashed: cm spacing More mice:. More recombination breakpoints educed sampling variation. Chromosome position cm) More markers: mice More detailed genotype information.. Not necessarily increased precision depends on number of mice size of QL effect and luck). Note: he figures should be taken with a grain of salt Chromosome position cm) LOD maxlod) LOD maxlod)

15 election bias he estimated effect of a QL will vary somewhat from its true effect. Only when the estimated effect is large will the QL be detected. Among those experiments in which the QL is detected the estimated QL effect will be on average larger than its true effect. QL effect = 5 Bias = 79% 5 5 QL effect = Bias = % Estimated QL effect his is selection bias. election bias is largest in QLs with small or moderate effects. he true effects of QLs that we identify are likely smaller than was observed. 5 5 QL effect = Bias = % Estimated QL effect 5 5 Estimated QL effect he genetic map: effects of errors Marker order Causes wiggly LOD curves. houldn t completely eliminate a signal. Map distances Doesn t seem to make much difference. Map position cm) Makes a big difference in perceived length of LOD support intervals. Greater effects of errors with appreciable missing data lod Error in marker order 5 5 Map position cm) lod Error in marker spacing

16 he genetic map: finding problems Consider: Estimated genetic map Pairwise recombination fractions Misplaced markers cause: Chromosome Big gaps in the map Large recombination fractions Location cm) Comparison of genetic maps 5 Pairwise recombination fractions and LOD scores Markers Genotyping errors: effects With genotyping errors individuals are placed in the wrong genotype group. With widely spaced markers there is little effect. With dense markers errors make the LOD curve have more dips. lod Genotyping errors; cm spacing Map position cm) lod Map position cm) Genotyping errors; cm spacing Markers

17 ˆ w ~ } { z y Markers Individuals ƒƒ ~ Identifying genotyping errors x LOD = Look for tight double crossovers. marker data marker data Crossover interference is often strong.) Error LOD scores Chromosome 5 Lincoln & Lander 99) Individual Model for genotyping errors. Genotyping error LOD scores Assumed error rate. 5 Assumption of no interference. in error correct in mouse At marker elective genotyping ave effort by only typing the most informative individuals say top & bottom %). seful in context of a single inexpensive trait. ricky to estimate the effects of QLs: use IM with all phenotypes. 7 5 Can t get at interactions. Likely better to also genotype some random portion of the rest of the individuals. BB AB All genotypes BB AB op and botton % Phenotype Position cm) 5 7

18 Covariates Examples: treatment sex litter lab age. Control residual variation. Avoid confounding. Look for QL interactions environ t Adjust before interval mapping IM) versus adjust within IM. lod Intercross split by sex 7 Map position cm) lod Map position cm) everse intercross 7 Non-normal traits tandard interval mapping assumes normally distributed residual variation. hus the phenotype distribution is a mixture of normals.) In reality: we see dichotomous traits counts skewed distributions outliers and all sorts of odd things. Interval mapping with LOD thresholds derived from permutation tests generally performs just fine anyway. Alternatives to consider: Nonparametric approaches Kruglyak & Lander 995) ransformations e.g. log square root) pecially-tailored models e.g. a generalized linear model the Cox proportional hazard model and the model in Broman et al. )

19 ummary I ANOVA at marker loci aka marker regression) is simple and easily extended to include covariates or accommodate complex models. Interval mapping improves on ANOVA by allowing inference of QLs to positions between markers and taking proper account of missing genotype data. ANOVA and IM consider only single-ql models. Multiple QL methods allow the better separation of linked QLs and are necessary for the investigation of epistasis. tatistical significance of LOD peaks requires consideration of the maximum LOD score genome-wide under the null hypothesis of no QLs. Permutation tests are extremely useful for this..5-lod support intervals indicate the plausible location of a QL. A plot of the LOD curve re-centered so that its maximum is at is a valuable tool for depicting evidence for QL location. Once you ve achieved a cm marker spacing more mice will probably be more important than more markers. But this depends on the number of mice the size of the QL effect and luck. ummary II Estimates of QL effects are subject to selection bias. uch estimated effects are often too large. tudy your data. Look for errors in the genetic map genotyping errors and phenotype outliers. But don t worry about them too much. elective genotyping can save you time and money but proceed with caution. tudy your data. he consideration of covariates may reveal extremely interesting phenomena. Interval mapping works reasonably well even with non-normal traits. But consider transformations or specially-tailored models. If interval mapping software is not available for your preferred model start with some version of ANOVA.