Statistical Tools for Predicting Ancestry from Genetic Data

Size: px
Start display at page:

Download "Statistical Tools for Predicting Ancestry from Genetic Data"

Transcription

1 Statistical Tools for Predicting Ancestry from Genetic Data Timothy Thornton Department of Biostatistics University of Washington March 1, / 33

2 Basic Genetic Terminology A gene is the most fundamental unit of heredity that controls the transmission and expression of one or more traits. Genes are sequences of DNA (deoxyribonucleic acid) and can be found on chromosomes. The four bases of DNA are adenine, guanine, cytosine, and thymine (A, G, C, and T). We can think of DNA as a sequence of these four letters....tatacgtgcacctggattacagattacagattacagattacattgcatcgatcgaa... 2 / 33

3 Basic Genetic Terminology Chromosomes are long strands of DNA. In humans, each cell typically contains 23 pairs of chromosomes. For any chromosome pair, one chromosome is inherited from the father of the individual and the other chromosome is inherited from the mother of the individual. 3 / 33

4 Genetic Mutations DNA strands are formed by sequences of base-pairs 99.9% of the 3 billion base pairs in the human genome are identical for all humans 0.1% differs because of mutations/deletions/insertions 4 / 33

5 Genetic Mutations DNA strands are formed by sequences of base-pairs 99.9% of the 3 billion base pairs in the human genome are identical for all humans 0.1% differs because of mutations/deletions/insertions Most common mutation: a single base pair 5 / 33

6 Single Base Pair Mutations Single base pair mutations create Single Nucleotide Polymorphisms (SNPs) The two variations of a SNP are called the alleles of the SNP Original base pair - Major allele (d) Mutated base pair - Minor allele (D) For a chromosome pair, the two alleles at a SNP are called an individual s genotype, where one allele is passed down from the mother and the other allele from the father. The are generally three possible genotypes for a SNP in a population: DD, Dd, or dd 6 / 33

7 Single Nucleotide Polymorphisms (SNPS) SNPs occur randomly throughout the DNA More than 20 Million SNPs in the human genome have been mapped Most SNPs have no effect (Non-functional SNPs) A little less than 1% are believed to damage the function of a protein or regulatory element (Functional SNPs) 7 / 33

8 Introduction Genetic association studies are widely used for the identification of genes that influence human traits. To date, hundreds of thousands of individuals have been included in genome-wide association studies (GWAS) for the mapping of complex traits. Large-scale genomic studies often have high-dimensional data consisting of Tens of thousands of individuals Genotypes data on a million (or more!) SNPs for all individuals in the study Phenotype or Trait values of interest such as Height, BMI, HDL cholesterol, blood pressure, diabetes, etc. The vast majority of these studies, however, have been conducted in populations of European ancestry Non-European populations have largely been underrepresented in genetic studies, despite often bearing a disproportionately high burden for some common diseases. 8 / 33

9 Introduction Several recent and ongoing genetic studies have focused on recently admixed populations: populations who have experienced admixing among continentally divided ancestral populations within the past 200 to 500 hundred years. Admixed populations have largely arisen as a consequence of historical events such as the transatlantic slave trade, the colonization of the Americas and other long-distance migrations. Examples of admixed populations include African Americans and Hispanics in the U.S Latinos from throughout Latin America Uyghur population of Central Asia Cape Verdeans South African Coloured population 9 / 33

10 Ancestry Admixture Ancestral Pop. A Ancestral Pop. B Today The chromosomes of an admixed individual represent a mosaic of chromosomal blocks from the ancestral populations. Can be substantial genetic heterogeneity in admixed populations 10 / 33

11 Association Mapping in Admixed Populations Genetic association studies in recently admixed populations offer exciting opportunities for the identification of variants underlying phenotypic diversity in populations. Two ongoing genetic studies focusing on recently admixed populations: Women s Health Initiative: Minority cohort of 12,008 African Americans and Hispanics The Hispanic Community Heath Study / Study of Latinos (HCHS/SoL) is, to date, the largest genetic study of Hispanics/Latinos with more than 16,000 participants 11 / 33

12 Opportunities and Challenges Genetic association studies in recently admixed populations offer exciting opportunities for the identification of novel variants underlying phenotypic diversity in populations. There are several challenges with genetic association mapping in samples from admixed populations Heterogeneous genomes of individuals from admixed populations can be a confounder leading to spurious association Big Data! More than 1 million SNPs directly assayed (and with 20 million additional SNPs imputed to 1000 Genomes) 12 / 33

13 Confounding due to Ancestry Ethnicgroups (and subgroups) often share distinct dietary habits and other lifestyle characteristics that leads to many traits of interest being correlated with ancestry and/or ethnicity. 13 / 33

14 Case-Control Association Testing Below is a simple example to illustrate association testing at a genetic marker with two allelic types labeled A and a Statistics for identifying an association could compare allele frequencies (A or a) between the cases (affected individuals) and the controls (unaffected individuals). Genotype frequencies (AA, Aa, or aa) could also be compared between the two groups. AA AA Aa Aa Aa aa AA Aa AA Aa aa Aa Cases Controls 14 / 33

15 Spurious Association Case/Control association test Comparison of allele frequency between cases and controls. Consider a sample from 2 populations: Red population overrepresented among cases in the sample. Genetic markers that are not influencing the disease but with significant differences in allele frequencies between the populations = spurious association between disease and genetic marker 15 / 33

16 Spurious Association Quantitative trait association test Test for association between genotype and trait value Consider sampling from 2 populations: Population 1 Population 2 Histogram of Trait Values Blue population has higher trait values. Different allele frequency in each population = spurious association between trait and genetic marker for samples containing individuals from both populations 16 / 33

17 Identifying Genetic Ancestry Differences: Unsupervised Learning Suppose a genetic association study consists of a sample of N individuals Assume that genotype data is available at S SNPs in a genome-screen, where S can be very large (e.g. hundreds of thousands). For SNP s define G s = (G1 s,...gs n) T is n 1 vector of the genotypes, where Gi s = 0, 1 2, or 1, according to whether individual i has, respectively, 0, 1 or 2 copies of the reference allele at SNP s. We define Z to be an N S standardized matrix with (i,s)-th entry Z is = G s i ˆp s ˆps (1 ˆp s ) and ˆp s will typically be an allele frequency estimate for SNP s calculated using all individuals in the sample. 17 / 33

18 Identifying Genetic Ancestry Differences: Unsupervised Learning Can obtain a genetic similarity matrix (GSM) ˆΨ where ˆΨ = 1 S ZZT The (i,j)-th entry of ˆΨ is a measure of the average genetic similarity for individuals i and j in the sample. ˆΨ ij = 1 S S (Gi s ˆp s )(Gj s ˆp s ) s=1 ˆp s (1 ˆp s ) Principal Components Analysis (PCA) is a dimension reduction that can be applied to GSM to identify ancestry differences among sample individuals PCA is performed by obtaining the eigendecomposition of the GSM ˆΨ. 18 / 33

19 Identifying Genetic Ancestry Differences: Unsupervised Learning Orthogonal axes of variation, i.e. linear combinations of SNPs, that best explain the genotypic variability amongst the n sample individuals are identified. For the eigendecomposition we have ˆΨ = VDV T, where V = [V 1,V 2,...,V n ] is an n n matrix with orthogonal column vectors, and D corresponding to a diagonal matrix of the length n eigenvalue vector Λ = (λ 1,λ 2,...,λ n ) T The eigenvalues are in decreasing order, λ 1 > λ 2 >... > λ n. The d th principal component (eigenvector) corresponds to eigenvalue λ d, where λ d is proportional to the percentage of variability in the genome-screen data that is explained by V d. 19 / 33

20 PCA of Europeans The top principal components are viewed as continuous axes of variation that reflect genetic variation due to ancestry in the sample. Individuals with similar values for a particular top principal component will have similar ancestry for that axes. A application of principal components to genetic data from European samples (Novembre et al., Nature 2008) showed that among Europeans for whom all four grandparents originated in the same country, the first two principal components computed using 200,000 SNPs could map their country of origin quite accurately in the plane 20 / 33

21 PCA of Europeans 21 / 33

22 Ancestry Inference with PCA for HCHS/SoL European Central/South American Central/South American Caribbean East Asian African 1 22 / 33

23 0.01 PC PC Ancestry Inference with PCA for HCHS/SoL PC8 Dominican CentralAm Cuban Mexican PuertoRican SouthAm Multi/Other Unknown PC PC PC PC PC7 23 / 33

24 Supervised Learning for Ancestry Admixture The chromosomes of an admixed individual represent a mosaic of chromosomal blocks from the ancestral populations. 24 / 33

25 Supervised Learning for Ancestry Admixture Methods have recently been developed for supervised learning of ancestry proportions for an admixed individuals using high-density SNP data. Most use either a hidden Markov model (HMM) or an Expectation-Maximization (EM) algorithm to infer ancestry Example: Suppose we are interested in identifying the ancestry proportions for an admixed individual Observed sequence on a chromosome for an admixed individual:...tatacgtgcacctggattacagattacagattacagattacattgcatcgatcgaa... Observed sequence on a chromosome for samples selected from a homogenous reference population:...tgatcctgaacctagattacagattacagattacagattacaatgcttcgatggac......agatcctgaacctagattacagattacagattacagataccaatgcttcgatggac......cgatcctgaacctagattacagattacagatttgcgtatacaatgcttcgatggac / 33

26 Women s Health Initiative The Womens Health Initiative (WHI) is a national health study focusing on strategies for preventing chronic diseases in postmenopausal women. A total of 161,808 women aged yrs. old were recruited from 40 clinical centers in the US between 1993 and The WHI SNP Health Association Resource (SHARe) minority cohort includes 8421 self-identified African American women from and 3587 self-identified Hispanic women Genotype data for 1 million SNPs from across the genome 26 / 33

27 Supervised Ancestry Estimation: WHI-SHARe Genome-screen data was used to estimate the admixture proportions Estimated genome-wide ancestry proportions of every individual using FRAPPE (Tang et al., 2005) software Used genotype data from available reference panels for three ancestral populations HapMap YRI for West African ancestry HapMap CEU samples for northern and western European ancestry HGDP Native American samples for Native American ancestry. 27 / 33

28 Supervised Ancestry Inference Ancestry of WHI African Americans Ancestry European African Native American Individual 28 / 33

29 Table: Average Estimated Ancestry Proportions for Self-reported African Americans Estimated Ancestry Proportions (SD) Population European African Native American African Americans.209 (.153).767 (.155).024 (.023) 29 / 33

30 WHI Hispanic Subpopulations Of the total cohort of WHI Hispanics, 3,036 (or 84.6%) reported specific nationality of origin in a follow-up questionnaire (Form 41). 395 labeled themselves as Puerto Rican 1,548 as Mexican or Chicano 258 as Cuban The remaining 835 as Latino from a category other than those three. 30 / 33

31 WHI SHARE Ancestry Admixture: Puerto Ricans Cumulative Ancestry Puerto Rican Individuals WHI SHARE Ancestry Admixture: Mexicans Cumulative Ancestry Mexican Individuals WHI SHARE Ancestry Admixture: Cubans Cumulative Ancestry Cuban Individuals 31 / 33

32 Table: Average Estimated Ancestry Proportions within Self-reported Ethnic Groups Estimated Ancestry Proportions (SD) Population European African Native American Puerto Ricans.648 (.216).215 (.208).136 (.078) Mexicans.547 (.160).030 (.058).423 (.158) Cubans.840 (.197).098 (.179).062 (.099) 32 / 33

33 Acknowledgement of Research Funding Research was supported in part by the National Institutes of Health grants: NCI K01CA NIGMS P01GM NHLBI HHSN / 33

34 Contact Information Please feel free to contact me: Timothy Thornton, PhD phone: / 33

35 Using R for Ancestry Inference We now will use R to perform PCA on real genetic data for ancestry inference Access the short course website for the data files, exercises, and R script: 35 / 33