Genome-wide analyses in admixed populations: Challenges and opportunities

Size: px
Start display at page:

Download "Genome-wide analyses in admixed populations: Challenges and opportunities"

Transcription

1 Genome-wide analyses in admixed populations: Challenges and opportunities Esteban J. Parra, Ph.D.

2 Admixed populations: an invaluable resource to study the genetics of complex diseases Populations resulting from recent admixture events between continental populations are a unique resource for genetic studies. Genome-wide studies in admixed populations have unique challenges and opportunities, as a result of the history of admixture. In this presentation, I will describe some of those challenges and opportunities, based on my research experience.

3 Genome-wide studies in admixed populations: the challenges One of the main challenges for the application of GWA studies in admixed populations is the presence of population structure (e.g. variation of ancestry proportions among individuals), which can dramatically increase the rate of false positives. I provide an example of the extent and consequences of population stratification in a sample from Mexico City, which was analyzed with the Affymetrix 5.0 microarray. The following slides show MDS representations of the Mexico City sample, the HapMap Mexican American LA sample, and seven reference parental samples (HapMap Yoruba, HapMap European, Spanish, Nahua, Maya, Aymara and Quechua).

4 AYM QUE MAY WAF NAH DF DF MAM-LA MAM-LA EUR SP

5 Exploring the factors responsible for population structure Genetic structure can be created and maintained by factors such as continuous gene flow or assortative mating. Continuous gene flow is probably a factor because admixture has not been instantaneous. Rather, it has been a continuous process which is still ongoing. Exploring the relationship between ancestry and education in a previous sample from Mexico City, we found strong evidence of socioeconomic stratification.

6 % EUR More about socioeconomic stratification Upper bound Lower bound Average Primary or Secondary Preparatory or University Our data indicate that there is an important social issue in Mexico: Not everyone has the same access to education. Using a logistic regression model with education as an outcome, people with 100% European ancestry are 2.4 times more likely to have preparatory/university education than people with 0% European ancestry. Mating is most probably not random with respect to socioeconomic status in Mexico, and socioeconomic status shows a strong association with ancestry. This is likely another major factor explaining the presence of genetic structure in this population.

7 Consequences of stratification Stratification will increase false positives in association studies. Therefore, it is critical to use strategies to control for the effects of stratification. Below is an example from our type 2 diabetes study in Mexico. The figures show QQ Plots corresponding to the logistic regression analysis conditioning on A/ sex and B/ sex and European ancestry. A B

8 Consequences of stratification Socioeconomic stratification can be a source of confounding when exploring the effect of ancestry on disease risk. A good example of this is a logistic regression analysis by Florez et al. (2009) for type 2 diabetes in samples from Mexico and Colombia. Population Beta - Eur p value Beta - SES p value Beta - Eur (SES as cov) p value Beta SES (Eur as cov) p value Mexico x x x 10-8 Colombia x x 10-5 Modified from Florez et al Diabetologia 52: Whenever available, it is advisable to include socioeconomic data in this type of study.

9 Genome-wide studies in admixed populations: the opportunities It is possible to exploit admixture to map disease genes using the admixture mapping (AM) method. Location of AIMs The principle of AM: Admixture generates chromosomes made up of segments that have ancestry from different populations. Ancestry can be estimated using a genome-wide panel of Ancestry Informative Markers (AIMs) The application: Linkage of a trait to a chromosomal region can be detected by testing the association of the trait with ancestry defined as the number of gene copies inherited from each parental population. From Patterson et al., 2004

10 Admixture mapping: Examples of success AM has been successfully used to identify genes involved in a growing number of traits, including. Trait Population Authors Hypertension African-Americans Zhu et al., 2005 & 2007 Multiple sclerosis African-Americans Reich et al., 2005 Prostate cancer African-Americans Freedman et al., 2006 Inflammatory markers African-Americans Reich et al., 2007 Coronary artery calcification African-Americans Zhang et al., 2008 White cell count African-Americans Nalls et al., 2008 Focal segmental glomerulosclerosis (FSGS) African-Americans Kopp et al., 2008 Asthma Puerto-Ricans Choudhry et al., 2008

11 Admixture mapping: Building a genome-wide panel of AIMs for Hispanic populations Most of the AM studies have been carried out in African American samples. The primary reason is that no panel of AIMs was available for populations resulting from the admixture process between European/Native American groups (e.g. Hispanic/Latino populations). In 2007 we developed a genome-wide panel of AIMs for Hispanic populations using data collected in four Native American samples using the Affy 500K chip (Mao et al., 2007). Independently, two other research groups also published genome-wide AIM panels for Hispanics (Tian et al., 2007; Price et al., 2007). These new panels will make it possible to apply AM to many other population groups in the Americas.

12 An example of our T2D admixture mapping results Cases Controls

13 Genome-wide analysis in admixed populations: AM vs. GWA AM is a useful method to identify genes involved in complex traits, but it is important to consider some advantages and disadvantages of this approach in relation to GWA studies. Advantages Reduced genotyping effort and lower cost Can be implemented in affected-only studies Disadvantages Phenotypes and risk alleles must be distributed differentially between populations Low resolution: requires fine-mapping However, it is important to note that the costs of the high-density genotyping platforms (Affymetrix, Illumina), and therefore the differences in cost between AM and GWA approaches, is decreasing all the time.

14 Getting the best of both worlds: Applying AM and GWA strategies in admixed populations The reduction in cost of the high-density genotyping platforms and the availability of frequency data for the relevant parental populations have made it possible to implement both AM and GWA when using case-control approaches in admixed populations. For high-density data, ancestry can be modeled with a small subset of AIMs. AM can then be applied to test for association of the trait with ancestry. A high resolution analysis using GWA can also be carried out, ensuring that methods are implemented to control for population stratification.

15 Conditioning on locus ancestry to control for the effect of stratification in admixed populations Association conditioning on individual ancestry can be applied to control for stratification in GWA studies in admixed populations. However, the information provided by the panel of AIMs can be used to implement another strategy: Association conditioning on locus ancestry In this strategy, a panel of AIMs is used to infer locus ancestry at every position of the genome and then association tests are carried out conditioning on locus ancestry. Benefits of this strategy are: Long-range signals of allelic association generated by admixture are eliminated Allows detection of population-specific associations (It is possible to test for association separately in gametes that have European and Native American ancestry, and also to obtain a summary test of association over the two populations). Our collaborator Paul McKeigue (University of Edinburgh) is currently implementing this method in his program ADMIXMAP.

16 Other methods to infer locus-specific ancestry Price et al. (2009) just published a new method to infer locus ancestry based on fine-scale variation data. This method requires reference haplotype data from the parental populations and has been implemented in the software package HAPMIX. Methods of association conditioning on locus ancestry are still under development but there have been important advances. Ideally, such tests should incorporate uncertainty in the estimation of locus ancestry.

17 Imputation of untyped markers in admixed populations Typically, the microarrays used in GWA studies include 500,000 to 1M markers. However, it is possible to impute untyped markers based on the patterns of Linkage Disequilibrium observed in the HapMap samples (up to 4 million SNPs). This strategy is widely used in GWA studies. The problem: No Native American samples have been characterized in the HapMap project. For Mexican American GWA studies, a good reference sample is the HapMap Mexican American sample from LA, but this sample has been characterized for only 1.4M markers. The question: Are the Phase II HapMap samples (East Asian, European, West African) good reference samples for imputation in admixed populations throughout the Americas?

18 Applying IMPUTE v2 to the Mexico City sample The program IMPUTE v2 offers flexibility for imputation. It is possible to use a single haploid reference sample, or a combination of haploid and diploid reference samples. We used Impute v2 to compare the results of imputation using the Phase III Mexican American LA sample as a reference sample with imputations based on two alternative strategies : HapMap Phase II combined sample (EAS, EUR and WAF) as haploid reference sample HapMap Phase II combined sample as haploid reference sample and Mexican American Phase III HapMap sample as diploid reference sample The comparison is based on 10 megabases on chromosome 22 The concordance rate is very high: MAM Phase III vs. HapMap Phase II: 99.76% MAM Phase III vs. HapMap Phase II + MAM Phase III: 99.90%

19 Are the imputed genotypes accurate? In order to test the accuracy of the imputations, we excluded 100 random markers located on chromosomes 4,9, 16 and 20 from the inference files, and compared the imputed genotypes with the original genotypes obtained with the microarray 5.0. The concordance rate was 99.0% Our results are in broad agreement with a recent analysis by Huang et al. (2009). Therefore, using the combined HapMap Phase II as a reference sample, in combination with the Mexican American HapMap Phase III sample seems to be a reasonable strategy to impute untyped common marker genotypes of admixed populations in the Americas.

20 Final thoughts Genome-wide association studies in admixed populations will greatly expand our understanding of the genetics of complex diseases. The availability of population data and new statistical methods have overcome many of the challenges, but further effort will be necessary to better characterize admixed populations throughout the Americas and their parental populations. Regarding implementation of GWA studies, the most efficient strategy is the creation of consortia to facilitate data sharing and meta-analysis. Ideally, consortia should be created before the initiation of the studies in order to coordinate issues regarding sampling, genotyping and statistical analysis. The impressive advances in next generation sequencing technologies will make it possible to explore the role of rare variants in complex diseases.

21 Major Collaborators University of Edinburgh Paul. M. McKeigue Centro Medico Siglo XXI Miguel Cruz Penn State University Mark D. Shriver Funding Agencies Canada Banting and Best Diabetes Centre CIHR Mexico Fundación IMSS

22 Samples and criteria for selection of AIMs Samples Native Americans Europeans Mesoamerica South America HapMap Maya (Mexico) Aymara (Bolivia) Nahua (Mexico) Quechua (Peru) Selection criteria All autosomal SNPs Preliminary list of candidate AIMs Genome-wide list of AIMs Final genomewide AM map Check for 1/ Genotyping quality 2/ HW 3/ f-values f>0.3 Eur/NAm f<0.1 MAm/SAm Select markers based on: 1/ No LD 2/ Intermarker distance >300Kb 3/ Select combination of markers maximizing ancestry information Fill the gaps 1/ f>0.2 Eur/NAm 2/ Select based on criteria described in stage 2

23 Characteristics of the AM panel Number of loci 2,120 Average inter-marker physical distance / S.D. Average inter-marker genetic distance / S.D Mb / 832 Kb cm / cm f value Average f value between South American and European American/ S.D / Average f value between MesoAmerican and European American/ S.D / Average f value between MesoAmerican and South American/ S.D / Delta (absolute allele frequency difference) Average delta between South American and European American/ S.D / Average delta between MesoAmerican and European American/ S.D / Average delta between MesoAmerican and South American/ S.D / 0.045