Population structure, heritability, and polygenic risk

Similar documents
Population description. 103 CHB Han Chinese in Beijing, China East Asian EAS. 104 JPT Japanese in Tokyo, Japan East Asian EAS

Supplemental materials. Table S1 Population names and abbreviations

Human Populations: History and Structure

Statistical Tools for Predicting Ancestry from Genetic Data

I/O Suite, VCF (1000 Genome) and HapMap

Genome variation - part 1

Supplementary Figures

Population differentiation analysis of 54,734 European Americans reveals independent evolution of ADH1B gene in Europe and East Asia

Supplementary Note: Detecting population structure in rare variant data

Evidence of selection on human stature inferred from spatial distribution of allele frequencies.

De novo human genome assemblies reveal spectrum of alternative haplotypes in diverse

Understanding genetic association studies. Peter Kamerman

Analysing Alu inserts detected from high-throughput sequencing data

Human Genetics and Gene Mapping of Complex Traits

Human Genetics and Gene Mapping of Complex Traits

A systematic assessment of the population genetic evidence for selection across twenty brain related phenotypes

Genome-wide analyses in admixed populations: Challenges and opportunities

S G. Design and Analysis of Genetic Association Studies. ection. tatistical. enetics

UK Biobank Axiom Array

Introduction to Add Health GWAS Data Part I. Christy Avery Department of Epidemiology University of North Carolina at Chapel Hill

VEGAS2: Gene-based test software using 1000 Genomes reference sets. User Manual

Sequence variation Introductory bioinformatics for human genomics workshop, UNSW

Goal: To use GCTA to estimate h 2 SNP from whole genome sequence data & understand how MAF/LD patterns influence biases

ARTICLE Contrasting X-Linked and Autosomal Diversity across 14 Human Populations

Derrek Paul Hibar

Nature Genetics: doi: /ng.3254

Contrasting regional architectures of schizophrenia and other complex diseases using fast variance components analysis

emerge-ii site report Vanderbilt

Genotype quality control with plinkqc Hannah Meyer

Lecture 2: Height in Plants, Animals, and Humans. Michael Gore lecture notes Tucson Winter Institute version 18 Jan 2013

Leveraging admixture analysis to resolve missing and crosspopulation. Noah Zaitlen

Supplementary Figure 1. Study design of a multi-stage GWAS of gout.

Genotype Prediction with SVMs

Genetic Variation and Genome- Wide Association Studies. Keyan Salari, MD/PhD Candidate Department of Genetics

Resources at HapMap.Org

EPIB 668 Genetic association studies. Aurélie LABBE - Winter 2011

Department of Psychology, Ben Gurion University of the Negev, Beer Sheva, Israel;

Polygenic Influences on Boys & Girls Pubertal Timing & Tempo. Gregor Horvath, Valerie Knopik, Kristine Marceau Purdue University

heritability problem Krishna Kumar et al. (1) claim that GCTA applied to current SNP

Redefine what s possible with the Axiom Genotyping Solution

IL1B-CGTC haplotype is associated with colorectal cancer in. admixed individuals with increased African ancestry

Alkes Price Harvard School of Public Health January 24 & January 26, 2017

Analysis of genome-wide genotype data

Supplementary Online Content

Update on the Genomics Data in the Health and Re4rement Study. Sharon Kardia Jennifer A. Smith University of Michigan April 2013

More Introduction to Positive Selection

Genome-Wide Association Studies. Ryan Collins, Gerissa Fowler, Sean Gamberg, Josselyn Hudasek & Victoria Mackey

Population Genetics II. Bio

Comparison of the levels of diversity between coldspots (CS) and highly recombining regions (HRRs) for SNPs in the FCQ data set.

Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project

H3A - Genome-Wide Association testing SOP

A review of intelligence GWAS hits: their relationship to country IQ and the issue of spatial autocorrelation.

Lecture 6: GWAS in Samples with Structure. Summer Institute in Statistical Genetics 2015

Linkage Disequilibrium. Adele Crane & Angela Taravella

SNP Selection. Outline of Tutorial. Why Do We Need tagsnps? Concepts of tagsnps. LD and haplotype definitions. Haplotype blocks and definitions

Summary. Introduction

Bioinformatic Analysis of SNP Data for Genetic Association Studies EPI573

Evaluation of Genome wide SNP Haplotype Blocks for Human Identification Applications

The Whole Genome TagSNP Selection and Transferability Among HapMap Populations. Reedik Magi, Lauris Kaplinski, and Maido Remm

Haplotypes, linkage disequilibrium, and the HapMap

PLINK gplink Haploview

Haplotype phasing in large cohorts: Modeling, search, or both?

Genome-wide association studies (GWAS) Part 1

Supplementary Figure 1 a

Browsing Genes and Genomes with Ensembl

Computational Workflows for Genome-Wide Association Study: I

Weighted likelihood inference of genomic autozygosity patterns in dense genotype data

DNA Collection. Data Quality Control. Whole Genome Amplification. Whole Genome Amplification. Measure DNA concentrations. Pros

Two Topics in Association Analysis of DNA Sequencing Data: Population Structure and Multivariate Traits

Supplementary Figures

Nature Genetics: doi: /ng Supplementary Figure 1. H3K27ac HiChIP enriches enhancer promoter-associated chromatin contacts.

Detecting ancient admixture using DNA sequence data

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2015

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2017

Reviewers' comments: Reviewer #1 (Remarks to the Author):

Evidence of Widespread Selection on Standing Variation in Europe at Height-Associated SNPs

Supporting Information

Supplementary Figure 1. Principle component analysis based on the GWAS subjects and the HapMap Phase 2 populations. (A) Distributions of all subjects

Pop Gen meets Quant Gen and other open questions

Potential of human genome sequencing. Paul Pharoah Reader in Cancer Epidemiology University of Cambridge

Single Nucleotide Polymorphisms (SNPs)

Imputation. Genetics of Human Complex Traits

Axiom Biobank Genotyping Solution

CHARACTERIZING genetic diversity across individuals

Roadmap: genotyping studies in the post-1kgp era. Alex Helm Product Manager Genotyping Applications

Prostate Cancer Genetics: Today and tomorrow

Human Population Differentiation Is Strongly Correlated with Local Recombination Rate

Concepts and relevance of genome-wide association studies

AN EVALUATION OF POWER TO DETECT LOW-FREQUENCY VARIANT ASSOCIATIONS USING ALLELE-MATCHING TESTS THAT ACCOUNT FOR UNCERTAINTY

Nature Genetics: doi: /ng Supplementary Figure 1. QQ plots of P values from the SMR tests under a range of simulation scenarios.

the hapflk method María Inés Fariello, Simon Boitard, Magali San Cristobal, Bertrand Servin INRA, GenPhySE, Toulouse

Further confirmation for unknown archaic ancestry in Andaman and South Asia.

Nature Genetics: doi: /ng.3143

Leveraging local ancestry to detect gene-gene interactions in genome-wide data

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies

POLYMORPHISM AND VARIANT ANALYSIS. Matt Hudson Crop Sciences NCSA HPCBio IGB University of Illinois

Simple inheritance. Defective Gene. Disease

Cross Haplotype Sharing Statistic: Haplotype length based method for whole genome association testing

QTL Mapping Using Multiple Markers Simultaneously

Evaluation of whole exome sequencing as an alternative to BeadChip and whole genome sequencing in human population genetic analysis

Transcription:

Population structure, heritability, and polygenic risk Alicia Martin Daly Lab October 18, 2016 armartin@broadinstitute.org @genetisaur

Project goals Call local ancestry in large case/control PTSD cohort of African Americans Estimate heritability using local ancestry tracts. Compare/ contrast this estimate with SNP-based heritability in this and European cohort (in progress) Perform admixture mapping Considerations: transferability of polygenic risk scores, cross-population heritability (Work with Karestan Koenen, Mark Daly, Laramie Duncan, Caroline Nievergelt)

Data overview Study PI Analyst NTotal NAA Data label 1 GTP (Grady Trauma Project) Kerry Ressler Lynn Almli 4752 3492 gt2y 2 Detriot (DNHS) Monica Uddin Guia Guffanti 812 650 dnhy 3 Genetics of Substance Dependence Goel Gelernter Pingxing Xie 5451 3100 gsdy 4 Marine Resilience Study Caroline Nievergelt / Dewleen Baker Adam Maihofer 4036 226 mrsy 5 Family Study of Cocaine Dependence Laura Bierut Louis Fox 1271 653 fscy 6 COGEND Laura Bierut Louis Fox 2768 711 cogy 7 Nurses Health Study Karestan Koenen Andrew Ratanatharathorn 1378 8 Stein South Africa Dan Stein / Kerry Ressler Lynn Almli 434 9 Ohio National Guard Israel Liberzon Tony King 239 Summary Statistics from imputed data 10 Duke 11 National Center for PTSD (Boston) J. Beckham / M. Hauser / A. Ashley-Koch Mark Miller / Mark Logue Melanie Garrett 1963 Mark Logue 652 Total 23,756 8,832

Local ancestry calling strategy 1. Merge intersecting genotyped SNPs (N=421,607 with MAF > 0.05) 2. Phase aggregated dataset with HAPI-UR 3x and take best combined phase 3. Split jointly phased haplotypes into reference + 50 sets of admixed samples for computational feasibility 4. Aggregate local ancestry calls across all runs 5. Collapse local ancestry output gt2y + dnhy + gsdy + mrsy + fscy + cogy + YRI + CEU Local ancestry run 1 1 AA + reference genos 2 AA + reference jointly phased haplotypes 3 Local Local + ancestry + + ancestry + run 2 run 49 4... Combined local ancestry calls 5 Local ancestry run 50 Collapsed bed files, ancestry karyograms, and plink files

Heritability estimates h 2 estimate Kinship matrix ĥ 2 SE N h 2 g REAP 0.018 0.046 7548 h 2 g GCTA GRM 0.02 0.048 7248 h 2 γ local ancestry GRM?? h 2 =phenotypic variation described by variation in local ancestry 2 =phenotypic variation explained by variation in local ancestry 2 e =residual phenotypic variance h 2 = 2 2 + 2 e F STC =weighted allele frequency di erences between ancestral populations at causal loci =genome-wide ancestry proportions h 2 =2F STC (1 )h 2 Zaitlen, N., et al. (2014). Nat. Genet. 46, 1356 1362.

1000 Genomes phase 3 populations Auton, A., et al. (2015). Nature 526, 68 74.

Substantial global genetic diversity in 1000 Genomes Europeans East Asians Africans South Asians Admixed Americas K=5 K=6 TS C I D KHX V C H S C H B JP T G W D M SL YR I ES N LW K ST U G IH PJ L IT U BE B AC B AS W PU R C LM M XL PE L FI N C EU G BR IB S K=7

Varying admixture proportions across populations in the Americas Reference panel 1.0 0.8 0.6 0.4 0.2 0.0 NAT CEU YRI NAT = Mao et al, (2007). AJHG. 80, 1171 1178. African American 1.0 0.8 0.6 0.4 0.2 0.0 ACB ASW Hispanic/ Latino 1.0 0.8 0.6 0.4 0.2 0.0 PUR CLM MXL PEL African Americans ACB = African Caribbean in Barbados ASW = African Ancestry in SW US Hispanic/Latinos CLM = Colombians MXL= Mexicans PUR = Puerto Ricans PEL = Peruvians

Admixed samples in the Americas

Admixture tracts inform subcontinental-level ancestral populations HG01893 (Peruvian) RFMix: Maples, B.K., et al (2013). AJHG. 93, 278 288.

Ancestry-specific PCA provides insight into subcontinental admixture origins 1 0 PC2 1 2 3 4 Reference AFR EUR NAT Admixed ACB ASW CLM MXL PEL PUR 5 1.0 0.5 0.0 0.5 1.0 PC1 ASPCA: Moreno-Estrada, A., et al. (2013). PLoS Genetics. 9, e1003925.

African Americans have northern European tracts, Hispanics have southern European tracts 1 PC2 0 1 2 Reference FIN CEU GBR IBS TSI Admixed ACB ASW CLM MXL PEL PUR 3 2 1 0 1 2 PC1 ASPCA: Moreno-Estrada, A., et al. (2013). PLoS Genetics. 9, e1003925.

African Americans have African tracts closest to Nigerian reference panel 1 PC2 0 1 GWD MSL YRI ESN LWK Reference ESN GWD LWK MSL YRI Admixed ACB ASW 2 1 0 1 2 PC1 ASPCA: Moreno-Estrada, A., et al. (2013). PLoS Genetics. 9, e1003925.

Africans have more genetic variation than out-of-africa populations AFR AMR EAS EUR SAS 1000 Genomes Project Consortium. (2015). A global reference for human genetic variation. Nature 526, 68 74.

Biased genetic discoveries Global population PGC GWAS (SCZ, BIP, MDD, ADHD) East Asian Latino African East Asian Middle Eastern European European Oceanic South Asian

Europeans (and Hispanic/Latinos) are overrepresented in disease databases 1000 Genomes Project Consortium. (2015). A global reference for human genetic variation. Nature 526, 68 74.

Computing polygenic risk scores from summary statistics X = mx i=1 g i i LD clumping for all variants with MAF 0.01: Apply p-value threshold (p=0.01) Thin for LD within window (R 2 =0.5, window=250kb)! (P+T in LDpred paper)

Polygenic risk score for height reflects adaptive event in Europeans and bias European height score 6000 Density 4000 Region N.Europe S.Europe 2000 0 0.0e+00 2.5e 04 5.0e 04 7.5e 04 1.0e 03 Polygenic Risk Score Wood, A.R., et al. (2014). Nature Genetics 46, 1173 1186.

Polygenic risk score for height reflects adaptive event in Europeans and bias European height score 10000 Global height score 6000 7500 Density 4000 Region N.Europe S.Europe Density 5000 Super population AFR AMR EAS EUR SAS 2000 2500 0 0.0e+00 2.5e 04 5.0e 04 7.5e 04 1.0e 03 Polygenic Risk Score 0 0.0e+00 2.5e 04 5.0e 04 7.5e 04 1.0e 03 Polygenic Score Wood, A.R., et al. (2014). Nature Genetics 46, 1173 1186.

Polygenic risk score for Type II diabetes highlights role of demography 25 Global T2D (EUR) score Global T2D (Multi ethnic) score 100 20 75 Density 15 10 Super population AFR AMR EAS EUR SAS Density 50 Super population AFR AMR EAS EUR SAS 5 25 0 0 0.50 0.55 0.60 0.65 Polygenic Score 0.54 0.56 0.58 0.60 Polygenic Score European: Gaulton, K.J., et al. (2015). Nat. Genet. 47, 1415 1425. Multi-ethnic: Mahajan, A., et al. (2014). Nat. Genet. 46, 234 244.

Coalescent model for simulation framework Demographic model: Gravel, S., et al. (2011). Proc. Natl. Acad. Sci. U. S. A. 108, 11983 11988. msprime: Kelleher, J., Etheridge, A.M., and Mcvean, G. (2015). PLoS Comput Biol 1 22.

Simulation steps Simulate for chr20 (μ=2e-8 mutations/(bp*generation)) genotypes with HapMap recombination map for 200k each: Africans, East Asians, Europeans Assign true causal effect sizes to m evenly spaced variants as: As before, define X as: Normalize: N(0, h2 m ) X = Compute true PRS as (such that total variance is h 2 ): mx i=1 Z X = X g i X i µ X G = p h 2 Z X

Simulation steps Compute the total liability for each individual (epsilon is standard normal noise), such that: T = p h 2 Z X + p 1 h 2 Z h 2 = Assuming a 5% prevalence, assign 10,000 European individuals at the most extreme end of the liability threshold case status. Randomly assign different 10,000 European individuals control status. Run a simulated GWAS, computing Fisher s exact test for all sites with MAF 0.01. Clump SNPs into LD blocks for all sites with p 1e-2, R 2 0.5 in Europeans, and window size of 250kb. Compute inferred PRS from summary stats and with true PRS Evaluate over 50 simulations for m = 200,500,1000 and h 2 =0.33,0.50,0.67 2 g 2 g + 2

True vs inferred PRS with same causal variants, different effect sizes are inconsistent h 2 =0.67, m=1000 G H I

Best performance in European study population h 2 =0.67, m=1000, 50 replicates Pearson's correlation 1.00 0.75 0.50 0.25 0.00 1000 Super population AFR EAS EUR ALL AFR EAS EUR ALL

http://biorxiv.org/content/early/2016/08/23/070797