Genome-Wide Association Studies (GWAS): Computational Them

Similar documents
Computational Workflows for Genome-Wide Association Study: I

EPIB 668 Genetic association studies. Aurélie LABBE - Winter 2011

Using the Association Workflow in Partek Genomics Suite

THE HEALTH AND RETIREMENT STUDY: GENETIC DATA UPDATE

Genome-wide association studies (GWAS) Part 1

Understanding genetic association studies. Peter Kamerman

Crash-course in genomics

Association studies (Linkage disequilibrium)

S G. Design and Analysis of Genetic Association Studies. ection. tatistical. enetics

H3A - Genome-Wide Association testing SOP

Human Genetic Variation. Ricardo Lebrón Dpto. Genética UGR

Genetic data concepts and tests

Algorithms for Genetics: Introduction, and sources of variation

Genetic Variation and Genome- Wide Association Studies. Keyan Salari, MD/PhD Candidate Department of Genetics

Potential of human genome sequencing. Paul Pharoah Reader in Cancer Epidemiology University of Cambridge

Introduc)on to Sta)s)cal Gene)cs: emphasis on Gene)c Associa)on Studies

Trudy F C Mackay, Department of Genetics, North Carolina State University, Raleigh NC , USA.

LD Mapping and the Coalescent

The Human Genome Project has always been something of a misnomer, implying the existence of a single human genome

Lecture 23: Causes and Consequences of Linkage Disequilibrium. November 16, 2012

Genome Wide Association Studies

Structure, Measurement & Analysis of Genetic Variation

Linkage Disequilibrium. Adele Crane & Angela Taravella

Questions we are addressing. Hardy-Weinberg Theorem

Genome-wide analyses in admixed populations: Challenges and opportunities

Population and Statistical Genetics including Hardy-Weinberg Equilibrium (HWE) and Genetic Drift

POPULATION GENETICS studies the genetic. It includes the study of forces that induce evolution (the

Haplotype Association Mapping by Density-Based Clustering in Case-Control Studies (Work-in-Progress)

Office Hours. We will try to find a time

Why do we need statistics to study genetics and evolution?

MONTE CARLO PEDIGREE DISEQUILIBRIUM TEST WITH MISSING DATA AND POPULATION STRUCTURE

Evaluation of Genome wide SNP Haplotype Blocks for Human Identification Applications

BST227 Introduction to Statistical Genetics. Lecture 3: Introduction to population genetics

Introduction to Quantitative Genomics / Genetics

What is genetic variation?

DNA Collection. Data Quality Control. Whole Genome Amplification. Whole Genome Amplification. Measure DNA concentrations. Pros

Familial Breast Cancer

Genome wide association studies. How do we know there is genetics involved in the disease susceptibility?

Analysis of genome-wide genotype data

Why can GBS be complicated? Tools for filtering & error correction. Edward Buckler USDA-ARS Cornell University

Haplotypes, linkage disequilibrium, and the HapMap

Association Mapping. Mendelian versus Complex Phenotypes. How to Perform an Association Study. Why Association Studies (Can) Work

Multi-SNP Models for Fine-Mapping Studies: Application to an. Kallikrein Region and Prostate Cancer

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

Genome-Wide Association Studies. Ryan Collins, Gerissa Fowler, Sean Gamberg, Josselyn Hudasek & Victoria Mackey

PLINK gplink Haploview

Applied Bioinformatics

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

Windfalls and pitfalls

Bioinformatic Analysis of SNP Data for Genetic Association Studies EPI573

Whole Genome Sequencing. Biostatistics 666

Population stratification. Background & PLINK practical

Papers for 11 September

POLYMORPHISM AND VARIANT ANALYSIS. Matt Hudson Crop Sciences NCSA HPCBio IGB University of Illinois

From Genotype to Phenotype

Overview. Methods for gene mapping and haplotype analysis. Haplotypes. Outline. acatactacataacatacaatagat. aaatactacctaacctacaagagat

Axiom mydesign Custom Array design guide for human genotyping applications

Lecture 6: Association Mapping. Michael Gore lecture notes Tucson Winter Institute version 18 Jan 2013

GBS Usage Cases: Non-model Organisms. Katie E. Hyma, PhD Bioinformatics Core Institute for Genomic Diversity Cornell University

Introduction to Add Health GWAS Data Part I. Christy Avery Department of Epidemiology University of North Carolina at Chapel Hill

B) You can conclude that A 1 is identical by descent. Notice that A2 had to come from the father (and therefore, A1 is maternal in both cases).

Linkage Disequilibrium

Implementing direct and indirect markers.

Oral Cleft Targeted Sequencing Project

Illumina s GWAS Roadmap: next-generation genotyping studies in the post-1kgp era

Human Genetics and Gene Mapping of Complex Traits

Axiom Biobank Genotyping Solution

B I O I N F O R M A T I C S

Appendix 5: Details of statistical methods in the CRP CHD Genetics Collaboration (CCGC) [posted as supplied by

Lecture 12. Genomics. Mapping. Definition Species sequencing ESTs. Why? Types of mapping Markers p & Types

QTL Mapping, MAS, and Genomic Selection

SNPs - GWAS - eqtls. Sebastian Schmeier

Genetics and Bioinformatics

(c) Suppose we add to our analysis another locus with j alleles. How many haplotypes are possible between the two sites?

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk

Computational Genomics

BST227 Introduction to Statistical Genetics. Lecture 3: Introduction to population genetics

Supplementary Methods Illumina Genome-Wide Genotyping Single SNP and Microsatellite Genotyping. Supplementary Table 4a Supplementary Table 4b

AN EVALUATION OF POWER TO DETECT LOW-FREQUENCY VARIANT ASSOCIATIONS USING ALLELE-MATCHING TESTS THAT ACCOUNT FOR UNCERTAINTY

Genetics and Psychiatric Disorders Lecture 1: Introduction

BTRY 7210: Topics in Quantitative Genomics and Genetics

Redefine what s possible with the Axiom Genotyping Solution

Course Announcements

INTRODUCTION TO GENETIC EPIDEMIOLOGY (1012GENEP1) Prof. Dr. Dr. K. Van Steen

SNP Selection. Outline of Tutorial. Why Do We Need tagsnps? Concepts of tagsnps. LD and haplotype definitions. Haplotype blocks and definitions

Answers to additional linkage problems.

Linking Genetic Variation to Important Phenotypes: SNPs, CNVs, GWAS, and eqtls

Supplementary Note: Detecting population structure in rare variant data

1b. How do people differ genetically?

Polygenic Influences on Boys & Girls Pubertal Timing & Tempo. Gregor Horvath, Valerie Knopik, Kristine Marceau Purdue University

Statistical Methods for Quantitative Trait Loci (QTL) Mapping

Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip

Two-locus models. Two-locus models. Two-locus models. Two-locus models. Consider two loci, A and B, each with two alleles:

Why can GBS be complicated? Tools for filtering, error correction and imputation.

CS 262 Lecture 14 Notes Human Genome Diversity, Coalescence and Haplotypes

Hardy-Weinberg Principle

Identifying Genes Underlying QTLs

University of York Department of Biology B. Sc Stage 2 Degree Examinations

Linking Genetic Variation to Important Phenotypes

Let s call the recessive allele r and the dominant allele R. The allele and genotype frequencies in the next generation are:

Transcription:

Genome-Wide Association Studies (GWAS): Computational Themes and Caveats October 14, 2014

Many issues in Genomewide Association Studies We show that even for the simplest analysis, there is little consensus for the most appropriate statistical procedure We will attempt to make a list of the complicating factors.

Many issues in Genomewide Association Studies We show that even for the simplest analysis, there is little consensus for the most appropriate statistical procedure We will attempt to make a list of the complicating factors. Goal of Genomewide Association Studies: Identify patterns of polymorphisms that vary systematically between individuals with different disease states (in particular, healthy and disease) and could therefore represent the effect of risk-enhancing or protective alleles.

In classifying patterns of the states, we will deal mostly biallelic SNPs, but in practice there are more complex patterns, including triallelic SNP instances.

In classifying patterns of the states, we will deal mostly biallelic SNPs, but in practice there are more complex patterns, including triallelic SNP instances. Association studies should address the following issues.

In classifying patterns of the states, we will deal mostly biallelic SNPs, but in practice there are more complex patterns, including triallelic SNP instances. Association studies should address the following issues. (A) Hardy-Weinberg equilibrium testing

In classifying patterns of the states, we will deal mostly biallelic SNPs, but in practice there are more complex patterns, including triallelic SNP instances. Association studies should address the following issues. (A) Hardy-Weinberg equilibrium testing (B) Inference of phase and missing data

In classifying patterns of the states, we will deal mostly biallelic SNPs, but in practice there are more complex patterns, including triallelic SNP instances. Association studies should address the following issues. (A) Hardy-Weinberg equilibrium testing (B) Inference of phase and missing data (C) SNP tagging

In classifying patterns of the states, we will deal mostly biallelic SNPs, but in practice there are more complex patterns, including triallelic SNP instances. Association studies should address the following issues. (A) Hardy-Weinberg equilibrium testing (B) Inference of phase and missing data (C) SNP tagging (D) Single SNP test of association

In classifying patterns of the states, we will deal mostly biallelic SNPs, but in practice there are more complex patterns, including triallelic SNP instances. Association studies should address the following issues. (A) Hardy-Weinberg equilibrium testing (B) Inference of phase and missing data (C) SNP tagging (D) Single SNP test of association (E) Multi SNP test of association

In classifying patterns of the states, we will deal mostly biallelic SNPs, but in practice there are more complex patterns, including triallelic SNP instances. Association studies should address the following issues. (A) Hardy-Weinberg equilibrium testing (B) Inference of phase and missing data (C) SNP tagging (D) Single SNP test of association (E) Multi SNP test of association (F) Population stratification

In classifying patterns of the states, we will deal mostly biallelic SNPs, but in practice there are more complex patterns, including triallelic SNP instances. Association studies should address the following issues. (A) Hardy-Weinberg equilibrium testing (B) Inference of phase and missing data (C) SNP tagging (D) Single SNP test of association (E) Multi SNP test of association (F) Population stratification (G) Multiple Testing Correction

Different types of Population Association Studies. Candidate Polymorphism Association Study These studies focus on individual polymorphism that is suspected of being indicated in disease causation. The purpose of the study is to perform further validation and testing.

Different types of Population Association Studies. Candidate Polymorphism Association Study These studies focus on individual polymorphism that is suspected of being indicated in disease causation. The purpose of the study is to perform further validation and testing. Candidate Gene Association Study

Different types of Population Association Studies. Candidate Polymorphism Association Study These studies focus on individual polymorphism that is suspected of being indicated in disease causation. The purpose of the study is to perform further validation and testing. Candidate Gene Association Study Typically involves 5-50 SNPs within a gene, defined as to include coding region flanking genes, regulatory modules, splicing regions

Different types of Population Association Studies. Candidate Polymorphism Association Study These studies focus on individual polymorphism that is suspected of being indicated in disease causation. The purpose of the study is to perform further validation and testing. Candidate Gene Association Study Typically involves 5-50 SNPs within a gene, defined as to include coding region flanking genes, regulatory modules, splicing regions Focuses on genes indicated in prior studies

Different types of Population Association Studies. Candidate Polymorphism Association Study These studies focus on individual polymorphism that is suspected of being indicated in disease causation. The purpose of the study is to perform further validation and testing. Candidate Gene Association Study Typically involves 5-50 SNPs within a gene, defined as to include coding region flanking genes, regulatory modules, splicing regions Focuses on genes indicated in prior studies Utilizes information from linkage studies (pedigree information, such as sequences of parents, grandparents, etc.)

Different types of Population Association Studies. Candidate Polymorphism Association Study These studies focus on individual polymorphism that is suspected of being indicated in disease causation. The purpose of the study is to perform further validation and testing. Candidate Gene Association Study Typically involves 5-50 SNPs within a gene, defined as to include coding region flanking genes, regulatory modules, splicing regions Focuses on genes indicated in prior studies Utilizes information from linkage studies (pedigree information, such as sequences of parents, grandparents, etc.) Homology information: homology to a gene in a model organism (such as mouse)

Different types of Population Association Studies. Candidate Polymorphism Association Study These studies focus on individual polymorphism that is suspected of being indicated in disease causation. The purpose of the study is to perform further validation and testing. Candidate Gene Association Study Typically involves 5-50 SNPs within a gene, defined as to include coding region flanking genes, regulatory modules, splicing regions Focuses on genes indicated in prior studies Utilizes information from linkage studies (pedigree information, such as sequences of parents, grandparents, etc.) Homology information: homology to a gene in a model organism (such as mouse) Can be applied multiple times for multiple gene testing

Fine Mapping

Fine Mapping Studies conducted in a candidate region (1-10Mb) that might involve several hundred SNPs (spanning 5-50 genes)

Fine Mapping Studies conducted in a candidate region (1-10Mb) that might involve several hundred SNPs (spanning 5-50 genes) Genome Wide Association Studies

Fine Mapping Studies conducted in a candidate region (1-10Mb) that might involve several hundred SNPs (spanning 5-50 genes) Genome Wide Association Studies This study aims at identifying common causal variants throughout the genome

Fine Mapping Studies conducted in a candidate region (1-10Mb) that might involve several hundred SNPs (spanning 5-50 genes) Genome Wide Association Studies This study aims at identifying common causal variants throughout the genome Requires at least 300,000 SNPs that are well chosen.

Fine Mapping Studies conducted in a candidate region (1-10Mb) that might involve several hundred SNPs (spanning 5-50 genes) Genome Wide Association Studies This study aims at identifying common causal variants throughout the genome Requires at least 300,000 SNPs that are well chosen. More are needed for certain populations, such as African populations, due to greater genetic diversity

Fine Mapping Studies conducted in a candidate region (1-10Mb) that might involve several hundred SNPs (spanning 5-50 genes) Genome Wide Association Studies This study aims at identifying common causal variants throughout the genome Requires at least 300,000 SNPs that are well chosen. More are needed for certain populations, such as African populations, due to greater genetic diversity Made possible by availability of SNPs in HAPMAP project and high throughput sequencing technology (Affymetrix, Illumina)

Rationale for Association Studies Four your final presentation, you should consider which association study characterizes your paper.

Rationale for Association Studies Four your final presentation, you should consider which association study characterizes your paper. You should also consider whether other types of polymorphisms are taken into account, such as insertions deletions microsatellites copy number variation

Rationale for Association Studies Population: difficulty in defining biases in sampling biases due to population substructure

Rationale for Association Studies Population: difficulty in defining biases in sampling biases due to population substructure Population association compares unrelated individuals, which means assuming the relationships are unknown and presumed to be distinct.

Rationale for Association Studies Population: difficulty in defining biases in sampling biases due to population substructure Population association compares unrelated individuals, which means assuming the relationships are unknown and presumed to be distinct. Since phenotypes cannot be followed in population history, we instead focus on genotype information.

Rationale for Association Studies Population: difficulty in defining biases in sampling biases due to population substructure Population association compares unrelated individuals, which means assuming the relationships are unknown and presumed to be distinct. Since phenotypes cannot be followed in population history, we instead focus on genotype information. The goal is to look for markers in the genotype.

Rationale for Association Studies Population: difficulty in defining biases in sampling biases due to population substructure Population association compares unrelated individuals, which means assuming the relationships are unknown and presumed to be distinct. Since phenotypes cannot be followed in population history, we instead focus on genotype information. The goal is to look for markers in the genotype. An increased density of markers corresponds to more information for use in association studies.

Linkage Disequilibrium Difficulties in association studies arise because of linkage disequilibrium (LD) and recombination.

Linkage Disequilibrium Difficulties in association studies arise because of linkage disequilibrium (LD) and recombination. Because of the number of SNPs, all we can do is to hope to find indirect association.

Linkage Disequilibrium Difficulties in association studies arise because of linkage disequilibrium (LD) and recombination. Because of the number of SNPs, all we can do is to hope to find indirect association. The hope is that by typing a dense set of markers, we will observe markers in direct association with unobserved causal locus, and in indirect association with disease phenotypes.

There are several major problems in the different stages of association studies. These include

There are several major problems in the different stages of association studies. These include collection of data

There are several major problems in the different stages of association studies. These include collection of data computational methods for phasing genotypes

There are several major problems in the different stages of association studies. These include collection of data computational methods for phasing genotypes Hardy-Weinberg equilibrium: significant deviation from HW needs to be addressed/scrutinized (carried out using Pearson χ 2, or Fisher exact test if counts are low)

There are several major problems in the different stages of association studies. These include collection of data computational methods for phasing genotypes Hardy-Weinberg equilibrium: significant deviation from HW needs to be addressed/scrutinized (carried out using Pearson χ 2, or Fisher exact test if counts are low) Inference of phase

There are several major problems in the different stages of association studies. These include collection of data computational methods for phasing genotypes Hardy-Weinberg equilibrium: significant deviation from HW needs to be addressed/scrutinized (carried out using Pearson χ 2, or Fisher exact test if counts are low) Inference of phase Missing data (can be as high as 10-20%): EM method for inferring missing data requires assumptions on probability distribution)

There are several major problems in the different stages of association studies. These include collection of data computational methods for phasing genotypes Hardy-Weinberg equilibrium: significant deviation from HW needs to be addressed/scrutinized (carried out using Pearson χ 2, or Fisher exact test if counts are low) Inference of phase Missing data (can be as high as 10-20%): EM method for inferring missing data requires assumptions on probability distribution) By varying the different techniques applied, obtain different association studies

Most studies involve choices that are ad hoc. Therefore, it is important to address the question of robustness for the studies conducted.

Most studies involve choices that are ad hoc. Therefore, it is important to address the question of robustness for the studies conducted. (1) Genome is so large that patterns that are suggestive of a causal polymorphism could well arise by chance

Most studies involve choices that are ad hoc. Therefore, it is important to address the question of robustness for the studies conducted. (1) Genome is so large that patterns that are suggestive of a causal polymorphism could well arise by chance (2) Hardy-Weinberg, SNP selection, phasing are as yet unresolved methods

Most studies involve choices that are ad hoc. Therefore, it is important to address the question of robustness for the studies conducted. (1) Genome is so large that patterns that are suggestive of a causal polymorphism could well arise by chance (2) Hardy-Weinberg, SNP selection, phasing are as yet unresolved methods (3) Missing Data (The Missingness Problem)

Most studies involve choices that are ad hoc. Therefore, it is important to address the question of robustness for the studies conducted. (1) Genome is so large that patterns that are suggestive of a causal polymorphism could well arise by chance (2) Hardy-Weinberg, SNP selection, phasing are as yet unresolved methods (3) Missing Data (The Missingness Problem) (4) Case samples are often collected differently than controls, giving different rates of missingness. This is often a problem even if the study is carried out blind to the case-control status

(5) LD is a nonquantitative phenomenon, and there is no natural scale for it

(5) LD is a nonquantitative phenomenon, and there is no natural scale for it (6) In practice, tagging is only effective in capturing common variants (tagging in one population only poorly performs in another

(5) LD is a nonquantitative phenomenon, and there is no natural scale for it (6) In practice, tagging is only effective in capturing common variants (tagging in one population only poorly performs in another (7) Population structure can generate spurious phenotype associations notion of subpopulation is a theoretical construct that only imperfectly reflects reality the question of the correct number of subpopulations can never fully be resolved

(5) LD is a nonquantitative phenomenon, and there is no natural scale for it (6) In practice, tagging is only effective in capturing common variants (tagging in one population only poorly performs in another (7) Population structure can generate spurious phenotype associations notion of subpopulation is a theoretical construct that only imperfectly reflects reality the question of the correct number of subpopulations can never fully be resolved (8) To phase or not to phase