SNP calling and Genome Wide Association Study (GWAS) Trushar Shah

Similar documents
Variant calling in NGS experiments

SNP calling and VCF format

Association Mapping in Plants PLSC 731 Plant Molecular Genetics Phil McClean April, 2010

Introduction to Add Health GWAS Data Part I. Christy Avery Department of Epidemiology University of North Carolina at Chapel Hill

Linkage Disequilibrium

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

Mapping and Mapping Populations

POLYMORPHISM AND VARIANT ANALYSIS. Matt Hudson Crop Sciences NCSA HPCBio IGB University of Illinois

Association mapping of Sclerotinia stalk rot resistance in domesticated sunflower plant introductions

Identifying Genes Underlying QTLs

Association Mapping in Wheat: Issues and Trends

Understanding genetic association studies. Peter Kamerman

Genetic dissection of complex traits, crop improvement through markerassisted selection, and genomic selection

Why can GBS be complicated? Tools for filtering, error correction and imputation.

I.1 The Principle: Identification and Application of Molecular Markers

Genome-wide association studies (GWAS) Part 1

Marker types. Potato Association of America Frederiction August 9, Allen Van Deynze

Applied Bioinformatics

Genomic resources and gene/qtl discovery in cereals

Module 1 Principles of plant breeding

Introduction to Quantitative Genomics / Genetics

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Genomics: Human variation

Genome-Wide Association Studies (GWAS): Computational Them

DNBseq TM SERVICE OVERVIEW Plant and Animal Whole Genome Re-Sequencing

Trudy F C Mackay, Department of Genetics, North Carolina State University, Raleigh NC , USA.

GBS Usage Cases: Non-model Organisms. Katie E. Hyma, PhD Bioinformatics Core Institute for Genomic Diversity Cornell University

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

Why can GBS be complicated? Tools for filtering & error correction. Edward Buckler USDA-ARS Cornell University

Genetics Effective Use of New and Existing Methods

SolCAP. Executive Commitee : David Douches Walter De Jong Robin Buell David Francis Alexandra Stone Lukas Mueller AllenVan Deynze

Gene Mapping in Natural Plant Populations Guilt by Association

GBS Usage Cases: Examples from Maize

Familial Breast Cancer

Comparing a few SNP calling algorithms using low-coverage sequencing data

Human Genetic Variation. Ricardo Lebrón Dpto. Genética UGR

RNA-SEQUENCING ANALYSIS

Linking Genetic Variation to Important Phenotypes: SNPs, CNVs, GWAS, and eqtls

Association studies (Linkage disequilibrium)

Read Mapping and Variant Calling. Johannes Starlinger

SNPs - GWAS - eqtls. Sebastian Schmeier

Variant prioritization in NGS studies: Annotation and Filtering "

Prioritization: from vcf to finding the causative gene

Genome Wide Association Study for Binomially Distributed Traits: A Case Study for Stalk Lodging in Maize

Variant Discovery. Jie (Jessie) Li PhD Bioinformatics Analyst Bioinformatics Core, UCD

From Genotype to Phenotype

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es

A brief introduction to Marker-Assisted Breeding. a BASF Plant Science Company

Why do we need statistics to study genetics and evolution?

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk

Linking Genetic Variation to Important Phenotypes

BTRY 7210: Topics in Quantitative Genomics and Genetics

Axiom mydesign Custom Array design guide for human genotyping applications

Supplementary Figure 1 Genotyping by Sequencing (GBS) pipeline used in this study to genotype maize inbred lines. The 14,129 maize inbred lines were

Whole Genome Sequencing. Biostatistics 666

BTRY 7210: Topics in Quantitative Genomics and Genetics

Question. In the last 100 years. What is Feed Efficiency? Genetics of Feed Efficiency and Applications for the Dairy Industry

Multi-SNP Models for Fine-Mapping Studies: Application to an. Kallikrein Region and Prostate Cancer

Linking Genetic Variation to Important Phenotypes: SNPs, CNVs, GWAS, and eqtls

Redefine what s possible with the Axiom Genotyping Solution

Structure, Measurement & Analysis of Genetic Variation

Processing Ion AmpliSeq Data using NextGENe Software v2.3.0

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4

SUPPLEMENTARY INFORMATION

Next Generation Genetics: Using deep sequencing to connect phenotype to genotype

Variant Finding. UCD Genome Center Bioinformatics Core Wednesday 30 August 2016

Authors: Vivek Sharma and Ram Kunwar

The 150+ Tomato Genome (re-)sequence Project; Lessons Learned and Potential

Genomic Selection in Breeding Programs BIOL 509 November 26, 2013

CS 262 Lecture 14 Notes Human Genome Diversity, Coalescence and Haplotypes

Quantitative Genetics, Genetical Genomics, and Plant Improvement

High-density SNP Genotyping Analysis of Broiler Breeding Lines

Analytics Behind Genomic Testing

Lecture 1 Introduction to Modern Plant Breeding. Bruce Walsh lecture notes Tucson Winter Institute 7-9 Jan 2013

Crash-course in genomics

Introducing combined CGH and SNP arrays for cancer characterisation and a unique next-generation sequencing service. Dr. Ruth Burton Product Manager

Statistical Methods in Bioinformatics

SUPPLEMENTARY INFORMATION

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

EPIB 668 Genetic association studies. Aurélie LABBE - Winter 2011

MICROSATELLITE MARKER AND ITS UTILITY

Genomic resources. for non-model systems

ABSTRACT : 162 IQUIRA E & BELZILE F*

GREG GIBSON SPENCER V. MUSE

QTL Mapping Using Multiple Markers Simultaneously

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Traditional Genetic Improvement. Genetic variation is due to differences in DNA sequence. Adding DNA sequence data to traditional breeding.

Genetic Variation and Genome- Wide Association Studies. Keyan Salari, MD/PhD Candidate Department of Genetics

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

Advanced Plant Technology Program Vocabulary

Efficiency of selective genotyping for genetic analysis of complex traits and potential applications in crop improvement

Identifying the functional bases of trait variation in Brassica napus using Associative Transcriptomics

Strategic Research Center. Genomic Selection in Animals and Plants

NGS in Pathology Webinar

Variant Callers. J Fass 24 August 2017

What is genetic variation?

MAS refers to the use of DNA markers that are tightly-linked to target loci as a substitute for or to assist phenotypic screening.

Evolutionary Genetics: Part 1 Polymorphism in DNA

Genomics assisted Genetic enhancement Applications and potential in tree improvement

Using RNAseq data to improve genomic selection in dairy cattle

Transcription:

SNP calling and Genome Wide Association Study (GWAS) Trushar Shah

Types of Genetic Variation Single Nucleotide Aberrations Single Nucleotide Polymorphisms (SNPs) Single Nucleotide Variations (SNVs) Short Insertions or Deletions (indels) Larger Structural Variations (SVs) 9/12/2012 Variant Calling 2

Catalogs of human genetic variation The 1000 Genomes Project http://www.1000genomes.org/ SNPs and structural variants genomes of about 2500 unidentified people from about 25 populations around the world will be sequenced using NGS technologies HapMap http://hapmap.ncbi.nlm.nih.gov/ identify and catalog genetic similarities and differences dbsnp http://www.ncbi.nlm.nih.gov/snp/ Database of SNPs and multiple small-scale variations that include indels, microsatellites, and nonpolymorphic variants COSMIC http://www.sanger.ac.uk/genetics/cgp/cosmic/ Catalog of Somatic Mutations in Cancer 9/12/2012 Variant Calling 3

SNP Discovery: Goal sequencing errors SNP

SNP Discovery: Base Qualities High quality Low quality

A framework for variation discovery DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5):491-8. PMID: 21478889 (2011). 9/12/2012 Variant Calling 6

Variant calling methods > 15 different algorithms Three categories Allele counting Probabilistic methods, e.g. Bayesian model to quantify statistical uncertainty Assign priors based on observed allele frequency of multiple samples Heuristic approach Based on thresholds for read depth, base quality, variant allele frequency, statistical significance Ref Ind1 Ind2 SNP variant A G/G A/G Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011 Jun;12(6):443-51. PMID: 21587300. http://seqanswers.com/wiki/software/list

Variant callers Name Category Tumor/Normal Pairs Metric Reference Bambino Allele Counting Yes SNP Score Edmonson, M.N. et al. (2011) JointSNVMix (Fisher) Allele Counting Yes Somatic probability Roth, A. et al. (2012) Somatic Sniper Heuristic Yes Somatic Score Larson, D.E. et al. (2012) VarScan 2 Heuristic Yes Somatic p-value Koboldt, D. et al. (2012) Genome Analysis ToolKit (GATK) Bayesian No Phred QUAL DePristo, M.A. et al. (2011) Edmonson, M.N. et al. Bambino: a variant detector and alignment viewer for next-generation sequencing data in the SAM/BAM format. Bioinformatics 27 (6): 865-866 (2011). Roth, A. et al. JointSNVMix : A Probabilistic Model For Accurate Detection Of Somatic Mutations In Normal/Tumour Paired Next Generation Sequencing Data. Bioinformatics (2012). Larson, D.E. et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics. 28(3):311-7 (2012). Koboldt, D. et al. VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research DOI: 10.1101/ gr.129684.111 (2012). DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5):491-8. PMID: 21478889 (2011). 9/12/2012 Variant Calling 8

Variant Annotation SeattleSeq annotation of known and novel SNPs includes dbsnp rs ID, gene names and accession numbers, SNP functions (e.g. missense), protein positions and amino-acid changes, conservation scores, HapMap frequencies, PolyPhen predictions, and clinical association Annovar Gene-based annotation Region-based annotations Filter-based annotation http://snp.gs.washington.edu/seattleseqannotation/ http://www.openbioinformatics.org/annovar/ 9/12/2012 Variant Calling 9

GWAS: Definition Association of molecular markers (usually SNPs detected across the whole genome), with a trait of interest, scored across a wide collection of individuals

GWAS: Definition 1000 unrelated finger millet plants with diverse response to blast disease Genome-wide SNP analysis of each genotype Association analysis Correlated loci!!!!!! Obvious candidate genes???? Phenotype across different environments

Association Mapping vs Family Mapping A more natural experiment Relatively easy and cost-effective Clonally propagated plants Trees (long life cycle) Impossible to cross Accounts for more phenotypic diversity Exploits more recombination events within a species However Difficult to establish when and where recombination occurred False signals highly likely

Linkage mapping (or) Family mapping Generation of mapping population (RILs, NILs, DH, BC, F2) Genotyping polymorphic markers Phenotyping trait of interest Limitations Resolution power is low (10 30 cm) Small population size Modest degree of recombination within the population Linkage mapping limited to sampling only two alleles at a given locus in any given bi-parental population

Recombination: key of Genetic variation or Success of Breeding

Strategy-1. QTL Mapping QTL Mapping QTL: Genomic region responsible for a phenotypic trait that shows continuous distribution QTL Mapping: process of finding and estimating associations between a set of markers and a continuously distributed trait

Strategy-1. QTL Mapping Mostly Used for oligogenic traits 1. Decide on the trait 2. Select contrasting parents 3. Identify polymorphic markers 4. Cross and develop suitable mapping population (F2/RIL/ NIL etc.) 5. Genotype the population 6. Measure the phenotype with precision 7. Association of genotypes with phenotype reveals QTL location, effect etc.

Strategy-1. QTL Mapping QTL Mapping populations and statistical methodologies Mapping Populations - F2/F2:3/BCnF1/RIL/NIL/Large F2/AILs/NAM/MAGIC Statistical Methodologies Single marker regressions, interval mapping (IM), composite interval mapping (CIM), Inclusive composite interval mapping (ICIM), Multiple Interval Mapping (MIM), Bayesian QTL Mapping etc.

Suggested readings Strategy-1. QTL Mapping Kearsey, M.J. and Pooni, H.S. 1996. The genetical analysis of quantitative traits. Chapter 7 Beavis, W. 1998. QTL analyses: Power, precision, and accuracy. P. 145 162. In: Paterson, A.H. (ed.), Molecular Dissection of Complex Traits. CRC Press, Boca Raton. Bernardo, R. 2008. Molecular Markers and Selection for Complex Traits in Plants: Learning from the Last 20 Years. Crop Sci. 48:1649-1664 IRRI s e-learning course: http://www.knowledgebank.irri.org/ricebreedingcourse/index.htm

ASSOCIATION MAPPING v v v v v v Currently existing natural populations are used Vs generating a population via a biparental cross No need to develop mapping population A potentially large number of alleles per locus as opposed to only two can be surveyed simultaneously Resolution can be dramatically increased (e.g. 2000 bp in diverse maize inbred lines) - - - Fine mapping. Reduces time Considering recombination of history/evolution AM is a multi-disciplinary field Ø Ø Ø Ø Ø Genomics Genetics Molecular Biology Statistical Genetics Bioinformatics

Association Mapping vs Family Mapping Yu and Buckler 2006

GWAS Types Success of either methods depends on population size and degree of LD 1. Genome wide scanning or AM Markers spanned across the genome Moderate to extensive LD 2. Candidate gene scanning or AM Sequencing only candidate gene Low LD

GENOME-WIDE ASSOCIATION MAPPING (GWA) Sps Self-fertile : Arabidopsis, rice Clonally propagated : Switch grass, grape If LD is high, GWA is useful with low resolution mapping Number of markers to screen determined by sample size, Extent of LD E.g.: Human 70,000 markers Arabidopsis 2,000 markers Diverse Maize Landraces 750,000 markers Elite Maize lines 50,000 markers Sorghum 556,000 markers

CANDIDATE GENE APPROACH Mutagenesis Multi-disciplinary approach Biochemical analysis Expression profiling Comparative genome mapping Bioinformatics Linkage mapping Positional candidates or Candidate genes

Pre-requisites Linkage disequilibrium Diverse genotypes Establishing the relatedness STRUCTURE Principle Component Analysis (PCA) Kinship matrix Distribution of phenotype in the population Robust numbers of markers covering the whole genome Reliable and reproducible phenotypic data

Planning a GWA Study 1. Population size 2. Experimental design 3. Phenotyping approach 4. Genotyping method 5. Analysis methods 6. Validation of detected loci

Population Size The larger the number, the higher the power and precision A minimum of 100 A study in barley recommended at least 384 (Wang et al. 2012) Depends on Trait to be examined Resources available Options Examine pop structure, select from representative groups

Experimental Design Should be replicated Different seasons, environments taken into account Can be one stage, or multiple stages One-stage Many individuals, all genotyped and phenotyped Two-stage Few individuals with traits of interest selected and genotyped Associated markers used in a wider population

Phenotyping Approach Quantitative rather than qualitative datasets Avoid Yes/No Score from 0-9, rather than 0-5 Overall pest score( 1-9)_1 Grain yield per panicle_combined 25 60.0 20 50.0 15 40.0 10 30.0 20.0 5 10.0 0 <-5 <-4 <-3 <-2 <-1 < 0 <1 <2 <3 >3 - <5 <10 <15 <20 <25

Genotyping Approach Genome-wide SNP detection Genotyping-by-sequencing RADseq DARTseq Etc SNP-chip analysis Depends on availability of arrays Not economical but maybe only option in some crops SSRs

4. POPULATION STRUCTURE Statistical methods for calculating population structure Structured associations (SA) - uses a set of random markers to estimate population structure (Q) and then incorporates this estimate into further statistical analysis Mixed model approach - random markers are used to estimate Q and a relative kinship matrix (K), which are then fit into a mixed-model framework to test for marker-trait associations Principal component analysis (PCA) - summarizes variation observed across all markers into a smaller number of underlying component variables

5. Statistical Analysis Germplasm STRUCTURE Phenotyping Genotyping Q- Mat rix PCA TASSEL K-matrix LD Marker-trait association (Association Mapping) Dendogram TASSEL = Trait Analysis by association, Evolution, Linkage

Data Analysis Depends on data generated Combining Genotype and phenotype data Two main methods General linear model (GLM) Does not account for relatedness Mixed linear model (MLM) Accounts for population structure and kinship Both Combine results and only present consensus

Interpreting GWAS Results P-value Should it be <10-7 or <5 x 10-8? P-value alone says very little about the results R 2 value An estimation of LD decay Correlation between a pair of loci What is the cut-off? False Discovery Rate (Q-value) False rejections: Total rejections

What Next? Identifying potential candidate genes Whole genome sequence available? Searching against public databases Validating SNPs in larger/bi-parental populations Depends on availability

Practical Session

Download data Import hapmap files (Genotypic data)

Plink

Phenotypic Data Format

Filter Datasets

GLM Join Genotypic and Phenotypic data Intersect join

GLM

GLM: QQ plots

GLM: QQ/Manhattan plots

GLM: QQ/Manhattan plots

MLM: Kinship

MLM: Run

MLM: Run

MLM: QQ/Manhattan plots

MLM: QQ/Manhattan plots

Compare GLM/MLM: QQ plots GLM MLM

OWN DATA

Thanks! Acknowledge slides adapted from Nair (SNP calling), Odeny (GWAS) and Babu (QTL Mapping)

MLM: PCA