SNP calling and Genome Wide Association Study (GWAS) Trushar Shah
Types of Genetic Variation Single Nucleotide Aberrations Single Nucleotide Polymorphisms (SNPs) Single Nucleotide Variations (SNVs) Short Insertions or Deletions (indels) Larger Structural Variations (SVs) 9/12/2012 Variant Calling 2
Catalogs of human genetic variation The 1000 Genomes Project http://www.1000genomes.org/ SNPs and structural variants genomes of about 2500 unidentified people from about 25 populations around the world will be sequenced using NGS technologies HapMap http://hapmap.ncbi.nlm.nih.gov/ identify and catalog genetic similarities and differences dbsnp http://www.ncbi.nlm.nih.gov/snp/ Database of SNPs and multiple small-scale variations that include indels, microsatellites, and nonpolymorphic variants COSMIC http://www.sanger.ac.uk/genetics/cgp/cosmic/ Catalog of Somatic Mutations in Cancer 9/12/2012 Variant Calling 3
SNP Discovery: Goal sequencing errors SNP
SNP Discovery: Base Qualities High quality Low quality
A framework for variation discovery DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5):491-8. PMID: 21478889 (2011). 9/12/2012 Variant Calling 6
Variant calling methods > 15 different algorithms Three categories Allele counting Probabilistic methods, e.g. Bayesian model to quantify statistical uncertainty Assign priors based on observed allele frequency of multiple samples Heuristic approach Based on thresholds for read depth, base quality, variant allele frequency, statistical significance Ref Ind1 Ind2 SNP variant A G/G A/G Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011 Jun;12(6):443-51. PMID: 21587300. http://seqanswers.com/wiki/software/list
Variant callers Name Category Tumor/Normal Pairs Metric Reference Bambino Allele Counting Yes SNP Score Edmonson, M.N. et al. (2011) JointSNVMix (Fisher) Allele Counting Yes Somatic probability Roth, A. et al. (2012) Somatic Sniper Heuristic Yes Somatic Score Larson, D.E. et al. (2012) VarScan 2 Heuristic Yes Somatic p-value Koboldt, D. et al. (2012) Genome Analysis ToolKit (GATK) Bayesian No Phred QUAL DePristo, M.A. et al. (2011) Edmonson, M.N. et al. Bambino: a variant detector and alignment viewer for next-generation sequencing data in the SAM/BAM format. Bioinformatics 27 (6): 865-866 (2011). Roth, A. et al. JointSNVMix : A Probabilistic Model For Accurate Detection Of Somatic Mutations In Normal/Tumour Paired Next Generation Sequencing Data. Bioinformatics (2012). Larson, D.E. et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics. 28(3):311-7 (2012). Koboldt, D. et al. VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research DOI: 10.1101/ gr.129684.111 (2012). DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5):491-8. PMID: 21478889 (2011). 9/12/2012 Variant Calling 8
Variant Annotation SeattleSeq annotation of known and novel SNPs includes dbsnp rs ID, gene names and accession numbers, SNP functions (e.g. missense), protein positions and amino-acid changes, conservation scores, HapMap frequencies, PolyPhen predictions, and clinical association Annovar Gene-based annotation Region-based annotations Filter-based annotation http://snp.gs.washington.edu/seattleseqannotation/ http://www.openbioinformatics.org/annovar/ 9/12/2012 Variant Calling 9
GWAS: Definition Association of molecular markers (usually SNPs detected across the whole genome), with a trait of interest, scored across a wide collection of individuals
GWAS: Definition 1000 unrelated finger millet plants with diverse response to blast disease Genome-wide SNP analysis of each genotype Association analysis Correlated loci!!!!!! Obvious candidate genes???? Phenotype across different environments
Association Mapping vs Family Mapping A more natural experiment Relatively easy and cost-effective Clonally propagated plants Trees (long life cycle) Impossible to cross Accounts for more phenotypic diversity Exploits more recombination events within a species However Difficult to establish when and where recombination occurred False signals highly likely
Linkage mapping (or) Family mapping Generation of mapping population (RILs, NILs, DH, BC, F2) Genotyping polymorphic markers Phenotyping trait of interest Limitations Resolution power is low (10 30 cm) Small population size Modest degree of recombination within the population Linkage mapping limited to sampling only two alleles at a given locus in any given bi-parental population
Recombination: key of Genetic variation or Success of Breeding
Strategy-1. QTL Mapping QTL Mapping QTL: Genomic region responsible for a phenotypic trait that shows continuous distribution QTL Mapping: process of finding and estimating associations between a set of markers and a continuously distributed trait
Strategy-1. QTL Mapping Mostly Used for oligogenic traits 1. Decide on the trait 2. Select contrasting parents 3. Identify polymorphic markers 4. Cross and develop suitable mapping population (F2/RIL/ NIL etc.) 5. Genotype the population 6. Measure the phenotype with precision 7. Association of genotypes with phenotype reveals QTL location, effect etc.
Strategy-1. QTL Mapping QTL Mapping populations and statistical methodologies Mapping Populations - F2/F2:3/BCnF1/RIL/NIL/Large F2/AILs/NAM/MAGIC Statistical Methodologies Single marker regressions, interval mapping (IM), composite interval mapping (CIM), Inclusive composite interval mapping (ICIM), Multiple Interval Mapping (MIM), Bayesian QTL Mapping etc.
Suggested readings Strategy-1. QTL Mapping Kearsey, M.J. and Pooni, H.S. 1996. The genetical analysis of quantitative traits. Chapter 7 Beavis, W. 1998. QTL analyses: Power, precision, and accuracy. P. 145 162. In: Paterson, A.H. (ed.), Molecular Dissection of Complex Traits. CRC Press, Boca Raton. Bernardo, R. 2008. Molecular Markers and Selection for Complex Traits in Plants: Learning from the Last 20 Years. Crop Sci. 48:1649-1664 IRRI s e-learning course: http://www.knowledgebank.irri.org/ricebreedingcourse/index.htm
ASSOCIATION MAPPING v v v v v v Currently existing natural populations are used Vs generating a population via a biparental cross No need to develop mapping population A potentially large number of alleles per locus as opposed to only two can be surveyed simultaneously Resolution can be dramatically increased (e.g. 2000 bp in diverse maize inbred lines) - - - Fine mapping. Reduces time Considering recombination of history/evolution AM is a multi-disciplinary field Ø Ø Ø Ø Ø Genomics Genetics Molecular Biology Statistical Genetics Bioinformatics
Association Mapping vs Family Mapping Yu and Buckler 2006
GWAS Types Success of either methods depends on population size and degree of LD 1. Genome wide scanning or AM Markers spanned across the genome Moderate to extensive LD 2. Candidate gene scanning or AM Sequencing only candidate gene Low LD
GENOME-WIDE ASSOCIATION MAPPING (GWA) Sps Self-fertile : Arabidopsis, rice Clonally propagated : Switch grass, grape If LD is high, GWA is useful with low resolution mapping Number of markers to screen determined by sample size, Extent of LD E.g.: Human 70,000 markers Arabidopsis 2,000 markers Diverse Maize Landraces 750,000 markers Elite Maize lines 50,000 markers Sorghum 556,000 markers
CANDIDATE GENE APPROACH Mutagenesis Multi-disciplinary approach Biochemical analysis Expression profiling Comparative genome mapping Bioinformatics Linkage mapping Positional candidates or Candidate genes
Pre-requisites Linkage disequilibrium Diverse genotypes Establishing the relatedness STRUCTURE Principle Component Analysis (PCA) Kinship matrix Distribution of phenotype in the population Robust numbers of markers covering the whole genome Reliable and reproducible phenotypic data
Planning a GWA Study 1. Population size 2. Experimental design 3. Phenotyping approach 4. Genotyping method 5. Analysis methods 6. Validation of detected loci
Population Size The larger the number, the higher the power and precision A minimum of 100 A study in barley recommended at least 384 (Wang et al. 2012) Depends on Trait to be examined Resources available Options Examine pop structure, select from representative groups
Experimental Design Should be replicated Different seasons, environments taken into account Can be one stage, or multiple stages One-stage Many individuals, all genotyped and phenotyped Two-stage Few individuals with traits of interest selected and genotyped Associated markers used in a wider population
Phenotyping Approach Quantitative rather than qualitative datasets Avoid Yes/No Score from 0-9, rather than 0-5 Overall pest score( 1-9)_1 Grain yield per panicle_combined 25 60.0 20 50.0 15 40.0 10 30.0 20.0 5 10.0 0 <-5 <-4 <-3 <-2 <-1 < 0 <1 <2 <3 >3 - <5 <10 <15 <20 <25
Genotyping Approach Genome-wide SNP detection Genotyping-by-sequencing RADseq DARTseq Etc SNP-chip analysis Depends on availability of arrays Not economical but maybe only option in some crops SSRs
4. POPULATION STRUCTURE Statistical methods for calculating population structure Structured associations (SA) - uses a set of random markers to estimate population structure (Q) and then incorporates this estimate into further statistical analysis Mixed model approach - random markers are used to estimate Q and a relative kinship matrix (K), which are then fit into a mixed-model framework to test for marker-trait associations Principal component analysis (PCA) - summarizes variation observed across all markers into a smaller number of underlying component variables
5. Statistical Analysis Germplasm STRUCTURE Phenotyping Genotyping Q- Mat rix PCA TASSEL K-matrix LD Marker-trait association (Association Mapping) Dendogram TASSEL = Trait Analysis by association, Evolution, Linkage
Data Analysis Depends on data generated Combining Genotype and phenotype data Two main methods General linear model (GLM) Does not account for relatedness Mixed linear model (MLM) Accounts for population structure and kinship Both Combine results and only present consensus
Interpreting GWAS Results P-value Should it be <10-7 or <5 x 10-8? P-value alone says very little about the results R 2 value An estimation of LD decay Correlation between a pair of loci What is the cut-off? False Discovery Rate (Q-value) False rejections: Total rejections
What Next? Identifying potential candidate genes Whole genome sequence available? Searching against public databases Validating SNPs in larger/bi-parental populations Depends on availability
Practical Session
Download data Import hapmap files (Genotypic data)
Plink
Phenotypic Data Format
Filter Datasets
GLM Join Genotypic and Phenotypic data Intersect join
GLM
GLM: QQ plots
GLM: QQ/Manhattan plots
GLM: QQ/Manhattan plots
MLM: Kinship
MLM: Run
MLM: Run
MLM: QQ/Manhattan plots
MLM: QQ/Manhattan plots
Compare GLM/MLM: QQ plots GLM MLM
OWN DATA
Thanks! Acknowledge slides adapted from Nair (SNP calling), Odeny (GWAS) and Babu (QTL Mapping)
MLM: PCA