Why can GBS be complicated? Tools for filtering, error correction and imputation.

Similar documents
Why can GBS be complicated? Tools for filtering & error correction. Edward Buckler USDA-ARS Cornell University

Applying Genotyping by Sequencing (GBS) to Corn Genetics and Breeding. Peter Bradbury USDA/Cornell University

GBS Usage Cases: Non-model Organisms. Katie E. Hyma, PhD Bioinformatics Core Institute for Genomic Diversity Cornell University

GBS Usage Cases: Examples from Maize

Supplementary Figure 1 Genotyping by Sequencing (GBS) pipeline used in this study to genotype maize inbred lines. The 14,129 maize inbred lines were

Usage Cases of GBS. Jeff Glaubitz Senior Research Associate, Buckler Lab, Cornell University Panzea Project Manager

Crop design with genomics and natural diversity. Edward Buckler USDA-ARS Cornell University

SNP calling and VCF format

Whole Genome Sequencing. Biostatistics 666

SUPPLEMENTARY INFORMATION

SNP calling and Genome Wide Association Study (GWAS) Trushar Shah

The Diploid Genome Sequence of an Individual Human

Mapping and Mapping Populations

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Genome-Wide Association Studies (GWAS): Computational Them

CS 262 Lecture 14 Notes Human Genome Diversity, Coalescence and Haplotypes

H3A - Genome-Wide Association testing SOP

Genomic resources. for non-model systems

S G. Design and Analysis of Genetic Association Studies. ection. tatistical. enetics

Construction of the third-generation Zea mays haplotype map

THE HEALTH AND RETIREMENT STUDY: GENETIC DATA UPDATE

Lecture 23: Causes and Consequences of Linkage Disequilibrium. November 16, 2012

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

DNBseq TM SERVICE OVERVIEW Plant and Animal Whole Genome Re-Sequencing

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es

Genome-wide association studies (GWAS) Part 1

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

Prostate Cancer Genetics: Today and tomorrow

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

Next Generation Genetics: Using deep sequencing to connect phenotype to genotype

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

Oral Cleft Targeted Sequencing Project

Improving genome assemblies and capturing genome variation data for applied crop improvement.

Bulked Segregant Analysis For Fine Mapping Of Genes. Cheng Zou, Qi Sun Bioinformatics Facility Cornell University

Introduc)on to GBS. Hueber Yann, Alexis Dereeper, Gau)er Sarah, François Sabot, Vincent Ranwez, Jean- François Dufayard 02/11/2015

Computational Workflows for Genome-Wide Association Study: I

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

Analysis of genome-wide genotype data

Identifying the functional bases of trait variation in Brassica napus using Associative Transcriptomics

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

Domestic animal genomes meet animal breeding translational biology in the ag-biotech sector. Jerry Taylor University of Missouri-Columbia

Illumina s GWAS Roadmap: next-generation genotyping studies in the post-1kgp era

Human Genetic Variation. Ricardo Lebrón Dpto. Genética UGR

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

Human Genetics and Gene Mapping of Complex Traits

Understanding genetic association studies. Peter Kamerman

Targeted Sequencing Reveals Large-Scale Sequence Polymorphism in Maize Candidate Genes for Biomass Production and Composition

Estimating the rates and modes of creation of new genetic variation in plants using NGS technologies

Genotypic data densification for genomic selection in the black poplar. Thesis director : Léopoldo SANCHEZ, Supervisor : Véronique JORGE

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

Comparison and Evaluation of Cotton SNPs Developed by Transcriptome, Genome Reduction on Restriction Site Conservation and RAD-based Sequencing

Harnessing the power of RADseq for ecological and evolutionary genomics

The Basics of Understanding Whole Genome Next Generation Sequence Data

POLYMORPHISM AND VARIANT ANALYSIS. Matt Hudson Crop Sciences NCSA HPCBio IGB University of Illinois

DE NOVO WHOLE GENOME ASSEMBLY AND SEQUENCING OF THE SUPERB FAIRYWREN. (Malurus cyaneus) JOSHUA PEÑALBA LEO JOSEPH CRAIG MORITZ ANDREW COCKBURN

Roadmap: genotyping studies in the post-1kgp era. Alex Helm Product Manager Genotyping Applications

UHT Sequencing Course Large-scale genotyping. Christian Iseli January 2009

Lecture 10 : Whole genome sequencing and analysis. Introduction to Computational Biology Teresa Przytycka, PhD

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Structure, Measurement & Analysis of Genetic Variation

Variant Callers. J Fass 24 August 2017

SUPPLEMENTARY INFORMATION

latestdevelopments relevant for the Ag sector André Eggen Agriculture Segment Manager, Europe

Add 2016 GBS Poster As Slide One

De novo whole genome assembly

Accelerate High Throughput Analysis for Genome Sequencing with GPU

Axiom mydesign Custom Array design guide for human genotyping applications

Application of Genotyping-By-Sequencing and Genome-Wide Association Analysis in Tetraploid Potato

Next-Generation Sequencing. Technologies

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

Maize Inbreds Exhibit High Levels of Copy Number Variation (CNV) and Presence/Absence Variation (PAV) in Genome Content

Progress in genomics applications in investigating abiotic stresses influencing perennial forage and biomass grasses

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

Using the Potato Genome Sequence! Robin Buell! Michigan State University! Department of Plant Biology! August 15, 2010!

Population Genetics Sequence Diversity Molecular Evolution. Physiology Quantitative Traits Human Diseases

Anchoring and ordering NGS contig assemblies by population sequencing (POPSEQ)

DNA Collection. Data Quality Control. Whole Genome Amplification. Whole Genome Amplification. Measure DNA concentrations. Pros

Bioinformatic Analysis of SNP Data for Genetic Association Studies EPI573

LD Mapping and the Coalescent

Linkage Disequilibrium

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

Proceedings of the World Congress on Genetics Applied to Livestock Production,

The Genome Analysis Centre. Building Excellence in Genomics and Computa5onal Bioscience

Supplementary Methods Illumina Genome-Wide Genotyping Single SNP and Microsatellite Genotyping. Supplementary Table 4a Supplementary Table 4b

Analysis of structural variation. Alistair Ward USTAR Center for Genetic Discovery University of Utah

RADSeq Data Analysis. Through STACKS on Galaxy. Yvan Le Bras Anthony Bretaudeau Cyril Monjeaud Gildas Le Corguillé

Using the Association Workflow in Partek Genomics Suite

De novo whole genome assembly

BIOINFORMATICS ORIGINAL PAPER

Association mapping of Sclerotinia stalk rot resistance in domesticated sunflower plant introductions

Statistical Methods, Analyses and Applications for Next-Generation Sequencing Studies

Introducing combined CGH and SNP arrays for cancer characterisation and a unique next-generation sequencing service. Dr. Ruth Burton Product Manager

What is genetic variation?

Why do we need statistics to study genetics and evolution?

Population Genetics II. Bio

GENOTYPING-BY-SEQUENCING USING CUSTOM ION AMPLISEQ TECHNOLOGY AS A TOOL FOR GENOMIC SELECTION IN ATLANTIC SALMON

HISTORICAL LINGUISTICS AND MOLECULAR ANTHROPOLOGY

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2015

Transcription:

Why can GBS be complicated? Tools for filtering, error correction and imputation. Edward Buckler USDA-ARS Cornell University http://www.maizegenetics.net

Many Organisms Are Diverse Humans are at the lower end of diversity, which results in most genomic and bioinformatics being optimized to this situation What is diverse: Grapes, Flies, Arabidopsis, and most dominant species on the planet

Maize has more molecular diversity than humans and apes combined 1.34% 0.09% 1.42% Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001)

Maize genetic variation has been evolving for 5 million years Warm Pliocene 5mya 4mya Modern Variation Begins Evolving Sister Genus Diverges Divergence from Chimps Ardipithecus 3mya Australopithecus Cold Pleistocene 2mya 1mya Zea species begin diverging Maize domesticated Homo erectus Modern Variation Begins Modern Humans

Features of TASSEL GBS Bioinformatic Pipeline Designed for large numbers of samples Modest computing requirements Species or Genus level SNP calling & filtering Filtering to deal with extensive paralogy Glaubitz et al 2014 PLOSone. Available in TASSEL at maizegenetics.net

Reference based GBS bioinformatics pipeline Discovery Production FASTQ FASTQ Tags by Taxa Tag Counts Tags on Physical Map Tags on Physical Map SNP Caller Genotypes Filtered Genotypes Genotypes

What are our expectations with GBS?

High Diversity Ensures High Return on Sequencing Proportion of informative markers Highly repetitive 15% not easily informative Half the genome is not shared between two maize line Potentially all of these are informative with a large enough database Low copy shared proportion (1% diversity) Bi-parental information = (1-0.01)^64bp = 48% informative Association information = (1-0.05)^64bp= 97% informative

Expectation of marker distribution Biallelic, 17% Presense /Absense, 50% Too Repetitiv e, 15% Presense /Absense, 50% Multialleli c, 34% Nonpolymor phic; 18% Too Repetitiv e, 15% Biparental population Nonpolymorp hic; 1% Across the species

Sequencing Error

Illumina Basic Error Rate is ~1% Error rates are associated with distance from start of sequence Bad GBS puts these all at the same position Good Reverse reads can correct Good Error are consistent and modelable

Reads with errors Perfect sequences: 0.99 64 =52.5% of the 64bp sequences are perfect 47.5 are NOT perfect The errors are autocorrelated so the proportion of perfect sequence is a little higher, and those with 2 or more is also higher.

Do we see these errors? Assume 10,000 lines genotyped at 0.5X coverage Base Type Read # (no SNP) Read # (w/ SNP) A Major 4950 4900 C Minor 17 67 (50 real) G Error 17 17 T Error 17 17

Do Errors Matter? Yes Imputation, Haplotype reconstruction Maybe GWAS for low frequency SNPs No GS, genetic distance, mapping on biparental populations

Expectations of Real SNPs Vast majority are biallelic Homozygosity is predicted by inbreeding coefficient Allele frequency is constrained in structured populations In linkage disequilibrium with neighboring SNPs

Studying Errors in Biparental populations Limited range of alleles, expected allele frequencies, high LD

Maize RIL population expectations Allele frequency 0% or 50% Nearby sites should be in very high LD (r 2 >50%) Most sites can be tested if multiple populations are available

Bi-parental populations allow identification of error, and non-mendelian segregation Non-segregating Error Segregating

Bi-parental populations allow identification of error, and non-mendelian segregation Error

Median error rate is 0, but there is a long tail of some high error sites Median

GBS error rates vs. Maize 50K SNP Chip 7,254 SNPs in common 279 maize inbreds in common ( Maize282 panel) Comparison to 50K SNPs Filtered GBS genotypes Mean Error Rate (per SNP) Median Error Rate (per SNP) All genotypic comparisons: 1.18% 0.93% Homozygotes only: 0.58% 0.42% Internal GBS recall error rate with 1X coverage is ~0.2%, half of 50K chip errors are paralogy issues between the systems

MergeDuplicateSNPsPlugin When restriction sites are less than 128bp apart, we may read SNP from both directions (strands) ~13% of all sites Fusing increases coverage Fixes errors -mismat = set maximum mismatch rate -callhets = mismatch set to hets or not

Use the biology of your system to filter SNP Hardy-Weinberg Disequilibrium For inbreeding crops use the expected inbreeding coefficient to filter SNPs How to deal with the low coverage Although many individuals only have 1X coverage (27% of the samples will have 2X or more). Use these individuals to evaluate the quality of segregation.

Product of Filtering After filters, in maize we find 0.0018 error rate AA<>aa = < 0.0018 AA<>Aa = 0.8 at low coverage SNPs in wrong location <~1%. Lower in other species.

Only a limited proportion of the best alignments are shared between aligners Bowtie2 31.2% 17.4% 51.5% BWA BWA 9.5% 12.8% Blast 12.4% 1.4% 1.2% 0.8% 10.9% 2.5% 0.6% 0.8% Bowtie2 16.9% 13.2% 45.4% 4.7% 3.0% 14.4% Bowtie2 BWA 1.3% 32.9% 1.6% 3.3% 3.9% 6.5% BWA-MEM Which alignment is real BWA-MEM

Genetic Mapping of All Unique Tags Test for association between between taxa with tags and an anchor map Apply in both an association and linkage context Fei Lu Trillions of tests, so computational speed is key Fei Lu in review

What do YOU want to filter on? MAF Mapping accuracy Support from Paired-End Support from multiple aligners Heterozygosity Coverage Agreement with WGS Allele frequency within certain populations LD with other SNPs Stats within a particular population

Finding the Good SNPs Discovery TOPM SNP & Features Train Filter TOPM Production TOPM

Unstable Genomes

Using the Presence/Absence Variants In species like maize, this is the majority of the data Less subject to sequencing error Need imputation methods to differentiate between missing from sampling and biologically missing

Only 50% of the maize genome is shared between two varieties Plant 1 Person 1 50% 99% Plant 2 Plant 3 Person 2 Person 3 Maize Humans Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005 Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010

Developed by Fei Lu Identify structural variations GWAS and Joint linkage mapping GWAS Tags Joint Linkage If Map Y If Align N PAVs Insertion Read depth (B73 vs Non-B73) PAV (Deletion) B73 CNV (Duplication) Non B73 Chromosome Bin 1 Bin 2 Bin 3

Model training for mapping accuracy Dependent variable, distance (abs(genetic position physical position)) Attributes: P-value, likelihood ratio, tag count, etc. Machine learning models: decision tree, SVM, M5Rules, etc. 26M tags GWAS 8.6 M 6.4 M G GJ Y Y 4.5 M mapped tags Joint linkage 0.5 M J Y Model training and prediction using M5Rules

High-resolution mapped tags 95% 50% 4.5 million mapped tags in all maize lines 100 bp 10 Kb 1 Mb 101 Kb Gb 10 Gb Median resolution Framework of maize pan genome

Paired-End Sequencing Most of the tags are less than 500bp in size. Full contigs can be developed by sequencing them on a MiSeq with paired-end 250bp reads Strategies physically bridge from the mapped end, genetically anchor one end.

Phylogenetic (Cross Species) Analysis When divergence <10% between taxa, GBS makes sense as 40% of the cut site pairs are preserved. If focused a genic regions, can go higher. Currently pipeline tools not great for very high levels of divergence. Standard nextgen or Paired-End approaches might be a good strategy.

Future Clean output and input into machine learning algorithms Developing a community to explore and develop filters

Become a power user: Play with the code to add your features Documentation at maizegenetics.net Source code at SourceForge Guide to programming TASSEL: http://bit.ly/kzli8l Google Group https://groups.google.com/forum/#!forum/ta ssel