Why can GBS be complicated? Tools for filtering, error correction and imputation.

Why can GBS be complicated? Tools for filtering, error correction and imputation. Edward Buckler USDA-ARS Cornell University http://www.maizegenetics.net

Many Organisms Are Diverse Humans are at the lower end of diversity, which results in most genomic and bioinformatics being optimized to this situation What is diverse: Grapes, Flies, Arabidopsis, and most dominant species on the planet

Maize has more molecular diversity than humans and apes combined 1.34% 0.09% 1.42% Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001)

Maize genetic variation has been evolving for 5 million years Warm Pliocene 5mya 4mya Modern Variation Begins Evolving Sister Genus Diverges Divergence from Chimps Ardipithecus 3mya Australopithecus Cold Pleistocene 2mya 1mya Zea species begin diverging Maize domesticated Homo erectus Modern Variation Begins Modern Humans

Features of TASSEL GBS Bioinformatic Pipeline Designed for large numbers of samples Modest computing requirements Species or Genus level SNP calling & filtering Filtering to deal with extensive paralogy Glaubitz et al 2014 PLOSone. Available in TASSEL at maizegenetics.net

Reference based GBS bioinformatics pipeline Discovery Production FASTQ FASTQ Tags by Taxa Tag Counts Tags on Physical Map Tags on Physical Map SNP Caller Genotypes Filtered Genotypes Genotypes

What are our expectations with GBS?

High Diversity Ensures High Return on Sequencing Proportion of informative markers Highly repetitive 15% not easily informative Half the genome is not shared between two maize line Potentially all of these are informative with a large enough database Low copy shared proportion (1% diversity) Bi-parental information = (1-0.01)^64bp = 48% informative Association information = (1-0.05)^64bp= 97% informative

Expectation of marker distribution Biallelic, 17% Presense /Absense, 50% Too Repetitiv e, 15% Presense /Absense, 50% Multialleli c, 34% Nonpolymor phic; 18% Too Repetitiv e, 15% Biparental population Nonpolymorp hic; 1% Across the species

Sequencing Error

Illumina Basic Error Rate is ~1% Error rates are associated with distance from start of sequence Bad GBS puts these all at the same position Good Reverse reads can correct Good Error are consistent and modelable

Reads with errors Perfect sequences: 0.99 64 =52.5% of the 64bp sequences are perfect 47.5 are NOT perfect The errors are autocorrelated so the proportion of perfect sequence is a little higher, and those with 2 or more is also higher.

Do we see these errors? Assume 10,000 lines genotyped at 0.5X coverage Base Type Read # (no SNP) Read # (w/ SNP) A Major 4950 4900 C Minor 17 67 (50 real) G Error 17 17 T Error 17 17

Do Errors Matter? Yes Imputation, Haplotype reconstruction Maybe GWAS for low frequency SNPs No GS, genetic distance, mapping on biparental populations

Expectations of Real SNPs Vast majority are biallelic Homozygosity is predicted by inbreeding coefficient Allele frequency is constrained in structured populations In linkage disequilibrium with neighboring SNPs

Studying Errors in Biparental populations Limited range of alleles, expected allele frequencies, high LD

Maize RIL population expectations Allele frequency 0% or 50% Nearby sites should be in very high LD (r 2 >50%) Most sites can be tested if multiple populations are available

Bi-parental populations allow identification of error, and non-mendelian segregation Non-segregating Error Segregating

Bi-parental populations allow identification of error, and non-mendelian segregation Error

Median error rate is 0, but there is a long tail of some high error sites Median

GBS error rates vs. Maize 50K SNP Chip 7,254 SNPs in common 279 maize inbreds in common ( Maize282 panel) Comparison to 50K SNPs Filtered GBS genotypes Mean Error Rate (per SNP) Median Error Rate (per SNP) All genotypic comparisons: 1.18% 0.93% Homozygotes only: 0.58% 0.42% Internal GBS recall error rate with 1X coverage is ~0.2%, half of 50K chip errors are paralogy issues between the systems

MergeDuplicateSNPsPlugin When restriction sites are less than 128bp apart, we may read SNP from both directions (strands) ~13% of all sites Fusing increases coverage Fixes errors -mismat = set maximum mismatch rate -callhets = mismatch set to hets or not

Use the biology of your system to filter SNP Hardy-Weinberg Disequilibrium For inbreeding crops use the expected inbreeding coefficient to filter SNPs How to deal with the low coverage Although many individuals only have 1X coverage (27% of the samples will have 2X or more). Use these individuals to evaluate the quality of segregation.

Product of Filtering After filters, in maize we find 0.0018 error rate AA<>aa = < 0.0018 AA<>Aa = 0.8 at low coverage SNPs in wrong location <~1%. Lower in other species.

Only a limited proportion of the best alignments are shared between aligners Bowtie2 31.2% 17.4% 51.5% BWA BWA 9.5% 12.8% Blast 12.4% 1.4% 1.2% 0.8% 10.9% 2.5% 0.6% 0.8% Bowtie2 16.9% 13.2% 45.4% 4.7% 3.0% 14.4% Bowtie2 BWA 1.3% 32.9% 1.6% 3.3% 3.9% 6.5% BWA-MEM Which alignment is real BWA-MEM

Genetic Mapping of All Unique Tags Test for association between between taxa with tags and an anchor map Apply in both an association and linkage context Fei Lu Trillions of tests, so computational speed is key Fei Lu in review

What do YOU want to filter on? MAF Mapping accuracy Support from Paired-End Support from multiple aligners Heterozygosity Coverage Agreement with WGS Allele frequency within certain populations LD with other SNPs Stats within a particular population

Finding the Good SNPs Discovery TOPM SNP & Features Train Filter TOPM Production TOPM

Unstable Genomes

Using the Presence/Absence Variants In species like maize, this is the majority of the data Less subject to sequencing error Need imputation methods to differentiate between missing from sampling and biologically missing

Only 50% of the maize genome is shared between two varieties Plant 1 Person 1 50% 99% Plant 2 Plant 3 Person 2 Person 3 Maize Humans Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005 Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010

Developed by Fei Lu Identify structural variations GWAS and Joint linkage mapping GWAS Tags Joint Linkage If Map Y If Align N PAVs Insertion Read depth (B73 vs Non-B73) PAV (Deletion) B73 CNV (Duplication) Non B73 Chromosome Bin 1 Bin 2 Bin 3

Model training for mapping accuracy Dependent variable, distance (abs(genetic position physical position)) Attributes: P-value, likelihood ratio, tag count, etc. Machine learning models: decision tree, SVM, M5Rules, etc. 26M tags GWAS 8.6 M 6.4 M G GJ Y Y 4.5 M mapped tags Joint linkage 0.5 M J Y Model training and prediction using M5Rules

High-resolution mapped tags 95% 50% 4.5 million mapped tags in all maize lines 100 bp 10 Kb 1 Mb 101 Kb Gb 10 Gb Median resolution Framework of maize pan genome

Paired-End Sequencing Most of the tags are less than 500bp in size. Full contigs can be developed by sequencing them on a MiSeq with paired-end 250bp reads Strategies physically bridge from the mapped end, genetically anchor one end.

Phylogenetic (Cross Species) Analysis When divergence <10% between taxa, GBS makes sense as 40% of the cut site pairs are preserved. If focused a genic regions, can go higher. Currently pipeline tools not great for very high levels of divergence. Standard nextgen or Paired-End approaches might be a good strategy.

Future Clean output and input into machine learning algorithms Developing a community to explore and develop filters

Become a power user: Play with the code to add your features Documentation at maizegenetics.net Source code at SourceForge Guide to programming TASSEL: http://bit.ly/kzli8l Google Group https://groups.google.com/forum/#!forum/ta ssel