Why can GBS be complicated? Tools for filtering & error correction. Edward Buckler USDA-ARS Cornell University

Why can GBS be complicated? Tools for filtering & error correction Edward Buckler USDA-ARS Cornell University http://www.maizegenetics.net

Maize has more molecular diversity than humans and apes combined 1.34% 0.09% 1.42% Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001)

What are our expectations with GBS?

High Diversity Ensures High Return on Sequencing Proportion of informative markers Highly repetitive 15% not easily informative Half the genome is not shared between two maize lines Potentially, all of the presence/absence tags are informative with a large enough database Low copy shared proportion (1% diversity) Bi-parental information = (1-0.01)^64bp = 48% informative Association information = (1-0.05)^64bp= 97% informative

Expectation of marker distribution (64 base tags) Biallelic, 17% Presense / Absense, 50% Too Repetitiv e, 15% Presense / Absense, 50% Multialleli c, 34% Nonpolymor phic; 18% Too Repetitiv e, 15% Biparental population Nonpolymorp hic; 1% Across the species

Sequencing Error

Illumina Basic Error Rate is ~1% Error rates are associated with distance from start of sequence Bad GBS puts these all at the same position Good Reverse reads can correct Good Error are consistent and modelable

Reads with errors Perfect sequences: - 0.99 64 = 52.5% of the 64bp sequences are perfect - 47.5 are NOT perfect The errors are autocorrelated so the proportion of perfect sequence is a little higher, and those with 2 or more is also higher.

Do we see these errors? Assume 10,000 lines genotyped at 0.5X coverage Base Type Read # (no SNP) Read # (w/ SNP) A Major 4950 4900 C Minor 17 67 (50 real) G Error 17 17 T Error 17 17

Do Errors Matter? Yes Imputation, Haplotype reconstruction Maybe GWAS for low frequency SNPs No GS, genetic distance, mapping on biparental populations

Expectations of Real SNPs Vast majority are biallelic Homozygosity is predicted by inbreeding coefficient Allele frequency is constrained in structured populations In linkage disequilibrium with neighboring SNPs

Studying Errors in Biparental populations Limited range of alleles, expected allele frequencies, high LD

Maize RIL population expectations Allele frequency 0% or 50% Nearby sites should be in very high LD (r 2 >50%) Most sites can be tested if multiple populations are available

Bi-parental populations allow identification of error, and non-mendelian segregation Non-segregating Error Segregating

Bi-parental populations allow identification of error, and non-mendelian segregation Error

Median error rate is 0, but there is a long tail of some high error sites Median

Clean Up and Imputation HapMap MergeDuplicateSNPsPlugin Merge reads from opposite sides GBSHapMapFiltersPlugin Site Coverage, Taxa Coverage, Inbreeding Coefficient, LD TASSEL3 BiParentalErrorCorrectionPlugin Error rate estimation, LD filters MergeIdenticalTaxaPlugin Error rate estimation, LD filters HapMap Kinship Distance Phylogeny LD GS Imputation GWAS Imputation & Phasing Process File (data structure)

Use the biology of your system to filter SNPs Hardy-Weinberg Disequilibrium For inbreeding crops use the expected inbreeding coefficient to filter SNPs How to deal with the low coverage Although many individuals only have 1X coverage (27% of the samples will have 2X or more). Use these individuals to evaluate the quality of segregation.

Product of Filtering After filters, in maize we find 0.0018 homozygous error rate AA<>aa = < 0.0018 Aa AA = 0.8 at low coverage SNPs in wrong location <~1%. Lower in other species.

GBS error rates vs. Maize 50K SNP Chip 7,254 SNPs in common 279 maize inbreds in common ( Maize282 panel) Comparison to 50K SNPs Filtered GBS genotypes Mean Error Rate (per SNP) Median Error Rate (per SNP) All genotypic comparisons: 1.18% 0.93% Homozygotes only: 0.58% 0.42% Internal GBS recall error rate with 1X coverage is ~0.2%, half of 50K chip errors are paralogy issues between the systems

Only a limited proportion of the best alignments are shared between aligners Bowtie2 31.2% 51.5% 17.4% BWA BWA Blast 9.5% 12.8% 12.4% 1.4% 1.2% 0.8% 10.9% Bowtie2 2.5% 16.9% 13.2% 45.4% BWA Bowtie2 1.3% 0.6% 3.3% 3.9% 6.5% 0.8% 32.9% 1.6% BWA-MEM 4.7% 3.0% 14.4% Which alignment is real? BWA-MEM

Genetic Mapping of All Unique Tags Test for association between between taxa with tags and an anchor map Apply in both an association and linkage context Fei Lu Trillions of tests, so computational speed is key Fei Lu in review

Finding the Good SNPs Discovery TOPM SNP & Features Train Filter TOPM Production TOPM

What do YOU want to filter on? MAF Mapping accuracy Support from Paired-End Support from multiple aligners Heterozygosity Coverage Agreement with WGS Allele frequency within certain populations LD with other SNPs Stats within a particular population

Unstable Genomes

Using the Presence/Absence Variants In species like maize, this is the majority of the data Less subject to sequencing error Need imputation methods to differentiate between missing from sampling and biologically missing

Only 50% of the maize genome is shared between two varieties Plant 1 Person 1 50% 99% Plant 2 Plant 3 Person 2 Person 3 Maize Humans Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005 Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010

Most tags can be mapped as individual alleles In a biparental cross such as maize IBM (B73 x Mo17) Provided that they are polymorphic between the parents ApeKI site (GCWGC) ( ) 64- base sequence tag B73 Loss of cut site < 450 bp Mo17

Gene[cally mapping individual GBS alleles SNPs (e.g., from Illumina 50K chip) RILs (e.g., from IBM) B73 Mo17 Heterozygote

Gene[cally mapping individual GBS alleles SNPs (e.g., from Illumina 50K chip) RILs (e.g., from IBM) does GBS tag map here? B73 Mo17 Heterozygote

Gene[cally mapping individual GBS alleles SNPs (e.g., from Illumina 50K chip) ( ) 64- base sequence tag (GBS coverage ~0.4x) RILs (e.g., from IBM) does GBS tag map here? B73 Mo17 Heterozygote

Gene[cally mapping individual GBS alleles SNPs (e.g., from Illumina 50K chip) ( ) 64- base sequence tag (GBS coverage ~0.4x) RILs (e.g., from IBM) Binomial Test for linkage prob. success: segrega4on ra4o of the SNP being tested (~0.5) n trials: n RILs with GBS tag (10) n successes: n co- occurrences with presumed parental allele at SNP being tested (co- segrega4on) p- value: 0.00098 (<10-3 ) does GBS tag map here? B73 Mo17 Heterozygote

Identifying structural variation Developed by Fei Lu GWAS and Joint linkage mapping GWAS Tags Joint Linkage If Map Y If Align N PAVs Inser4on Read depth (B73 vs Non-B73) PAV (Dele4on) CNV (Duplica4on) B73 Non- B73 Chromosome Bin 1 Bin 2 Bin 3

Machine learning model for determining which tags are accurately mapped genetically B73 tags used as a positive control Dependent variable, distance (abs(genetic position physical position)) Attributes: P-value, likelihood ratio, tag count, etc. Machine learning models: decision tree, SVM, M5Rules, etc. 26M tags GWAS 8.6 M 6.4 M G GJ Y Y 4.5 M mapped tags Joint linkage 0.5 M J Y Model training and predic[on using M5Rules

High-resolution mapped tags 95% 50% 4.5 million mapped tags in all maize lines 100 bp 10 Kb 1 Mb 101 Kb Gb 10 Gb Median resolu[on Framework of maize pan genome

Become a power user: Play with the code to add your features Documentation at maizegenetics.net Source code at SourceForge Guide to programming TASSEL: http://bit.ly/kzli8l Google Group https://groups.google.com/forum/#!forum/ tassel