Why can GBS be complicated? Tools for filtering & error correction. Edward Buckler USDA-ARS Cornell University

Size: px
Start display at page:

Download "Why can GBS be complicated? Tools for filtering & error correction. Edward Buckler USDA-ARS Cornell University"

Transcription

1 Why can GBS be complicated? Tools for filtering & error correction Edward Buckler USDA-ARS Cornell University

2 Maize has more molecular diversity than humans and apes combined 1.34% 0.09% 1.42% Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001)

3 What are our expectations with GBS?

4 High Diversity Ensures High Return on Sequencing Proportion of informative markers Highly repetitive 15% not easily informative Half the genome is not shared between two maize lines Potentially, all of the presence/absence tags are informative with a large enough database Low copy shared proportion (1% diversity) Bi-parental information = (1-0.01)^64bp = 48% informative Association information = (1-0.05)^64bp= 97% informative

5 Expectation of marker distribution (64 base tags) Biallelic, 17% Presense / Absense, 50% Too Repetitiv e, 15% Presense / Absense, 50% Multialleli c, 34% Nonpolymor phic; 18% Too Repetitiv e, 15% Biparental population Nonpolymorp hic; 1% Across the species

6 Sequencing Error

7 Illumina Basic Error Rate is ~1% Error rates are associated with distance from start of sequence Bad GBS puts these all at the same position Good Reverse reads can correct Good Error are consistent and modelable

8 Reads with errors Perfect sequences: = 52.5% of the 64bp sequences are perfect are NOT perfect The errors are autocorrelated so the proportion of perfect sequence is a little higher, and those with 2 or more is also higher.

9 Do we see these errors? Assume 10,000 lines genotyped at 0.5X coverage Base Type Read # (no SNP) Read # (w/ SNP) A Major C Minor (50 real) G Error T Error 17 17

10 Do Errors Matter? Yes Imputation, Haplotype reconstruction Maybe GWAS for low frequency SNPs No GS, genetic distance, mapping on biparental populations

11 Expectations of Real SNPs Vast majority are biallelic Homozygosity is predicted by inbreeding coefficient Allele frequency is constrained in structured populations In linkage disequilibrium with neighboring SNPs

12 Studying Errors in Biparental populations Limited range of alleles, expected allele frequencies, high LD

13 Maize RIL population expectations Allele frequency 0% or 50% Nearby sites should be in very high LD (r 2 >50%) Most sites can be tested if multiple populations are available

14 Bi-parental populations allow identification of error, and non-mendelian segregation Non-segregating Error Segregating

15 Bi-parental populations allow identification of error, and non-mendelian segregation Error

16 Median error rate is 0, but there is a long tail of some high error sites Median

17 Clean Up and Imputation HapMap MergeDuplicateSNPsPlugin Merge reads from opposite sides GBSHapMapFiltersPlugin Site Coverage, Taxa Coverage, Inbreeding Coefficient, LD TASSEL3 BiParentalErrorCorrectionPlugin Error rate estimation, LD filters MergeIdenticalTaxaPlugin Error rate estimation, LD filters HapMap Kinship Distance Phylogeny LD GS Imputation GWAS Imputation & Phasing Process File (data structure)

18 Use the biology of your system to filter SNPs Hardy-Weinberg Disequilibrium For inbreeding crops use the expected inbreeding coefficient to filter SNPs How to deal with the low coverage Although many individuals only have 1X coverage (27% of the samples will have 2X or more). Use these individuals to evaluate the quality of segregation.

19 Product of Filtering After filters, in maize we find homozygous error rate AA<>aa = < Aa AA = 0.8 at low coverage SNPs in wrong location <~1%. Lower in other species.

20 GBS error rates vs. Maize 50K SNP Chip 7,254 SNPs in common 279 maize inbreds in common ( Maize282 panel) Comparison to 50K SNPs Filtered GBS genotypes Mean Error Rate (per SNP) Median Error Rate (per SNP) All genotypic comparisons: 1.18% 0.93% Homozygotes only: 0.58% 0.42% Internal GBS recall error rate with 1X coverage is ~0.2%, half of 50K chip errors are paralogy issues between the systems

21 Only a limited proportion of the best alignments are shared between aligners Bowtie2 31.2% 51.5% 17.4% BWA BWA Blast 9.5% 12.8% 12.4% 1.4% 1.2% 0.8% 10.9% Bowtie2 2.5% 16.9% 13.2% 45.4% BWA Bowtie2 1.3% 0.6% 3.3% 3.9% 6.5% 0.8% 32.9% 1.6% BWA-MEM 4.7% 3.0% 14.4% Which alignment is real? BWA-MEM

22 Genetic Mapping of All Unique Tags Test for association between between taxa with tags and an anchor map Apply in both an association and linkage context Fei Lu Trillions of tests, so computational speed is key Fei Lu in review

23 Finding the Good SNPs Discovery TOPM SNP & Features Train Filter TOPM Production TOPM

24 What do YOU want to filter on? MAF Mapping accuracy Support from Paired-End Support from multiple aligners Heterozygosity Coverage Agreement with WGS Allele frequency within certain populations LD with other SNPs Stats within a particular population

25 Unstable Genomes

26 Using the Presence/Absence Variants In species like maize, this is the majority of the data Less subject to sequencing error Need imputation methods to differentiate between missing from sampling and biologically missing

27 Only 50% of the maize genome is shared between two varieties Plant 1 Person 1 50% 99% Plant 2 Plant 3 Person 2 Person 3 Maize Humans Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005 Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010

28 Most tags can be mapped as individual alleles In a biparental cross such as maize IBM (B73 x Mo17) Provided that they are polymorphic between the parents ApeKI site (GCWGC) ( ) 64- base sequence tag B73 Loss of cut site < 450 bp Mo17

29 Gene[cally mapping individual GBS alleles SNPs (e.g., from Illumina 50K chip) RILs (e.g., from IBM) B73 Mo17 Heterozygote

30 Gene[cally mapping individual GBS alleles SNPs (e.g., from Illumina 50K chip) RILs (e.g., from IBM) does GBS tag map here? B73 Mo17 Heterozygote

31 Gene[cally mapping individual GBS alleles SNPs (e.g., from Illumina 50K chip) ( ) 64- base sequence tag (GBS coverage ~0.4x) RILs (e.g., from IBM) does GBS tag map here? B73 Mo17 Heterozygote

32 Gene[cally mapping individual GBS alleles SNPs (e.g., from Illumina 50K chip) ( ) 64- base sequence tag (GBS coverage ~0.4x) RILs (e.g., from IBM) Binomial Test for linkage prob. success: segrega4on ra4o of the SNP being tested (~0.5) n trials: n RILs with GBS tag (10) n successes: n co- occurrences with presumed parental allele at SNP being tested (co- segrega4on) p- value: (<10-3 ) does GBS tag map here? B73 Mo17 Heterozygote

33 Gene[cally mapping individual GBS alleles SNPs (e.g., from Illumina 50K chip) ( ) 64- base sequence tag (GBS coverage ~0.4x) RILs (e.g., from IBM) Binomial Test for linkage prob. success: segrega4on ra4o of the SNP being tested (~0.5) n trials: n RILs with GBS tag (10) n successes: n co- occurrences with presumed parental allele at SNP being tested (co- segrega4on) p- value: (<10-3 ) These 10 SNPs all tie (p = 9.8 x 10-4 ) B73 Mo17 Heterozygote

34 Identifying structural variation Developed by Fei Lu GWAS and Joint linkage mapping GWAS Tags Joint Linkage If Map Y If Align N PAVs Inser4on Read depth (B73 vs Non-B73) PAV (Dele4on) CNV (Duplica4on) B73 Non- B73 Chromosome Bin 1 Bin 2 Bin 3

35 Machine learning model for determining which tags are accurately mapped genetically B73 tags used as a positive control Dependent variable, distance (abs(genetic position physical position)) Attributes: P-value, likelihood ratio, tag count, etc. Machine learning models: decision tree, SVM, M5Rules, etc. 26M tags GWAS 8.6 M 6.4 M G GJ Y Y 4.5 M mapped tags Joint linkage 0.5 M J Y Model training and predic[on using M5Rules

36 High-resolution mapped tags 95% 50% 4.5 million mapped tags in all maize lines 100 bp 10 Kb 1 Mb 101 Kb Gb 10 Gb Median resolu[on Framework of maize pan genome

37 Become a power user: Play with the code to add your features Documentation at maizegenetics.net Source code at SourceForge Guide to programming TASSEL: Google Group tassel