GBS Usage Cases: Non-model Organisms. Katie E. Hyma, PhD Bioinformatics Core Institute for Genomic Diversity Cornell University

Similar documents
Transcription:

GBS Usage Cases: Non-model Organisms Katie E. Hyma, PhD Bioinformatics Core Institute for Genomic Diversity Cornell University

Q: How many SNPs will I get? A: 42. What question do you really want to ask?

Q: How many SNPs do I need? A: 42. What question are you trying to answer?

The Best Advice I can Give You:

The Best Advice I can Give You: Does my Experimental Design make Sense? Do I have a Plan for Analysis? Do my Results make Biological Sense??

Why can GBS be complicated? Sequencing Error SNP calling Missing data Large data sets Population type Biological Reality

Is GBS more complicated for non-model organisms? I don t have a reference genome We don t know anything about population structure or diversity My reference genome is in a preassembly state, or the physical ordering is inaccurate My species is outcrossing

Roadmap No reference genome (UNEAK pipeline) SNP CALLING reference pipeline Examples and usage cases

The non-model problem: No Reference Genome

Universal Network Enabled Analysis Kit (UNEAK) A reference free SNP calling pipeline Fei Lu, Cornell University

Overview of UNEAK A Genome is digested, sequenced using GBS Reads are trimmed to 64 bp B Identical reads = tag slide courtesy Fei Lu, Cornell University

Overview of UNEAK Network filter C D E F count Pairwise alignment to find tag pairs with 1 bp mismatch Build tag networks Topology of tag networks Keep common reciprocal tags slide courtesy Fei Lu, Cornell University real tags error

Topology of tag networks Networks of 2496 tags Tag Error Plastid & Highly repetitive tags Moderately repetitive tags, Paralogs & SNPs slide courtesy Fei Lu, Cornell University

Details about network filter Error tolerance SNP slide courtesy Fei Lu, Cornell University

Proportion of SNPs 0 0.08 0.16 0.24 0.32 0.4 0.48 0.56 0.64 0.72 0.8 0.88 0.96 Proportion of SNPs 0 0.07 0.14 0.21 0.28 0.35 0.42 0.49 0.56 0.63 0.7 0.77 0.84 0.91 0.98 Pipeline validated with maize inbred linkage population network filter Evaluation criteria Step 1 Pairwise alignment of tags Step 2 Network filter Single-site rate (Blast to maize) 23.3% 85.0% Allele frequency distribution 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 Allele frequency Allele frequency slide courtesy Fei Lu, Cornell University

0 0.04 0.08 0.12 0.16 0.2 0.24 0.28 0.32 0.36 0.4 0.44 0.48 0.52 0.56 0.6 0.64 0.68 0.72 0.76 0.8 0.84 0.88 0.92 0.96 Proportion of SNPs Pipeline validated with maize inbred linkage population SNP validation LD distribution (SNPs against 1106 markers) 0.08 0.07 0.2 0.06 0.05 0.04 92.2% 0.03 0.02 0.01 0 LD (r 2 ) Alignment (B97 tags against B97 shotgun genome from Maize HapMap2 data) 93.2% SNPs are polymorphic slide courtesy Fei Lu, Cornell University

GBS UNEAK Pipeline Fork Discovery Sequence Tags by Taxa Tag Counts Network Filter TASSEL 3.0 1) UTagCountToTagPairPlugin 2) UExportTagPairPlugin SNP Caller Genotypes

No Reference Genome SNPs can be called with the UNEAK pipeline if there is no reference genome. Be careful with your sampling strategy Optimized for single species Allele frequency can matter! Decide whether your downstream analysis will be qualitatively influenced by biases in SNP calling Are you doing association mapping? Estimating population genetic parameters?

No Reference Genome Tag pairs are maximum 1 bp different / 64 bp sequenced Examples when sampling strategy matters: The same locus may be identified as two separate loci when there is population structure and overall divergence is > 1.5% With highly divergent sub-populations some loci that are monomorphic within a subpopulation but diverged > 1bp/tag between species will not be detected

No Reference Genome Allele frequency can matter for SNP calling in the UNEAK pipeline and is influenced by your population/ sampling strategy Undersampled diversity (and true rare alleles) can look like error, and are even harder to detect with the non-reference pipeline

Sequence based SNP calling: What s in a SNP? Total Reads # minor allele reads to call heterozygote (1% error rate) 2 1 3 1 4 1 5 1 6 1 7 2 8 2 9 2 10 2 11 2 12 2 13 2 14 3 Lookup table for GBS pipeline is calculated based on binomial distribution

Sequence based SNP calling: What s in a SNP? Total Reads # minor allele reads to call heterozygote (1% error rate) # minor allele reads to call heterozygote (0.33% error rate) 2 1 1 3 1 1 4 1 1 5 1 1 6 1 1 7 2 1 8 2 1 9 2 2 10 2 2 11 2 2 12 2 2 13 2 2 14 3 2 Lookup table for GBS pipeline is calculated based on binomial distribution

Sequence based SNP calling: What s in a SNP? Total read depth: 7 Major allele ( G ) count: 4 Minor allele ( C ) count: 3 SNP call: S (heterozygous G/C)

Sequence based SNP calling: What s in a SNP? Total read depth: 7 Major allele ( G ) count: 6 Minor allele ( A ) count: 1 SNP call: G (invariant)

Sequence based SNP calling: What s in a SNP? Total read depth: 3 Major allele ( G ) count: 3 Minor allele ( C ) count: 0 SNP call: G (homozygote)

Sequence based SNP calling: What s in a SNP? Total read depth: 3 Major allele ( G ) count: 2 Minor allele ( A ) count: 1 SNP call: R (A/G heterozygote)

Sequence based SNP calling: What s in a SNP? Depth Depth

Heterozygote undercalling in Vitis spp. Without filters Aa > AA error rate (heterozygote undercalling) in Vitis is 30-50% 50,951 SNPs Mean depth 5.7x

Heterozygote Undercalling Solutions 1. Focus on the high coverage genotypes (through VCF plugin) Ex: Vitisgen project 2. Use genetics (ex. Maize project) Each project is different, and different solutions are appropriate for each situation

SNP calling in Vitis (heterozygous) Read Depth of Each allele is captured for SNP calling in Vitis using the Variant Call format (VCF) Qi Sun Jeff Glaubitz Jon Zhang Terry Casstevens 0/0:3,0,0:3:88:0,0,108 Plugins available in TASSEL 3 to output VCF format tbt2vcfplugin MergeDuplicateSNP_vcf_Plugin Does NOT work with HDF5 TBT/TOPM VCF integration in TASSEL 4 in now implemented

Sequenced based SNP calling: What s in a SNP? VCF pipeline SNP calling algorithm Likelihood score calculation. No allele frequency based prior probability was used due to mixed population structures. Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862.

Heterozygote undercalling in Vitis spp. 120% 100% 80% 60% 40% 20% 0% 100% 82% 67% 56% 30% 15% 8% 3% - - - no filter depth >3 depth >5 GQ>98 Filtering criteria Call rate % Errors

Working With VCF Files VCFtools http://vcftools.sourceforge.net/ VCFtools output formats: plink *** can be used as TASSEL input 012 IMPUTE ldhat BEAGLE-GL VCFtools is installed on CBSU BioHPC Computing Lab http://cbsu.tc.cornell.edu/lab/lab.aspx

How to Get VCF Output Tassel 3.0 1. QseqToTagCountPlugin/FastqToTagCountPlugin 2. MergeMultipleTagCountPlugin 3. TagCountToFastqPlugin 4. Align 5. SAMConverterPlugin 6. QseqToTBTPlugin/FastqToTBTPlugin (use y to output in TBTByte format) 7. MergeTagsByTaxaFilesPlugin 8. tbt2vcfplugin 9. MergeDuplicateSNP_vcf_Plugin running each plugin without any options will output a list of available options on the console there is no plugin to merge taxa for VCF files. Merge samples by name at the TBT stage if desired using the flag x with the MergeTagsByTaxaFilesPlugin. TASSEL 4.0 From TagsToSNPByAlignmentPlugin

The non-model problem: Less-than-perfect reference genome

How less than perfect is it? Even relatively short contigs are suitable for anchoring 64bp GBS tags! Assembly errors can affect alignment and SNP calling. Common with short read assemblies Reference allele does not match GBS alleles Reference genome assembly Biological Reality Paralogous regions

Less than Perfect Reference SNPs can be called with the reference pipeline even on genomes that are not fully assembled Hint: concatenate contigs into pseudochromosomes, but pad with Ns in between contigs to avoid aligning to junctions for 64 bp tags pad with > 64 Ns

A non-model case study: Vitis a highly heterozygous outcrossing crop species with a genus-level breeding program

Mapping the way to a better grape 37 Mapping populations Genotyping-by-Sequencing (GBS) and SSR based approach to identify QTL and candidate genes biotic stress resistance, abiotic stress resistance, and fruit quality. Inclusion of 32 different wild and cultivated species to help in map building and genome assembly.

Outcrosses Use pseudo-testcross markers (Aa x aa) Leverage deep sequencing of parental genotyping to phase markers Filter by Mendelian errors and/or Linkage Disequilibrium Methods development in Vitis for genetic mapping with GBS data

Dense maps from GBS data 4 way cross Dense Mapping of 4-way Cross GBS Data Three pipelines: Synteny Synteny with genetic ordering De-novo linkage group formation and genetic ordering

Physical position Dense maps from GBS data 4 way cross Chromosome 1 markers from male parent in a VitisGen family (phased) and ordered by physical position Progeny Number (index) ~26,000 markers for a single biparental family with ~ 200 progeny Separate maps for male and female parents No parental genotypes Up to 50% missing data per SNP locus

The non-model problem: Take home messages

Designing your GBS study Know what you need for your analyses before planning your project GBS is a powerful reduced representation approach, with two distinct components molecular methods bioinformatics pipelines The GBS bioinformatics pipelines were developed and optimized for specific applications in mind They can work for you, but be mindful of your needs

Keep Calm and Filter expect to filter up to 90% of SNPs when imputation is not practical Pedigree independent filtering: Read depth/genotype quality Percentage of absent calls (missingness) Minor allele frequency For inbred species, inbreeding coefficient filters out most paralogous sites Likely to require customized bioinformatics especially without a reference genome Be ready for post-processing of data into other formats! Have a plan!

The Best Advice I can Give You:

The Best Advice I can Give You: Does my Experimental Design make Sense? Do I have a Plan for Analysis? Do my Results make Biological Sense??