talks Callset Evalua,on Comparing sta,s,cs between your callset and a truth set

Size: px
Start display at page:

Download "talks Callset Evalua,on Comparing sta,s,cs between your callset and a truth set"

Transcription

1 talks Callset Evalua,on Comparing sta,s,cs between your callset and a truth set

2 You are here in the GATK Best Prac,ces workflow for germline variant discovery Data Pre-processing >> Variant Discovery >> Callset Refinement Raw Reads 111 Analysis-Ready Reads Var. Calling HC in ERC mode 111 Analysis-Ready Variants SNPs & Indels Non-GATK Map to Reference BWA mem Mark Duplicates & Sort (Picard) Genotype Likelihoods Joint Genotyping Variant Annotation Genotype Refinement Indel Realignment Base Recalibration Analysis-Ready Reads Raw Variants SNPs Indels Variant Recalibration separately per variant type Analysis-Ready SNPs Indels Variants Variant Evaluation look good? troubleshoot use in project

3 Where are you on this spectrum? IDEAL OKAY TERRIBLE Your variant calls perfectly match the underlying biological truth You found many real variants and called few false posi,ves You didn t find any real variants and only called ar,facts! = What callset evalua,on methods aim to determine (not veracity of individual variant calls)

4 How do I figure out how good/bad my callset is? MyCallset.vcf Specs: # SNPs # Indels TiTv Ratio Indel Ratio Concordance... Extract key sta@s@cs and compare to truth set stats MyCallset Specs vs Truth Specs Guiding principle: divergence is indica@ve of error

5 Key assump,on: truth set is representa,ve / comparable Important to match dataset proper,es! Popula,on ethnicity (European, African, etc.) Sequencing / exp. design (WGS vs. WES) Cohort size hup://

6 Ethnicity affects many variant call metrics SNP count 24K 21K 18K Monkol Lek, G ATV Bipolar BUP ESP OUawa NFBC SCZ T2D-GENES GoT2D SIGMA TAT

7 Older popula,ons tend to display more heterogeneity

8 If possible, use truth sets generated with orthogonal methods Sequencing Probe/Array-based T G C A GeneChip Sanger sequencing Other HTS technologies GeneChip Microarrays

9 Commonly used truth sets dbsnp All previously reported varia,on (lots of junk!) Sample-matched genotyping chip Awesome! But adds cost & limited to known variants HapMap Highly validated common human variants OMNI Common varia,on validated by array NIST Genomes in a BoSle (single sample evalua@on) Consensus callsets from common benchmarking samples

10 Recommended metrics for callset evalua,on Number of Indels & SNPs ref Indel ref CATG CA CCA TG CA G Genotype Concordance my calls TiTv Ra@o A Ti G truth Tv C T

11 Number of Indels & SNPs ref Sequencing Type # of Variants WGS WES ~4.4 M ~41 k Variants = Indels + SNPs Useful for order-of-magnitude sanity check

12 TiTv Ra,o A Ti G C Tv T Sequencing Type TiTv Ra@o WGS WES If random: expect ra,o of 0.5 Twice as many possible transversions vs transi,ons! Low TiTv ra,o indicates high rate of false posi,ves

13 Indel Ra,o ref CATG CA CCA TG CA G Variant Associa@on Study type Indel Ra@o Common ~1 Rare Ra,o of inser,ons to dele,ons Varies by type of study e.g. rare variant associa,on vs common variant associa,on

14 Genotype Concordance my calls truth Most appropriate truth set is genotyping chip for same sample % Genotype calls in callset matching GT calls in truth set Unmatched variants considered false posi,ves

15 Cheat sheet of concordance metrics

16 So how do I get these metrics? Variant Level Evalua@on VariantEval Genotype Level Evalua@on GenotypeConcordance GATK java -jar GenomeAnalysisTK.jar \ -T VariantEval \ -R reference.b37.fasta \ -eval callset.vcf \ -D truthset.vcf \ -o results.eval.grp java -jar GenomeAnalysisTK.jar \ -T GenotypeConcordance \ -R reference.b37.fasta \ --comp truthset.vcf \ --eval callset.vcf \ -o results.grp CollectVariantCallingMetrics GenotypeConcordance Picard java -jar picard.jar \ CollectVariantCallingMetrics INPUT=callset.vcf \ DBSNP=truthset.vcf \ OUTPUT=results java -jar picard.jar \ GenotypeConcordance \ CALL_VCF=callset.vcf \ TRUTH_VCF=truthset.vcf \ CALL_SAMPLE=sampleName \ TRUTH_SAMPLE=sampleName \ OUTPUT=results

17 Which variant-level evaluator should I use? GATK VariantEval Picard CollectVariantCallingMetrics More detailed analysis More op,ons for stra,fica,on Ability to compare to mul,ple truth sets Best performance & speed on very large callsets Few op,ons beyond the metrics discussed here

18 You are here in the GATK Best Prac,ces workflow for germline variant discovery Data Pre-processing >> Variant Discovery >> Callset Refinement Raw Reads 111 Analysis-Ready Reads Var. Calling HC in ERC mode 111 Analysis-Ready Variants SNPs & Indels Non-GATK Map to Reference BWA mem Mark Duplicates & Sort (Picard) Genotype Likelihoods Joint Genotyping Variant Annotation Genotype Refinement Indel Realignment Base Recalibration Analysis-Ready Reads Raw Variants SNPs Indels Variant Recalibration separately per variant type Analysis-Ready SNPs Indels Variants Variant Evaluation look good? troubleshoot use in project

19 talks Further reading hup:// hup://