Variant Finding. UCD Genome Center Bioinformatics Core Wednesday 30 August 2016

Size: px
Start display at page:

Download "Variant Finding. UCD Genome Center Bioinformatics Core Wednesday 30 August 2016"

Transcription

1 Variant Finding UCD Genome Center Bioinformatics Core Wednesday 30 August 2016

2 Types of Variants Adapted from Alkan et al, Nature Reviews Genetics 2011

3 Why Look For Variants? Genotyping Correlation with Traits Breeding (Agriculture) Disease Susceptibility Disease Progression Population Structure Identification of changes to protein sequences

4 Variant Calling Tools A few of the many SNP/Indel calling tools include: GATK ( A suite of tools including local realigner, quality score recalibrator, and SNP/INDEL caller. Samtools ( For working with SNPs and short INDELs Freebayes (github.com/ekg/freebayes) Finds SNPs, indels, MNPs (multi-nucleotide polymorphisms), and complex events (composite insertions and substitutions)

5 Variant Calling Tools Different software is needed for larger scale variants, with fewer choices, including: Breakdancer (github.com/genome/breakdancer) predicts insertions, deletions, inversions, inter- and intra-chromosomal translocations. Delly2 (github.com/tobiasrausch/delly) discovers and genotypes deletions, tandem duplications, inversions and translocations includes visualization software Delly-maze and Delly-suave

6 A Comparison of Tools Venn diagrams showing the number of identified variants for tested (A) germline, (B) somatic, (C) CNV, and (D) exome CNV tools. Stephan Pabinger et al. Brief Bioinform 2014;15: The Author 2013.

7 Variant Call Format (VCF) The general specifications for most of today s file formats are at github.com/samtools/hts-specs The specs tend to be minimum requirements. Different software tools can produce different file versions that may (or may not) completely follow the spec, and often add tool-specific info. This can lead to compatibility issues between tools in a workflow.

8 Variant Call Format (VCF) A good tutorial (with examples) can be found at faculty.washington.edu/browning/beagle/intro-to-vcf.html VCF poster

9 Variant Call Format (VCF) ##fileformat=vcfv4.1 ##filedate= ##source=freebayes v gfbf46fc-dirty ##reference=../results/8/8.fa ##phasing=none ##commandline="../tools/freebayes/bin/freebayes -f../results/8/8.fa --min-alternate-fraction min-mapping-quality 20 --min-base-quality 20 --ploidy 1 --pooled-continuous --use-best-n-alleles 4 --use-mapping-quality --min-alternate-fraction min-alternate-count 1../results/8/8.bam" ##INFO=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count, with partial observations recorded fractionally"> ##INFO=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observations, with partial observations recorded fractionally"> ##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex."> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype"> ##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count"> ##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations"> ##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count"> ##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 8 8_PB1 26. TGTTACGCG GCTTTTGC,TGTTTCTAC AO=1,2;RO=0;TYPE=complex,complex GT:DP:RO:QR:AO:QA:GL 2:3:0:0:1,2:31,70:-4.46,-1.65,0 8_PB1 38. TCA ACG,TA,AGA AO=1,1,1;RO=3;TYPE=complex,del,mnp GT:DP:RO:QR:AO:QA:GL 2:6:3:101:1,1,1:31,37,34:0,-4.556,-4.004, _PB1 42. G A e-14. AO=8;RO=128;TYPE=snp GT:DP:RO:QR:AO:QA:GL

10 Variant Call Format (VCF) ##fileformat=vcfv4.1 ##filedate= ##source=freebayes v gfbf46fc-dirty ##reference=../results/8/8.fa ##phasing=none ##commandline="../tools/freebayes/bin/freebayes -f../results/8/8.fa --min-alternate-fraction min-mapping-quality 20 --min-base-quality 20 --ploidy 1 --pooled-continuous --use-best-n-alleles 4 --use-mapping-quality --min-alternate-fraction min-alternate-count 1../results/8/8.bam" ##INFO=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count, with partial observations recorded fractionally"> ##INFO=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observations, with partial observations recorded fractionally"> ##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex."> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype"> ##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count"> ##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations"> ##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count"> ##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 8 8_PB1 26. TGTTACGCG GCTTTTGC,TGTTTCTAC AO=1,2;RO=0;TYPE=complex,complex GT:DP:RO:QR:AO:QA:GL 2:3:0:0:1,2:31,70:-4.46,-1.65,0 8_PB1 38. TCA ACG,TA,AGA AO=1,1,1;RO=3;TYPE=complex,del,mnp GT:DP:RO:QR:AO:QA:GL 2:6:3:101:1,1,1:31,37,34:0,-4.556,-4.004, _PB1 42. G A e-14. AO=8;RO=128;TYPE=snp GT:DP:RO:QR:AO:QA:GL

11 Variant Call Format (VCF) #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 8 8_PB A G AO=149;RO=21;TYPE=snp GT:DP:RO:QR:AO:QA:GL 1:170:21:788:149:5579:-5,0 CHROM = 8_PB2 POS = 407 ID =. REF = A ALT = G QUAL = FILTER =. INFO = AO=149;RO=21;TYPE=snp FORMAT = GT:DP:RO:QR:AO:QA:GL 8 = 1:170:21:788:149:5579:-5,0

12 Variant Call Format (VCF)

13 Variant Call Format (VCF) #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 8 8_PB A G AO=149;RO=21;TYPE=snp GT:DP:RO:QR:AO:QA:GL 1:170:21:788:149:5579:-5,0 CHROM = 8_PB2 POS = 407 ID =. REF ALT = A = G QUAL = FILTER =. INFO = AO=149;RO=21;TYPE=snp FORMAT = GT:DP:RO:QR:AO:QA:GL 8 = 1:170:21:788:149:5579:-5,0 ##FORMAT=<ID=DP,Number=1,Type=Integer, Description="Read Depth">

14 Variant Call Format (VCF) ##INFO=<ID=RO,Number=1,Type=Integer,Description= "Reference allele observation count, with partial observations recorded fractionally"> ##INFO=<ID=AO,Number=A,Type=Integer,Description= "Alternate allele observations, with partial observations recorded fractionally"> ##INFO=<ID=TYPE,Number=A,Type=String,Description ="The type of allele, either snp, mnp, ins, del, or complex.">

15 Variant Call Format (VCF) ##FORMAT=<ID=GT,Number=1,Type=String,Description ="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Float,Description= "Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype"> ##FORMAT=<ID=GL,Number=G,Type=Float,Description= "Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Descriptio n="read Depth">

16 Variant Call Format (VCF) ##FORMAT=<ID=RO,Number=1,Type=Integer,Description= "Reference allele observation count"> ##FORMAT=<ID=QR,Number=1,Type=Integer,Description= "Sum of quality of the reference observations"> ##FORMAT=<ID=AO,Number=A,Type=Integer,Description= "Alternate allele observation count"> ##FORMAT=<ID=QA,Number=A,Type=Integer,Description= "Sum of quality of the alternate observations">

17 Variant Effect Prediction Tools snpeff (snpeff.sourceforge.net/) Variant Effect Predictor - EMBL ( SIFT (sift.jcvi.org)

18 VCF after Effect Prediction #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 8 8_PB A G AO=149;RO=21;TYPE=snp;EFF=SYNONYMOUS_CODING(LOW SILENT gaa/gag E PB2 CODING Tr_PB2 1 1) GT:DP:RO:QR:AO:QA:GL 1:170:21:788:149:5579:-5,0 CHROM = 8_PB2 POS = 407 ID =. REF ALT = A = G QUAL = FILTER =. INFO = AO=149;RO=21;TYPE=snp;EFF=SYNONYMOUS_CODING(LOW SILENT gaa/gag E PB2 CODING Tr_PB2 1 1) FORMAT = GT:DP:RO:QR:AO:QA:GL 8 = 1:170:21:788:149:5579:-5,0

19 VCF after Effect Prediction ##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex."> ##INFO=<ID=EFF,Number=.,Type=String,Description="Predicted effects for this variant.format: 'Effect ( Effect_Impact Functional_Class Codon_Change Amino_Acid_change Amino_Acid_length Gene_Name Transcript_BioType Gene_Coding Transcript_ID Exon GenotypeNum [ ERRORS WARNINGS ] )' "> INFO = AO=149;RO=21;TYPE=snp; EFF=SYNONYMOUS_CODING(LOW SILENT gaa/gag E PB2 CODING Tr_PB2 1 1)

20

21 Why Duplicates Are Bad

22 The Need for Indel Realignment

23 Information Used For Indel Realignment Known sites (dbsnp, 1000 Genomes) Indels present in original alignments (in CIGARs) Sites where evidence suggests a hidden indel

24 After Local Realignment - One Indel Remains

25 Base Quality Score Recalibration Critical for downstream analysis Scores assigned by sequencers are inaccurate and biased Recalibration information is obtained by analyzing covariation among several features of a base, including: Reported quality score Position within the read (machine cycle) Preceding and current nucleotide (sequencing chemistry effect) Known variants are used to discount most of the real genetic variation present in the sample All other differences from the reference are assumed to be sequencing errors Indel Realignments first reduces noise from misalignments

26 Base Quality Score Recalibration

27 Read Compression Discard redundant information Only keep the essential information for variant calling

28 Read Compression: Full vs. Reduced BAM

29 Haplotype Caller - Initial Variant Calling Calls SNPs, indels, and some structural variants simultaneously by performing a local de-novo assembly Distinguishes genetic variant and random machine noise Uses active regions for variant calling, based on significant evidence for variation Determines likelihoods of the haplotypes given the read data Assigns sample genotypes based on Bayesian likelihoods

30 Variant Quality Score Recalibration (VQSR) Also called Hard Filtering Initial variant calling has very large set that is full of false positives Hand-tuned filtering requires time and expertise Statistical model could be used to recalibrate variants Each variant has a set of statistics associated with them that are called variant annotations Real variants tend to cluster together via these statistics SNPs and indels must be recalibrated separately Training resources: SNP (HapMap, Omni, 1000G, dbsnp) INDEL (Mills)

31 Variant Quality Score Recalibration (VQSR)

32 Variant Filtering Based on some criteria relevant to your research Useful for: Small data sets in terms of both number of samples or size of targeted regions No database available with high confidence known variants Example: For SNPs: QD < 2.0 MQ < 40.0 FS > 60.0 MQRankSum < ReadPosRankSum < -8.0 For indels: QD < 2.0 ReadPosRankSum < InbreedingCoeff < -0.8 FS > 200.0

33 Genotype gvcf Files VCF files with information for every position in the genome regardless of variant calls Used by GATK to perform variant discovery in a way that enables joint analysis of multiple samples, but decoupled from the initial individual variant calling step. I.e. you don't have to call variants on all your samples together to perform a joint analysis Drastically reduces run time and allows for easy incorporation of additional samples into the pipeline Part of GATK 3.0, but NOT in our Galaxy AMI because wrappers have not been written yet