SNP calling and VCF format Laurent Falquet, Oct 12 SNP? What is this? A type of genetic variation, among others: Family of Single Nucleotide Aberrations Single Nucleotide Polymorphisms (SNPs) Single Nucleotide Variations (SNVs) Short Insertions or Deletions (indels) (less than 50bp) Larger Structural Variations (SVs) large indels inversion translocation CNVs...
SNPs vs SNVs Both are concerned with aberrations at a single nucleotide But differ by their frequency of occurrence SNP Aberration expected at the position for any member in the species Occur in population at some frequency (usually > 1%) Validated in the population Catalogued in dbsnp (http://www.ncbi.nlm.nih.gov/snp) SNV Aberration seen in only one individual Occur at low frequency Not validated in the population SNP example SNP genotype Ref Ind1 A G/G Ind2 A/G Comparison of 2 diploid individuals vs a reference genome
SNP real life example Why looking for SNPs/SNVs? SNPs may lead to a change in function or expression of a gene. Non-synonymous as an impact on protein sequence, examples: premature stop codon different fold in a protein Genetic markers SNP may be linked to a gene for a given trait response to a pathogen (susceptible or resistant) a phenotype
Types of SNPs/SNVs Effect of SNPs vary depending on location. Intergenic regions may alter the sequence of regulatory RNAs Non-coding regions alteration of promoter and enhancer sequences may change expression of gene Coding regions Substitutions synonymous: no change in the amino acid non-synonymous: change in amino acid Other variants: Indels Insertion/deletion Sometimes a matter of perspective does the reference have an insertion, or does the query (e.g. a read sequence) have a deletion? Differ from SNPs by having at least one nucleotide extra or missing when compared to a reference sequence. I A F A M A! Can cause frame-shifts codons shift Reference ATCGCGTTTGCCATGGCC! by one creating a different protein ATCGCGTTTCGCCATGGCC! sequence after indel. I A F R H G! Note: Indels of a length divisible by 3 cause whole amino acid insertions/deletions, not frame-shifts. Reference ATCGCGTTTGCCATGGCC! ATCGCGTTGCCATGGCC! I A L P W P!
Variant Calling Format (4.3 Oct2015) http://samtools.github.io/hts-specs/vcfv4.3.pdf VCF is a tab-delimited text file format ## Meta information lines # Header line Data lines each with information about a position in the genome Variant Calling Format in more details The format also allows to code for genotype information of each sample
IGV visualization of VCF Tools for SNP calling (non exhaustive list) samtools VarScan2 GATK (Picard tools required) strelka FreeBayes Generally take a BAM/SAM file as input Produce a VCF like as output
Variant Calling Methods and Tools > 15 different algorithms, but three main categories: Allele counting with simple cutoff rules Probabilistic methods, e.g. Bayesian model to quantify statistical uncertainty Assign priors based on observed allele frequency of multiple samples Heuristic approach Based on thresholds for read depth, base quality, variant allele frequency, statistical significance Identifying SNPs Filter SNPs based on some rules Coverage: minimum depth of coverage required? Genotype: genotypes, or combination of genotypes? Alternative allele frequency: 0.5? 0.33? Absent in dbsnp or other databases Exclude LOH (loss of heterozygosity) events Retain non-synonymous SNV present in given number of reads High mapping and SNV quality SNV density in a given bp window SNV greater than a given bp from a predicted indel Strand balance/bias Concordance across various SNV callers
GATK recommended pipeline Mapping and deduplicate Mapping was seen previously (BWA, Bowtie2 etc )
Mapping and deduplicate Removing or Marking PCR duplicates can be achieved by samtools or Picard tools GATK recommended pipeline
Indels realignments Hidden indels realignments (strand discordant locus)
Hidden indels realignments (strand discordant locus) FreeBayes does it without! FreeBayes is haplotype-based, it calls variants based on the literal sequences of reads aligned to a particular target, not their precise alignment. This method avoids one of the core problems with alignment-based variant detection--- that identical sequences may have multiple possible alignments.
GATK recommended pipeline Base Quality Score Recalibration The quality score of the bam file is based on the fastq score and thus reflects more the quality of the reads from the sequencing machine rather than the quality of the mapping location. Applying a recalibration of the score based on the mapping information allows to correct the errors in the base quality score.
GATK methods for SNP calling Recommended but slow Haploid vs Diploid genomes Warning many SVN callers are designed for diploid genomes. They call both homozygotes and heterozygotes variants. In the case of haploid genomes only homozygotes variants are of interest, the heterozygotes can be filtered out.