Single Nucleotide Variant Analysis H3ABioNet May 14, 2014
Outline What are SNPs and SNVs? How do we identify them? How do we call them? SAMTools GATK VCF File Format Let s call variants!
Single Nucleotide Polymorphisms (SNPs) A single-nucleotide polymorphism (SNP, pronounced "snip") is the variation in a single base of DNA that is present in at least 1% of the population. http://www.springerreference.com/docs/html/chapterdbid/334682.html
Single Nucleotide Variants (SNVs) Single-nucleotide variants (SNVs) include both rare (<1%) common ( 1%) (SNP) variants of a single base pair http://www.springerreference.com/docs/html/chapterdbid/334682.html SNV SNP
Subtypes of SNVs Coding SNV Synonymous Non-synonymous Missense Nonsense Non-coding SNV Non-coding regions of genes (ex. Introns) Intergenic regions (regions between genes)
Synonymous SNV Synonymous SNV Type : Single base pair change Where: Coding region (exon) Feature: No amino acid change
Non-synonymous SNV Non-synonymous SNV Type: Single base pair change Where: Coding region (exon) Feature: Amino acid change Two sub types: Missense and Non-sense
Missense mutation Missense Mutation Type : Single base pair change Where: Coding region (exon) Feature: One amino acid change
Nonsense Mutation Nonsense mutation Type : Single base pair change Where: Coding region (exon) Feature: Pre-mature stop (nonsense) codon -> protein truncation
Why are SNVs important? Human Disease The association with SNP and diseases OMIM, HGMD(Human Gene Mutation Database) Cancer Normal vs tumor sample Response to drugs, chemicals, and pathogens GWAS (Genome Wide Association Studies) GWAS Central
SNV Databases dbsnp (Single Nucleotide Polymorphism database) Up until v138 Only common variants (not disease related) COSMIC (Catalogue of Somatic Mutations in Cancer)
Outline What are SNVs and SNPs? How do we identify them? How do we call them? SAMTools GATK VCF File Format Let s call variants!
Identifying SNVs Identifying SNVs can be challenging and there are many tools available to help with this. Let s first look at what a SNV may look like. Assume the samples are from human (diploid)
Inheritance You inherit the genetic material from Mother Father You have 2 copies of each chromosome At any given base position then, the genotype should be either homozygous or heterozygous AA Homozygous AB Heterozygous
Allelic Fractions We can keep in mind allelic fractions when looking at SNVs Homozygous samples should have 100% of the bases showing one allele Heterozygous should be ~50/50 This gets complicated with cancer genomes
Calling SNVs Example 1 AACTACGGTCCGAGATAGAG GAACTACGGTCCGAGATAGA AGAACTACGGTCCGAGATAG TAGAACTACGGTCCGAGATA ATAGAACTACGGTCCGAGAT AATAGAACTACGGTCCGAGA TAATAGAACTACGGTCCGAG GTAATAGAACTACGGTCCGA TCGTAATAGAACTCCGGTCCGAGATAGAGGATAC Reference Homozygous C->A SNV.
Calling SNVs Example 2 AACTCCGGTCCGAGATAGAG GAACTCCGGTCCGAGATAGA AGAACTCCGGTCCGAGATAG TAGAACTCCGGTCCGAGATA ATAGAACTACGGTCCGAGAT AATAGAACTACGGTCCGAGA TAATAGAACTACGGTCCGAG GTAATAGAACTACGGTCCGA TCGTAATAGAACTCCGGTCCGAGATAGAGGATAC Heterozygous C->A SNV.
Calling SNVs Example 3 AACTCCGGTCCGAGATAGAG GAACTCCGGTCCGAGATAGA AGAACTCCGGTCCGAGATAG TAGAACTCCGGTCCGAGATA ATAGAACTCCGGTCCGAGAT AATAGAACTCCGGTCCGAGA TAATAGAACTCCGGTCCGAG GTAATAGAACTACGGTCCGA TCGTAATAGAACTCCGGTCCGAGATAGAGGATAC Is this an SNV?
Calling Variants To call variants you need to think about: Read depth (coverage) Base Quality Mapping Quality Sequencing errors Distribution within a read Strand bias Distinguishing what we would see from PCR bias
Read Depth (Coverage) Read depth (Coverage) refers to how many reads are covering a given base in the reference. GAACTACGGTCCGAGATAGA ATAGAACTACGGTCCGAGAT TAATAGAACTACGGTCCGAG GTAATAGAACTACGGTCCGA CCATACCAGTCGTAATAGAACTACGGTCCGAGATAGAGGATACACAGATTAGATAGGGATACCG Read Depth = 4 Read Depth = 1 Read Depth = 0
Depth AACTCCGGTCCGAGATAGAG GAACTCCGGTCCGAGATAGA AGAACTCCGGTCCGAGATAG TAGAACTCCGGTCCGAGATA ATAGAACTACGGTCCGAGAT AATAGAACTACGGTCCGAGA TAATAGAACTACGGTCCGAG GTAATAGAACTACGGTCCGA TCGTAATAGAACTCCGGTCCGAGATAGAGGATAC Do you call this an SNV? Why? Or Why not?
Depth GTAATAGAACTACGGTCCGA TCGTAATAGAACTCCGGTCCGAGATAGAGGATAC Do you call this an SNV? Why? Or Why not?
Depth Researchers typically want to have a depth of >8x at a given base position before attempting to make a SNV call. Need at least 3 reads with a variant to call it an SNV. Make sure to consider your average coverage of your sample Carefully set a threshold
Sequencing Errors Sequencing errors are random but See base quality trend Usually more errors towards the end of reads We would rarely see them at the same position in different reads AGAACTACGGTCCGAGACAG TAGAACTACGGTCTGAGATA ATAGAACTACGGTCCGAGAT AAAAGAACTACGGTCCGAGA TAATAGAACTACGGTCCCAG GTAATAGAACTCCGGTCCGA TCGTAATAGAACTACGGTCCGAGATAGAGGATAC
Distribution within a Read GTAATATAACTACGCTCCGA TCGTAATAGAACTCCGGTCCGAGATAGAGGATAC What is happening here? Multiple mismatches within a read is a sign of possible misalignments.
Why do Mis-alignments occur? We are forcing the aligner to compare our sequence reads against a known reference. The aligner tries to find the best alignment position against the reference provide. Contamination may still align if originating organism are similar enough.
Mis-alignments ATATAACTACGCTCCGAGAT AATATAACTACGCTCCGAGA TAATATAACTACGCTCCGAG GTAATATAACTACGCTCCGA TCGTAATAGAACTCCGGTCCGAGATAGAGGATAC It is possible all 3 mutations are true. More likely though this is a problematic region of the genome that have mis-alignment issues.
Gene Families A set of several similar genes, formed by duplication of a single original gene, and generally with similar biochemical functions Similar Sequences Gene families are typically problematic across a single genome.
Mis-alignments Always align against the whole genome of an organism even if we do targeted sequencing This will reduce the chances of mis-alignments Major human reference genomes contain Chr1-22, X and Y Many fragmented chromosomes
Strand Bias All evidence of a variant from either the forward or reverse strand. Implies problematic area in the genome or biases in capturing technology.
Strand Bias T T T T T T T T Variant
Strand Bias T T T Variant? T T T
Duplicated Reads PCR amplification results in the sequencing of duplicate reads. Can not distinguish when there were multiple fragments from the DNA OR when there was PCR amplification. To deal with PCR amplification we collapse our data.
Collapsed Data CATACCAGTC-------ACTACCATGT CATACCAGTC-------ACTACCATGT Would you CATACCAGTC-------ACTACCATGT call this a CATTCGTAAT -----ACCATGATAG variant? CATTCGTAAT -----ACCATGATAG CATTCGTAAT-----ACCATGATAG CATTCGTAAT-----ACCATGATAG CATTCGTAAT--------ATGTTAGATA CCATACCAGTCGTAATGAACTACCATGTTAGATACACAGATTAGATA Now? CATACCAGTC-------ACTACCATGT CATTCGTAAT-----ACCATGATAG CATTCGTAAT--------ATGTTAGATA CCATACCAGTCGTAATGAACTACCATGTTAGATACACAGATTAGATA
Ideal Case Lots of staggered reads Multiple reads supporting a variant Ideally in expected ratios (ie 1.0, 0.5, 0) Both strands Low number of variants in any of the read