Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Similar documents
SNP calling and VCF format

Read Mapping and Variant Calling. Johannes Starlinger

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

Introduction to Next Generation Sequencing (NGS) Andrew Parrish Exeter, 2 nd November 2017

Genetic Variation and Genome- Wide Association Studies. Keyan Salari, MD/PhD Candidate Department of Genetics

Mutations during meiosis and germ line division lead to genetic variation between individuals

Mutation entries in SMA databases Guidelines for national curators

Chapter 14: Genes in Action

Gene mutation and DNA polymorphism

Next Generation Sequencing: Data analysis for genetic profiling

Midterm 1 Results. Midterm 1 Akey/ Fields Median Number of Students. Exam Score

Introduction to the UCSC genome browser

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

RareVariantVis 2: R suite for analysis of rare variants in whole genome sequencing data.

Hands-On Four Investigating Inherited Diseases

Population and Community Dynamics. The Hardy-Weinberg Principle

An introduction to genetics and molecular biology

Higher Human Biology Unit 1: Human Cells Pupils Learning Outcomes

Variant calling in NGS experiments

Release Notes for Genomes Processed Using Complete Genomics Software

USER MANUAL for the use of the human Genome Clinical Annotation Tool (h-gcat) uthors: Klaas J. Wierenga, MD & Zhijie Jiang, P PhD

Linking Genetic Variation to Important Phenotypes

Concepts: What are RFLPs and how do they act like genetic marker loci?

Sequence Variations. Baxevanis and Ouellette, Chapter 7 - Sequence Polymorphisms. NCBI SNP Primer:

Bundle 5 Test Review

Oral Cleft Targeted Sequencing Project

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Answers to additional linkage problems.

Gen e e n t e i t c c V a V ri r abi b li l ty Biolo l gy g Lec e tur u e e 9 : 9 Gen e et e ic I n I her e itan a ce

Exploring genomic databases: Practical session "

LATE-PCR. Linear-After-The-Exponential

Bio 6 Natural Selection Lab

SENIOR BIOLOGY. Blueprint of life and Genetics: the Code Broken? INTRODUCTORY NOTES NAME SCHOOL / ORGANISATION DATE. Bay 12, 1417.

Genetics module. DNA Structure, Replication. The Genetic Code; Transcription and Translation. Principles of Heredity; Gene Mapping

Variant Analysis. CB2-201 Computational Biology and Bioinformatics! February 27, Emidio Capriotti!

BIOSTATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH. Genetic Variation & CANCERS

TaqMan SNP Genotyping

Targeted resequencing

Introduction to Bioinformatics

EOC Review Reporting Category 2 Mechanisms of Genetics

Axiom mydesign Custom Array design guide for human genotyping applications

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

How about the genes? Biology or Genes? DNA Structure. DNA Structure DNA. Proteins. Life functions are regulated by proteins:

Introduction to Pharmacogenetics Competency

Mutations and Disease

Identification of the Photoreceptor Transcriptional Co-Repressor SAMD11 as Novel Cause of. Autosomal Recessive Retinitis Pigmentosa

FORENSIC GENETICS. DNA in the cell FORENSIC GENETICS PERSONAL IDENTIFICATION KINSHIP ANALYSIS FORENSIC GENETICS. Sources of biological evidence

Protein Synthesis

Identification of Single Nucleotide Polymorphisms and associated Disease Genes using NCBI resources

Genome Sequence Assembly

BA, BSc, and MSc Degree Examinations

Cancer Genetics Solutions

Themes: RNA and RNA Processing. Messenger RNA (mrna) What is a gene? RNA is very versatile! RNA-RNA interactions are very important!

Terminology: chromosome; gene; allele; proteins; enzymes

AP BIOLOGY Population Genetics and Evolution Lab

Genome variation - part 1

Genome-wide association studies (GWAS) Part 1

Conifer Translational Genomics Network Coordinated Agricultural Project

4.1. Genetics as a Tool in Anthropology

Chapter 8: DNA and RNA

Introduction to Basic Human Genetics. Professor Hanan Hamamy Department of Genetic Medicine and Development Geneva University Switzerland

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

Bio 311 Learning Objectives

PCR Amplification of The Human Dimorphic Alu PV92 Site 3/17 Honors Biomedical Science 2 Redwood High School Name: [ETRLMBR]

Variant detection analysis in the BRCA1/2 genes from Ion torrent PGM data

Next-Generation Sequencing. Technologies

Welcome to the NGS webinar series

Basic Concepts of Human Genetics

GENETICS. I. Review of DNA/RNA A. Basic Structure DNA 3 parts that make up a nucleotide chains wrap around each other to form a

Genes and Proteins in Health. and Disease

Chapter 23: The Evolution of Populations. 1. Populations & Gene Pools. Populations & Gene Pools 12/2/ Populations and Gene Pools

Genetic Equilibrium: Human Diversity Student Version

Student Sheet 1.1: KWL Chart

Mutagenesis. Classification of mutation. Spontaneous Base Substitution. Molecular Mutagenesis. Limits to DNA Pol Fidelity.

Genetic load. For the organism as a whole (its genome, and the species), what is the fitness cost of deleterious mutations?

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data

7-1. Read this exercise before you come to the laboratory. Review the lecture notes from October 15 (Hardy-Weinberg Equilibrium)

SCI-02 Evolution: A Primer Session 1, 12 Sept. 2017

RNA-Sequencing analysis

Genomic Research: Issues to Consider. IRB Brown Bag August 28, 2014 Sharon Aufox, MS, LGC

Gene Regulation & Mutation 8.6,8.7

MRC-Holland MLPA. Description version 12; 27 November 2015

Chapter 13. From DNA to Protein

Oncomine cfdna Assays Part III: Variant Analysis

DNA segment: T A C T G T G G C A A A

Autozygosity by difference a method for locating autosomal recessive mutations. Geoff Pollott

Bio 101 Sample questions: Chapter 10

Bundle 6 Test Review

Targeted Sequencing Reveals Large-Scale Sequence Polymorphism in Maize Candidate Genes for Biomass Production and Composition

The Mosaic Nature of Genomes

Genetics and Biotechnology Chapter 13

AS91159 Demonstrate understanding of gene expression

Basic Concepts of Human Genetics

PV92 PCR Bio Informatics

STR Profiling Matching Criteria: Establishment and Importance of a Cell Line Database

LightScanner Hi-Res Melting Comparison of Six Master Mixes for Scanning and Small Amplicon and LunaProbes Genotyping

Biology 3201 Grading Standards June 2005

Transcription:

Single Nucleotide Variant Analysis H3ABioNet May 14, 2014

Outline What are SNPs and SNVs? How do we identify them? How do we call them? SAMTools GATK VCF File Format Let s call variants!

Single Nucleotide Polymorphisms (SNPs) A single-nucleotide polymorphism (SNP, pronounced "snip") is the variation in a single base of DNA that is present in at least 1% of the population. http://www.springerreference.com/docs/html/chapterdbid/334682.html

Single Nucleotide Variants (SNVs) Single-nucleotide variants (SNVs) include both rare (<1%) common ( 1%) (SNP) variants of a single base pair http://www.springerreference.com/docs/html/chapterdbid/334682.html SNV SNP

Subtypes of SNVs Coding SNV Synonymous Non-synonymous Missense Nonsense Non-coding SNV Non-coding regions of genes (ex. Introns) Intergenic regions (regions between genes)

Synonymous SNV Synonymous SNV Type : Single base pair change Where: Coding region (exon) Feature: No amino acid change

Non-synonymous SNV Non-synonymous SNV Type: Single base pair change Where: Coding region (exon) Feature: Amino acid change Two sub types: Missense and Non-sense

Missense mutation Missense Mutation Type : Single base pair change Where: Coding region (exon) Feature: One amino acid change

Nonsense Mutation Nonsense mutation Type : Single base pair change Where: Coding region (exon) Feature: Pre-mature stop (nonsense) codon -> protein truncation

Why are SNVs important? Human Disease The association with SNP and diseases OMIM, HGMD(Human Gene Mutation Database) Cancer Normal vs tumor sample Response to drugs, chemicals, and pathogens GWAS (Genome Wide Association Studies) GWAS Central

SNV Databases dbsnp (Single Nucleotide Polymorphism database) Up until v138 Only common variants (not disease related) COSMIC (Catalogue of Somatic Mutations in Cancer)

Outline What are SNVs and SNPs? How do we identify them? How do we call them? SAMTools GATK VCF File Format Let s call variants!

Identifying SNVs Identifying SNVs can be challenging and there are many tools available to help with this. Let s first look at what a SNV may look like. Assume the samples are from human (diploid)

Inheritance You inherit the genetic material from Mother Father You have 2 copies of each chromosome At any given base position then, the genotype should be either homozygous or heterozygous AA Homozygous AB Heterozygous

Allelic Fractions We can keep in mind allelic fractions when looking at SNVs Homozygous samples should have 100% of the bases showing one allele Heterozygous should be ~50/50 This gets complicated with cancer genomes

Calling SNVs Example 1 AACTACGGTCCGAGATAGAG GAACTACGGTCCGAGATAGA AGAACTACGGTCCGAGATAG TAGAACTACGGTCCGAGATA ATAGAACTACGGTCCGAGAT AATAGAACTACGGTCCGAGA TAATAGAACTACGGTCCGAG GTAATAGAACTACGGTCCGA TCGTAATAGAACTCCGGTCCGAGATAGAGGATAC Reference Homozygous C->A SNV.

Calling SNVs Example 2 AACTCCGGTCCGAGATAGAG GAACTCCGGTCCGAGATAGA AGAACTCCGGTCCGAGATAG TAGAACTCCGGTCCGAGATA ATAGAACTACGGTCCGAGAT AATAGAACTACGGTCCGAGA TAATAGAACTACGGTCCGAG GTAATAGAACTACGGTCCGA TCGTAATAGAACTCCGGTCCGAGATAGAGGATAC Heterozygous C->A SNV.

Calling SNVs Example 3 AACTCCGGTCCGAGATAGAG GAACTCCGGTCCGAGATAGA AGAACTCCGGTCCGAGATAG TAGAACTCCGGTCCGAGATA ATAGAACTCCGGTCCGAGAT AATAGAACTCCGGTCCGAGA TAATAGAACTCCGGTCCGAG GTAATAGAACTACGGTCCGA TCGTAATAGAACTCCGGTCCGAGATAGAGGATAC Is this an SNV?

Calling Variants To call variants you need to think about: Read depth (coverage) Base Quality Mapping Quality Sequencing errors Distribution within a read Strand bias Distinguishing what we would see from PCR bias

Read Depth (Coverage) Read depth (Coverage) refers to how many reads are covering a given base in the reference. GAACTACGGTCCGAGATAGA ATAGAACTACGGTCCGAGAT TAATAGAACTACGGTCCGAG GTAATAGAACTACGGTCCGA CCATACCAGTCGTAATAGAACTACGGTCCGAGATAGAGGATACACAGATTAGATAGGGATACCG Read Depth = 4 Read Depth = 1 Read Depth = 0

Depth AACTCCGGTCCGAGATAGAG GAACTCCGGTCCGAGATAGA AGAACTCCGGTCCGAGATAG TAGAACTCCGGTCCGAGATA ATAGAACTACGGTCCGAGAT AATAGAACTACGGTCCGAGA TAATAGAACTACGGTCCGAG GTAATAGAACTACGGTCCGA TCGTAATAGAACTCCGGTCCGAGATAGAGGATAC Do you call this an SNV? Why? Or Why not?

Depth GTAATAGAACTACGGTCCGA TCGTAATAGAACTCCGGTCCGAGATAGAGGATAC Do you call this an SNV? Why? Or Why not?

Depth Researchers typically want to have a depth of >8x at a given base position before attempting to make a SNV call. Need at least 3 reads with a variant to call it an SNV. Make sure to consider your average coverage of your sample Carefully set a threshold

Sequencing Errors Sequencing errors are random but See base quality trend Usually more errors towards the end of reads We would rarely see them at the same position in different reads AGAACTACGGTCCGAGACAG TAGAACTACGGTCTGAGATA ATAGAACTACGGTCCGAGAT AAAAGAACTACGGTCCGAGA TAATAGAACTACGGTCCCAG GTAATAGAACTCCGGTCCGA TCGTAATAGAACTACGGTCCGAGATAGAGGATAC

Distribution within a Read GTAATATAACTACGCTCCGA TCGTAATAGAACTCCGGTCCGAGATAGAGGATAC What is happening here? Multiple mismatches within a read is a sign of possible misalignments.

Why do Mis-alignments occur? We are forcing the aligner to compare our sequence reads against a known reference. The aligner tries to find the best alignment position against the reference provide. Contamination may still align if originating organism are similar enough.

Mis-alignments ATATAACTACGCTCCGAGAT AATATAACTACGCTCCGAGA TAATATAACTACGCTCCGAG GTAATATAACTACGCTCCGA TCGTAATAGAACTCCGGTCCGAGATAGAGGATAC It is possible all 3 mutations are true. More likely though this is a problematic region of the genome that have mis-alignment issues.

Gene Families A set of several similar genes, formed by duplication of a single original gene, and generally with similar biochemical functions Similar Sequences Gene families are typically problematic across a single genome.

Mis-alignments Always align against the whole genome of an organism even if we do targeted sequencing This will reduce the chances of mis-alignments Major human reference genomes contain Chr1-22, X and Y Many fragmented chromosomes

Strand Bias All evidence of a variant from either the forward or reverse strand. Implies problematic area in the genome or biases in capturing technology.

Strand Bias T T T T T T T T Variant

Strand Bias T T T Variant? T T T

Duplicated Reads PCR amplification results in the sequencing of duplicate reads. Can not distinguish when there were multiple fragments from the DNA OR when there was PCR amplification. To deal with PCR amplification we collapse our data.

Collapsed Data CATACCAGTC-------ACTACCATGT CATACCAGTC-------ACTACCATGT Would you CATACCAGTC-------ACTACCATGT call this a CATTCGTAAT -----ACCATGATAG variant? CATTCGTAAT -----ACCATGATAG CATTCGTAAT-----ACCATGATAG CATTCGTAAT-----ACCATGATAG CATTCGTAAT--------ATGTTAGATA CCATACCAGTCGTAATGAACTACCATGTTAGATACACAGATTAGATA Now? CATACCAGTC-------ACTACCATGT CATTCGTAAT-----ACCATGATAG CATTCGTAAT--------ATGTTAGATA CCATACCAGTCGTAATGAACTACCATGTTAGATACACAGATTAGATA

Ideal Case Lots of staggered reads Multiple reads supporting a variant Ideally in expected ratios (ie 1.0, 0.5, 0) Both strands Low number of variants in any of the read