SNP calling and VCF format

Similar documents
C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es

Variant Callers. J Fass 24 August 2017

Variant Finding. UCD Genome Center Bioinformatics Core Wednesday 30 August 2016

Variant calling in NGS experiments

Variant Discovery. Jie (Jessie) Li PhD Bioinformatics Analyst Bioinformatics Core, UCD

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

Variant Detection in Next Generation Sequencing Data. John Osborne Sept 14, 2012

Read Mapping and Variant Calling. Johannes Starlinger

BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte

Comparing a few SNP calling algorithms using low-coverage sequencing data

NGS in Pathology Webinar

Introducing combined CGH and SNP arrays for cancer characterisation and a unique next-generation sequencing service. Dr. Ruth Burton Product Manager

Analytics Behind Genomic Testing

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

Prioritization: from vcf to finding the causative gene

Fast and Accurate Variant Calling in Strand NGS

14 March, 2016: Introduction to Genomics

Release Notes for Genomes Processed Using Complete Genomics Software

POPULATION GENETICS studies the genetic. It includes the study of forces that induce evolution (the

Germline variant calling and joint genotyping

Release Notes for Genomes Processed Using Complete Genomics Software

Next Generation Sequencing: Data analysis for genetic profiling

SNP calling and Genome Wide Association Study (GWAS) Trushar Shah

Setting Standards and Raising Quality for Clinical Bioinformatics. Joo Wook Ahn, Guy s & St Thomas 04/07/ ACGS summer scientific meeting

Assignment 9: Genetic Variation

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4

Release Notes for Genomes Processed Using Complete Genomics Software

Supplementary information ATLAS

Human Genetic Variation. Ricardo Lebrón Dpto. Genética UGR

Normal-Tumor Comparison using Next-Generation Sequencing Data

Targeted Sequencing Reveals Large-Scale Sequence Polymorphism in Maize Candidate Genes for Biomass Production and Composition

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

What is genetic variation?

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data

Genomic resources. for non-model systems

SUPPLEMENTARY INFORMATION

Lecture 2: Biology Basics Continued

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

STAT 536: Genetic Statistics

Introduction to human genomics and genome informatics

Strand NGS Variant Caller

Why can GBS be complicated? Tools for filtering, error correction and imputation.

SUPPLEMENTARY INFORMATION

Axiom mydesign Custom Array design guide for human genotyping applications

Genomics: Human variation

Bulked Segregant Analysis For Fine Mapping Of Genes. Cheng Zou, Qi Sun Bioinformatics Facility Cornell University

Novel Variant Discovery Tutorial

Mapping errors require re- alignment

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

SUPPLEMENTARY INFORMATION

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

Authors: Vivek Sharma and Ram Kunwar

For more information about how to cite these materials visit

White Paper GENALICE MAP: Variant Calling in a Matter of Minutes. Bas Tolhuis, PhD - GENALICE B.V.

Calling DNA Variants Steve Laurie Centro Nacional de Analisis Genomico (CNAG-CRG), Barcelona

VARIANT DETECTION USING NEXT GENERATION SEQUENCING DATA YOON SOO PYON. For the Degree of Doctor of Philosophy

Introduction to Next Generation Sequencing (NGS) Andrew Parrish Exeter, 2 nd November 2017

Supplementary Figures and Data

QIAseq Targeted Panel Analysis Plugin USER MANUAL

Structural variation analysis using NGS sequencing

The Diploid Genome Sequence of an Individual Human

Whole Genome Sequencing. Biostatistics 666

Applicazioni biotecnologiche

Evolutionary Genetics: Part 1 Polymorphism in DNA

Structural variation. Marta Puig Institut de Biotecnologia i Biomedicina Universitat Autònoma de Barcelona

Deletion of Indian hedgehog gene causes dominant semi-lethal Creeper trait in chicken

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

Accelerate precision medicine with Microsoft Genomics

Protein Synthesis: From Gene RNA Protein Trait

MPG NGS workshop I: SNP calling

Linking Genetic Variation to Important Phenotypes

GDMS Templates Documentation GDMS Templates Release 1.0

From raw reads to variants

Annotating your variants: Ensembl Variant Effect Predictor (VEP) Helen Sparrow Ensembl EMBL-EBI 2nd November 2016

Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing (HaploSeq)

Introduction to Bioinformatics

SUPPLEMENTARY INFORMATION

Trimethylaminuria (TMAU) Yiran Guo, Ph.D. Center for Applied Genomics Children's Hospital of Philadelphia

IDENTIFYING A DISEASE CAUSING MUTATION

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

HiSeq Whole Exome Sequencing Report. BGI Co., Ltd.

Assay Validation Services

Basic Concepts of Human Genetics

THE HEALTH AND RETIREMENT STUDY: GENETIC DATA UPDATE

Processing Ion AmpliSeq Data using NextGENe Software v2.3.0

Supplementary Figures

PeCan Data Portal. rnal/v48/n1/full/ng.3466.html

Understanding Genes & Mutations. John A Phillips III May 16, 2005

Automating the ACMG Guidelines with VSClinical. Gabe Rudy VP of Product & Engineering

Linking Genetic Variation to Important Phenotypes: SNPs, CNVs, GWAS, and eqtls

Developing Tools for Rapid and Accurate Post-Sequencing Analysis of Foodborne Pathogens. Mitchell Holland, Noblis

DNBseq TM SERVICE OVERVIEW Plant and Animal Whole Genome Re-Sequencing

NUCLEOTIDE RESOLUTION STRUCTURAL VARIATION DETECTION USING NEXT- GENERATION WHOLE GENOME RESEQUENCING

The Sentieon Genomic Tools Improved Best Practices Pipelines for Analysis of Germline and Tumor-Normal Samples

Applications and Uses. (adapted from Roche RealTime PCR Application Manual)

Next Generation Genetics: Using deep sequencing to connect phenotype to genotype

Variant calling and filtering for INDELs. Erik Garrison University of Michigan

CNV and variant detection for human genome resequencing data - for biomedical researchers (II)

Transcription:

SNP calling and VCF format Laurent Falquet, Oct 12 SNP? What is this? A type of genetic variation, among others: Family of Single Nucleotide Aberrations Single Nucleotide Polymorphisms (SNPs) Single Nucleotide Variations (SNVs) Short Insertions or Deletions (indels) (less than 50bp) Larger Structural Variations (SVs) large indels inversion translocation CNVs...

SNPs vs SNVs Both are concerned with aberrations at a single nucleotide But differ by their frequency of occurrence SNP Aberration expected at the position for any member in the species Occur in population at some frequency (usually > 1%) Validated in the population Catalogued in dbsnp (http://www.ncbi.nlm.nih.gov/snp) SNV Aberration seen in only one individual Occur at low frequency Not validated in the population SNP example SNP genotype Ref Ind1 A G/G Ind2 A/G Comparison of 2 diploid individuals vs a reference genome

SNP real life example Why looking for SNPs/SNVs? SNPs may lead to a change in function or expression of a gene. Non-synonymous as an impact on protein sequence, examples: premature stop codon different fold in a protein Genetic markers SNP may be linked to a gene for a given trait response to a pathogen (susceptible or resistant) a phenotype

Types of SNPs/SNVs Effect of SNPs vary depending on location. Intergenic regions may alter the sequence of regulatory RNAs Non-coding regions alteration of promoter and enhancer sequences may change expression of gene Coding regions Substitutions synonymous: no change in the amino acid non-synonymous: change in amino acid Other variants: Indels Insertion/deletion Sometimes a matter of perspective does the reference have an insertion, or does the query (e.g. a read sequence) have a deletion? Differ from SNPs by having at least one nucleotide extra or missing when compared to a reference sequence. I A F A M A! Can cause frame-shifts codons shift Reference ATCGCGTTTGCCATGGCC! by one creating a different protein ATCGCGTTTCGCCATGGCC! sequence after indel. I A F R H G! Note: Indels of a length divisible by 3 cause whole amino acid insertions/deletions, not frame-shifts. Reference ATCGCGTTTGCCATGGCC! ATCGCGTTGCCATGGCC! I A L P W P!

Variant Calling Format (4.3 Oct2015) http://samtools.github.io/hts-specs/vcfv4.3.pdf VCF is a tab-delimited text file format ## Meta information lines # Header line Data lines each with information about a position in the genome Variant Calling Format in more details The format also allows to code for genotype information of each sample

IGV visualization of VCF Tools for SNP calling (non exhaustive list) samtools VarScan2 GATK (Picard tools required) strelka FreeBayes Generally take a BAM/SAM file as input Produce a VCF like as output

Variant Calling Methods and Tools > 15 different algorithms, but three main categories: Allele counting with simple cutoff rules Probabilistic methods, e.g. Bayesian model to quantify statistical uncertainty Assign priors based on observed allele frequency of multiple samples Heuristic approach Based on thresholds for read depth, base quality, variant allele frequency, statistical significance Identifying SNPs Filter SNPs based on some rules Coverage: minimum depth of coverage required? Genotype: genotypes, or combination of genotypes? Alternative allele frequency: 0.5? 0.33? Absent in dbsnp or other databases Exclude LOH (loss of heterozygosity) events Retain non-synonymous SNV present in given number of reads High mapping and SNV quality SNV density in a given bp window SNV greater than a given bp from a predicted indel Strand balance/bias Concordance across various SNV callers

GATK recommended pipeline Mapping and deduplicate Mapping was seen previously (BWA, Bowtie2 etc )

Mapping and deduplicate Removing or Marking PCR duplicates can be achieved by samtools or Picard tools GATK recommended pipeline

Indels realignments Hidden indels realignments (strand discordant locus)

Hidden indels realignments (strand discordant locus) FreeBayes does it without! FreeBayes is haplotype-based, it calls variants based on the literal sequences of reads aligned to a particular target, not their precise alignment. This method avoids one of the core problems with alignment-based variant detection--- that identical sequences may have multiple possible alignments.

GATK recommended pipeline Base Quality Score Recalibration The quality score of the bam file is based on the fastq score and thus reflects more the quality of the reads from the sequencing machine rather than the quality of the mapping location. Applying a recalibration of the score based on the mapping information allows to correct the errors in the base quality score.

GATK methods for SNP calling Recommended but slow Haploid vs Diploid genomes Warning many SVN callers are designed for diploid genomes. They call both homozygotes and heterozygotes variants. In the case of haploid genomes only homozygotes variants are of interest, the heterozygotes can be filtered out.