Read Mapping and Variant Calling. Johannes Starlinger

Similar documents
SNP calling and VCF format

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Prioritization: from vcf to finding the causative gene

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

NGS in Pathology Webinar

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es

Introduction to Bioinformatics Finish. Johannes Starlinger

Comparing a few SNP calling algorithms using low-coverage sequencing data

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

Variant calling in NGS experiments

Variant Finding. UCD Genome Center Bioinformatics Core Wednesday 30 August 2016

Next-Generation Sequencing. Technologies

14 March, 2016: Introduction to Genomics

Variant Detection in Next Generation Sequencing Data. John Osborne Sept 14, 2012

Variant Discovery. Jie (Jessie) Li PhD Bioinformatics Analyst Bioinformatics Core, UCD

Reads to Discovery. Visualize Annotate Discover. Small DNA-Seq ChIP-Seq Methyl-Seq. MeDIP-Seq. RNA-Seq. RNA-Seq.

Normal-Tumor Comparison using Next-Generation Sequencing Data

Variant Analysis. CB2-201 Computational Biology and Bioinformatics! February 27, Emidio Capriotti!

CITATION FILE CONTENT / FORMAT

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Variant Callers. J Fass 24 August 2017

Alignment & Variant Discovery. J Fass UCD Genome Center Bioinformatics Core Tuesday June 17, 2014

Analytics Behind Genomic Testing

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:

About Strand NGS. Strand Genomics, Inc All rights reserved.

Introduction to Next Generation Sequencing (NGS) Andrew Parrish Exeter, 2 nd November 2017

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

Alignment methods. Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Review Article A Survey of Computational Tools to Analyze and Interpret Whole Exome Sequencing Data

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

SUPPLEMENTARY INFORMATION

Annotating your variants: Ensembl Variant Effect Predictor (VEP) Helen Sparrow Ensembl EMBL-EBI 2nd November 2016

Processing Ion AmpliSeq Data using NextGENe Software v2.3.0

How to use Variant Effects Report

Next Generation Sequencing: Data analysis for genetic profiling

Fast and Accurate Variant Calling in Strand NGS

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data

Using VarSeq to Improve Variant Analysis Research

Variant prioritization in NGS studies: Annotation and Filtering "

IDENTIFYING A DISEASE CAUSING MUTATION

Introduction to Short Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Assignment 9: Genetic Variation

Genomic DNA ASSEMBLY BY REMAPPING. Course overview

UHT Sequencing Course Large-scale genotyping. Christian Iseli January 2009

Compute- and Data-Intensive Analyses in Bioinformatics"

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

PeCan Data Portal. rnal/v48/n1/full/ng.3466.html

Setting Standards and Raising Quality for Clinical Bioinformatics. Joo Wook Ahn, Guy s & St Thomas 04/07/ ACGS summer scientific meeting

Sanger vs Next-Gen Sequencing

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION

Genomics: Human variation

Course Presentation. Ignacio Medina Presentation

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

Novel Variant Discovery Tutorial

Supplementary Figures and Data

Functional Annotation and Prioritization of Whole Exome and Whole Genome Sequencing Variants. Mulin Jun Li

SNP calling and Genome Wide Association Study (GWAS) Trushar Shah

HiSeq Whole Exome Sequencing Report. BGI Co., Ltd.

NEXT GENERATION SEQUENCING. Farhat Habib

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

Lecture 7. Next-generation sequencing technologies

Introducing combined CGH and SNP arrays for cancer characterisation and a unique next-generation sequencing service. Dr. Ruth Burton Product Manager

ISO/IEC JTC 1/SC 29/WG 11 N15527 Warsaw, CH June Introduction

INTRODUCTION TO BIOINFORMATICS. SAINTS GENETICS Ian Bosdet

SNP detection in allopolyploid crops

Introduction to Next Generation Sequencing

Genomes contain all of the information needed for an organism to grow and survive.

RareVariantVis 2: R suite for analysis of rare variants in whole genome sequencing data.

BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte

DATA FORMATS AND QUALITY CONTROL

Supplemental Methods. Exome Enrichment and Sequencing

Data Analysis with CASAVA v1.8 and the MiSeq Reporter

Published online 15 May 2014 Nucleic Acids Research, 2014, Vol. 42, No. 12 e101 doi: /nar/gku392

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

Sample to Insight. Dr. Bhagyashree S. Birla NGS Field Application Scientist

Introduction to RNA-Seq in GeneSpring NGS Software

Towards Personal Genomics

G E N OM I C S S E RV I C ES

Human Genetic Variation. Ricardo Lebrón Dpto. Genética UGR

NGS part 2: applications. Tobias Österlund

The Sentieon Genomic Tools Improved Best Practices Pipelines for Analysis of Germline and Tumor-Normal Samples

Accelerate precision medicine with Microsoft Genomics

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Variant detection analysis in the BRCA1/2 genes from Ion torrent PGM data

Next Generation Sequencing. Tobias Österlund

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

Bioinformatics in next generation sequencing projects

Browsing Genes and Genomes with Ensembl

Introduction to Bioinformatics

Introduction to RNA sequencing

Targeted Sequencing Reveals Large-Scale Sequence Polymorphism in Maize Candidate Genes for Biomass Production and Composition

Strand NGS Variant Caller

The Diploid Genome Sequence of an Individual Human

Data Mining for Biological Data Analysis

Transcription:

Read Mapping and Variant Calling Johannes Starlinger

Application Scenario: Personalized Cancer Therapy Different mutations require different therapy Collins, Meredith A., and Marina Pasca di Magliano. "Kras as a key oncogene and therapeutic target in pancreatic cancer." Frontiers in physiology 4 (2013). Johannes Starlinger: Bioinformatics, Summer Semester 2017 2

Application Scenario: Personalized Cancer Therapy Madeleine Kittner @ WBI Johannes Starlinger: Bioinformatics, Summer Semester 2017 3

Application Scenario: Personalized Cancer Therapy Madeleine Kittner @ WBI Johannes Starlinger: Bioinformatics, Summer Semester 2017 4

Application Scenario: Personalized Cancer Therapy Madeleine Kittner @ WBI Johannes Starlinger: Bioinformatics, Summer Semester 2017 5

Molecular Profiling Workflow Orange India Johannes Starlinger: Bioinformatics, Summer Semester 2017 6

Molecular Profiling Workflow Sequencing Reads ` Read Mapping Genome, Exome, Panel Variant Calling Genes & Mutations Orange India Filter & Rank Filtered List w/ Top-X Candidates Find known Targets & Evidences Target Candidates Discuss at Molecular Tumor Board Johannes Starlinger: Bioinformatics, Summer Semester 2017 7

Molecular Profiling Workflow Sequencing Reads ` Target candidates: Mutated genes, for which a specific drug exists* Read Mapping Genome, Exome, Panel Variant Calling Genes & Mutations Filter & Rank Filtered List w/ Top-X Candidates * Approved, in clinical/preclinical evaluation, known to work in other cancer type, Find known Targets & Evidences Target Candidates Discuss at Molecular Tumor Board Johannes Starlinger: Bioinformatics, Summer Semester 2017 8

Molecular Profiling Workflow Sequencing Reads ` Read Mapping Genome, Exome, Panel This lecture Variant Calling Genes & Mutations Filter & Rank Filtered List w/ Top-X Candidates Find known Targets & Evidences Target Candidates Discuss at Molecular Tumor Board Johannes Starlinger: Bioinformatics, Summer Semester 2017 9

This Lecture Read Mapping Variant Calling Johannes Starlinger: Bioinformatics, Summer Semester 2017 10

Read Mapping A sequencing machine outputs short sequence reads Not whole genome or chromosome as one long sequence Need to reconstruct to whole sequence from the reads General Approach: Given a reference build of the whole genome Given the reads from the sequencing machine Find the best alignment for each read within the reference sequence Together with information about matches, mismatches indels Similar to an edit script Marc Bux @ WBI Johannes Starlinger: Bioinformatics, Summer Semester 2017 11

Read Mapping A sequencing machine outputs short sequence reads Not whole genome or chromosome as one long sequence Need to reconstruct to whole sequence from the reads General Approach: Given a reference build of the whole genome Given the reads from the sequencing machine With a certain depth of coverage of each base Find the best alignment for each read within the reference sequence Together with information about matches, mismatches indels Similar to an edit script Marc Bux @ WBI Johannes Starlinger: Bioinformatics, Summer Semester 2017 12

Read Mapping: the Problem Wanted: fast and accurate method of mapping all reads to the reference - Performing an approximate matching for up to 3 billion reads - BLAST is fast but still too slow here Need to account for different sources of errors Sequencing errors in the reads Errors introduced by the cloning process (PCR) Errors in the reference assembly Need strategy to handle multiple best mappings for a read Generating a quality score for each read including information other than just the alignment score Johannes Starlinger: Bioinformatics, Summer Semester 2017 13

Many Read Mappers available Key element: indexing of reference Similar to BLAST q-gram index > 100 tools for read mapping available (cmp SEQwiki) e.g., Blat, BWA, Bowtie, GSNAP, Maq, RMAP, Stampy, SHRiMP Some faster, some more accurate, some more memory efficient, some better to parallelize, some better for long/short reads etc BWA (Burrows-Wheeler Aligner) ist very popular Uses Burrows-Wheeler transform for indexing and compression (also used by bzip2 and by many other read mappers) 3 algorithms: BWA-backtrack for reads 100 bp BWA-SW for reads 70 bp BWA-MEM for reads 70 bp, faster & more accurate than BWA-SW Johannes Starlinger: Bioinformatics, Summer Semester 2017 14

Different Sequencing Techniques and Read Lengths Quelle: Wikipedia/DNA Sequencing 5/17 Johannes Starlinger: Bioinformatics, Summer Semester 2017 15

Different Types of Genomic Sequencing and Variants Different types of sequencing substrates: Whole genome Whole exome: all protein coding DNA regions Panel: Sequence only a defined list of genomic sites / genes Various different panels exist, e.g., for different tumor types RNASeq: Sequence the mrna present in a cell at a given point in time Different types of genomic variantions single nucleotide variants (SNV) / polymorphisms (SNP) Multi nucleotide polymorphisms (MNP) Fusion, deletion, copy number variation, translocation, inversion Johannes Starlinger: Bioinformatics, Summer Semester 2017 16

This Lecture Read Mapping Variant Calling Johannes Starlinger: Bioinformatics, Summer Semester 2017 17

Variant Calling: Problem definition Input for each position A column of bases (cmp coverage) Mapping quality score for each read Base call quality for each position in the read (from sequencing) Output for each position: Whether the genomic position is Homozygous wildtype (as per reference) Heterozygous Homozygous variant Quelle: Wikipedia Johannes Starlinger: Bioinformatics, Summer Semester 2017 18

Sources of Mismatches True variants Errors introduced by the cloning process (PCR) Errors from sequencing in the reads Errors in the reference assembly Errors in the read mapping Higher depth of coverage helps deal with the errors in the variant calling process Johannes Starlinger: Bioinformatics, Summer Semester 2017 19

PHRED Quality Score Q Originally used by the PHRED variant calling program Now widely used for characterizing error probability for bases PHRED score Q is logarithmically related to error probability P at which a base is wrong Error rate P = 10 (PHRED score Q / 10) Q = -10 log 10 P Example: PHRED score Q30 means error rate P = 10-3 means 1 in 1000 bases will be wrong Johannes Starlinger: Bioinformatics, Summer Semester 2017 20

Error Estimation Given k wildtype bases a n reads n k bases b with a, b {A,C,G,T} a b Observations: True genotype Homozygous a,a Homozygous b,b Heterozygous a,b Number of Errors n k k From: http://www.mi.fu-berlin.de/wiki/pub/abi/genomics12/varcall.pdf Johannes Starlinger: Bioinformatics, Summer Semester 2017 21

Naïve Approach Filter base calls by quality scores Apply frequency filter by column Use frequency thresholds f(b) for variant base b (PHRED Q20): The heterozygous region is the tricky one to characterize Works well with high coverage Johannes Starlinger: Bioinformatics, Summer Semester 2017 22

Naïve Approach Problems Quality threshold leads to loss of information on individual read/base qualities Low sequencing depth (low coverage) leads to undercalling of heterozygous genotypes Does not provide a measure of confidence in the call No way to jugde the quality of a specific call Has been replaced by probabilistic methods MAQ (integrates base and mapping quality scores, provides measure of reliability of the call, uses a Bayesian model) SNVmix, SAMtools, GATK, VarScan2, Freebayes others Johannes Starlinger: Bioinformatics, Summer Semester 2017 23

Data Formats FASTA For sequence data FASTQ For sequence data with quality information SAM Sequence Alignment Map Alignment details for every read VCF Variant Call Format Variants found wrt reference Johannes Starlinger: Bioinformatics, Summer Semester 2017 24

Last Step: Variant Annotation For each identified variant from the VCF file, find Coding region? Which gene? Protein domain affected Pathway affected Population statistics (1000 genomes, TCGA, etc) Predicted effect (pathogenic?) Druggable? Evidence Level? Using numerous public databases dbsnp, CIViC, ClinVar, COSMIC, Using specialized tools (effect) SIFT, PolyPhen, ClinGen, Johannes Starlinger: Bioinformatics, Summer Semester 2017 25

Variant Annotation Tools exist for subsets of information ANNOVAR is very popular Typical data integration problem Get data from many different public databases Map data elements onto each other Different schemata, different names for elements Map identification schemes onto each other Genes can have IDs from RefSeq, HGVS, Entres, HUGO, UniProt, Ensembl etc Map between reference builds! Information constantly updated Johannes Starlinger: Bioinformatics, Summer Semester 2017 26