Statistical method for Next Generation Sequencing pipeline comparison

Similar documents
Statistical method to compare massive parallel sequencing pipelines

Variant detection analysis in the BRCA1/2 genes from Ion torrent PGM data

The Basics of Understanding Whole Genome Next Generation Sequence Data

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Performance of the Newly Developed Non-Invasive Prenatal Multi- Gene Sequencing Screen

Targeted Sequencing in the NBS Laboratory

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Assay Validation Services

Alignment methods. Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

THE ERA OF INDIVIDUAL GENOMES. Sandra Viz Lasheras Advanced Genetics ( )

Analytics Behind Genomic Testing

An innovative approach to genetic testing for improved patient care

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4

Variant Callers. J Fass 24 August 2017

Setting Standards and Raising Quality for Clinical Bioinformatics. Joo Wook Ahn, Guy s & St Thomas 04/07/ ACGS summer scientific meeting

14 March, 2016: Introduction to Genomics

Agilent NGS Solutions : Addressing Today s Challenges

Processing Ion AmpliSeq Data using NextGENe Software v2.3.0

SNP calling and VCF format

with drmid Dx for Illumina NGS systems

Biology Evolution: Mutation I Science and Mathematics Education Research Group

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

Understanding the science and technology of whole genome sequencing

DNA polymorphisms and RNA-Seq alternative splicing blow bubbles in de Bruijn Graphs

Lees J.A., Vehkala M. et al., 2016 In Review

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Experiences in implementing large-scale biomedical workflows on the cloud: Challenges in transitioning to the clinical domain

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Next-Generation Sequencing. Technologies

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:

Welcome to the NGS webinar series

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Analytical verification methods for the Oncomine Lung cfdna Assay using the Ion S5 XL System

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Introduction to Short Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

Human Genetic Variation. Ricardo Lebrón Dpto. Genética UGR

Expected Relationship Between the Silent Substitution Rate and the GC Content: Implications for the Evolution of Isochores

Course Presentation. Ignacio Medina Presentation

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Mate-pair library data improves genome assembly

SEQUENCING. M Ataei, PhD. Feb 2016

The Human Genome and its upcoming Dynamics

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Reconstruction of Infectious Bronchitis Virus Quasispecies from NGS Data

Practical Considerations for Implementation of Clinical Sequencing. Emily Winn-Deen, Ph.D. April 2017

Department of Research Evaluation. Theory and Approaches of Genomics Complexity TAGC

Variant Discovery. Jie (Jessie) Li PhD Bioinformatics Analyst Bioinformatics Core, UCD

Genetic Testing in the Clinic. Anne Goodeve Sheffield Diagnostic Genetics Service Sheffield Children s NHS Foundation Trust

Titelstijl van model bewerken

Whole Human Genome Sequencing Report This is a technical summary report for PG DNA

Using Genomics to Guide Immunosuppression Therapy David A. Baran, MD, FACC, FSCAI System Director, Advanced HF, Transplant and MCS, Sentara Heart

Next Generation Sequencing of CFTR from dried blood spots using the Ion Torrent PGM

Complementary Technologies for Precision Genetic Analysis

HaloPlex HS. Get to Know Your DNA. Every Single Fragment. Kevin Poon, Ph.D.

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

Eucalyptus gene assembly

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

Variant Finding. UCD Genome Center Bioinformatics Core Wednesday 30 August 2016

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

Incorporating SeqStudio Genetic Analyzer and Sanger sequencing into genome editing workflows

Informatic Issues in Genomics

Applications of HMMs in Computational Biology. BMI/CS Colin Dewey

Sanger vs Next-Gen Sequencing

A DE NOVO NONSENSE MUTATION IN MAGEL2 IN A PATIENT INITIALLY DIAGNOSED AS OPITZ-C: SIMILARITIES BETWEEN SCHAAF-YANG AND

Published online 15 May 2014 Nucleic Acids Research, 2014, Vol. 42, No. 12 e101 doi: /nar/gku392

BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte

Applied Bioinformatics

Assignment 9: Genetic Variation

SUPPLEMENTARY INFORMATION

A Crash Course in NGS for GI Pathologists. Sandra O Toole

Variant calling in NGS experiments

Introduction to Bioinformatics

HGMD : Human Gene Mutation Database

Sample to Insight. Dr. Bhagyashree S. Birla NGS Field Application Scientist

HLA and Next Generation Sequencing it s all about the Data

Trimethylaminuria (TMAU) Yiran Guo, Ph.D. Center for Applied Genomics Children's Hospital of Philadelphia

Read Mapping and Variant Calling. Johannes Starlinger

Introducing QIAseq. Accelerate your NGS performance through Sample to Insight solutions. Sample to Insight

Alignment & Variant Discovery. J Fass UCD Genome Center Bioinformatics Core Tuesday June 17, 2014

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es

RNA-SEQUENCING ANALYSIS

Bioinformatics Advice on Experimental Design

AN ALGORITHM FOR STRUCTURAL VARIANT DETECTION WITH THIRD GENERATION SEQUENCING HUI-JOU CHOU. A thesis submitted to the. Graduate School Camden

Developing Tools for Rapid and Accurate Post-Sequencing Analysis of Foodborne Pathogens. Mitchell Holland, Noblis

The Beery Twins Story and Sepiapterin Reductase

Genomes contain all of the information needed for an organism to grow and survive.

SUPPLEMENTARY INFORMATION

Pharmacogenetics: A SNPshot of the Future. Ani Khondkaryan Genomics, Bioinformatics, and Medicine Spring 2001

Variant Detection in Next Generation Sequencing Data. John Osborne Sept 14, 2012

Functional DNA Quality Analysis Improves the Accuracy of Next Generation Sequencing from Clinical Specimens

QIAseq Targeted Panel Analysis Plugin USER MANUAL

CAPTURE-BASED APPROACH FOR COMPREHENSIVE DETECTION OF IMPORTANT ALTERATIONS

Next Generation Sequencing. Dylan Young Biomedical Engineering

Linking Genetic Variation to Important Phenotypes

Compute- and Data-Intensive Analyses in Bioinformatics"

DNA Sequencing by Ion Torrent. Marc Lavergne CHEM 4590

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Outline General NGS background and terms 11/14/2016 CONFLICT OF INTEREST. HLA region targeted enrichment. NGS library preparation methodologies

Matthew Tinning Australian Genome Research Facility. July 2012

Transcription:

Statistical method for Next Generation Sequencing pipeline comparison Pascal Roy, MD PhD EPICLIN 2016 Strasbourg 25-27 mai 2016 MH Elsensohn 1-4*, N Leblay 1-4, S Dimassi 5,6, A Campan-Fournier 5,6, A Labalme 5, D Sanlaville 5,6, G Lesca 5,6, C Bardel 1-4, P Roy 1-4. 1 Service de Biostatistique, Hospices Civils de Lyon, Lyon, France. 2 Université de Lyon, Lyon, France. 3 Université Lyon 1, Villeurbanne, France. 4 CNRS UMR 5558, Laboratoire de Biométrie et Biologie Evolutive, Equipe Biostatistique-Santé, Villeurbanne, France. 5 Service de Génétique, Hospices Civils de Lyon, Lyon, France 6 Centre de recherche en Neurosciences de Lyon (CNRS UMR 5292, INSERM U1028, Université Claude Bernard Lyon 1, Université Jean Monnet Saint-Etienne, and Hospices Civils de Lyon) Lyon, France. 1

DNA sequencing Sanger method (1977) considered as a Gold Standard New sequencers were designed since 2000 Next Generation Sequencing technologies became available since 2010 2

Pipeline Association of software programs for the various steps of NGS data analyses Academic and commercial softwares are available How To compare two NGS pipelines once with another? Each with the Gold Standard Sanger? 3

DNA Sequencing NGS sequencing Sanger technique Genetics Department - Hospices Civils de Lyon Whole blood Informed consent 43 epileptic patients 41 genes (epilepsy / mental retardation) Ion Torrent PGM BWA-GATK and TMAP-NextGENe pipelines 30 epileptic patients 1 to 3 genes according to clinical signs 4

Statistical Unit = Chromosomal position on the reference sequence Hg19 Each patient = A single study All patients = A meta-analysis 5

2-by-2 pairwise table for agreement p = 1,, P patient z = A, B Pipeline k = 1,, K X pzk nπ pab 1 0 K k 1 otherwise Chromosomal position disagreeme ntwithhg19 I X a I X b,a 0,1,b 0,1 pak pbk 6

Agreement on position Agreement on position and nature 7

With Gold Standard Sanger variants Sensitivity comparison Sanger non-variants Specificity comparison 8

9

1 2 P-1 P 10

1 2 Saturated model for variant identity A B log(npab )= μp +aλp +bλp +ab(θ p +Iθ ps Alternative approach Fitting mixed-effects log-linear models Random effects for biological variability ) P-1 P 11

# of Variants identified by the pipelines and by the gold standard All Types of variants* Variants and pipelines N Mean±SD Min. Max. Regions sequenced by MPS only BWA-GATK 43 1871 ± 225.08 1198 2360 TMAP-NextGENe 43 2280 ± 339.72 1214 3094 Region sequenced by MPS +Sanger Sanger 30 2.67 ± 2.88 0 10 BWA-GATK 30 27.40 ± 20.54 3 92 TMAP-NextGENe 30 22.77 ± 18.71 3 75 * Single Nucleotide Variants (SNVs), insertions, and deletions 390339 base-pairs per patient 1 to 3 genes and 1 085 to 16 570 base-pairs per patient 12

# of Variants identified by the pipelines and by the gold standard Only SNVs Variants and pipelines N Mean±SD Min. Max. Regions sequenced by MPS only BWA-GATK 43 267 ± 22.04 204 318 TMAP-NextGENe 43 315 ± 28.32 215 384 Region sequenced by MPS +Sanger Sanger 30 2.30 ± 2.79 0 9 BWA-GATK 30 2.77 ± 2.81 0 10 TMAP-NextGENe 30 2.77 ± 2.81 0 10 390 339 base-pairs per patient 1 085 to 16 570 base-pairs per patient 13

Pipeline comparisons - All types of variants Estimation of parameters Variants, pipelines, parameters Value 95% CI 95% BVI Without Gold Standard * # of variants for BWA-GATK 1857.11 1789.25; 1927.54 1368.17; 2520.7 # of variants for TMAP-NextGENe 2253.44 2149.93; 2361.94 1653.91; 3070.3 OR for agreement 64.59 60.36; 69.11 42.02; 99.29 Conditional probability of identity ** 0.24 0.23; 0.25 0.20; 0.28 With Gold Standard Sensitivity of BWA-GATK (%) 63.47 45.98; 87.60 47.85; 84.18 Sensitivity of TMAP-NextGENe (%) 63.42 45.98; 87.48 44.68; 90.02 FP rate for BWA-GATK 43.03 41.22; 45.20 NA FP rate for TMAP-NextGENe 35.25 33.59; 36.80 NA * Heterogeneous margins and odds-ratios. **Heterogeneous variant identity parameter Heterogeneous margins and odds-ratios for specificity extended to sensitivity analysis for 10 000 Sanger NV 14

Pipeline comparisons - Only SNVs Estimation of parameters Variants, pipelines, parameters Value 95% CI 95% BVI Without Gold Standard * Number of SNVs for BWA-GATK 266.41 259.17; 273.84 229.84; 308.80 Number of SNVs for TMAP-NextGENe 314.24 305.52; 323.21 269.33; 366.65 OR for agreement 3165.26 2955.53; 3389.88 1373.08; 7296.64 Conditional probability of identity ** 0.9987 0.9984;0.9989 NA With Gold Standard Sensitivity of BWA-GATK (%) 76.81 63.50; 92.92 NA Sensitivity of TMAP-NextGENe (%) 76.81 63.50; 92.92 NA FP rate for BWA-GATK 2.01 1.82; 2.24 NA FP rate for TMAP-NextGENe 2.01 1.82; 2.24 NA * Heterogeneous margins and odds-ratios. **Homogeneous variant identity parameter Homogeneous margins and odds-ratios for specificity extended to sensitivity analysis 15 for 10000 Sanger NV

Discussion Several usual statistical methods may be adapted to NGS analyses More sophisticated experimental designs are needed to analyze the various components of experimental variability, in contrast with biological variability More sophisticated decision rules are needed to select appropriate pipelines, including the respective costs of FP and FN When all variants were analyzed, the performances of the 2 pipelines were low in terms of false sensitivity et specificity In substitution analyses, a better performance was observed, leading to a very high value of the odds-ratio for agreement Extensions of these models are needed to evaluated the performances of NGS 16

References Li H. Exploring single-sample SNP and INDEL calling with wholegenome de novo assembly. Bioinformatics 2012;28:1838-44 Agresti A. Categorical Data Analysis, 3 rd edition. Wiley, 2013. Becker MP., Agresti A. Log-linear modelling of pairwise interobserver agreement on a categorical scale. Stat Med 1992;11:101-114. 17