Statistical method for Next Generation Sequencing pipeline comparison

Statistical method for Next Generation Sequencing pipeline comparison Pascal Roy, MD PhD EPICLIN 2016 Strasbourg 25-27 mai 2016 MH Elsensohn 1-4*, N Leblay 1-4, S Dimassi 5,6, A Campan-Fournier 5,6, A Labalme 5, D Sanlaville 5,6, G Lesca 5,6, C Bardel 1-4, P Roy 1-4. 1 Service de Biostatistique, Hospices Civils de Lyon, Lyon, France. 2 Université de Lyon, Lyon, France. 3 Université Lyon 1, Villeurbanne, France. 4 CNRS UMR 5558, Laboratoire de Biométrie et Biologie Evolutive, Equipe Biostatistique-Santé, Villeurbanne, France. 5 Service de Génétique, Hospices Civils de Lyon, Lyon, France 6 Centre de recherche en Neurosciences de Lyon (CNRS UMR 5292, INSERM U1028, Université Claude Bernard Lyon 1, Université Jean Monnet Saint-Etienne, and Hospices Civils de Lyon) Lyon, France. 1

DNA sequencing Sanger method (1977) considered as a Gold Standard New sequencers were designed since 2000 Next Generation Sequencing technologies became available since 2010 2

Pipeline Association of software programs for the various steps of NGS data analyses Academic and commercial softwares are available How To compare two NGS pipelines once with another? Each with the Gold Standard Sanger? 3

DNA Sequencing NGS sequencing Sanger technique Genetics Department - Hospices Civils de Lyon Whole blood Informed consent 43 epileptic patients 41 genes (epilepsy / mental retardation) Ion Torrent PGM BWA-GATK and TMAP-NextGENe pipelines 30 epileptic patients 1 to 3 genes according to clinical signs 4

Statistical Unit = Chromosomal position on the reference sequence Hg19 Each patient = A single study All patients = A meta-analysis 5

2-by-2 pairwise table for agreement p = 1,, P patient z = A, B Pipeline k = 1,, K X pzk nπ pab 1 0 K k 1 otherwise Chromosomal position disagreeme ntwithhg19 I X a I X b,a 0,1,b 0,1 pak pbk 6

Agreement on position Agreement on position and nature 7

With Gold Standard Sanger variants Sensitivity comparison Sanger non-variants Specificity comparison 8

1 2 P-1 P 10

1 2 Saturated model for variant identity A B log(npab )= μp +aλp +bλp +ab(θ p +Iθ ps Alternative approach Fitting mixed-effects log-linear models Random effects for biological variability ) P-1 P 11

# of Variants identified by the pipelines and by the gold standard All Types of variants* Variants and pipelines N Mean±SD Min. Max. Regions sequenced by MPS only BWA-GATK 43 1871 ± 225.08 1198 2360 TMAP-NextGENe 43 2280 ± 339.72 1214 3094 Region sequenced by MPS +Sanger Sanger 30 2.67 ± 2.88 0 10 BWA-GATK 30 27.40 ± 20.54 3 92 TMAP-NextGENe 30 22.77 ± 18.71 3 75 * Single Nucleotide Variants (SNVs), insertions, and deletions 390339 base-pairs per patient 1 to 3 genes and 1 085 to 16 570 base-pairs per patient 12

# of Variants identified by the pipelines and by the gold standard Only SNVs Variants and pipelines N Mean±SD Min. Max. Regions sequenced by MPS only BWA-GATK 43 267 ± 22.04 204 318 TMAP-NextGENe 43 315 ± 28.32 215 384 Region sequenced by MPS +Sanger Sanger 30 2.30 ± 2.79 0 9 BWA-GATK 30 2.77 ± 2.81 0 10 TMAP-NextGENe 30 2.77 ± 2.81 0 10 390 339 base-pairs per patient 1 085 to 16 570 base-pairs per patient 13

Pipeline comparisons - All types of variants Estimation of parameters Variants, pipelines, parameters Value 95% CI 95% BVI Without Gold Standard * # of variants for BWA-GATK 1857.11 1789.25; 1927.54 1368.17; 2520.7 # of variants for TMAP-NextGENe 2253.44 2149.93; 2361.94 1653.91; 3070.3 OR for agreement 64.59 60.36; 69.11 42.02; 99.29 Conditional probability of identity ** 0.24 0.23; 0.25 0.20; 0.28 With Gold Standard Sensitivity of BWA-GATK (%) 63.47 45.98; 87.60 47.85; 84.18 Sensitivity of TMAP-NextGENe (%) 63.42 45.98; 87.48 44.68; 90.02 FP rate for BWA-GATK 43.03 41.22; 45.20 NA FP rate for TMAP-NextGENe 35.25 33.59; 36.80 NA * Heterogeneous margins and odds-ratios. **Heterogeneous variant identity parameter Heterogeneous margins and odds-ratios for specificity extended to sensitivity analysis for 10 000 Sanger NV 14

Pipeline comparisons - Only SNVs Estimation of parameters Variants, pipelines, parameters Value 95% CI 95% BVI Without Gold Standard * Number of SNVs for BWA-GATK 266.41 259.17; 273.84 229.84; 308.80 Number of SNVs for TMAP-NextGENe 314.24 305.52; 323.21 269.33; 366.65 OR for agreement 3165.26 2955.53; 3389.88 1373.08; 7296.64 Conditional probability of identity ** 0.9987 0.9984;0.9989 NA With Gold Standard Sensitivity of BWA-GATK (%) 76.81 63.50; 92.92 NA Sensitivity of TMAP-NextGENe (%) 76.81 63.50; 92.92 NA FP rate for BWA-GATK 2.01 1.82; 2.24 NA FP rate for TMAP-NextGENe 2.01 1.82; 2.24 NA * Heterogeneous margins and odds-ratios. **Homogeneous variant identity parameter Homogeneous margins and odds-ratios for specificity extended to sensitivity analysis 15 for 10000 Sanger NV

Discussion Several usual statistical methods may be adapted to NGS analyses More sophisticated experimental designs are needed to analyze the various components of experimental variability, in contrast with biological variability More sophisticated decision rules are needed to select appropriate pipelines, including the respective costs of FP and FN When all variants were analyzed, the performances of the 2 pipelines were low in terms of false sensitivity et specificity In substitution analyses, a better performance was observed, leading to a very high value of the odds-ratio for agreement Extensions of these models are needed to evaluated the performances of NGS 16

References Li H. Exploring single-sample SNP and INDEL calling with wholegenome de novo assembly. Bioinformatics 2012;28:1838-44 Agresti A. Categorical Data Analysis, 3 rd edition. Wiley, 2013. Becker MP., Agresti A. Log-linear modelling of pairwise interobserver agreement on a categorical scale. Stat Med 1992;11:101-114. 17