Downloading PrecisionFDA Challenge Datasets 1. Consistency challenge (https://precision.fda.gov/challenges/consistency)

Similar documents
Variant calling in NGS experiments

Variant Finding. UCD Genome Center Bioinformatics Core Wednesday 30 August 2016

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Variant Quality Score Recalibra2on

Variant Callers. J Fass 24 August 2017

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4

Variant Discovery. Jie (Jessie) Li PhD Bioinformatics Analyst Bioinformatics Core, UCD

The Sentieon Genomic Tools Improved Best Practices Pipelines for Analysis of Germline and Tumor-Normal Samples

Comparing a few SNP calling algorithms using low-coverage sequencing data

SNP calling and VCF format

Fast and Accurate Variant Calling in Strand NGS

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

NGS in Pathology Webinar

Accelerate precision medicine with Microsoft Genomics

Prioritization: from vcf to finding the causative gene

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es

BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte

Assignment 9: Genetic Variation

SUPPLEMENTARY INFORMATION

HiSeq Whole Exome Sequencing Report. BGI Co., Ltd.

MPG NGS workshop I: SNP calling

Analytics Behind Genomic Testing

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

The Sentieon Genomics Tools A fast and accurate solution to variant calling from next-generation sequence data

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

Germline variant calling and joint genotyping

Mapping errors require re- alignment

Supplementary Text for Manta: Rapid detection of structural variants and indels for clinical sequencing applications.

Read Mapping and Variant Calling. Johannes Starlinger

Dipping into Guacamole. Tim O Donnell & Ryan Williams NYC Big Data Genetics Meetup Aug 11, 2016

Supplementary information ATLAS

G E N OM I C S S E RV I C ES

Supplementary Materials for

Strand NGS Variant Caller

QIAseq Targeted Panel Analysis Plugin USER MANUAL

Processing Ion AmpliSeq Data using NextGENe Software v2.3.0

1. Detailed SomaticSeq Results Table 1 summarizes what training data were used for each study in the paper.

SUPPLEMENTARY INFORMATION

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

Normal-Tumor Comparison using Next-Generation Sequencing Data

SUPPLEMENTARY INFORMATION

2017 HTS-CSRS COMMUNITY PUBLIC WORKSHOP

Supplementary Information

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Read Complexity

White Paper GENALICE MAP: Variant Calling in a Matter of Minutes. Bas Tolhuis, PhD - GENALICE B.V.

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data

Supplementary Figures and Data

Course Presentation. Ignacio Medina Presentation

Published online 15 May 2014 Nucleic Acids Research, 2014, Vol. 42, No. 12 e101 doi: /nar/gku392

Release Notes for Genomes Processed Using Complete Genomics Software

Quality assurance in NGS (diagnostics)

Setting Standards and Raising Quality for Clinical Bioinformatics. Joo Wook Ahn, Guy s & St Thomas 04/07/ ACGS summer scientific meeting

From raw reads to variants

Novel Variant Discovery Tutorial

Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls

Welcome to the NGS webinar series

OHSU Digital Commons. Oregon Health & Science University. Benjamin Cordier. Scholar Archive

Bulked Segregant Analysis For Fine Mapping Of Genes. Cheng Zou, Qi Sun Bioinformatics Facility Cornell University

Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing (HaploSeq)

Introduction to RNA-Seq in GeneSpring NGS Software

Release Notes for Genomes Processed Using Complete Genomics Software

Next-Generation Sequencing. Technologies

Genomic resources. for non-model systems

Release Notes for Genomes Processed Using Complete Genomics Software

Discretized Gaussian Mixture for Genotyping of microsatellite loci containing homopolymer runs

CNV and variant detection for human genome resequencing data - for biomedical researchers (II)

Nature Genetics: doi: /ng Supplementary Figure 1. H3K27ac HiChIP enriches enhancer promoter-associated chromatin contacts.

Introducing combined CGH and SNP arrays for cancer characterisation and a unique next-generation sequencing service. Dr. Ruth Burton Product Manager

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

Bionano Solve Theory of Operation: Variant Annotation Pipeline

Next Generation Sequencing: Data analysis for genetic profiling

Bionano Access : Assembly Report Guidelines

Addressing Challenges of Ancient DNA Sequence Data Obtained with Next Generation Methods

Oral Cleft Targeted Sequencing Project

Supplementary Figures

Supplementary Material for Extremely low-coverage whole genome sequencing in South Asians captures population genomics information

Whole Human Genome Sequencing Report This is a technical summary report for PG DNA

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

Runs of Homozygosity Analysis Tutorial

Shaare Zedek Medical Center (SZMC) Gaucher Clinic. Peripheral blood samples were collected from each

Reference genomes and common file formats

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Variant detection analysis in the BRCA1/2 genes from Ion torrent PGM data

Almac Diagnostics. NGS Panels: From Patient Selection to CDx. Dr Katarina Wikstrom Head of US Operations Almac Diagnostics

De novo meta-assembly of ultra-deep sequencing data

Alignment & Variant Discovery. J Fass UCD Genome Center Bioinformatics Core Tuesday June 17, 2014

Release Notes for Genomes Processed Using Complete Genomics Software

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:

SNP detection in allopolyploid crops

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

UHT Sequencing Course Large-scale genotyping. Christian Iseli January 2009

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Reads to Discovery. Visualize Annotate Discover. Small DNA-Seq ChIP-Seq Methyl-Seq. MeDIP-Seq. RNA-Seq. RNA-Seq.

Reference genomes and common file formats

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

GINDEL: Accurate genotype calling of insertions and deletions from low coverage population sequence reads

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

Axiom mydesign Custom Array design guide for human genotyping applications

Transcription:

Supplementary Notes for Strelka2: Fast and accurate variant calling for clinical sequencing applications Supplementary Note 1 Command lines to run analyses Downloading PrecisionFDA Challenge Datasets 1. Consistency challenge (https://precision.fda.gov/challenges/consistency) Dataset Garvan HLI Files NA12878-Garvan-Vial1_R1.fastq.gz NA12878-Garvan-Vial1_R2.fastq.gz TSNano_1lane_L008_13801_NA12878_R1_001.fastq.gz TSNano_1lane_L008_13801_NA12878_R1_002.fastq.gz Library Prep TruSeq Nano DNA Library Prep kit TruSeq Nano DNA Library Prep kit Read 2x150bp 2x150bp Length Coverage ~40x ~35x Instrument HiSeq X HiSeq X 2. Truth Challenge (https://precision.fda.gov/challenges/truth) Dataset HG001 HG002 Files HG001-NA12878-50x_1.fastq.gz HG001-NA12878-50x_2.fastq.gz HG002-NA24385-50x_1.fastq.gz HG002-NA24385-50x_1.fastq.gz Library Prep TruSeq DNA PCR-Free Read 2x148bp 2x148bp Length Coverage ~50x ~50x Instrument HiSeq 2500 HiSeq 2500 Read alignment 1. Run bwa-mem (version 0.7.12) bwa mem -M -t 28 -R \ @RG\tID:${sample}_1\tSM:$sample\tLB:$library\tPL:ILLUMINA\ $hg19 $fastq_1 $fastq_2 samtools view -b -o $unsorted_bam - 2. Sort reads (samtools version 1.3) samtools sort $unsorted_bam -O bam -o $sorted_bam -@ 28 3. Remove duplicates samtools rmdup <(samtools view -F 0x100 -u $sorted_bam) $sample_bam 4. Index samtools index $sample_bam

The hg19 reference file is available at https://s3.amazonaws.com/strelka-public/hg19/hg19.fa. Germline variant calling Strelka2 (version 2.8.3) 1. Configuration python ${strelka_install_path}/bin/configurestrelkagermlineworkflow.py --ref $hg19 --bam $sample_bam --rundir $strelka_analysis_path 2. Run germline calling python $strelka_analysis_path/runworkflow.py -m local -j 28 Sentieon DNAseq Haplotyper (version sentieon-genomics-201704, equivalent to GATK v3.7) We followed the Sentieon DNAseq Haplotyper pipeline as written in the software manual for Sentieon Genomics pipeline tools (version 201704). This pipeline is equivalent to the GATK best practices pipeline. 1. Indel realignment sentieon driver --temp_dir $temp_dir -r $hg19_fasta -t 28 -i $sample_bam --algo Realigner -k $Mills ${sample}.realigned.bam 2. Base quality score recalibration sentieon driver -r $hg19 -t 28 -i ${sample}.realigned.bam --interval chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr1 8,chr19,chr20,chr21,chr22,chrX,chrY --algo QualCal -k $dbsnp -k $Mills ${sample}.recal.table 3. Haplotype calling sentieon driver --temp_dir $temp_dir -r $hg19 -t 28 -i ${sample}.realigned.bam -q ${sample}.recal.table --interval chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr1 8,chr19,chr20,chr21,chr22,chrX,chrY --algo Haplotyper -d $dbsnp --emit_conf=10 --call_conf=30 -- prune_factor=3 ${sample}.vcf 4. Calculating SNV variant quality score recalibration (VQSR) sentieon driver -r $hg19 -t 28 --interval chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr1 8,chr19,chr20,chr21,chr22,chrX,chrY --algo VarCal -v ${sample}.vcf--resource 1000G_phase1.snps.high_confidence.hg19.sites.vcf --resource_param 1000G,known=false,training=true,truth=false,prior=10.0 --resource 1000G_omni2.5.hg19.sites.vcf -- resource_param omni,known=false,training=true,truth=true,prior=12.0 --resource $dbsnp -- resource_param dbsnp,known=true,training=false,truth=false,prior=2.0 --resource hapmap_3.3.hg19.sites.vcf --resource_param hapmap,known=false,training=true,truth=true,prior=15.0 --annotation QD --annotation MQ -- annotation MQRankSum --annotation ReadPosRankSum --annotation FS --var_type SNP --plot_file

${sample}.snv.vqsr.csv --max_gaussians 8 --tranches_file ${sample}.snv.tranches.csv ${sample}.snv.recal 5. Applying SNV VQSR sentieon driver -r $hg19 -t 28 --interval chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr1 8,chr19,chr20,chr21,chr22,chrX,chrY --algo ApplyVarCal -v ${sample}.vcf --var_type SNP --recal ${sample}.snv.recal --tranches_file ${sample}.snv.tranches.csv --sensitivity 99.5 ${sample}.snv.vqsr.vcf 6. Calculating indel VQSR sentieon driver -r $hg19 -t 28 --interval chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr1 8,chr19,chr20,chr21,chr22,chrX,chrY --algo VarCal -v ${sample}.snv.vqsr.vcf --resource 1000G_phase1.indels.hg19.sites.vcf --resource_param 1000G,known=false,training=true,truth=true,prior=10.0 --resource Mills_and_1000G_gold_standard.indels.hg19.sites.vcf --resource_param mills,known=false,training=true,truth=true,prior=12.0 --resource $dbsnp --resource_param dbsnp,known=true,training=false,truth=false,prior=2.0 --annotation QD --annotation MQRankSum - -annotation ReadPosRankSum --annotation FS --var_type INDEL --plot_file ${sample}.indel.vqsr.csv - -max_gaussians 4 --tranches_file ${sample}.indel.tranches.csv ${sample}.indel.recal 7. Applying indel VQSR sentieon driver -r $hg19 -t 28 --interval chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr1 8,chr19,chr20,chr21,chr22,chrX,chrY --algo ApplyVarCal -v ${sample}.snv.vqsr.vcf --var_type INDEL -- recal ${sample}.indel.recal --tranches_file ${sample}.indel.tranches.csv --sensitivity 99.5 ${sample}.all.vqsr.vcf Germline calling evaluation Germline calling accuracy was measured using hap.py (version 0.3.7) against the latest NIST Genome in a bottle truth set (version 3.3.2). Hap.py command for Strelka2 results hap.py -X --roc GQX --roc-filter LowGQX --fixchr --force-interactive --threads 4 -f $high_confidence_regions_bed -o $sample $giab_truth_vcf $vcf -V --engine vcfeval Hap.py command for Sentieon DNAseq Haplotyper and PrecisionFDA submissions hap.py -X --fixchr --force-interactive --threads 4 -f $high_confidence_regions_bed -o $sample $giab_truth_vcf $vcf -V --engine vcfeval Creating in Silico Mixtures for somatic calling benchmark We created 4 in silico tumor/normal pairs by down sampling and mixing 8 HiSeqX WGS samples, 4 from NA12877 (normal), and the other 4 from NA12878 (tumor) as in the table below: Dataset Tumor sample (~110x coverage) Normal sample (~37x coverage)

TumorPurity / NormalPurity InputSample DownSampleRatio InputSample DownSampleRatio NA12877_2 0.8 NA12877_1 1 NA12877_3 0.8 NA12877_4 0.8 20% / 100% NA12878_1 0.2 NA12878_2 0.2 NA12878_3 0.2 File: InSilicoTumor1_Purity20.bam File: Normal1_Purity100.bam NA12877_1 0.75 NA12877_3 1 NA12877_4 0.75 50% / 100% NA12878_1 0.75 NA12878_2 0.75 File: InSilicoTumor2_Purity50.bam File: Normal2_Purity100.bam NA12877_1 0.3 NA12877_3 1 NA12877_4 0.3 80% / 100% NA12878_1 0.8 NA12878_2 0.8 NA12878_3 0.8 File: InSilicoTumor3_Purity80.bam File: Normal2_Purity100.bam NA12877_1 0.3 NA12877_3 0.9 NA12877_4 0.3 NA12878_4 0.1 80% / 90% NA12878_1 0.8 NA12878_2 0.8 NA12878_3 0.8 File: InSilicoTumor3_Purity80.bam File: InSilicoNormal3_Purity90.bam All the bam files are available at https://s3.amazonaws.com/strelka-public/bams/${bamfilename}. The GRCh38Decoy reference file is available at https://s3.amazonaws.com/strelkapublic/grch38decoy/grch38decoy.fa. Below is the procedure to create in silico mixture data using sambamba 1 (version 0.5.8) and samtools 2 (version 1.3). 1. Down sample (sambamba version 0.5.8) sambamba_v0.5.8 view -h -t 10 -s $down_sample_ratio -f bam --subsampling-seed=10 $bam_i -o ${downsampled_ i}.bam 2. Merge (samtools version 1.3) samtools merge -@ 20 merged.bam ${downsampled_1}.bam ${downsampled_n}.bam 3. Index merged samtools index ${merged}.bam 4. Replace SM_0 sambamba_v0.5.8 view -H before_reheader.bam sed 's/^\(\@rg.*\)sm:[a-za-z0-9_]*/\1sm:admix/g' - > $new_header 5. Reheader

samtools reheader $new_header ${reheader}.bam > ${mixture}.bam 6. Index samtools index ${mixture}.bam Somatic variant calling Strelka2 (version 2.8.3) 1. Configuration python ${strelka_install_path}/bin/configurestrelkasomaticworkflow.py --ref $GRCh38Decoy -- normalbam $normal_bam --tumorbam $tumor_bam --callregions callable.bed.gz --rundir $strelka_analysis_path (Note callable.bed.gz lists complete chromosomes so as to exclude by omission a number of smaller decoy contigs from hg38, which are currently problematic for Strelka2. The bed file lists all autosomes together with chromosomes X, Y and M, as well as all random and unplaced contigs. This file is available at https://s3.amazonaws.com/strelka-public/grch38decoy/callable.bed.gz) 2. Run somatic calling python ${strelka_analysis_path}/runworkflow.py -m local -j 28 Sentieon TNseq TNhaplotyper (version sentieon-genomics-201704, equivalent to MuTect2) We followed the TNhaplotyper pipeline as written in the software manual for Sentieon Genomics pipeline tools (version 201704). This pipeline is equivalent to the GATK best practices pipeline using MuTect2. 1. Tumor indel realignment sentieon driver -r $GRCh38Decoy -t 28 -i $tumor_bam --algo Realigner -k $Mills -k $known_indels ${sample}.tumour.realigned.bam 2. Normal indel realignment sentieon driver -r $GRCh38Decoy -t 28 -i $normal_bam --algo Realigner -k $Mills -k $known_indels ${sample}.normal.realigned.bam 3. Tumor BQSR sentieon driver -r $GRCh38Decoy -t 28 -i ${sample}.tumour.realigned.bam --algo QualCal -k $dbsnp - k $Mills -k $known_indels ${sample}.tumour.recal.table 4. Normal BQSR sentieon driver -r $GRCh38Decoy -t 28 -i ${sample}.normal.realigned.bam --algo QualCal -k $dbsnp - k $Mills -k $known_indels ${sample}.normal.recal.table 5. Corealign sentieon driver -r $GRCh38Decoy -t 28 -i ${sample}.tumour.realigned.bam -i ${sample}.normal.realigned.bam -q ${sample}.tumour.recal.table -q ${sample}.normal.recal.table -- algo Realigner -k $Mills -k $known_indels ${sample}.corealigned.bam 6. Variant calling

sentieon driver -r $GRCh38Decoy -t 28 -i ${sample}.corealigned.bam --algo TNhaplotyper -- tumor_sample TUMOR --normal_sample NORMAL --dbsnp $dbsnp ${sample}.somatic.vcf Somatic calling evaluation Somatic calling accuracy was measured using som.py (version 0.3.7) within the hap.py package against the truth set consisting of the variant calls in NA12878 where the corresponding NA12877 genotype is homozygous reference. To measure the accuracy separately for indels and SNVs, the VCF file was split into two so that each file contains either indel or SNV calls. Som.py command for indels som.py InSilicoMix_indels.vcf.gz $vcf -f InSilicoMix_indels.bed -o ${sample}.indels --roc $score Som.py command for SNVs som.py InSilicoMix_snvs.vcf.gz $vcf -f InSilicoMix_snvs.bed -o ${sample}.snvs --roc $score score is strelka.indel.evs for Strelka, mutect.indel for TNhaplotyper indels, mutect.snv for TNhaplotyper SNVs. All the truth files (InSilicoMix_indels.vcf.gz, InSilicoMix_indels.bed, InSilicoMix_snvs.vcf.gz, and InSilicoMix_snvs.bed) are available at https://s3.amazonaws.com/strelka-public/truthset/${filename}.

Supplementary Note 2 EVS model and hard filters EVS model Empirical variant scoring in Strelka2 uses pre-trained random forest models taking a set of features as input to produce the probability of an erroneous variant call. For each of germline, RNA-seq, and somatic variant calling, there are two separate trained random forest models and feature sets for the two highlevel variant categories, SNVs and indels. Strelka2 is intended to run with models which have been pretrained on a combined training data set representing a wide variety of sample-prep and sequencing assays. Note that although scripts are provided to recreate the EVS model training procedure, there is no intention for the models to be retrained for each input sequencing dataset to be analyzed (this is in contrast to dynamic re-scoring systems such as the GATK VQSR procedure 3 ). The EVS models are trained using the random forest learning procedures implemented in the scikit-learn package 4, trained on a set of candidate calls with truth labels assigned as described below. For the germline and RNA-seq EVS models, each random forest uses 50 decision trees with a maximum depth of 12, a minimum of 50 samples per leaf, and no limit on the maximum number of features. For the somatic EVS models, each random forest uses 100 decision trees with a maximum depth of 6. The remaining options are set to scikit-learn defaults. The training data are compiled from a collection of sequencing runs using different sample prep, sequencing platforms and chemistries. All germline and RNA-seq datasets are from an individual for which a gold standard truth set is available from the Platinum Genomes project 5. Candidate variants that correspond to the high-confidence regions of the truth set are labeled as true or false using the hap.py haplotype comparison tool 6. Variants that exist in the truth set but were called with incorrect genotype are treated as false variants. In the case of germline calling, it is believed that the vast majority of candidate SNVs (but not indels) outside of the high-confidence regions (and classified by hap.py as unknown) are false; for this reason, these SNV variants are added to the set of false variants that are presented to the model during training, downweighted so as to have a total weight which is half the total weight of the known false variants. Somatic datasets include simulated tumor-normal pairs from the Platinum Genomes project as well as tumor-normal data from real tumor cell lines for which curated (but generally noisier) truth sets have been constructed. Labeling of somatic datasets is done by means of a script included in the Strelka2 distribution. The output scores produced by the random forest classifier are transformed to phred-scale and calibrated by passing the resulting quality values through a linear transform estimated by regressing binned empirical quality onto predicted quality. Finally, variant filter labels are assigned based on thresholds that have been selected to achieve a reasonable tradeoff across multiple datasets. Features used by the EVS model Features used in each model are listed here and definitions are provided further below.

Germline SNV features: GenotypeCategory, SampleRMSMappingQuality, SiteHomopolymerLength, SampleStrandBias, SampleRMSMappingQualityRankSum, SampleReadPosRankSum, RelativeTotalLocusDepth, SampleUsedDepthFraction, ConservativeGenotypeQuality, NormalizedAltHaplotypeCountRatio. Germline Indel features: GenotypeCategory, SampleIndelRepeatCount, SampleIndelRepeatUnitSize, SampleIndelAlleleBiasLower, SampleIndelAlleleBias, SampleProxyRMSMappingQuality, RelativeTotalLocusDepth, SamplePrimaryAltAlleleDepthFraction, ConservativeGenotypeQuality, InterruptedHomopolymerLength, ContextCompressability, IndelCategory, NormalizedAltHaplotypeCountRatio. Somatic SNV features: SomaticSNVQualityAndHomRefGermlineGenotype, NormalSampleRelativeTotalLocusDepth, TumorSampleAltAlleleFraction, RMSMappingQuality, ZeroMappingQualityFraction, TumorSampleStrandBias, TumorSampleReadPosRankSum, AlleleCountLogOddsRatio, NormalSampleFilteredDepthFraction, TumorSampleFilteredDepthFraction. Somatic Indel features: SomaticIndelQualityAndHomRefGermlineGenotype, TumorSampleReadPosRankSum, TumorSampleLogSymmetricStrandOddsRatio, RepeatUnitLength, IndelRepeatCount, RefRepeatCount, InterruptedHomopolymerLength, TumorSampleIndelNoiseLogOdds, TumorNormalIndelAlleleLogOdds, AlleleCountLogOddsRatio. Hard filters The hard filter model is used whenever EVS cannot be; in particular, this is the method used to filter germline homozygous reference calls. The EVS model may be turned off for all variants whenever the assay conditions are suspected to poorly match the EVS models training conditions, in which case the hard filters will be used instead. When hard filteres are used, each filter is triggered when a single feature exceeds some critical value. Filtration thresholds are set to remove variants which are very likely to be incorrect. The LowDepth filter mentioned above is applied to all germline and somatic calls. Additionally, the following filters are used depending on the variant type. Hard filter thresholds for germline model Shared filter conditions: Variant is filtered if ConservativeGenotypeQuality < 15 or RelativeTotalLocusDepth > 3. SNV-specific filter conditions: SNV is filtered if SampleStrandBias > 10. Hard filter thresholds for somatic model SNV-specific filter conditions: SNV is filtered if SomaticSNVQualityAndHomRefGermlineGenotype < 15, SiteFilteredBasecallFrac >= 4, or SpanningDeletionFraction > 0.75. Indel-specific filter conditions: Indel is filtered if SomaticIndelQualityAndHomRefGermlineGenotype < 40 or IndelWindowFilteredBasecallFrac >= 3. Hard-filters that are also applied when the EVS model is used When the EVS model is used, variant filtering is primarily based on the score computed by the random forest model. In some cases this score is supplemented with additional hard filters.

LowDepth: This filter is applied to all germline and somatic calls. For germline site calls, read depths are calculated from base calls used for site genotyping. For germline indel loci, read depths are taken from the depths of the sites preceding the indels. If read depths are below 3, genotype calls are filtered out and tagged as LowDepth. For variant calls in particular, allelic depths for the reference and alternative alleles are additionally considered. If the sum of the allelec depths is below 3, the associated variant calls are supplemented with a LowDepth filter. For somatic variants, tumor sample read depths are calculated for SNVs and indels in a similar way. If tumor depths are below 2, associated somatic variant calls are supplemented with a LowDepth filter. NormalSampleRelativeTotalLocusDepth: This filter is applied to all somatic calls variant is filtered if this value is greater than 3. Germline and RNA-seq feature descriptions GenotypeCategory A category variable reflecting the most likely genotype as heterozygous (0), homozygous (1) or alt-heterozygous (2). SampleRMSMappingQuality RMS mapping quality of all reads spanning the variant in one sample. This feature matches SAMPLE/MQ in the VCF spec. SiteHomopolymerLength Length of the longest homopolymer containing the current position if this position can be treated as any base. InterruptedHomopolymerLength One less than the length of the longest interrupted homopolymer in the reference sequence containing the current position. An interrupted homopolymer is a string that has edit distance 1 to a homopolymer. SampleStrandBias Log ratio of the sample s genotype likelihood computed assuming the alternate allele occurs on only one strand vs both strands (thus positive values indicate bias). SampleRMSMappingQualityRankS Z-score of Mann-Whitney U test for reference vs alternate allele um mapping quality values in one sample. SampleReadPosRankSum Z-score of Mann-Whitney U test for reference vs alternate allele read positions in one sample. RelativeTotalLocusDepth Locus depth relative to expectation: this is the ratio of total read depth at the variant locus in all samples over the total expected depth in all samples. Depth at the variant locus includes reads at any mapping quality. Expected depth is taken from the preliminary depth estimation step. This value is set to 1 in exome and targeted analyses, because it is problematic to define expected depth in this case. SampleUsedDepthFraction The ratio of reads used to genotype the locus over the total number of reads at the variant locus in one sample. Reads are not used if the mapping quality is less than the minimum threshold, if the local read alignment fails the mismatch density filter or if the basecall is ambiguous. ConservativeGenotypeQuality The model-based ConservativeGenotypeQuality (GQX) value for one sample, reflecting the conservative confidence of the called genotype. NormalizedAltHaplotypeCountRat io For variants in an active region, the proportion of reads supporting the top 2 haplotypes, or 0 if haplotyping failed due to

SampleIndelRepeatCount SampleIndelRepeatUnitSize SampleIndelAlleleBiasLower SampleIndelAlleleBias this proportion being below threshold. For heterozygous variants with only one non-reference allele, the proportion is doubled so that its value is expected to be close to 1.0 regardless of genotype. The feature is set to -1 for variants not in an active region. The number of times the primary indel allele s repeat unit occurs in a haplotype containing the indel allele. The primary indel allele s repeat unit is the smallest possible sequence such that the inserted/deleted sequence can be formed by concatenating multiple copies of it. The primary indel allele is the best supported allele among all overlapping indel alleles at the locus of interest in one sample. Length of the primary indel allele s repeat unit, as defined for feature SampleIndelRepeatCount. The negative log probability of seeing N or fewer observations of one allele in a heterozygous variant out of the total observations from both alleles in one sample. N is typically the observation count of the reference allele. If the heterozygous variant does not include the reference allele, the first indel allele is used instead. Similar to SampleIndelAlleleBiasLower, except the count used is twice the count of the least frequently observed allele. SampleProxyRMSMappingQuality RMS mapping quality of all reads spanning the position immediately preceding the indel in one sample. This feature approximates the SAMPLE/MQ value defined in the VCF spec. SamplePrimaryAltAlleleDepthFrac tion ContextCompressability IndelCategory SamplePrimaryAltAlleleDepth VariantAlleleQuality SampleMeanDistanceFromReadEd ge The ratio of the confident observation count of the bestsupported non-reference allele at the variant locus, over all confident allele observation counts in one sample. The length of the upstream or downstream reference context (whichever is greater) that can be represented using 5 Ziv- Lempel keywords 7,8. The Ziv-Lempel keywords are obtained using the scheme of Ziv and Lempel 1977 8, by traversing the sequence and successively selecting the shortest subsequence that has not yet been encountered. A binary variable set to 1 if the indel allele is a primitive deletion or 0 otherwise. The confident observation count of the best-supported nonreference allele at the variant locus. The model-based variant quality value reflecting confidence that the called variant is present in at least one sample, regardless of genotype. This feature matches QUAL in the VCF spec. For all non-reference basecall observations in one sample at a candidate SNV site, report the mean distance to the closest edge of each alternate basecall s read. Distance is measured in readcoordinates, zero-indexed, and is allowed to have a maximum value of 20.

SampleRefAlleleDepth SampleIndelMeanDistanceFromR eadedge SampleRefRepeatCount The confident observation count of the reference allele at the variant locus. For all indel allele observations in one sample at a candidate indel locus, report the mean distance to the closest edge of each indel allele s read. Distance is measured in read-coordinates, zero-indexed, and is allowed to have a maximum value of 20. The left or right side of the indel may be used to provide the shortest distance, but the indel will only be considered in its leftaligned position. The number of times the primary indel allele s repeat unit occurs in the reference sequence. Somatic feature descriptions Note that for somatic features "all samples" refers to the tumor and matched normal samples together. SomaticSNVQualityAndHomRefG ermlinegenotype NormalSampleRelativeTotalLocus Depth TumorSampleAltAlleleFraction RMSMappingQuality ZeroMappingQualityFraction Posterior probability of a somatic SNV conditioned on a homozygous reference germline genotype. When INFO/NT is "ref", this feature matches INFO/QSS_NT in the VCF output. This feature matches the germline RelativeTotalLocusDepth feature, except that it reflects the depth of only the matched normal sample. Fraction of the tumor sample s observations which are not the reference allele. This is restricted to a maximum of 0.5 to prevent the model from overtraining against high somatic allele frequencies (these might be common e.g. for loss of heterozygosity regions from liquid tumors). Root mean square read mapping quality of all reads spanning the variant in all samples. This feature matches INFO/MQ in the VCF spec. Fraction of read mapping qualities equal to zero, for all reads spanning the variant in all samples. InterruptedHomopolymerLength One less than the length of the longest interrupted homopolymer in the reference sequence containing the current position. An interrupted homopolymer is a string that has edit distance 1 to a homopolymer. TumorSampleStrandBias TumorSampleReadPosRankSum AlleleCountLogOddsRatio Log ratio of the tumor-sample somatic allele likelihood computed assuming the somatic allele occurs on only one strand vs both strands (thus higher values indicate greater bias). Z-score of Mann-Whitney U test for reference vs non-reference allele read positions in the tumor sample s observations. rt an The log odds ratio of allele counts log, given reference r a ( t n r, r ) and non-reference a, a ) allele counts for the tumor and normal sample pair. ( t n n t

NormalSampleFilteredDepthFract ion TumorSampleFilteredDepthFracti on SomaticIndelQualityAndHomRefG ermlinegenotype TumorSampleLogSymmetricStran doddsratio The fraction of reads that were filtered out of the normal sample before calling the variant locus. The fraction of reads that were filtered out of the tumor sample before calling the variant locus. Posterior probability of a somatic indel conditioned on a homozygous reference germline genotype. When INFO/NT is "ref", this feature matches INFO/QSI_NT in the VCF output. Log of the symmetric strand odds ratio of allele counts r fwdarev rrev a fwd log +, given reference ( r, ) rrev a fwd rfwda r fwd rev and nonreference a, a ) confident counts of the tumor rev sample s ( fwd rev observations. RepeatUnitLength The length of the somatic indel allele's repeat unit. The repeat unit is the smallest possible sequence such that the inserted/deleted sequence can be formed by concatenating multiple copies of it. IndelRepeatCount The number of times the somatic indel allele s repeat unit occurs in a haplotype containing the indel allele. RefRepeatCount The number of times the somatic indel allele s repeat unit occurs in the reference sequence. TumorSampleIndelNoiseLogOdds Log ratio of the frequency of the candidate indel vs all other indels at the same locus in the tumor sample. The frequencies are computed from reads which confidently support a single allele at the locus. TumorNormalIndelAlleleLogOdds Log ratio of the frequency of the candidate indel in the tumor vs normal samples. The frequencies are computed from reads which confidently support a single allele at the locus. SiteFilteredBasecallFrac The maximum value over all samples of SampleSiteFilteredBasecallFrac, which is the fraction of basecalls at a site which have been removed by the mismatch density filter in a given sample. IndelWindowFilteredBasecallFrac The maximum value over all samples of SampleIndelWindowFilteredBasecallFrac, which is the fraction of basecalls in a window extending 50 bases to each side of the candidate indel s call position which have been removed by the mismatch density filter in a given sample. SpanningDeletionFraction The maximum value over all samples of SampleSpanningDeletionFraction, which is the fraction of reads crossing a candidate SNV site with spanning deletions in a given sample.

References 1. Tarasov, A., Vilella, A. J., Cuppen, E., Nijman, I. J. & Prins, P. Sambamba: Fast processing of NGS alignment formats. Bioinformatics 31, 2032 2034 (2015). 2. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078 2079 (2009). 3. Van der Auwera, G. A. et al. From fastq data to high-confidence variant calls: The genome analysis toolkit best practices pipeline. Curr. Protoc. Bioinforma. 43, 11.10.1 11.10.33 (2013). 4. Pedregosa, F. & Varoquaux, G. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, (2011). 5. Eberle, M. A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157 164 (2017). 6. Krusche, P. Haplotype Comparison Tools. https://github.com/illumina/hap.py 7. Lesne, A., Blanc, J. L. & Pezard, L. Entropy estimation of very short symbolic sequences. Phys. Rev. E - Stat. Nonlinear, Soft Matter Phys. 79, (2009). 8. Ziv, J. & Lempel, A. A Universal Algorithm for Sequential Data Compression. IEEE Trans. Inf. Theory 23, 337 343 (1977).