Downloading PrecisionFDA Challenge Datasets 1. Consistency challenge (

Size: px

Start display at page:

Download "Downloading PrecisionFDA Challenge Datasets 1. Consistency challenge (https://precision.fda.gov/challenges/consistency)"

Ursula McCoy
5 years ago
Views:

1 Supplementary Notes for Strelka2: Fast and accurate variant calling for clinical sequencing applications Supplementary Note 1 Command lines to run analyses Downloading PrecisionFDA Challenge Datasets 1. Consistency challenge ( Dataset Garvan HLI Files NA12878-Garvan-Vial1_R1.fastq.gz NA12878-Garvan-Vial1_R2.fastq.gz TSNano_1lane_L008_13801_NA12878_R1_001.fastq.gz TSNano_1lane_L008_13801_NA12878_R1_002.fastq.gz Library Prep TruSeq Nano DNA Library Prep kit TruSeq Nano DNA Library Prep kit Read 2x150bp 2x150bp Length Coverage ~40x ~35x Instrument HiSeq X HiSeq X 2. Truth Challenge ( Dataset HG001 HG002 Files HG001-NA x_1.fastq.gz HG001-NA x_2.fastq.gz HG002-NA x_1.fastq.gz HG002-NA x_1.fastq.gz Library Prep TruSeq DNA PCR-Free Read 2x148bp 2x148bp Length Coverage ~50x ~50x Instrument HiSeq 2500 HiSeq 2500 Read alignment 1. Run bwa-mem (version ) bwa mem -M -t 28 -R $hg19 $fastq_1 $fastq_2 samtools view -b -o $unsorted_bam - 2. Sort reads (samtools version 1.3) samtools sort $unsorted_bam -O bam -o $sorted_bam -@ Remove duplicates samtools rmdup <(samtools view -F 0x100 -u $sorted_bam) $sample_bam 4. Index samtools index $sample_bam

2 The hg19 reference file is available at Germline variant calling Strelka2 (version 2.8.3) 1. Configuration python ${strelka_install_path}/bin/configurestrelkagermlineworkflow.py --ref $hg19 --bam $sample_bam --rundir $strelka_analysis_path 2. Run germline calling python $strelka_analysis_path/runworkflow.py -m local -j 28 Sentieon DNAseq Haplotyper (version sentieon-genomics , equivalent to GATK v3.7) We followed the Sentieon DNAseq Haplotyper pipeline as written in the software manual for Sentieon Genomics pipeline tools (version ). This pipeline is equivalent to the GATK best practices pipeline. 1. Indel realignment sentieon driver --temp_dir $temp_dir -r $hg19_fasta -t 28 -i $sample_bam --algo Realigner -k $Mills ${sample}.realigned.bam 2. Base quality score recalibration sentieon driver -r $hg19 -t 28 -i ${sample}.realigned.bam --interval chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr1 8,chr19,chr20,chr21,chr22,chrX,chrY --algo QualCal -k $dbsnp -k $Mills ${sample}.recal.table 3. Haplotype calling sentieon driver --temp_dir $temp_dir -r $hg19 -t 28 -i ${sample}.realigned.bam -q ${sample}.recal.table --interval chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr1 8,chr19,chr20,chr21,chr22,chrX,chrY --algo Haplotyper -d $dbsnp --emit_conf=10 --call_conf=30 -- prune_factor=3 ${sample}.vcf 4. Calculating SNV variant quality score recalibration (VQSR) sentieon driver -r $hg19 -t 28 --interval chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr1 8,chr19,chr20,chr21,chr22,chrX,chrY --algo VarCal -v ${sample}.vcf--resource 1000G_phase1.snps.high_confidence.hg19.sites.vcf --resource_param 1000G,known=false,training=true,truth=false,prior= resource 1000G_omni2.5.hg19.sites.vcf -- resource_param omni,known=false,training=true,truth=true,prior= resource $dbsnp -- resource_param dbsnp,known=true,training=false,truth=false,prior=2.0 --resource hapmap_3.3.hg19.sites.vcf --resource_param hapmap,known=false,training=true,truth=true,prior= annotation QD --annotation MQ -- annotation MQRankSum --annotation ReadPosRankSum --annotation FS --var_type SNP --plot_file

3 ${sample}.snv.vqsr.csv --max_gaussians 8 --tranches_file ${sample}.snv.tranches.csv ${sample}.snv.recal 5. Applying SNV VQSR sentieon driver -r $hg19 -t 28 --interval chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr1 8,chr19,chr20,chr21,chr22,chrX,chrY --algo ApplyVarCal -v ${sample}.vcf --var_type SNP --recal ${sample}.snv.recal --tranches_file ${sample}.snv.tranches.csv --sensitivity 99.5 ${sample}.snv.vqsr.vcf 6. Calculating indel VQSR sentieon driver -r $hg19 -t 28 --interval chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr1 8,chr19,chr20,chr21,chr22,chrX,chrY --algo VarCal -v ${sample}.snv.vqsr.vcf --resource 1000G_phase1.indels.hg19.sites.vcf --resource_param 1000G,known=false,training=true,truth=true,prior= resource Mills_and_1000G_gold_standard.indels.hg19.sites.vcf --resource_param mills,known=false,training=true,truth=true,prior= resource $dbsnp --resource_param dbsnp,known=true,training=false,truth=false,prior=2.0 --annotation QD --annotation MQRankSum - -annotation ReadPosRankSum --annotation FS --var_type INDEL --plot_file ${sample}.indel.vqsr.csv - -max_gaussians 4 --tranches_file ${sample}.indel.tranches.csv ${sample}.indel.recal 7. Applying indel VQSR sentieon driver -r $hg19 -t 28 --interval chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr1 8,chr19,chr20,chr21,chr22,chrX,chrY --algo ApplyVarCal -v ${sample}.snv.vqsr.vcf --var_type INDEL -- recal ${sample}.indel.recal --tranches_file ${sample}.indel.tranches.csv --sensitivity 99.5 ${sample}.all.vqsr.vcf Germline calling evaluation Germline calling accuracy was measured using hap.py (version 0.3.7) against the latest NIST Genome in a bottle truth set (version 3.3.2). Hap.py command for Strelka2 results hap.py -X --roc GQX --roc-filter LowGQX --fixchr --force-interactive --threads 4 -f $high_confidence_regions_bed -o $sample $giab_truth_vcf $vcf -V --engine vcfeval Hap.py command for Sentieon DNAseq Haplotyper and PrecisionFDA submissions hap.py -X --fixchr --force-interactive --threads 4 -f $high_confidence_regions_bed -o $sample $giab_truth_vcf $vcf -V --engine vcfeval Creating in Silico Mixtures for somatic calling benchmark We created 4 in silico tumor/normal pairs by down sampling and mixing 8 HiSeqX WGS samples, 4 from NA12877 (normal), and the other 4 from NA12878 (tumor) as in the table below: Dataset Tumor sample (~110x coverage) Normal sample (~37x coverage)

4 TumorPurity / NormalPurity InputSample DownSampleRatio InputSample DownSampleRatio NA12877_2 0.8 NA12877_1 1 NA12877_3 0.8 NA12877_ % / 100% NA12878_1 0.2 NA12878_2 0.2 NA12878_3 0.2 File: InSilicoTumor1_Purity20.bam File: Normal1_Purity100.bam NA12877_ NA12877_3 1 NA12877_ % / 100% NA12878_ NA12878_ File: InSilicoTumor2_Purity50.bam File: Normal2_Purity100.bam NA12877_1 0.3 NA12877_3 1 NA12877_ % / 100% NA12878_1 0.8 NA12878_2 0.8 NA12878_3 0.8 File: InSilicoTumor3_Purity80.bam File: Normal2_Purity100.bam NA12877_1 0.3 NA12877_3 0.9 NA12877_4 0.3 NA12878_ % / 90% NA12878_1 0.8 NA12878_2 0.8 NA12878_3 0.8 File: InSilicoTumor3_Purity80.bam File: InSilicoNormal3_Purity90.bam All the bam files are available at The GRCh38Decoy reference file is available at Below is the procedure to create in silico mixture data using sambamba 1 (version 0.5.8) and samtools 2 (version 1.3). 1. Down sample (sambamba version 0.5.8) sambamba_v0.5.8 view -h -t 10 -s $down_sample_ratio -f bam --subsampling-seed=10 $bam_i -o ${downsampled_ i}.bam 2. Merge (samtools version 1.3) samtools merge -@ 20 merged.bam ${downsampled_1}.bam ${downsampled_n}.bam 3. Index merged samtools index ${merged}.bam 4. Replace SM_0 sambamba_v0.5.8 view -H before_reheader.bam sed 's/^$\@rg.*$sm:[a-za-z0-9_]*/\1sm:admix/g' - > $new_header 5. Reheader

5 samtools reheader $new_header ${reheader}.bam > ${mixture}.bam 6. Index samtools index ${mixture}.bam Somatic variant calling Strelka2 (version 2.8.3) 1. Configuration python ${strelka_install_path}/bin/configurestrelkasomaticworkflow.py --ref $GRCh38Decoy -- normalbam $normal_bam --tumorbam $tumor_bam --callregions callable.bed.gz --rundir $strelka_analysis_path (Note callable.bed.gz lists complete chromosomes so as to exclude by omission a number of smaller decoy contigs from hg38, which are currently problematic for Strelka2. The bed file lists all autosomes together with chromosomes X, Y and M, as well as all random and unplaced contigs. This file is available at 2. Run somatic calling python ${strelka_analysis_path}/runworkflow.py -m local -j 28 Sentieon TNseq TNhaplotyper (version sentieon-genomics , equivalent to MuTect2) We followed the TNhaplotyper pipeline as written in the software manual for Sentieon Genomics pipeline tools (version ). This pipeline is equivalent to the GATK best practices pipeline using MuTect2. 1. Tumor indel realignment sentieon driver -r $GRCh38Decoy -t 28 -i $tumor_bam --algo Realigner -k $Mills -k $known_indels ${sample}.tumour.realigned.bam 2. Normal indel realignment sentieon driver -r $GRCh38Decoy -t 28 -i $normal_bam --algo Realigner -k $Mills -k $known_indels ${sample}.normal.realigned.bam 3. Tumor BQSR sentieon driver -r $GRCh38Decoy -t 28 -i ${sample}.tumour.realigned.bam --algo QualCal -k $dbsnp - k $Mills -k $known_indels ${sample}.tumour.recal.table 4. Normal BQSR sentieon driver -r $GRCh38Decoy -t 28 -i ${sample}.normal.realigned.bam --algo QualCal -k $dbsnp - k $Mills -k $known_indels ${sample}.normal.recal.table 5. Corealign sentieon driver -r $GRCh38Decoy -t 28 -i ${sample}.tumour.realigned.bam -i ${sample}.normal.realigned.bam -q ${sample}.tumour.recal.table -q ${sample}.normal.recal.table -- algo Realigner -k $Mills -k $known_indels ${sample}.corealigned.bam 6. Variant calling

6 sentieon driver -r $GRCh38Decoy -t 28 -i ${sample}.corealigned.bam --algo TNhaplotyper -- tumor_sample TUMOR --normal_sample NORMAL --dbsnp $dbsnp ${sample}.somatic.vcf Somatic calling evaluation Somatic calling accuracy was measured using som.py (version 0.3.7) within the hap.py package against the truth set consisting of the variant calls in NA12878 where the corresponding NA12877 genotype is homozygous reference. To measure the accuracy separately for indels and SNVs, the VCF file was split into two so that each file contains either indel or SNV calls. Som.py command for indels som.py InSilicoMix_indels.vcf.gz $vcf -f InSilicoMix_indels.bed -o ${sample}.indels --roc $score Som.py command for SNVs som.py InSilicoMix_snvs.vcf.gz $vcf -f InSilicoMix_snvs.bed -o ${sample}.snvs --roc $score score is strelka.indel.evs for Strelka, mutect.indel for TNhaplotyper indels, mutect.snv for TNhaplotyper SNVs. All the truth files (InSilicoMix_indels.vcf.gz, InSilicoMix_indels.bed, InSilicoMix_snvs.vcf.gz, and InSilicoMix_snvs.bed) are available at

7 Supplementary Note 2 EVS model and hard filters EVS model Empirical variant scoring in Strelka2 uses pre-trained random forest models taking a set of features as input to produce the probability of an erroneous variant call. For each of germline, RNA-seq, and somatic variant calling, there are two separate trained random forest models and feature sets for the two highlevel variant categories, SNVs and indels. Strelka2 is intended to run with models which have been pretrained on a combined training data set representing a wide variety of sample-prep and sequencing assays. Note that although scripts are provided to recreate the EVS model training procedure, there is no intention for the models to be retrained for each input sequencing dataset to be analyzed (this is in contrast to dynamic re-scoring systems such as the GATK VQSR procedure 3 ). The EVS models are trained using the random forest learning procedures implemented in the scikit-learn package 4, trained on a set of candidate calls with truth labels assigned as described below. For the germline and RNA-seq EVS models, each random forest uses 50 decision trees with a maximum depth of 12, a minimum of 50 samples per leaf, and no limit on the maximum number of features. For the somatic EVS models, each random forest uses 100 decision trees with a maximum depth of 6. The remaining options are set to scikit-learn defaults. The training data are compiled from a collection of sequencing runs using different sample prep, sequencing platforms and chemistries. All germline and RNA-seq datasets are from an individual for which a gold standard truth set is available from the Platinum Genomes project 5. Candidate variants that correspond to the high-confidence regions of the truth set are labeled as true or false using the hap.py haplotype comparison tool 6. Variants that exist in the truth set but were called with incorrect genotype are treated as false variants. In the case of germline calling, it is believed that the vast majority of candidate SNVs (but not indels) outside of the high-confidence regions (and classified by hap.py as unknown) are false; for this reason, these SNV variants are added to the set of false variants that are presented to the model during training, downweighted so as to have a total weight which is half the total weight of the known false variants. Somatic datasets include simulated tumor-normal pairs from the Platinum Genomes project as well as tumor-normal data from real tumor cell lines for which curated (but generally noisier) truth sets have been constructed. Labeling of somatic datasets is done by means of a script included in the Strelka2 distribution. The output scores produced by the random forest classifier are transformed to phred-scale and calibrated by passing the resulting quality values through a linear transform estimated by regressing binned empirical quality onto predicted quality. Finally, variant filter labels are assigned based on thresholds that have been selected to achieve a reasonable tradeoff across multiple datasets. Features used by the EVS model Features used in each model are listed here and definitions are provided further below.

8 Germline SNV features: GenotypeCategory, SampleRMSMappingQuality, SiteHomopolymerLength, SampleStrandBias, SampleRMSMappingQualityRankSum, SampleReadPosRankSum, RelativeTotalLocusDepth, SampleUsedDepthFraction, ConservativeGenotypeQuality, NormalizedAltHaplotypeCountRatio. Germline Indel features: GenotypeCategory, SampleIndelRepeatCount, SampleIndelRepeatUnitSize, SampleIndelAlleleBiasLower, SampleIndelAlleleBias, SampleProxyRMSMappingQuality, RelativeTotalLocusDepth, SamplePrimaryAltAlleleDepthFraction, ConservativeGenotypeQuality, InterruptedHomopolymerLength, ContextCompressability, IndelCategory, NormalizedAltHaplotypeCountRatio. Somatic SNV features: SomaticSNVQualityAndHomRefGermlineGenotype, NormalSampleRelativeTotalLocusDepth, TumorSampleAltAlleleFraction, RMSMappingQuality, ZeroMappingQualityFraction, TumorSampleStrandBias, TumorSampleReadPosRankSum, AlleleCountLogOddsRatio, NormalSampleFilteredDepthFraction, TumorSampleFilteredDepthFraction. Somatic Indel features: SomaticIndelQualityAndHomRefGermlineGenotype, TumorSampleReadPosRankSum, TumorSampleLogSymmetricStrandOddsRatio, RepeatUnitLength, IndelRepeatCount, RefRepeatCount, InterruptedHomopolymerLength, TumorSampleIndelNoiseLogOdds, TumorNormalIndelAlleleLogOdds, AlleleCountLogOddsRatio. Hard filters The hard filter model is used whenever EVS cannot be; in particular, this is the method used to filter germline homozygous reference calls. The EVS model may be turned off for all variants whenever the assay conditions are suspected to poorly match the EVS models training conditions, in which case the hard filters will be used instead. When hard filteres are used, each filter is triggered when a single feature exceeds some critical value. Filtration thresholds are set to remove variants which are very likely to be incorrect. The LowDepth filter mentioned above is applied to all germline and somatic calls. Additionally, the following filters are used depending on the variant type. Hard filter thresholds for germline model Shared filter conditions: Variant is filtered if ConservativeGenotypeQuality < 15 or RelativeTotalLocusDepth > 3. SNV-specific filter conditions: SNV is filtered if SampleStrandBias > 10. Hard filter thresholds for somatic model SNV-specific filter conditions: SNV is filtered if SomaticSNVQualityAndHomRefGermlineGenotype < 15, SiteFilteredBasecallFrac >= 4, or SpanningDeletionFraction > Indel-specific filter conditions: Indel is filtered if SomaticIndelQualityAndHomRefGermlineGenotype < 40 or IndelWindowFilteredBasecallFrac >= 3. Hard-filters that are also applied when the EVS model is used When the EVS model is used, variant filtering is primarily based on the score computed by the random forest model. In some cases this score is supplemented with additional hard filters.

9 LowDepth: This filter is applied to all germline and somatic calls. For germline site calls, read depths are calculated from base calls used for site genotyping. For germline indel loci, read depths are taken from the depths of the sites preceding the indels. If read depths are below 3, genotype calls are filtered out and tagged as LowDepth. For variant calls in particular, allelic depths for the reference and alternative alleles are additionally considered. If the sum of the allelec depths is below 3, the associated variant calls are supplemented with a LowDepth filter. For somatic variants, tumor sample read depths are calculated for SNVs and indels in a similar way. If tumor depths are below 2, associated somatic variant calls are supplemented with a LowDepth filter. NormalSampleRelativeTotalLocusDepth: This filter is applied to all somatic calls variant is filtered if this value is greater than 3. Germline and RNA-seq feature descriptions GenotypeCategory A category variable reflecting the most likely genotype as heterozygous (0), homozygous (1) or alt-heterozygous (2). SampleRMSMappingQuality RMS mapping quality of all reads spanning the variant in one sample. This feature matches SAMPLE/MQ in the VCF spec. SiteHomopolymerLength Length of the longest homopolymer containing the current position if this position can be treated as any base. InterruptedHomopolymerLength One less than the length of the longest interrupted homopolymer in the reference sequence containing the current position. An interrupted homopolymer is a string that has edit distance 1 to a homopolymer. SampleStrandBias Log ratio of the sample s genotype likelihood computed assuming the alternate allele occurs on only one strand vs both strands (thus positive values indicate bias). SampleRMSMappingQualityRankS Z-score of Mann-Whitney U test for reference vs alternate allele um mapping quality values in one sample. SampleReadPosRankSum Z-score of Mann-Whitney U test for reference vs alternate allele read positions in one sample. RelativeTotalLocusDepth Locus depth relative to expectation: this is the ratio of total read depth at the variant locus in all samples over the total expected depth in all samples. Depth at the variant locus includes reads at any mapping quality. Expected depth is taken from the preliminary depth estimation step. This value is set to 1 in exome and targeted analyses, because it is problematic to define expected depth in this case. SampleUsedDepthFraction The ratio of reads used to genotype the locus over the total number of reads at the variant locus in one sample. Reads are not used if the mapping quality is less than the minimum threshold, if the local read alignment fails the mismatch density filter or if the basecall is ambiguous. ConservativeGenotypeQuality The model-based ConservativeGenotypeQuality (GQX) value for one sample, reflecting the conservative confidence of the called genotype. NormalizedAltHaplotypeCountRat io For variants in an active region, the proportion of reads supporting the top 2 haplotypes, or 0 if haplotyping failed due to

10 SampleIndelRepeatCount SampleIndelRepeatUnitSize SampleIndelAlleleBiasLower SampleIndelAlleleBias this proportion being below threshold. For heterozygous variants with only one non-reference allele, the proportion is doubled so that its value is expected to be close to 1.0 regardless of genotype. The feature is set to -1 for variants not in an active region. The number of times the primary indel allele s repeat unit occurs in a haplotype containing the indel allele. The primary indel allele s repeat unit is the smallest possible sequence such that the inserted/deleted sequence can be formed by concatenating multiple copies of it. The primary indel allele is the best supported allele among all overlapping indel alleles at the locus of interest in one sample. Length of the primary indel allele s repeat unit, as defined for feature SampleIndelRepeatCount. The negative log probability of seeing N or fewer observations of one allele in a heterozygous variant out of the total observations from both alleles in one sample. N is typically the observation count of the reference allele. If the heterozygous variant does not include the reference allele, the first indel allele is used instead. Similar to SampleIndelAlleleBiasLower, except the count used is twice the count of the least frequently observed allele. SampleProxyRMSMappingQuality RMS mapping quality of all reads spanning the position immediately preceding the indel in one sample. This feature approximates the SAMPLE/MQ value defined in the VCF spec. SamplePrimaryAltAlleleDepthFrac tion ContextCompressability IndelCategory SamplePrimaryAltAlleleDepth VariantAlleleQuality SampleMeanDistanceFromReadEd ge The ratio of the confident observation count of the bestsupported non-reference allele at the variant locus, over all confident allele observation counts in one sample. The length of the upstream or downstream reference context (whichever is greater) that can be represented using 5 Ziv- Lempel keywords 7,8. The Ziv-Lempel keywords are obtained using the scheme of Ziv and Lempel , by traversing the sequence and successively selecting the shortest subsequence that has not yet been encountered. A binary variable set to 1 if the indel allele is a primitive deletion or 0 otherwise. The confident observation count of the best-supported nonreference allele at the variant locus. The model-based variant quality value reflecting confidence that the called variant is present in at least one sample, regardless of genotype. This feature matches QUAL in the VCF spec. For all non-reference basecall observations in one sample at a candidate SNV site, report the mean distance to the closest edge of each alternate basecall s read. Distance is measured in readcoordinates, zero-indexed, and is allowed to have a maximum value of 20.

11 SampleRefAlleleDepth SampleIndelMeanDistanceFromR eadedge SampleRefRepeatCount The confident observation count of the reference allele at the variant locus. For all indel allele observations in one sample at a candidate indel locus, report the mean distance to the closest edge of each indel allele s read. Distance is measured in read-coordinates, zero-indexed, and is allowed to have a maximum value of 20. The left or right side of the indel may be used to provide the shortest distance, but the indel will only be considered in its leftaligned position. The number of times the primary indel allele s repeat unit occurs in the reference sequence. Somatic feature descriptions Note that for somatic features "all samples" refers to the tumor and matched normal samples together. SomaticSNVQualityAndHomRefG ermlinegenotype NormalSampleRelativeTotalLocus Depth TumorSampleAltAlleleFraction RMSMappingQuality ZeroMappingQualityFraction Posterior probability of a somatic SNV conditioned on a homozygous reference germline genotype. When INFO/NT is "ref", this feature matches INFO/QSS_NT in the VCF output. This feature matches the germline RelativeTotalLocusDepth feature, except that it reflects the depth of only the matched normal sample. Fraction of the tumor sample s observations which are not the reference allele. This is restricted to a maximum of 0.5 to prevent the model from overtraining against high somatic allele frequencies (these might be common e.g. for loss of heterozygosity regions from liquid tumors). Root mean square read mapping quality of all reads spanning the variant in all samples. This feature matches INFO/MQ in the VCF spec. Fraction of read mapping qualities equal to zero, for all reads spanning the variant in all samples. InterruptedHomopolymerLength One less than the length of the longest interrupted homopolymer in the reference sequence containing the current position. An interrupted homopolymer is a string that has edit distance 1 to a homopolymer. TumorSampleStrandBias TumorSampleReadPosRankSum AlleleCountLogOddsRatio Log ratio of the tumor-sample somatic allele likelihood computed assuming the somatic allele occurs on only one strand vs both strands (thus higher values indicate greater bias). Z-score of Mann-Whitney U test for reference vs non-reference allele read positions in the tumor sample s observations. rt an The log odds ratio of allele counts log, given reference r a ( t n r, r ) and non-reference a, a ) allele counts for the tumor and normal sample pair. ( t n n t

12 NormalSampleFilteredDepthFract ion TumorSampleFilteredDepthFracti on SomaticIndelQualityAndHomRefG ermlinegenotype TumorSampleLogSymmetricStran doddsratio The fraction of reads that were filtered out of the normal sample before calling the variant locus. The fraction of reads that were filtered out of the tumor sample before calling the variant locus. Posterior probability of a somatic indel conditioned on a homozygous reference germline genotype. When INFO/NT is "ref", this feature matches INFO/QSI_NT in the VCF output. Log of the symmetric strand odds ratio of allele counts r fwdarev rrev a fwd log +, given reference ( r, ) rrev a fwd rfwda r fwd rev and nonreference a, a ) confident counts of the tumor rev sample s ( fwd rev observations. RepeatUnitLength The length of the somatic indel allele's repeat unit. The repeat unit is the smallest possible sequence such that the inserted/deleted sequence can be formed by concatenating multiple copies of it. IndelRepeatCount The number of times the somatic indel allele s repeat unit occurs in a haplotype containing the indel allele. RefRepeatCount The number of times the somatic indel allele s repeat unit occurs in the reference sequence. TumorSampleIndelNoiseLogOdds Log ratio of the frequency of the candidate indel vs all other indels at the same locus in the tumor sample. The frequencies are computed from reads which confidently support a single allele at the locus. TumorNormalIndelAlleleLogOdds Log ratio of the frequency of the candidate indel in the tumor vs normal samples. The frequencies are computed from reads which confidently support a single allele at the locus. SiteFilteredBasecallFrac The maximum value over all samples of SampleSiteFilteredBasecallFrac, which is the fraction of basecalls at a site which have been removed by the mismatch density filter in a given sample. IndelWindowFilteredBasecallFrac The maximum value over all samples of SampleIndelWindowFilteredBasecallFrac, which is the fraction of basecalls in a window extending 50 bases to each side of the candidate indel s call position which have been removed by the mismatch density filter in a given sample. SpanningDeletionFraction The maximum value over all samples of SampleSpanningDeletionFraction, which is the fraction of reads crossing a candidate SNV site with spanning deletions in a given sample.

13 References 1. Tarasov, A., Vilella, A. J., Cuppen, E., Nijman, I. J. & Prins, P. Sambamba: Fast processing of NGS alignment formats. Bioinformatics 31, (2015). 2. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, (2009). 3. Van der Auwera, G. A. et al. From fastq data to high-confidence variant calls: The genome analysis toolkit best practices pipeline. Curr. Protoc. Bioinforma. 43, (2013). 4. Pedregosa, F. & Varoquaux, G. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, (2011). 5. Eberle, M. A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, (2017). 6. Krusche, P. Haplotype Comparison Tools Lesne, A., Blanc, J. L. & Pezard, L. Entropy estimation of very short symbolic sequences. Phys. Rev. E - Stat. Nonlinear, Soft Matter Phys. 79, (2009). 8. Ziv, J. & Lempel, A. A Universal Algorithm for Sequential Data Compression. IEEE Trans. Inf. Theory 23, (1977).

Variant calling in NGS experiments

Variant calling in NGS experiments Jorge Jiménez jjimeneza@cipf.es BIER CIBERER Genomics Department Centro de Investigacion Principe Felipe (CIPF) (Valencia, Spain) 1 Index 1. NGS workflow 2. Variant calling