Downloading PrecisionFDA Challenge Datasets 1. Consistency challenge (

Size: px
Start display at page:

Download "Downloading PrecisionFDA Challenge Datasets 1. Consistency challenge (https://precision.fda.gov/challenges/consistency)"

Transcription

1 Supplementary Notes for Strelka2: Fast and accurate variant calling for clinical sequencing applications Supplementary Note 1 Command lines to run analyses Downloading PrecisionFDA Challenge Datasets 1. Consistency challenge ( Dataset Garvan HLI Files NA12878-Garvan-Vial1_R1.fastq.gz NA12878-Garvan-Vial1_R2.fastq.gz TSNano_1lane_L008_13801_NA12878_R1_001.fastq.gz TSNano_1lane_L008_13801_NA12878_R1_002.fastq.gz Library Prep TruSeq Nano DNA Library Prep kit TruSeq Nano DNA Library Prep kit Read 2x150bp 2x150bp Length Coverage ~40x ~35x Instrument HiSeq X HiSeq X 2. Truth Challenge ( Dataset HG001 HG002 Files HG001-NA x_1.fastq.gz HG001-NA x_2.fastq.gz HG002-NA x_1.fastq.gz HG002-NA x_1.fastq.gz Library Prep TruSeq DNA PCR-Free Read 2x148bp 2x148bp Length Coverage ~50x ~50x Instrument HiSeq 2500 HiSeq 2500 Read alignment 1. Run bwa-mem (version ) bwa mem -M -t 28 -R $hg19 $fastq_1 $fastq_2 samtools view -b -o $unsorted_bam - 2. Sort reads (samtools version 1.3) samtools sort $unsorted_bam -O bam -o $sorted_bam -@ Remove duplicates samtools rmdup <(samtools view -F 0x100 -u $sorted_bam) $sample_bam 4. Index samtools index $sample_bam

2 The hg19 reference file is available at Germline variant calling Strelka2 (version 2.8.3) 1. Configuration python ${strelka_install_path}/bin/configurestrelkagermlineworkflow.py --ref $hg19 --bam $sample_bam --rundir $strelka_analysis_path 2. Run germline calling python $strelka_analysis_path/runworkflow.py -m local -j 28 Sentieon DNAseq Haplotyper (version sentieon-genomics , equivalent to GATK v3.7) We followed the Sentieon DNAseq Haplotyper pipeline as written in the software manual for Sentieon Genomics pipeline tools (version ). This pipeline is equivalent to the GATK best practices pipeline. 1. Indel realignment sentieon driver --temp_dir $temp_dir -r $hg19_fasta -t 28 -i $sample_bam --algo Realigner -k $Mills ${sample}.realigned.bam 2. Base quality score recalibration sentieon driver -r $hg19 -t 28 -i ${sample}.realigned.bam --interval chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr1 8,chr19,chr20,chr21,chr22,chrX,chrY --algo QualCal -k $dbsnp -k $Mills ${sample}.recal.table 3. Haplotype calling sentieon driver --temp_dir $temp_dir -r $hg19 -t 28 -i ${sample}.realigned.bam -q ${sample}.recal.table --interval chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr1 8,chr19,chr20,chr21,chr22,chrX,chrY --algo Haplotyper -d $dbsnp --emit_conf=10 --call_conf=30 -- prune_factor=3 ${sample}.vcf 4. Calculating SNV variant quality score recalibration (VQSR) sentieon driver -r $hg19 -t 28 --interval chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr1 8,chr19,chr20,chr21,chr22,chrX,chrY --algo VarCal -v ${sample}.vcf--resource 1000G_phase1.snps.high_confidence.hg19.sites.vcf --resource_param 1000G,known=false,training=true,truth=false,prior= resource 1000G_omni2.5.hg19.sites.vcf -- resource_param omni,known=false,training=true,truth=true,prior= resource $dbsnp -- resource_param dbsnp,known=true,training=false,truth=false,prior=2.0 --resource hapmap_3.3.hg19.sites.vcf --resource_param hapmap,known=false,training=true,truth=true,prior= annotation QD --annotation MQ -- annotation MQRankSum --annotation ReadPosRankSum --annotation FS --var_type SNP --plot_file

3 ${sample}.snv.vqsr.csv --max_gaussians 8 --tranches_file ${sample}.snv.tranches.csv ${sample}.snv.recal 5. Applying SNV VQSR sentieon driver -r $hg19 -t 28 --interval chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr1 8,chr19,chr20,chr21,chr22,chrX,chrY --algo ApplyVarCal -v ${sample}.vcf --var_type SNP --recal ${sample}.snv.recal --tranches_file ${sample}.snv.tranches.csv --sensitivity 99.5 ${sample}.snv.vqsr.vcf 6. Calculating indel VQSR sentieon driver -r $hg19 -t 28 --interval chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr1 8,chr19,chr20,chr21,chr22,chrX,chrY --algo VarCal -v ${sample}.snv.vqsr.vcf --resource 1000G_phase1.indels.hg19.sites.vcf --resource_param 1000G,known=false,training=true,truth=true,prior= resource Mills_and_1000G_gold_standard.indels.hg19.sites.vcf --resource_param mills,known=false,training=true,truth=true,prior= resource $dbsnp --resource_param dbsnp,known=true,training=false,truth=false,prior=2.0 --annotation QD --annotation MQRankSum - -annotation ReadPosRankSum --annotation FS --var_type INDEL --plot_file ${sample}.indel.vqsr.csv - -max_gaussians 4 --tranches_file ${sample}.indel.tranches.csv ${sample}.indel.recal 7. Applying indel VQSR sentieon driver -r $hg19 -t 28 --interval chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr1 8,chr19,chr20,chr21,chr22,chrX,chrY --algo ApplyVarCal -v ${sample}.snv.vqsr.vcf --var_type INDEL -- recal ${sample}.indel.recal --tranches_file ${sample}.indel.tranches.csv --sensitivity 99.5 ${sample}.all.vqsr.vcf Germline calling evaluation Germline calling accuracy was measured using hap.py (version 0.3.7) against the latest NIST Genome in a bottle truth set (version 3.3.2). Hap.py command for Strelka2 results hap.py -X --roc GQX --roc-filter LowGQX --fixchr --force-interactive --threads 4 -f $high_confidence_regions_bed -o $sample $giab_truth_vcf $vcf -V --engine vcfeval Hap.py command for Sentieon DNAseq Haplotyper and PrecisionFDA submissions hap.py -X --fixchr --force-interactive --threads 4 -f $high_confidence_regions_bed -o $sample $giab_truth_vcf $vcf -V --engine vcfeval Creating in Silico Mixtures for somatic calling benchmark We created 4 in silico tumor/normal pairs by down sampling and mixing 8 HiSeqX WGS samples, 4 from NA12877 (normal), and the other 4 from NA12878 (tumor) as in the table below: Dataset Tumor sample (~110x coverage) Normal sample (~37x coverage)

4 TumorPurity / NormalPurity InputSample DownSampleRatio InputSample DownSampleRatio NA12877_2 0.8 NA12877_1 1 NA12877_3 0.8 NA12877_ % / 100% NA12878_1 0.2 NA12878_2 0.2 NA12878_3 0.2 File: InSilicoTumor1_Purity20.bam File: Normal1_Purity100.bam NA12877_ NA12877_3 1 NA12877_ % / 100% NA12878_ NA12878_ File: InSilicoTumor2_Purity50.bam File: Normal2_Purity100.bam NA12877_1 0.3 NA12877_3 1 NA12877_ % / 100% NA12878_1 0.8 NA12878_2 0.8 NA12878_3 0.8 File: InSilicoTumor3_Purity80.bam File: Normal2_Purity100.bam NA12877_1 0.3 NA12877_3 0.9 NA12877_4 0.3 NA12878_ % / 90% NA12878_1 0.8 NA12878_2 0.8 NA12878_3 0.8 File: InSilicoTumor3_Purity80.bam File: InSilicoNormal3_Purity90.bam All the bam files are available at The GRCh38Decoy reference file is available at Below is the procedure to create in silico mixture data using sambamba 1 (version 0.5.8) and samtools 2 (version 1.3). 1. Down sample (sambamba version 0.5.8) sambamba_v0.5.8 view -h -t 10 -s $down_sample_ratio -f bam --subsampling-seed=10 $bam_i -o ${downsampled_ i}.bam 2. Merge (samtools version 1.3) samtools merge -@ 20 merged.bam ${downsampled_1}.bam ${downsampled_n}.bam 3. Index merged samtools index ${merged}.bam 4. Replace SM_0 sambamba_v0.5.8 view -H before_reheader.bam sed 's/^\(\@rg.*\)sm:[a-za-z0-9_]*/\1sm:admix/g' - > $new_header 5. Reheader

5 samtools reheader $new_header ${reheader}.bam > ${mixture}.bam 6. Index samtools index ${mixture}.bam Somatic variant calling Strelka2 (version 2.8.3) 1. Configuration python ${strelka_install_path}/bin/configurestrelkasomaticworkflow.py --ref $GRCh38Decoy -- normalbam $normal_bam --tumorbam $tumor_bam --callregions callable.bed.gz --rundir $strelka_analysis_path (Note callable.bed.gz lists complete chromosomes so as to exclude by omission a number of smaller decoy contigs from hg38, which are currently problematic for Strelka2. The bed file lists all autosomes together with chromosomes X, Y and M, as well as all random and unplaced contigs. This file is available at 2. Run somatic calling python ${strelka_analysis_path}/runworkflow.py -m local -j 28 Sentieon TNseq TNhaplotyper (version sentieon-genomics , equivalent to MuTect2) We followed the TNhaplotyper pipeline as written in the software manual for Sentieon Genomics pipeline tools (version ). This pipeline is equivalent to the GATK best practices pipeline using MuTect2. 1. Tumor indel realignment sentieon driver -r $GRCh38Decoy -t 28 -i $tumor_bam --algo Realigner -k $Mills -k $known_indels ${sample}.tumour.realigned.bam 2. Normal indel realignment sentieon driver -r $GRCh38Decoy -t 28 -i $normal_bam --algo Realigner -k $Mills -k $known_indels ${sample}.normal.realigned.bam 3. Tumor BQSR sentieon driver -r $GRCh38Decoy -t 28 -i ${sample}.tumour.realigned.bam --algo QualCal -k $dbsnp - k $Mills -k $known_indels ${sample}.tumour.recal.table 4. Normal BQSR sentieon driver -r $GRCh38Decoy -t 28 -i ${sample}.normal.realigned.bam --algo QualCal -k $dbsnp - k $Mills -k $known_indels ${sample}.normal.recal.table 5. Corealign sentieon driver -r $GRCh38Decoy -t 28 -i ${sample}.tumour.realigned.bam -i ${sample}.normal.realigned.bam -q ${sample}.tumour.recal.table -q ${sample}.normal.recal.table -- algo Realigner -k $Mills -k $known_indels ${sample}.corealigned.bam 6. Variant calling

6 sentieon driver -r $GRCh38Decoy -t 28 -i ${sample}.corealigned.bam --algo TNhaplotyper -- tumor_sample TUMOR --normal_sample NORMAL --dbsnp $dbsnp ${sample}.somatic.vcf Somatic calling evaluation Somatic calling accuracy was measured using som.py (version 0.3.7) within the hap.py package against the truth set consisting of the variant calls in NA12878 where the corresponding NA12877 genotype is homozygous reference. To measure the accuracy separately for indels and SNVs, the VCF file was split into two so that each file contains either indel or SNV calls. Som.py command for indels som.py InSilicoMix_indels.vcf.gz $vcf -f InSilicoMix_indels.bed -o ${sample}.indels --roc $score Som.py command for SNVs som.py InSilicoMix_snvs.vcf.gz $vcf -f InSilicoMix_snvs.bed -o ${sample}.snvs --roc $score score is strelka.indel.evs for Strelka, mutect.indel for TNhaplotyper indels, mutect.snv for TNhaplotyper SNVs. All the truth files (InSilicoMix_indels.vcf.gz, InSilicoMix_indels.bed, InSilicoMix_snvs.vcf.gz, and InSilicoMix_snvs.bed) are available at

7 Supplementary Note 2 EVS model and hard filters EVS model Empirical variant scoring in Strelka2 uses pre-trained random forest models taking a set of features as input to produce the probability of an erroneous variant call. For each of germline, RNA-seq, and somatic variant calling, there are two separate trained random forest models and feature sets for the two highlevel variant categories, SNVs and indels. Strelka2 is intended to run with models which have been pretrained on a combined training data set representing a wide variety of sample-prep and sequencing assays. Note that although scripts are provided to recreate the EVS model training procedure, there is no intention for the models to be retrained for each input sequencing dataset to be analyzed (this is in contrast to dynamic re-scoring systems such as the GATK VQSR procedure 3 ). The EVS models are trained using the random forest learning procedures implemented in the scikit-learn package 4, trained on a set of candidate calls with truth labels assigned as described below. For the germline and RNA-seq EVS models, each random forest uses 50 decision trees with a maximum depth of 12, a minimum of 50 samples per leaf, and no limit on the maximum number of features. For the somatic EVS models, each random forest uses 100 decision trees with a maximum depth of 6. The remaining options are set to scikit-learn defaults. The training data are compiled from a collection of sequencing runs using different sample prep, sequencing platforms and chemistries. All germline and RNA-seq datasets are from an individual for which a gold standard truth set is available from the Platinum Genomes project 5. Candidate variants that correspond to the high-confidence regions of the truth set are labeled as true or false using the hap.py haplotype comparison tool 6. Variants that exist in the truth set but were called with incorrect genotype are treated as false variants. In the case of germline calling, it is believed that the vast majority of candidate SNVs (but not indels) outside of the high-confidence regions (and classified by hap.py as unknown) are false; for this reason, these SNV variants are added to the set of false variants that are presented to the model during training, downweighted so as to have a total weight which is half the total weight of the known false variants. Somatic datasets include simulated tumor-normal pairs from the Platinum Genomes project as well as tumor-normal data from real tumor cell lines for which curated (but generally noisier) truth sets have been constructed. Labeling of somatic datasets is done by means of a script included in the Strelka2 distribution. The output scores produced by the random forest classifier are transformed to phred-scale and calibrated by passing the resulting quality values through a linear transform estimated by regressing binned empirical quality onto predicted quality. Finally, variant filter labels are assigned based on thresholds that have been selected to achieve a reasonable tradeoff across multiple datasets. Features used by the EVS model Features used in each model are listed here and definitions are provided further below.

8 Germline SNV features: GenotypeCategory, SampleRMSMappingQuality, SiteHomopolymerLength, SampleStrandBias, SampleRMSMappingQualityRankSum, SampleReadPosRankSum, RelativeTotalLocusDepth, SampleUsedDepthFraction, ConservativeGenotypeQuality, NormalizedAltHaplotypeCountRatio. Germline Indel features: GenotypeCategory, SampleIndelRepeatCount, SampleIndelRepeatUnitSize, SampleIndelAlleleBiasLower, SampleIndelAlleleBias, SampleProxyRMSMappingQuality, RelativeTotalLocusDepth, SamplePrimaryAltAlleleDepthFraction, ConservativeGenotypeQuality, InterruptedHomopolymerLength, ContextCompressability, IndelCategory, NormalizedAltHaplotypeCountRatio. Somatic SNV features: SomaticSNVQualityAndHomRefGermlineGenotype, NormalSampleRelativeTotalLocusDepth, TumorSampleAltAlleleFraction, RMSMappingQuality, ZeroMappingQualityFraction, TumorSampleStrandBias, TumorSampleReadPosRankSum, AlleleCountLogOddsRatio, NormalSampleFilteredDepthFraction, TumorSampleFilteredDepthFraction. Somatic Indel features: SomaticIndelQualityAndHomRefGermlineGenotype, TumorSampleReadPosRankSum, TumorSampleLogSymmetricStrandOddsRatio, RepeatUnitLength, IndelRepeatCount, RefRepeatCount, InterruptedHomopolymerLength, TumorSampleIndelNoiseLogOdds, TumorNormalIndelAlleleLogOdds, AlleleCountLogOddsRatio. Hard filters The hard filter model is used whenever EVS cannot be; in particular, this is the method used to filter germline homozygous reference calls. The EVS model may be turned off for all variants whenever the assay conditions are suspected to poorly match the EVS models training conditions, in which case the hard filters will be used instead. When hard filteres are used, each filter is triggered when a single feature exceeds some critical value. Filtration thresholds are set to remove variants which are very likely to be incorrect. The LowDepth filter mentioned above is applied to all germline and somatic calls. Additionally, the following filters are used depending on the variant type. Hard filter thresholds for germline model Shared filter conditions: Variant is filtered if ConservativeGenotypeQuality < 15 or RelativeTotalLocusDepth > 3. SNV-specific filter conditions: SNV is filtered if SampleStrandBias > 10. Hard filter thresholds for somatic model SNV-specific filter conditions: SNV is filtered if SomaticSNVQualityAndHomRefGermlineGenotype < 15, SiteFilteredBasecallFrac >= 4, or SpanningDeletionFraction > Indel-specific filter conditions: Indel is filtered if SomaticIndelQualityAndHomRefGermlineGenotype < 40 or IndelWindowFilteredBasecallFrac >= 3. Hard-filters that are also applied when the EVS model is used When the EVS model is used, variant filtering is primarily based on the score computed by the random forest model. In some cases this score is supplemented with additional hard filters.

9 LowDepth: This filter is applied to all germline and somatic calls. For germline site calls, read depths are calculated from base calls used for site genotyping. For germline indel loci, read depths are taken from the depths of the sites preceding the indels. If read depths are below 3, genotype calls are filtered out and tagged as LowDepth. For variant calls in particular, allelic depths for the reference and alternative alleles are additionally considered. If the sum of the allelec depths is below 3, the associated variant calls are supplemented with a LowDepth filter. For somatic variants, tumor sample read depths are calculated for SNVs and indels in a similar way. If tumor depths are below 2, associated somatic variant calls are supplemented with a LowDepth filter. NormalSampleRelativeTotalLocusDepth: This filter is applied to all somatic calls variant is filtered if this value is greater than 3. Germline and RNA-seq feature descriptions GenotypeCategory A category variable reflecting the most likely genotype as heterozygous (0), homozygous (1) or alt-heterozygous (2). SampleRMSMappingQuality RMS mapping quality of all reads spanning the variant in one sample. This feature matches SAMPLE/MQ in the VCF spec. SiteHomopolymerLength Length of the longest homopolymer containing the current position if this position can be treated as any base. InterruptedHomopolymerLength One less than the length of the longest interrupted homopolymer in the reference sequence containing the current position. An interrupted homopolymer is a string that has edit distance 1 to a homopolymer. SampleStrandBias Log ratio of the sample s genotype likelihood computed assuming the alternate allele occurs on only one strand vs both strands (thus positive values indicate bias). SampleRMSMappingQualityRankS Z-score of Mann-Whitney U test for reference vs alternate allele um mapping quality values in one sample. SampleReadPosRankSum Z-score of Mann-Whitney U test for reference vs alternate allele read positions in one sample. RelativeTotalLocusDepth Locus depth relative to expectation: this is the ratio of total read depth at the variant locus in all samples over the total expected depth in all samples. Depth at the variant locus includes reads at any mapping quality. Expected depth is taken from the preliminary depth estimation step. This value is set to 1 in exome and targeted analyses, because it is problematic to define expected depth in this case. SampleUsedDepthFraction The ratio of reads used to genotype the locus over the total number of reads at the variant locus in one sample. Reads are not used if the mapping quality is less than the minimum threshold, if the local read alignment fails the mismatch density filter or if the basecall is ambiguous. ConservativeGenotypeQuality The model-based ConservativeGenotypeQuality (GQX) value for one sample, reflecting the conservative confidence of the called genotype. NormalizedAltHaplotypeCountRat io For variants in an active region, the proportion of reads supporting the top 2 haplotypes, or 0 if haplotyping failed due to

10 SampleIndelRepeatCount SampleIndelRepeatUnitSize SampleIndelAlleleBiasLower SampleIndelAlleleBias this proportion being below threshold. For heterozygous variants with only one non-reference allele, the proportion is doubled so that its value is expected to be close to 1.0 regardless of genotype. The feature is set to -1 for variants not in an active region. The number of times the primary indel allele s repeat unit occurs in a haplotype containing the indel allele. The primary indel allele s repeat unit is the smallest possible sequence such that the inserted/deleted sequence can be formed by concatenating multiple copies of it. The primary indel allele is the best supported allele among all overlapping indel alleles at the locus of interest in one sample. Length of the primary indel allele s repeat unit, as defined for feature SampleIndelRepeatCount. The negative log probability of seeing N or fewer observations of one allele in a heterozygous variant out of the total observations from both alleles in one sample. N is typically the observation count of the reference allele. If the heterozygous variant does not include the reference allele, the first indel allele is used instead. Similar to SampleIndelAlleleBiasLower, except the count used is twice the count of the least frequently observed allele. SampleProxyRMSMappingQuality RMS mapping quality of all reads spanning the position immediately preceding the indel in one sample. This feature approximates the SAMPLE/MQ value defined in the VCF spec. SamplePrimaryAltAlleleDepthFrac tion ContextCompressability IndelCategory SamplePrimaryAltAlleleDepth VariantAlleleQuality SampleMeanDistanceFromReadEd ge The ratio of the confident observation count of the bestsupported non-reference allele at the variant locus, over all confident allele observation counts in one sample. The length of the upstream or downstream reference context (whichever is greater) that can be represented using 5 Ziv- Lempel keywords 7,8. The Ziv-Lempel keywords are obtained using the scheme of Ziv and Lempel , by traversing the sequence and successively selecting the shortest subsequence that has not yet been encountered. A binary variable set to 1 if the indel allele is a primitive deletion or 0 otherwise. The confident observation count of the best-supported nonreference allele at the variant locus. The model-based variant quality value reflecting confidence that the called variant is present in at least one sample, regardless of genotype. This feature matches QUAL in the VCF spec. For all non-reference basecall observations in one sample at a candidate SNV site, report the mean distance to the closest edge of each alternate basecall s read. Distance is measured in readcoordinates, zero-indexed, and is allowed to have a maximum value of 20.

11 SampleRefAlleleDepth SampleIndelMeanDistanceFromR eadedge SampleRefRepeatCount The confident observation count of the reference allele at the variant locus. For all indel allele observations in one sample at a candidate indel locus, report the mean distance to the closest edge of each indel allele s read. Distance is measured in read-coordinates, zero-indexed, and is allowed to have a maximum value of 20. The left or right side of the indel may be used to provide the shortest distance, but the indel will only be considered in its leftaligned position. The number of times the primary indel allele s repeat unit occurs in the reference sequence. Somatic feature descriptions Note that for somatic features "all samples" refers to the tumor and matched normal samples together. SomaticSNVQualityAndHomRefG ermlinegenotype NormalSampleRelativeTotalLocus Depth TumorSampleAltAlleleFraction RMSMappingQuality ZeroMappingQualityFraction Posterior probability of a somatic SNV conditioned on a homozygous reference germline genotype. When INFO/NT is "ref", this feature matches INFO/QSS_NT in the VCF output. This feature matches the germline RelativeTotalLocusDepth feature, except that it reflects the depth of only the matched normal sample. Fraction of the tumor sample s observations which are not the reference allele. This is restricted to a maximum of 0.5 to prevent the model from overtraining against high somatic allele frequencies (these might be common e.g. for loss of heterozygosity regions from liquid tumors). Root mean square read mapping quality of all reads spanning the variant in all samples. This feature matches INFO/MQ in the VCF spec. Fraction of read mapping qualities equal to zero, for all reads spanning the variant in all samples. InterruptedHomopolymerLength One less than the length of the longest interrupted homopolymer in the reference sequence containing the current position. An interrupted homopolymer is a string that has edit distance 1 to a homopolymer. TumorSampleStrandBias TumorSampleReadPosRankSum AlleleCountLogOddsRatio Log ratio of the tumor-sample somatic allele likelihood computed assuming the somatic allele occurs on only one strand vs both strands (thus higher values indicate greater bias). Z-score of Mann-Whitney U test for reference vs non-reference allele read positions in the tumor sample s observations. rt an The log odds ratio of allele counts log, given reference r a ( t n r, r ) and non-reference a, a ) allele counts for the tumor and normal sample pair. ( t n n t

12 NormalSampleFilteredDepthFract ion TumorSampleFilteredDepthFracti on SomaticIndelQualityAndHomRefG ermlinegenotype TumorSampleLogSymmetricStran doddsratio The fraction of reads that were filtered out of the normal sample before calling the variant locus. The fraction of reads that were filtered out of the tumor sample before calling the variant locus. Posterior probability of a somatic indel conditioned on a homozygous reference germline genotype. When INFO/NT is "ref", this feature matches INFO/QSI_NT in the VCF output. Log of the symmetric strand odds ratio of allele counts r fwdarev rrev a fwd log +, given reference ( r, ) rrev a fwd rfwda r fwd rev and nonreference a, a ) confident counts of the tumor rev sample s ( fwd rev observations. RepeatUnitLength The length of the somatic indel allele's repeat unit. The repeat unit is the smallest possible sequence such that the inserted/deleted sequence can be formed by concatenating multiple copies of it. IndelRepeatCount The number of times the somatic indel allele s repeat unit occurs in a haplotype containing the indel allele. RefRepeatCount The number of times the somatic indel allele s repeat unit occurs in the reference sequence. TumorSampleIndelNoiseLogOdds Log ratio of the frequency of the candidate indel vs all other indels at the same locus in the tumor sample. The frequencies are computed from reads which confidently support a single allele at the locus. TumorNormalIndelAlleleLogOdds Log ratio of the frequency of the candidate indel in the tumor vs normal samples. The frequencies are computed from reads which confidently support a single allele at the locus. SiteFilteredBasecallFrac The maximum value over all samples of SampleSiteFilteredBasecallFrac, which is the fraction of basecalls at a site which have been removed by the mismatch density filter in a given sample. IndelWindowFilteredBasecallFrac The maximum value over all samples of SampleIndelWindowFilteredBasecallFrac, which is the fraction of basecalls in a window extending 50 bases to each side of the candidate indel s call position which have been removed by the mismatch density filter in a given sample. SpanningDeletionFraction The maximum value over all samples of SampleSpanningDeletionFraction, which is the fraction of reads crossing a candidate SNV site with spanning deletions in a given sample.

13 References 1. Tarasov, A., Vilella, A. J., Cuppen, E., Nijman, I. J. & Prins, P. Sambamba: Fast processing of NGS alignment formats. Bioinformatics 31, (2015). 2. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, (2009). 3. Van der Auwera, G. A. et al. From fastq data to high-confidence variant calls: The genome analysis toolkit best practices pipeline. Curr. Protoc. Bioinforma. 43, (2013). 4. Pedregosa, F. & Varoquaux, G. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, (2011). 5. Eberle, M. A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, (2017). 6. Krusche, P. Haplotype Comparison Tools Lesne, A., Blanc, J. L. & Pezard, L. Entropy estimation of very short symbolic sequences. Phys. Rev. E - Stat. Nonlinear, Soft Matter Phys. 79, (2009). 8. Ziv, J. & Lempel, A. A Universal Algorithm for Sequential Data Compression. IEEE Trans. Inf. Theory 23, (1977).

Variant calling in NGS experiments

Variant calling in NGS experiments Variant calling in NGS experiments Jorge Jiménez jjimeneza@cipf.es BIER CIBERER Genomics Department Centro de Investigacion Principe Felipe (CIPF) (Valencia, Spain) 1 Index 1. NGS workflow 2. Variant calling

More information

Variant Finding. UCD Genome Center Bioinformatics Core Wednesday 30 August 2016

Variant Finding. UCD Genome Center Bioinformatics Core Wednesday 30 August 2016 Variant Finding UCD Genome Center Bioinformatics Core Wednesday 30 August 2016 Types of Variants Adapted from Alkan et al, Nature Reviews Genetics 2011 Why Look For Variants? Genotyping Correlation with

More information

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère C3BI VARIANTS CALLING November 2016 Pierre Lechat Stéphane Descorps-Declère General Workflow (GATK) software websites software bwa picard samtools GATK IGV tablet vcftools website http://bio-bwa.sourceforge.net/

More information

Variant Quality Score Recalibra2on

Variant Quality Score Recalibra2on talks Variant Quality Score Recalibra2on Assigning accurate confidence scores to each puta2ve muta2on call You are here in the GATK Best Prac2ces workflow for germline variant discovery Data Pre-processing

More information

Variant Callers. J Fass 24 August 2017

Variant Callers. J Fass 24 August 2017 Variant Callers J Fass 24 August 2017 Variant Types Caller Consistency Pabinger (2014) Briefings Bioinformatics 15:256 Freebayes Bayesian haplotype caller that can call SNPs, short CNVs / duplications,

More information

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang Supplementary Materials for: Detecting very low allele fraction variants using targeted DNA sequencing and a novel molecular barcode-aware variant caller Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John

More information

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4 WHITE PAPER Oncomine Comprehensive Assay Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4 Contents Scope and purpose of document...2 Content...2 How Torrent

More information

Variant Discovery. Jie (Jessie) Li PhD Bioinformatics Analyst Bioinformatics Core, UCD

Variant Discovery. Jie (Jessie) Li PhD Bioinformatics Analyst Bioinformatics Core, UCD Variant Discovery Jie (Jessie) Li PhD Bioinformatics Analyst Bioinformatics Core, UCD Variant Type Alkan et al, Nature Reviews Genetics 2011 doi:10.1038/nrg2958 Variant Type http://www.broadinstitute.org/education/glossary/snp

More information

The Sentieon Genomic Tools Improved Best Practices Pipelines for Analysis of Germline and Tumor-Normal Samples

The Sentieon Genomic Tools Improved Best Practices Pipelines for Analysis of Germline and Tumor-Normal Samples The Sentieon Genomic Tools Improved Best Practices Pipelines for Analysis of Germline and Tumor-Normal Samples Andreas Scherer, Ph.D. President and CEO Dr. Donald Freed, Bioinformatics Scientist, Sentieon

More information

Comparing a few SNP calling algorithms using low-coverage sequencing data

Comparing a few SNP calling algorithms using low-coverage sequencing data Yu and Sun BMC Bioinformatics 2013, 14:274 RESEARCH ARTICLE Open Access Comparing a few SNP calling algorithms using low-coverage sequencing data Xiaoqing Yu 1 and Shuying Sun 1,2* Abstract Background:

More information

SNP calling and VCF format

SNP calling and VCF format SNP calling and VCF format Laurent Falquet, Oct 12 SNP? What is this? A type of genetic variation, among others: Family of Single Nucleotide Aberrations Single Nucleotide Polymorphisms (SNPs) Single Nucleotide

More information

Fast and Accurate Variant Calling in Strand NGS

Fast and Accurate Variant Calling in Strand NGS S T R A ND LIF E SCIENCE S WH ITE PAPE R Fast and Accurate Variant Calling in Strand NGS A benchmarking study Radhakrishna Bettadapura, Shanmukh Katragadda, Vamsi Veeramachaneni, Atanu Pal, Mahesh Nagarajan

More information

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer. DNA Preparation and QC Extraction DNA was extracted from whole blood or flash frozen post-mortem tissue using a DNA mini kit (QIAmp #51104 and QIAmp#51404, respectively) following the manufacturer s recommendations.

More information

NGS in Pathology Webinar

NGS in Pathology Webinar NGS in Pathology Webinar NGS Data Analysis March 10 2016 1 Topics for today s presentation 2 Introduction Next Generation Sequencing (NGS) is becoming a common and versatile tool for biological and medical

More information

Accelerate precision medicine with Microsoft Genomics

Accelerate precision medicine with Microsoft Genomics Accelerate precision medicine with Microsoft Genomics Copyright 2018 Microsoft, Inc. All rights reserved. This content is for informational purposes only. Microsoft makes no warranties, express or implied,

More information

Prioritization: from vcf to finding the causative gene

Prioritization: from vcf to finding the causative gene Prioritization: from vcf to finding the causative gene vcf file making sense A vcf file from an exome sequencing project may easily contain 40-50 thousand variants. In order to optimize the search for

More information

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014 Single Nucleotide Variant Analysis H3ABioNet May 14, 2014 Outline What are SNPs and SNVs? How do we identify them? How do we call them? SAMTools GATK VCF File Format Let s call variants! Single Nucleotide

More information

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es SNP calling Jose Blanca COMAV institute bioinf.comav.upv.es SNP calling Genotype matrix Genotype matrix: Samples x SNPs SNPs and errors A change in a read may due to: Sample contamination Cloning or PCR

More information

BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte

BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte BICF Variant Analysis Tools Using the BioHPC Workflow Launching Tool Astrocyte Prioritization of Variants SNP INDEL SV Astrocyte BioHPC Workflow Platform Allows groups to give easy-access to their analysis

More information

Assignment 9: Genetic Variation

Assignment 9: Genetic Variation Assignment 9: Genetic Variation Due Date: Friday, March 30 th, 2018, 10 am In this assignment, you will profile genome variation information and attempt to answer biologically relevant questions. The variant

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION doi:10.1038/nature24473 1. Supplementary Information Computational identification of neoantigens Neoantigens from the three datasets were inferred using a consistent pipeline

More information

HiSeq Whole Exome Sequencing Report. BGI Co., Ltd.

HiSeq Whole Exome Sequencing Report. BGI Co., Ltd. HiSeq Whole Exome Sequencing Report BGI Co., Ltd. Friday, 11th Nov., 2016 Table of Contents Results 1 Data Production 2 Summary Statistics of Alignment on Target Regions 3 Data Quality Control 4 SNP Results

More information

MPG NGS workshop I: SNP calling

MPG NGS workshop I: SNP calling MPG NGS workshop I: SNP calling Mark DePristo Manager, Medical and Popula

More information

Analytics Behind Genomic Testing

Analytics Behind Genomic Testing A Quick Guide to the Analytics Behind Genomic Testing Elaine Gee, PhD Director, Bioinformatics ARUP Laboratories 1 Learning Objectives Catalogue various types of bioinformatics analyses that support clinical

More information

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer Project XX Customer Detail Table of Contents. Bioinformatics analysis pipeline...3.. Read quality check. 3.2. Read alignment...3.3.

More information

The Sentieon Genomics Tools A fast and accurate solution to variant calling from next-generation sequence data

The Sentieon Genomics Tools A fast and accurate solution to variant calling from next-generation sequence data The Sentieon Genomics Tools A fast and accurate solution to variant calling from next-generation sequence data Donald Freed 1*, Rafael Aldana 1, Jessica A. Weber 2, Jeremy S. Edwards 3,4,5 1 Sentieon Inc,

More information

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary Executive Summary Helix is a personal genomics platform company with a simple but powerful mission: to empower every person to improve their life through DNA. Our platform includes saliva sample collection,

More information

Germline variant calling and joint genotyping

Germline variant calling and joint genotyping talks Germline variant calling and joint genotyping Applying the joint discovery workflow with HaplotypeCaller + GenotypeGVCFs You are here in the GATK Best PracDces workflow for germline variant discovery

More information

Mapping errors require re- alignment

Mapping errors require re- alignment RE- ALIGNMENT Mapping errors require re- alignment Source: Heng Li, presenta8on at GSA workshop 2011 Alignment Key component of alignment algorithm is the scoring nega8ve contribu8on to score opening a

More information

Supplementary Text for Manta: Rapid detection of structural variants and indels for clinical sequencing applications.

Supplementary Text for Manta: Rapid detection of structural variants and indels for clinical sequencing applications. Supplementary Text for Manta: Rapid detection of structural variants and indels for clinical sequencing applications. Xiaoyu Chen, Ole Schulz-Trieglaff, Richard Shaw, Bret Barnes, Felix Schlesinger, Anthony

More information

Read Mapping and Variant Calling. Johannes Starlinger

Read Mapping and Variant Calling. Johannes Starlinger Read Mapping and Variant Calling Johannes Starlinger Application Scenario: Personalized Cancer Therapy Different mutations require different therapy Collins, Meredith A., and Marina Pasca di Magliano.

More information

Dipping into Guacamole. Tim O Donnell & Ryan Williams NYC Big Data Genetics Meetup Aug 11, 2016

Dipping into Guacamole. Tim O Donnell & Ryan Williams NYC Big Data Genetics Meetup Aug 11, 2016 Dipping into uacamole Tim O Donnell & Ryan Williams NYC Big Data enetics Meetup ug 11, 2016 Who we are: Hammer Lab Computational lab in the department of enetics and enomic Sciences at Mount Sinai Principal

More information

Supplementary information ATLAS

Supplementary information ATLAS Supplementary information ATLAS Vivian Link, Athanasios Kousathanas, Krishna Veeramah, Christian Sell, Amelie Scheu and Daniel Wegmann Section 1: Complete list of functionalities Sequence data processing

More information

G E N OM I C S S E RV I C ES

G E N OM I C S S E RV I C ES GENOMICS SERVICES ABOUT T H E N E W YOR K G E NOM E C E N T E R NYGC is an independent non-profit implementing advanced genomic research to improve diagnosis and treatment of serious diseases. Through

More information

Supplementary Materials for

Supplementary Materials for www.sciencetranslationalmedicine.org/cgi/content/full/10/457/eaar7939/dc1 Supplementary Materials for A machine learning approach for somatic mutation discovery Derrick E. Wood, James R. White, Andrew

More information

Strand NGS Variant Caller

Strand NGS Variant Caller STRAND LIFE SCIENCES WHITE PAPER Strand NGS Variant Caller A Benchmarking Study Rohit Gupta, Pallavi Gupta, Aishwarya Narayanan, Somak Aditya, Shanmukh Katragadda, Vamsi Veeramachaneni, and Ramesh Hariharan

More information

QIAseq Targeted Panel Analysis Plugin USER MANUAL

QIAseq Targeted Panel Analysis Plugin USER MANUAL QIAseq Targeted Panel Analysis Plugin USER MANUAL User manual for QIAseq Targeted Panel Analysis 1.1 Windows, macos and Linux June 18, 2018 This software is for research purposes only. QIAGEN Aarhus Silkeborgvej

More information

Processing Ion AmpliSeq Data using NextGENe Software v2.3.0

Processing Ion AmpliSeq Data using NextGENe Software v2.3.0 Processing Ion AmpliSeq Data using NextGENe Software v2.3.0 July 2012 John McGuigan, Megan Manion, Kevin LeVan, CS Jonathan Liu Introduction The Ion AmpliSeq Panels use highly multiplexed PCR in order

More information

1. Detailed SomaticSeq Results Table 1 summarizes what training data were used for each study in the paper.

1. Detailed SomaticSeq Results Table 1 summarizes what training data were used for each study in the paper. 1. Detailed SomaticSeq Results Table 1 summarizes what training data were used for each study in the paper. 1.1. DREAM Challenge. The cross validation results for DREAM Challenge are presented in Table

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature26136 We reexamined the available whole data from different cave and surface populations (McGaugh et al, unpublished) to investigate whether insra exhibited any indication that it has

More information

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl Next Generation Sequencing Bioinformatics small variants Data Analysis Guidelines genomescan.nl GenomeScan s Guidelines for Small Variant Analysis on NGS Data Using our own proprietary data analysis pipelines

More information

Normal-Tumor Comparison using Next-Generation Sequencing Data

Normal-Tumor Comparison using Next-Generation Sequencing Data Normal-Tumor Comparison using Next-Generation Sequencing Data Chun Li Vanderbilt University Taichung, March 16, 2011 Next-Generation Sequencing First-generation (Sanger sequencing): 115 kb per day per

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Contents De novo assembly... 2 Assembly statistics for all 150 individuals... 2 HHV6b integration... 2 Comparison of assemblers... 4 Variant calling and genotyping... 4 Protein truncating variants (PTV)...

More information

2017 HTS-CSRS COMMUNITY PUBLIC WORKSHOP

2017 HTS-CSRS COMMUNITY PUBLIC WORKSHOP 2017 HTS-CSRS COMMUNITY PUBLIC WORKSHOP GenomeNext Overview Olympus Platform The Olympus Platform provides a continuous workflow and data management solution from the sequencing instrument through analysis,

More information

Supplementary Information

Supplementary Information Supplementary Figures Supplementary Information Supplementary Figure 1: DeepVariant accuracy, consistency, and calibration relative to the GATK. Panel A Panel B Panel C (A) Precision-recall plot for DeepVariant

More information

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Read Complexity

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Read Complexity Supplementary Figure 1 Read Complexity A) Density plot showing the percentage of read length masked by the dust program, which identifies low-complexity sequence (simple repeats). Scrappie outputs a significantly

More information

White Paper GENALICE MAP: Variant Calling in a Matter of Minutes. Bas Tolhuis, PhD - GENALICE B.V.

White Paper GENALICE MAP: Variant Calling in a Matter of Minutes. Bas Tolhuis, PhD - GENALICE B.V. White Paper GENALICE MAP: Variant Calling in a Matter of Minutes Bas Tolhuis, PhD - GENALICE B.V. White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 White Paper GENALICE MAP Variant Calling

More information

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data BST227 Introduction to Statistical Genetics Lecture 8: Variant calling from high-throughput sequencing data 1 PC recap typical genome Differs from the reference genome at 4-5 million sites ~85% SNPs ~15%

More information

Supplementary Figures and Data

Supplementary Figures and Data Supplementary Figures and Data Whole Exome Screening Identifies Novel and Recurrent WISP3 Mutations Causing Progressive Pseudorheumatoid Dysplasia in Jammu and Kashmir India Ekta Rai 1, Ankit Mahajan 2,

More information

Course Presentation. Ignacio Medina Presentation

Course Presentation. Ignacio Medina Presentation Course Index Introduction Agenda Analysis pipeline Some considerations Introduction Who we are Teachers: Marta Bleda: Computational Biologist and Data Analyst at Department of Medicine, Addenbrooke's Hospital

More information

Published online 15 May 2014 Nucleic Acids Research, 2014, Vol. 42, No. 12 e101 doi: /nar/gku392

Published online 15 May 2014 Nucleic Acids Research, 2014, Vol. 42, No. 12 e101 doi: /nar/gku392 Published online 15 May 2014 Nucleic Acids Research, 2014, Vol. 42, No. 12 e101 doi: 10.1093/nar/gku392 Performance comparison of SNP detection tools with illumina exome sequencing data an assessment using

More information

Release Notes for Genomes Processed Using Complete Genomics Software

Release Notes for Genomes Processed Using Complete Genomics Software Release Notes for Genomes Processed Using Complete Genomics Software Software Version 2.4 Related Documents... 1 Changes to Version 2.4... 2 Changes to Version 2.2... 4 Changes to Version 2.0... 5 Changes

More information

Quality assurance in NGS (diagnostics)

Quality assurance in NGS (diagnostics) Quality assurance in NGS (diagnostics) Chris Mattocks National Genetics Reference Laboratory (Wessex) Research Diagnostics Quality assurance Any systematic process of checking to see whether a product

More information

Setting Standards and Raising Quality for Clinical Bioinformatics. Joo Wook Ahn, Guy s & St Thomas 04/07/ ACGS summer scientific meeting

Setting Standards and Raising Quality for Clinical Bioinformatics. Joo Wook Ahn, Guy s & St Thomas 04/07/ ACGS summer scientific meeting Setting Standards and Raising Quality for Clinical Bioinformatics Joo Wook Ahn, Guy s & St Thomas 04/07/2016 - ACGS summer scientific meeting 1. Best Practice Guidelines Draft guidelines circulated to

More information

From raw reads to variants

From raw reads to variants From raw reads to variants Sebastian DiLorenzo Sebastian.DiLorenzo@NBIS.se Talk Overview Concepts Reference genome Variants Paired-end data NGS Workflow Quality control & Trimming Alignment Local realignment

More information

Novel Variant Discovery Tutorial

Novel Variant Discovery Tutorial Novel Variant Discovery Tutorial Release 8.4.0 Golden Helix, Inc. August 12, 2015 Contents Requirements 2 Download Annotation Data Sources...................................... 2 1. Overview...................................................

More information

Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls

Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls The Harvard community has made this article openly available. Please share how this access benefits you.

More information

Welcome to the NGS webinar series

Welcome to the NGS webinar series Welcome to the NGS webinar series Webinar 1 NGS: Introduction to technology, and applications NGS Technology Webinar 2 Targeted NGS for Cancer Research NGS in cancer Webinar 3 NGS: Data analysis for genetic

More information

OHSU Digital Commons. Oregon Health & Science University. Benjamin Cordier. Scholar Archive

OHSU Digital Commons. Oregon Health & Science University. Benjamin Cordier. Scholar Archive Oregon Health & Science University OHSU Digital Commons Scholar Archive 5-19-2017 Evaluation Of Background Prediction For Variant Detection In A Clinical Context: Towards Improved Ngs Monitoring Of Minimal

More information

Bulked Segregant Analysis For Fine Mapping Of Genes. Cheng Zou, Qi Sun Bioinformatics Facility Cornell University

Bulked Segregant Analysis For Fine Mapping Of Genes. Cheng Zou, Qi Sun Bioinformatics Facility Cornell University Bulked Segregant Analysis For Fine Mapping Of enes heng Zou, Qi Sun Bioinformatics Facility ornell University Outline What is BSA? Keys for a successful BSA study Pipeline of BSA extended reading ompare

More information

Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing (HaploSeq)

Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing (HaploSeq) Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing (HaploSeq) Lyon Lab Journal Clubs Han Fang 01/28/2014 Lyon Lab Journal Clubs 1 Lyon Lab Journal Clubs 2 Existing genome

More information

Introduction to RNA-Seq in GeneSpring NGS Software

Introduction to RNA-Seq in GeneSpring NGS Software Introduction to RNA-Seq in GeneSpring NGS Software Dipa Roy Choudhury, Ph.D. Strand Scientific Intelligence and Agilent Technologies Learn more at www.genespring.com Introduction to RNA-Seq In a few years,

More information

Release Notes for Genomes Processed Using Complete Genomics Software

Release Notes for Genomes Processed Using Complete Genomics Software Release Notes for Genomes Processed Using Complete Genomics Software Software Version 2.2 Related Documents... 1 Changes to Version 2.2... 2 Changes to Version 2.0... 3 Changes to Version 1.12.0... 11

More information

Next-Generation Sequencing. Technologies

Next-Generation Sequencing. Technologies Next-Generation Next-Generation Sequencing Technologies Sequencing Technologies Nicholas E. Navin, Ph.D. MD Anderson Cancer Center Dept. Genetics Dept. Bioinformatics Introduction to Bioinformatics GS011062

More information

Genomic resources. for non-model systems

Genomic resources. for non-model systems Genomic resources for non-model systems 1 Genomic resources Whole genome sequencing reference genome sequence comparisons across species identify signatures of natural selection population-level resequencing

More information

Release Notes for Genomes Processed Using Complete Genomics Software

Release Notes for Genomes Processed Using Complete Genomics Software Release Notes for Genomes Processed Using Complete Genomics Software Version 2.0 Related Documents... 1 Changes to Version 2.0... 2 Changes to Version 1.12.0... 10 Changes to Version 1.11.0... 12 Changes

More information

Discretized Gaussian Mixture for Genotyping of microsatellite loci containing homopolymer runs

Discretized Gaussian Mixture for Genotyping of microsatellite loci containing homopolymer runs Bioinformatics Advance Access published October 17, 2013 Sequence analysis Discretized Gaussian Mixture for Genotyping of microsatellite loci containing homopolymer runs Hongseok Tae 1, Dong-Yun Kim 2,

More information

CNV and variant detection for human genome resequencing data - for biomedical researchers (II)

CNV and variant detection for human genome resequencing data - for biomedical researchers (II) CNV and variant detection for human genome resequencing data - for biomedical researchers (II) Chuan-Kun Liu 劉傳崑 Senior Maneger National Center for Genome Medican bioit@ncgm.sinica.edu.tw Abstract Common

More information

Nature Genetics: doi: /ng Supplementary Figure 1. H3K27ac HiChIP enriches enhancer promoter-associated chromatin contacts.

Nature Genetics: doi: /ng Supplementary Figure 1. H3K27ac HiChIP enriches enhancer promoter-associated chromatin contacts. Supplementary Figure 1 H3K27ac HiChIP enriches enhancer promoter-associated chromatin contacts. (a) Schematic of chromatin contacts captured in H3K27ac HiChIP. (b) Loop call overlap for cohesin HiChIP

More information

Introducing combined CGH and SNP arrays for cancer characterisation and a unique next-generation sequencing service. Dr. Ruth Burton Product Manager

Introducing combined CGH and SNP arrays for cancer characterisation and a unique next-generation sequencing service. Dr. Ruth Burton Product Manager Introducing combined CGH and SNP arrays for cancer characterisation and a unique next-generation sequencing service Dr. Ruth Burton Product Manager Today s agenda Introduction CytoSure arrays and analysis

More information

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics Genomic Technologies Michael Schatz Feb 1, 2018 Lecture 2: Applied Comparative Genomics Welcome! The primary goal of the course is for students to be grounded in theory and leave the course empowered to

More information

Bionano Solve Theory of Operation: Variant Annotation Pipeline

Bionano Solve Theory of Operation: Variant Annotation Pipeline Bionano Solve Theory of Operation: Variant Annotation Pipeline Document Number: 30190 Document Revision: B For Research Use Only. Not for use in diagnostic procedures. Copyright 2018 Bionano Genomics,

More information

Next Generation Sequencing: Data analysis for genetic profiling

Next Generation Sequencing: Data analysis for genetic profiling Next Generation Sequencing: Data analysis for genetic profiling Raed Samara, Ph.D. Global Product Manager Raed.Samara@QIAGEN.com Welcome to the NGS webinar series - 2015 NGS Technology Webinar 1 NGS: Introduction

More information

Bionano Access : Assembly Report Guidelines

Bionano Access : Assembly Report Guidelines Bionano Access : Assembly Report Guidelines Document Number: 30255 Document Revision: A For Research Use Only. Not for use in diagnostic procedures. Copyright 2018 Bionano Genomics Inc. All Rights Reserved

More information

Addressing Challenges of Ancient DNA Sequence Data Obtained with Next Generation Methods

Addressing Challenges of Ancient DNA Sequence Data Obtained with Next Generation Methods DISSERTATION Addressing Challenges of Ancient DNA Sequence Data Obtained with Next Generation Methods submitted in fulfillment of the requirements for the degree Doctorate of natural science doctor rerum

More information

Oral Cleft Targeted Sequencing Project

Oral Cleft Targeted Sequencing Project Oral Cleft Targeted Sequencing Project Oral Cleft Group January, 2013 Contents I Quality Control 3 1 Summary of Multi-Family vcf File, Jan. 11, 2013 3 2 Analysis Group Quality Control (Proposed Protocol)

More information

Supplementary Figures

Supplementary Figures Supplementary Figures A B Supplementary Figure 1. Examples of discrepancies in predicted and validated breakpoint coordinates. A) Most frequently, predicted breakpoints were shifted relative to those derived

More information

Supplementary Material for Extremely low-coverage whole genome sequencing in South Asians captures population genomics information

Supplementary Material for Extremely low-coverage whole genome sequencing in South Asians captures population genomics information Supplementary Material for Extremely low-coverage whole genome sequencing in South Asians captures population genomics information Navin Rustagi, Anbo Zhou, W. Scott Watkins, Erika Gedvilaite, Shuoguo

More information

Whole Human Genome Sequencing Report This is a technical summary report for PG DNA

Whole Human Genome Sequencing Report This is a technical summary report for PG DNA Whole Human Genome Sequencing Report This is a technical summary report for PG0002601-DNA Physician and Patient Information Physician name: Vinodh Naraynan Address: Suite 406 222 West Thomas Road Phoenix

More information

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI Variation detection based on second generation sequencing data Xin LIU Department of Science and Technology, BGI liuxin@genomics.org.cn 2013.11.21 Outline Summary of sequencing techniques Data quality

More information

Runs of Homozygosity Analysis Tutorial

Runs of Homozygosity Analysis Tutorial Runs of Homozygosity Analysis Tutorial Release 8.7.0 Golden Helix, Inc. March 22, 2017 Contents 1. Overview of the Project 2 2. Identify Runs of Homozygosity 6 Illustrative Example...............................................

More information

Shaare Zedek Medical Center (SZMC) Gaucher Clinic. Peripheral blood samples were collected from each

Shaare Zedek Medical Center (SZMC) Gaucher Clinic. Peripheral blood samples were collected from each SUPPLEMENTAL METHODS Sample collection and DNA extraction Pregnant Ashkenazi Jewish (AJ) couples, carrying mutation/s in the GBA gene, were recruited at the Shaare Zedek Medical Center (SZMC) Gaucher Clinic.

More information

Reference genomes and common file formats

Reference genomes and common file formats Reference genomes and common file formats Dóra Bihary MRC Cancer Unit, University of Cambridge CRUK Functional Genomics Workshop September 2017 Overview Reference genomes and GRC Fasta and FastQ (unaligned

More information

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Transcriptomics analysis with RNA seq: an overview Frederik Coppens Transcriptomics analysis with RNA seq: an overview Frederik Coppens Platforms Applications Analysis Quantification RNA content Platforms Platforms Short (few hundred bases) Long reads (multiple kilobases)

More information

Variant detection analysis in the BRCA1/2 genes from Ion torrent PGM data

Variant detection analysis in the BRCA1/2 genes from Ion torrent PGM data Variant detection analysis in the BRCA1/2 genes from Ion torrent PGM data Bruno Zeitouni Bionformatics department of the Institut Curie Inserm U900 Mines ParisTech Ion Torrent User Meeting 2012, October

More information

Almac Diagnostics. NGS Panels: From Patient Selection to CDx. Dr Katarina Wikstrom Head of US Operations Almac Diagnostics

Almac Diagnostics. NGS Panels: From Patient Selection to CDx. Dr Katarina Wikstrom Head of US Operations Almac Diagnostics Almac Diagnostics NGS Panels: From Patient Selection to CDx Dr Katarina Wikstrom Head of US Operations Almac Diagnostics Overview Almac Diagnostics Overview Benefits and Challenges of NGS Panels for Subject

More information

De novo meta-assembly of ultra-deep sequencing data

De novo meta-assembly of ultra-deep sequencing data De novo meta-assembly of ultra-deep sequencing data Hamid Mirebrahim 1, Timothy J. Close 2 and Stefano Lonardi 1 1 Department of Computer Science and Engineering 2 Department of Botany and Plant Sciences

More information

Alignment & Variant Discovery. J Fass UCD Genome Center Bioinformatics Core Tuesday June 17, 2014

Alignment & Variant Discovery. J Fass UCD Genome Center Bioinformatics Core Tuesday June 17, 2014 Alignment & Variant Discovery J Fass UCD Genome Center Bioinformatics Core Tuesday June 17, 2014 From reads to molecules Why align? Individual A Individual B ATGATAGCATCGTCGGGTGTCTGCTCAATAATAGTGCCGTATCATGCTGGTGTTATAATCGCCGCATGACATGATCAATGG

More information

Release Notes for Genomes Processed Using Complete Genomics Software

Release Notes for Genomes Processed Using Complete Genomics Software Release Notes for Genomes Processed Using Complete Genomics Software Version 1.11.0 Related Documents... 1 Changes to Version 1.11.0... 2 Changes to Version 1.10.0... 6 Changes to Version 1.9.0... 10 Changes

More information

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing: Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing: Patented, Anti-Correlation Technology Provides 99.5% Accuracy & Sensitivity to 5% Variant Knowledge Base and External Annotation

More information

SNP detection in allopolyploid crops

SNP detection in allopolyploid crops SNP detection in allopolyploid crops using NGS data Abstract Homologous SNP detection in polyploid organisms is complicated due to the presence of subgenome polymorphisms, i.e. homeologous SNPs. Several

More information

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

Genome 373: Mapping Short Sequence Reads II. Doug Fowler Genome 373: Mapping Short Sequence Reads II Doug Fowler The final Will be in this room on June 6 th at 8:30a Will be focused on the second half of the course, but will include material from the first half

More information

UHT Sequencing Course Large-scale genotyping. Christian Iseli January 2009

UHT Sequencing Course Large-scale genotyping. Christian Iseli January 2009 UHT Sequencing Course Large-scale genotyping Christian Iseli January 2009 Overview Introduction Examples Base calling method and parameters Reads filtering Reads classification Detailed alignment Alignments

More information

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014 Alignment J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014 From reads to molecules Why align? Individual A Individual B ATGATAGCATCGTCGGGTGTCTGCTCAATAATAGTGCCGTATCATGCTGGTGTTATAATCGCCGCATGACATGATCAATGG

More information

Reads to Discovery. Visualize Annotate Discover. Small DNA-Seq ChIP-Seq Methyl-Seq. MeDIP-Seq. RNA-Seq. RNA-Seq.

Reads to Discovery. Visualize Annotate Discover. Small DNA-Seq ChIP-Seq Methyl-Seq. MeDIP-Seq. RNA-Seq. RNA-Seq. Reads to Discovery RNA-Seq Small DNA-Seq ChIP-Seq Methyl-Seq RNA-Seq MeDIP-Seq www.strand-ngs.com Analyze Visualize Annotate Discover Data Import Alignment Vendor Platforms: Illumina Ion Torrent Roche

More information

Reference genomes and common file formats

Reference genomes and common file formats Reference genomes and common file formats Overview Reference genomes and GRC Fasta and FastQ (unaligned sequences) SAM/BAM (aligned sequences) Summarized genomic features BED (genomic intervals) GFF/GTF

More information

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz Table of Contents Supplementary Note 1: Unique Anchor Filtering Supplementary Figure

More information

GINDEL: Accurate genotype calling of insertions and deletions from low coverage population sequence reads

GINDEL: Accurate genotype calling of insertions and deletions from low coverage population sequence reads Washington University School of Medicine Digital Commons@Becker Open Access Publications 2014 GINDEL: Accurate genotype calling of insertions and deletions from low coverage population sequence reads Chong

More information

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech GALAXY INITIATION A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech How does Next- Gen sequencing work? DNA fragmentation Size selection and clonal amplification Massive parallel sequencing ACCGTTTGCCG

More information

Axiom mydesign Custom Array design guide for human genotyping applications

Axiom mydesign Custom Array design guide for human genotyping applications TECHNICAL NOTE Axiom mydesign Custom Genotyping Arrays Axiom mydesign Custom Array design guide for human genotyping applications Overview In the past, custom genotyping arrays were expensive, required

More information