White Paper GENALICE MAP: Variant Calling in a Matter of Minutes. Bas Tolhuis, PhD - GENALICE B.V.

Size: px

Start display at page:

Download "White Paper GENALICE MAP: Variant Calling in a Matter of Minutes. Bas Tolhuis, PhD - GENALICE B.V."

Amie Berry
5 years ago
Views:

1 White Paper GENALICE MAP: Variant Calling in a Matter of Minutes Bas Tolhuis, PhD - GENALICE B.V. White Paper GENALICE MAP Variant Calling GENALICE BV May 2014

2 White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 2

3 Index Abstract Introduction Size Matters GENALICE MAP Variant Caller: A High Level Overview Summary From FASTQ to VCF More Than a Hundred Times Faster Great Usability Variant Calling with Better Accuracy Benchmarking with SNP Microarray Standard Comparison with Genome In A Bottle (GIB) Data GCAT: A Comparison in the Public Domain Conclusions Revolutionary Fast Alignment and Variant Calling Highly Accurate Variant Detection What Happens Next? Materials and Methods Compute Resources Data Sets External Software Literature References More Information White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 3

4 Abstract Processing Next Generation Sequence (NGS) data is complicated and time- consuming due to the large volume and complexity of this type of data. Two essential and tedious processing steps are read alignment and variant calling. Recently, GENALICE introduced a new high performance short read alignment solution called GENALICE MAP [Tolhuis 2013]. In this whitepaper, we present GENALICE MAP s high performance variant caller. Using high coverage whole genome sequencing data, we observed that GENALICE MAP processes raw input sequence read into high quality variants more than a hundred times faster compared to BWA- MEM/GATK. To do so, it uses only two processing steps: read alignment and variant calling. Three validation studies show that GENALICE MAP has better accuracy, excellent sensititivy and specificity, excels on lower quality NGS data, and detects more high quality variants. With GENALICE MAP you can transform your NGS data from FASTQ to VCF in minutes rather than days. White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 4

5 1.0 Introduction In this white paper GENALICE presents their beta- version of GENALICE MAP s variant caller. Together with GENALICE MAP s short read aligner the variant caller provides a revolutionary acceleration in Next Generation Sequencing (NGS) data processing. GENALICE MAP delivers high quality variants. It outperforms other NGS data processing tools in accuracy, excels on lower quality NGS data and will provide new insights into NGS data. 1.1 Size Matters An important aspect of GENALICE MAP s speed gain is the underlying data structure of the proprietary GENALICE Aligned Reads (GAR) format. This novel file format has a very small data storage footprint. For human whole genome sequencing with 37x coverage depth, GAR is approximately 75 times smaller than unaligned reads stored in FASTQ format and about 25 times smaller than an aligned reads collection in BAM format. Unlike other aligned read storage formats (e.g. BAM, CRAM) the GAR file is not compressed, but encoded. Finally, the GAR format stores aligned sequence reads in a genome coordinate sorted fashion, while read pair information is maintained, enabling realignment directly from GAR without reversion to FASTQ. The GAR s encoding and reduced file enables high performance writing and reading of aligned reads and size minimizes time- consuming data traffic. Moreover the GAR file's "sorted storage" feature avoids time consuming preprocessing before variant calling. 1.2 GENALICE MAP Variant Caller: A High Level Overview In common with GENALICE MAP aligner, the variant caller is designed to optimize CPU usage [Karten 2014] achieving it s performance on commodity hardware (Paragraph 5.1). The main features of the variant caller are listed below Feature List for the GENALICE MAP Variant Caller As input it uses aligned sequence reads stored in GAR file format. A single parameter setting can remove potential PCR duplicates on the fly before variant calling. Hence, there is no time delay and storage- consumption; no intermediate aligned reads files are produced; A Joined Frequency Weight mechanism matches and combines variant loci that are closely spaced and represent an identical variant. This mechanism serves the same purpose as local indel realignment in other variant calling software solutions; A statistical framework discriminates variants from noise and errors; Extra parameter settings can be added to filter out low quality variants, resulting in high quality variants; Variants are stored in VCF format (v4.1) and can be used directly by downstream tools (e.g. annotation). 1.3 Summary This whitepaper evidencess the revolutionary change in alignment and variant calling speed that is achieved by GENALICE MAP (Chapter 2). The simple workflow and small storage requirements of the GAR format underpin the usability of GENALICE MAP. In addition this white paper focuses on the accuracy of the variants detected by GENALICE MAP (Chapter 3). In short, GENALICE MAP is extremely fast, easy to use and highly accurate. White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 5

6 2.0 From FASTQ to VCF An NGS data processing workflow transforms raw sequence reads into high quality sequence variants. Raw sequence reads are often stored in FASTQ format and identified variants in VCF format. To demonstrate the accelaration in FASTQ to VCF transformation that is achieved by GENALICE MAP its speed of alignment and variant calling were compared to the speed of a workflow consisting of several open source software packages (Figure 1), including BWA- MEM (version 0.7.8) and GATK (version 3.1) Both workflows processed the same set of raw sequence reads. This data set is whole genome sequencing (WGS) sample of a single human sample (NA12878 from Illumina s Platinum Genome samples). It is sequenced at 50x coverage depth using paired end technology and, as such, consists of almost 790 million read pairs with a read length of 101 bases per read. To allow a fair comparison of processing times the GENALICE MAP and BWA- MEM/GATK workflows were both run using an identical hardware configuration (see Paragraph 5.1). Figure 1. Next Generation Sequencing Workflows. (A) GENALICE MAP workflow consists of two processing steps and produces two output files: GENALICE Aligned Reads (GAR) and Variant Call Format (VCF). Yet, it incorporates all key functionalities of alignment and variant calling. (B) BWA- MEM/GATK workflow uses six processing steps, four sofware packages (BWA- MEM v0.7.8; SAMtools v0.1.19; Picard v1.83; and GATK v3.1) and produces 5 output files. White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 6

7 2.1 More Than a Hundred Times Faster GENALICE MAP On this 50x coverage human WGS data set, GENALICE MAP requires less than 45 minutes to process raw sequence reads into high quality variants (Table 1). Aligning raw sequence reads is most time consuming with more than 36 minutes processing time. Raw sequence reads were streamed from a storage server across the network into the alignment server to avoid tedious copying of large FASTQ files (> 380GB for this data set). Variant calling requires little over 8 minutes, which includes key steps for retrieving high quality variants (Table 1, Figure 1), namely: PCR duplicate marking, Joined Frequency Weight consolidation (i.e. local indel realignment), variant calling using a statistical framework and removal of low quality variants. Table 1. Turnaround time of GENALICE MAP workflow and storage footprint of output files Program Processes Time (hh:mm:ss) File Size (type) gamap Align reads 00:36:17 5.4GB (GAR) Write GAR gavariant Mark duplicates 00:08:09 0.5GB (VCF) Indel realignment Variant calling Filter low quality Total - 00:44:26 5.9GB BWA- MEM/GATK The BWA- MEM/GATK is far more time consuming and requires approximately 79 hours to transform FASTQ into VCF (Table 2). An NFS mount was used to connect the alignment server to the storage server to avoid copying of the large input FASTQ files. BWA- MEM aligned reads were piped into SAMtools to store them in BAM format. This BAM file was further processed (sorted, indexed and PCR duplicates marked) before variants were called using GATK s HaplotypeCaller. Finally, variants discovered by the HaplotypeCaller were refined using GATK s variant recalibration to obtain high quality variants. Table 2. Turnaround time of BWA- MEM/GATK workflow and storage footprint of output files Program Processes Time (hh:mm:ss) File Size (type) BWA- MEM SAMtools Align reads 20:42:46 113GB (BAM) Write BAM SAMtools Sort reads 10:43:56 113GB (BAM) SAMtools Index BAM 00:30:56 8MB (BAI) Picard Mark duplicates 12:15:08 113GB (BAM) GATK HaplotypeCaller 34:13:00 1GB (VCF) GATK Variant recalibration 00:35:02 1GB (VCF) Total - 79:00:48 341GB What Makes GENALICE MAP So Fast? This comparative analysis shows that GENALICE MAP reduces total processing time by more than a hundred fold compared to BWA- MEM/GATK. How is this tremendous accelarations achieved using commodity hardware (Paragraph 5.1)? The design of GENALICE MAP optimizes CPU usage and reduces time consuming data traffic [Karten White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 7

8 2014]. The proprietary, underlying data structure of the GAR format further contributes to this speed gain by minimizing read and write times. The GATK software and BAM file format lack these two features. 2.2 Great Usability GENALICE MAP has a short and simple workflow, consisting of only two processing steps: alignment and variant calling (Figure 1A). In sharp contrast, the BWA- MEM/GATK workflow is lengthy, complex and multi- staged; involving several software packages (Figure 1B). For this data set, GENALICE MAP alignment is more than 50 times faster than BWA- MEM. The output of alignment is a coordinate sorted GAR file, which has a small storage footprint (Table 1). BWA- MEM aligned reads are stored in the BAM format, which is about 20 times larger than the GAR (Table 2). These reads still need to be sorted (Figure 1), which is time consuming and generates a second BAM file of similar size. Moreover, downstream analysis software requires an index to access this large BAM file, which adds another processing step and more time to the BWA- MEM workflow. GENALICE MAP variant calling is over 350 times faster than GATK for this 50x coverage human WGS data. GENALICE MAP variant calling includes removal of potential PCR duplicates, joining and matching closely spaced variants (i.e. similar to local indel realignment), a statistical framework to call variants and filtering of low quality variants (Figure 1). Discovered variants are stored in VCF format (version 4.1), which is compatible with downstream VCF processing tools. GATK uses a three step approach. PCR duplicates are marked in the BAM file resulting in another BAM file. This final BAM file serves as input for the genotyper (HaplotypeCaller of GATK), which performs de novo assembly around call sites and Bayesian statistics to call variants. Recalibration of variants is needed to remove false positives and retrieve high quality variants. The two stage workflow, easy file handling and short processing times contribute to the great usability of GENALICE MAP. White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 8

9 3.0 Variant Calling with Better Accuracy It doesn t matter how fast NGS data is processed if the resulting list of variants is of poor quality. Therefore, GENALICE takes great care in validating and optimizing the accuracy of alignment and variant calling by GENALICE MAP. In this chapter three of those validation studies are described. 3.1 Benchmarking with SNP Microarray Standard In this benchmark study SNP detection accuracy was compared between GENALICE MAP, BWA- SW/GATK. Six Whole Exome Sequence (WES) data samples were analyzed by both workflows. Independent genotypes were obtained from SNP microarray data from the same data samples, which served as gold standard. Approximately 10 thousand SNPs can be compared between SNP microarray gold standard and the WES data sets. A strict comparison strategy revealed: 1. concordant calls: gold standard SNP detected in WES call set with exact nucleotide and genotype match; 2. discordant calls: gold standard SNP detected in WES call set, but with either nucleotide or genotype mismatch; 3. not called: gold standard SNP not detected in WES call set. Accuracy is defined as the number of concordant calls divided by the total number of SNPs in the gold standard Whole Exome Sequencing Data The WES data sets and microarray SNP genotypes were kindly provided by Prof. Andre Uitterlinden (ErasmusMC, Rotterdam) GENALICE MAP Better Accuracy and Less Missed Calls For all six WES samples, GENALICE MAP has a better accuracy (Figure 2A). When excluding outlier sample 2, GENALICE MAP has an average accuracy of 98.6%, while the average accuracy of BWA- MEM/GATK is 97.8%. Both workflows show reduced accuracy for sample 2, but with 89.0% GENALICE MAP has considerably higher accuracy than BWA- SW/GATK (81.2%). Sample 2 appears to be an outlier and is further discussed in Paragraphs and The superior accuracy of GENALICE MAP compared to BWA- SW/GATK is achieved by detecting a higher number of gold standard SNPs. This becomes evident when no call rates are compared (Figure 2B). In every WES sample the proportion of not called gold standard SNPs is lower for GENALICE MAP. On average, BWA- MEM misses 2.1% and GENALICE MAP 1.1% of the gold standard SNPs. White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 9

10 Figure 2. Accuracy Performance of GENALICE MAP and BWA- SW/GATK (A) For each sample the proportion of accurately discovered gold standard SNPs is shown for GENALICE MAP (orange) and BWA- SW/GATK (gray). (B) (B) Proportion of gold standard SNPs that were not discovered by WES workflows Coverage Depth Influences Accuracy Low coverage depth at golden standard SNP loci is the major cause of discordant and missed calls (Figure 3). In almost all samples, golden standard concordant calls have a significantly higher coverage depth than discordant and missed calls. Both GENALICE MAP and BWA- SW/GATK show the same trend, indicating that this effect is independent of data analysis and a consequence of the coverage depth of the input WES data. White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 10

Figure 3. Coverage Depth at Golden Standard SNP Loci (A) For each sample coverage depth at golden standard SNP loci is shown for GENALICE MAP. Golden standard SNPs are grouped into concordant (i.e. discovered by GENALICE MAP), discordant (i.

11 Figure 3. Coverage Depth at Golden Standard SNP Loci (A) For each sample coverage depth at golden standard SNP loci is shown for GENALICE MAP. Golden standard SNPs are grouped into concordant (i.e. discovered by GENALICE MAP), discordant (i.e. genotype mismatch) and no call (i.e. missed by GENALICE MAP). (B) Coverage depth at golden standard SNP loci is shown for BWA- SW/GATK GENALICE MAP Detects More High Quality SNPs Paragraph showed that GENALICE MAP has a greater detection of gold standard SNPs. This observation raises the question: does GENALICE MAP identify more variants in general? Detecting more variants is only valuable if the quality of entire call set is high. Therefore, the complete SNP call sets of GENALICE MAP and BWA- SW/GATK were also compared. In the comparison below outlier sample 2 is ignored. This sample will be discussed in Paragraphs and Compared to BWA- SW/GATK, GENALICE MAP detects slightly more SNPs, which ranges between 2.4 White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 11

12 and 4.6% per sample. Approximately 96.6% of SNPs discovered by GENALICE MAP are described in the dbsnp database (v137) and the remaining 3.4% are novel SNP discoveriess (Table 3). The BWA- SW/GATK workflow has a slightly larger proportion of SNPs in common with dbsnp (~97.5%). Transition- transversion ratios (Ti/Tv) are a quality measure for detected SNPs. For human WES data a Ti/Tv ratio of 2.8 is considered to contain a maximum proportion of true positives, whilst a ratio of 0.5 indicates a call set of false positives caused by random sequencing errors. Values ranging from less than 2.8 to 0.5 suggest the inclusion of a certain degree of false positives. SNPs in common with dbsnp (v137) have similar Ti/Tv ratios for both GENALICE MAP and BWA- SW/GATK (Tables 3 and 4). Ti/Tv ratios of novel SNP discoveries differ greatly between GENALICE MAP and BWA- SW/GATK (Tables 3 and 4). Novel SNPs identified by GENALICE MAP have significantly higher Ti/Tv ratios, suggesting that these calls have less relatively false positives than the novel BWA- SW/GATK SNPs. Table 3. GENALICE MAP SNP Discovery (Exome Sequencing Samples) Sample SNP calls dbsnp (%) Ti/Tv Novel (%) Ti/Tv 1 52,119 50, , ,260 36, , ,675 51, , ,969 50, , ,240 51, , ,361 50, , Table 4. BWA- SW/GATK (UnifiedGenotyper) SNP Discovery (Exome Sequencing Samples) Sample SNP calls dbsnp (%) Ti/Tv Novel (%) Ti/Tv 1 50,870 48, , ,485 35, , ,327 50, , ,359 48, , ,995 50, ,352 49, White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 12

13 Figure 4. Intersection between GENALICE MAP and BWA- SW/GATK SNP call sets Venn diagrams (A- F) show overlapping SNP calls between GENALICE MAP (orange) and BWA- SW/GATK (gray) for all six WES samples Concordance Between GENALICE MAP and BWA- SW/GATK The intersection between GENALICE MAP and BWA- SW/GATK call sets is between 80% and 85% (Figure 4) demonstrating that the two workflows are highly concordant. There are, however, a considerable number of calls that are unique to a single workflow, indicating that differences in alignment and variant calling strategies between the two workflows will reveal distinct biological insights Better Accuracy with Lower Quality Data Sample 2 is an outlier in terms of accuracy (Figure 2A), no call rate (Figure 2B), coverage depth at golden standard SNP loci (Figure 3) and number of discovered SNPs (Tables 3 and 4). For this sample, lower coverage depth is an important factor that negatively influences accuracy in both GENALICE MAP and BWA- SW/GATK workflows. Yet, GENALICE MAP outperformes BWA- SW/GATK with an accuracy that is 7.8% better. In this sample, GENALICE MAP has higher coverage depth at golden standard loci. The median coverage depth across all golden standard SNPs is 26 for GENALICE MAP and only 16 for BWA- SW/GATK. Since coverage depth at golden standard SNPs is measured using aligned reads as input, it is fair to state that this difference in coverage depth is caused by distinct read alignment strategies. Clearly, GENALICE MAP s alignment strategy works better for this sample than BWA- SW. White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 13

In the complete SNP call sets for sample 2 (Tables 3 and 4) there is a substantial lower number of SNPs compared to the other samples. BWA- SW/GATK calls 3.1% more SNPs than GENALICE MAP.

14 In the complete SNP call sets for sample 2 (Tables 3 and 4) there is a substantial lower number of SNPs compared to the other samples. BWA- SW/GATK calls 3.1% more SNPs than GENALICE MAP. This increase mostly comes from novel SNPs that have not yet been described in dbsnp (v137). Importantly, these novel SNPs have a very low Ti/Tv ratio of 0.93, suggesting a high degree of false positve calls. Novel SNPs detected by GENALICE MAP most likely also contain false positives, but based on a Ti/Tv ratio of 1.43 and a lower absolute number, the total amount of false positive calls will be less than in the BWA- SW/GATK calls set. The SNP call sets from GENALICE MAP and BWA- SW/GATK are more divergent for this sample compared to the other five WES samples. The intersection between the two call sets is with 64.6% significantly lower than observed for the other samples (Figure 4). Given the better accuracy with a golden standard, higher coverage depth at golden standard loci and better Ti/Tv ratios it is fair to conclude that GENALICE MAP delivers a more reliable SNP call set than BWA- SW/GATK Less Is More How does GENALICE MAP get such an improved SNP call set on such a relatively low quality WES data set as sample 2? Part of the answer lies in GENALICE MAP alignment strategy resulting in better coverage at golden standard loci. Simply aligning more reads, however, is not the entire answer. Another important aspect of alignment by GENALICE MAP is filtering out reads that are mapped with relatively low confidence. Distributions of GENALICE MAP alignment results show that sample 2 has 9.5% low confidence mapped reads that were filtered out, while the proportion of low confidence reads in the other five samples ranges between 3.4% and 5.8% (Figure 5). Inhouse validation studies have revealed that filtering out low confidence mappings results in better accuracy. It seems that low confidence mappings introduce noise in the variant calling process, which lowers accuracy and increases the chance of retrieving false positives. Figure 5. GENALICE MAP read alignment result distribution for all six WES samples. Pie charts (A- F) show mapping result for each sample. GENALICE MAP discriminates six result categories. Perfect (dark green) and Partial (light green) are reads that have been mapped to unique reference positions. These reads make up the largest proportion of processed reads and contribute to downstream variant calling. Repeats (blue) are reads mapped to White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 14

15 non- unique reference positions. Low confidence reads (yellow) have been mapped, but a large proportion of the read does not match the reference sequence and therefore they are filtered out. Bad reads (gray) do no pass read quality control prior to alignment. Unmappable (red) reads could not be placed on the reference sequence. 3.2 Comparison with Genome In A Bottle (GIB) Data In addition to WES data we also validated GENALICE MAP s performance on high coverage whole genome sequencing (WGS) data. The Genome In A Bottle (GIB) consortium has published a high confidence variant call set for sample NA12878 [Zook et al. 2014]. For this sample, 14 NGS sequence data sets from 5 sequencing technologies were integrated. Data analysis was based on 7 different read aligners and 3 variant callers. In this whitepaper, one of the 14 NGS sequence data sets was used in the runtime benchmark study (Chapter 2). Here, GENALICE MAP and BWA- MEM/GATK variant call sets from these analyses were compared to version 2.18 of GIB s high confidence set. Despite the efforts of GIB to minimize bias towards any technology used in NGS data processing, we believe that variant calling in the high confidence set is biased in favour of GATK s UnifiedGenotyper and HaplotypeCaller. Since, those two genotypers were used on the bulk of the data sets GENALICE MAP Has Better Concordance with High Confidence SNPs Table 5 summarizes SNP concordancy with GIB s high confidence SNPs for GENALICE MAP and BWA- MEM/GATK. GENALICE MAP reports more concordant calls (98.4%) than BWA- MEM/GATK (95.2%). As observed with the WES samples (Paragraph 3.1), GENALICE MAP is capable of making more calls at high confidence SNP loci than BWA- MEM/GATK. Table 5. Concordancy with GIB s high confidence SNP call set (NA12878 full genome; 50x) Workflow Concordant (%) Discordant (%) No Call (%) GENALICE MAP 2,696, , , BWA- MEM/GATK 2,608, , GENALICE MAP detects significantly more SNPs than BWA- MEM/GATK (Table 6). A large proportion of SNPs in the GENALICE MAP call set are listed in dbsnp (v137) and have a Ti/Tv ratio close to 2.1, which is expected for high quality SNPs derived from whole genome sequencing data. The novel calls have a Ti/Tv ratio of 1.53, which indicates a certain degree of false positives. This Ti/Tv ratio is, however, better than the Ti/Tv ratio observed for novel calls discovered by BWA- MEM/GATK. Table 6. NA x coverage full genome SNP call sets Workflow SNP calls dbsnp (%) Ti/Tv Novel (%) Ti/Tv GENALICE MAP 3,954,083 3,766, , BWA- MEM/GATK 3,367,028 3,367, , Despite a bias in favour of GATK in GIB s high confidence SNP call set, GENALICE MAP has a better concordancy than BWA- MEM/GATK. GENALICE MAP s total SNP call set is larger and with a generally higher Ti/Tv ratio than BWA- MEM/GATK Indel Comparison: a Work in Progress At this moment, GENALICE MAP has a concordancy of more than 85% with GIB s high confidence indel set. A large proportion of high confidence indels is not detected, because GIB s reporting conventions differ from GENALICE MAP. Since GIB s high confidence set is biased in favour of GATK s HaplotypeCaller the latter does not suffer from these differences. In many cases, GENALICE MAP reports an alternative reporting convention, which describes an identical sequence change [Tolhuis 2014]. White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 15

16 3.3 GCAT: A Comparison in the Public Domain We further benchmarked GENALICE MAP using the public Genome Comparison & Analytic Testing (GCAT) website. On this site combinations of variant calling workflows are compared across different data sets. Public facing reports show how various aligners and variant callers perform against each other. GCAT s variant call reports demonstrate a series of metrics, including comparison to Genome In A Bottle (GIB) high confidence call set, consistency with genotyping mircoarray data and concordance with other workflows. In this paragraph GENALICE MAP s performance is summarized. Two Illumina exome sequencing data sets were compared with different coverage depths (150x and 30x). Full summary reports are available through the GCAT website: 150x: qmsrmiqaqe/variant- calls/illumina- 100bp- pe- exome- 150x/genalice- genalice/compare /on- target/group- read- depth 30x: wlzrlsdrjw/variant- calls/illumina- 100bp- pe- exome- 30x/genalice- genalice/compare elbhaavyuj snxrdrztpi/group- read- depth Please note that GCAT requires registration/log in before these links will work. Since GCAT provide exome sequencing data for benchmarking, GENALICE MAP variant calling was limited to the targeted exome regions. Therefore, results presented below are also restricted to exome target regions SNPs: Excellent Precision Rates, Sensitivity and Specificity Comparison against GIB highly confident SNP call set shows that GENALICE MAP produces a call set with excellent precision rate (98.5%), sensitivity (97.3%) and specifity (99.9%) for the high coverage data (Figure 6A). These metrics are comparable to other presented workflows (Novoalign/GATK, BWA- MEM/GATK and Isaac Aligner/VariantCaller). At low coverage depth GENALICE MAP has a better precision rate and sensitivity is comparable to Novoalign/GATK (Figure 6B). BWA- MEM/GATK UnifiedGenotyper has the best sensitivity, while the same aligner in combination with GATK HaplotypeCaller has a significantly lower sensitivity. GCAT further compares SNP calls sets to an NGS independent genotype source, namely: Illumina OMNI SNP array. In the high coverage data, GENALICE MAP performs slightly better than the other three presented workflows on precision rate, sensitivity and specificity (Figure 7A). As observed for the comparison with GIB high confidence SNPs low coverage data shows a better precision rate for GENALICE MAP and its sensitivity is comparable to Novoalign/GATK (Figure 7B). Moreover GENALICE MAP has slightly better specificity in this low coverage data. White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 16

17 Figure 6. Screenshot from GCAT website comparing against GIB high confidence SNP call set. (A) High coverage exome sequencing data. Precision rate, sensitivity and specificity values are shown for GENALICE MAP (blue), BWA- MEM/GATK UnifiedGenotyper (orange), Novoalign/GATK UnifiedGenotyper (green) and Isaac Aligner/Variant Caller (red). (B) Low coverage exome sequencing data. Precision rate, sensitivity and specificity values are shown for GENALICE MAP (blue), BWA- MEM/GATK UnifiedGenotyper (orange), Novoalign/GATK UnifiedGenotyper (green) and BWA- MEM GATK HaplotypeCaller (red). White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 17

18 Figure 7. Screenshot from GCAT website comparing against Illumina OMNI SNP array. (A) High coverage exome sequencing data. Precision rate, sensitivity and specificity values are shown for GENALICE MAP (blue), BWA- MEM/GATK UnifiedGenotyper (orange), Novoalign/GATK UnifiedGenotyper (green) and Isaac Aligner/Variant Caller (red). (B) Low coverage exome sequencing data. Precision rate, sensitivity and specificity values are shown for GENALICE MAP (blue), BWA- MEM/GATK UnifiedGenotyper (orange), Novoalign/GATK UnifiedGenotyper (green) and BWA- MEM GATK HaplotypeCaller (red) SNPs: Outstanding Transition Transversion (Ti/Tv) Ratios Examination of Ti/Tv ratios shows that novel SNPs detected by GENALICE MAP have a far better ratio than any of the other presented workflows (Figure 8). This suggests that novel SNPs detected by GENALICE MAP contain relatively less false positives. Regarding the high coverage data, the total number of novel SNPs is comparable between GENALICE MAP (n=2,585) and Isaac Aligner/Variant Caller (n=2,324). The other two workflows detect significantly less novel SNPs (Novoalign/GATK UnifiedGenotyper n=665; and BWA- MEM/GATK UnifiedGenotyper n=881). Based on these metrics it seems likely that GENALICE MAP discovers more high quality novel SNPs and as such new biological insights. White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 18

19 Figure 8. Screenshot from GCAT website comparing Ti/Tv ratios. (A) High coverage exome sequencing data. Ti/Tv ratios for novel and SNPs common with dbsnp database are shown for GENALICE MAP (blue), BWA- MEM/GATK UnifiedGenotyper (orange), Novoalign/GATK UnifiedGenotyper (green) and Isaac Aligner/Variant Caller (red). (B) Low coverage exome sequencing data. Ti/Tv ratios for novel and SNPs common with dbsnp database are shown for GENALICE MAP (blue), BWA- MEM/GATK UnifiedGenotyper (orange), Novoalign/GATK UnifiedGenotyper (green) and BWA- MEM GATK HaplotypeCaller (red) Indels: High Quality Calls GCAT also allows comparison against GIB highly confident indel call set. With low coverage data GENALICE MAP has better precision rate (90.8%) and sensitivity (77.7%) than any of the other three presented workflows (Figure 9B). In high coverage data sensitivity of GENALICE MAP is comparable to BWA- MEM/GATK UnifiedGenotyper and Novoalign/GATK UnifiedGenotyper and considerably better than Isaac Aligner/Variant Caller (Figure 9A). Precision rate of GENALICE MAP is somewhat lower than BWA- MEM/GATK UnifiedGenotyper and Novoalign/GATK UnifiedGenotyper, but substantially better than Isaac Aligner/Variant Caller. White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 19

20 Figure 9. Screenshot from GCAT website comparing against GIB high confidence indel call set. (A) High coverage exome sequencing data. Precision rate, sensitivity and specificity values are shown for GENALICE MAP (blue), BWA- MEM/GATK UnifiedGenotyper (orange), Novoalign/GATK UnifiedGenotyper (green) and Isaac Aligner/Variant Caller (red). (B) Low coverage exome sequencing data. Precision rate, sensitivity and specificity values are shown for GENALICE MAP (blue), BWA- MEM/GATK UnifiedGenotyper (orange), Novoalign/GATK UnifiedGenotyper (green) and BWA- MEM GATK HaplotypeCaller (red). White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 20

21 4.0 Conclusions 4.1 Revolutionary Fast Alignment and Variant Calling GENALICE MAP aligns short paired end reads (i.e. 101 bases) about 50 times faster than BWA- MEM. Alignment speed of BWA- MEM on short paired end reads (i.e. 70 bases) is similar to GEM and Bowtie2 [Li, 2013]. Hence, short read alignment using GENALICE MAP is substantially faster than other widely used software solutions. Software design of GENALICE MAP and data structure of the GAR format provide a revolutionary accelaration of variant calling. It is over 350 times faster than GATK and its associated processing tools. GENALICE MAP reduces processing time to minutes and transforms variant calling into an iterative process in which parameters can easily be optimized for tailor- made high quality variant detection. Another attractive aspect of GENALICE MAP is its ease of use. Generating high quality variants from raw sequence reads requires only two processing steps and produces only two files as output: 1) a GENALICE Aligned Reads (GAR) file to store the aligned reads; and 2) a VCF file to store high quality variants. 4.2 Highly Accurate Variant Detection Accuracy of variant detection is of utmost importance for NGS data analysis. The three validation studies (Chapter 3) demonstrate that: 1. GENALICE MAP detects variants with better accuracy; 2. GENALICE MAP generally has a lower no call rate; 3. GENALICE MAP calls more high quality variants; 4. SNPs detected by GENALICE MAP have generally higher Ti/Tv ratios and extrapolated from that: relatively less false positives; 5. GENALICE MAP excels on lower quality data; 6. GENALICE MAP s precision rate, sensitivity and specificity are either on par or better. In conclusion, GENALICE MAP aligner and the beta- version of the variant caller produce highly accurate variant call sets. 4.3 What Happens Next? GENALICE will continue to further develop GENALICE MAP aligner (v1.1.3) and variant caller (beta- version). Ongoing work focuses on further improvement of sensitivity and specificity in SNP and indel detection. GENALICE MAP aligner is capable of splitting a sequence read and map its parts on distinct regions of the reference sequence [Tolhuis 2014]. These so- called breaks will be used for structural variant calling in future releases of GENALICE MAP. GENALICE will report these developments in future white papers. White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 21

22 5.0 Materials and Methods 5.1 Compute Resources Benchmarking Hardware All data analysis were performed on a server with the following hardware configuration: Dual Intel Xeon E V2 ( Mhz) CPU with a total of 12 cores and 24 threads 128GB Memory (8x16GB, 1333 Mhz) 2x256GB Solid State Disks (read/write: 270 MB/second) The server was connected to storage servers via an Infiniband adapter. NGS reads were streamed via network from FASTQ (located on one of the storage servers) into GENALICE MAP. The Infiniband and storage server configurations are: Infiniband adapter of 20 Gigabit/second 2x9 Terabytes disk space Raid 5 disk with 1 Gigabyte/second streaming speed Operating System GENALICE MAP runs on SuSE Linux Enterprise server 11 Service Pack 2 on standard server board. 5.2 Data Sets WGS NA12878 Illumina s Platinum Genome Series (50x coverage) The NA12878 data set from Illumina s Platinum Genomes was downloaded from their website ( This sample is sequenced to 50x depth on a HiSeq 2000 system and is part of a 17 member pedigree (CEPH Family 1463). We downloaded archival BAM file from the European Nucleotide Archive (Study ERP001960) and converted them back to FASTQ filles using the bamutils package. The resulting FASTQ files served as input for the benchmark studies WES data sets Targeted sequencing data sets were kindly provided by Prof. André Uitterlinden (Erasmus MC, Rotterdam, the Neterlands). Data consists of 101 bases paired end reads Reference Sequence GENALICE MAP aligns next generation sequence reads against a reference sequence. In this study we used build 37 (hg19) of the human genome sequence as a reference. 5.3 External Software BWA- MEM and BWA- SW In this study BWA version was applied to all data. BWA- MEM and BWA- SW were run with default parameter settings GATK GATK version 3.1 was used througout this study. Indel realignment, UnifiedGenotyper and VariantFiltration were applied to the WES samples in order to detect SNPs. HaplotypeCaller was used for NA12878 WGS data. For this sample GATK s Best practices parameter settings were used. White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 22

23 6.0 Literature References Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA- MEM. arxiv: [q- bio.gn] Karten, H. (2014). GENALICE MAP on Intel - A Perfect Match. White paper: map/ Tolhuis, B. (2013). GENALICE MAP: An Introduction to a New High Performance Read Alignment Solution. White paper: map/ Tolhuis, B. (2014). GENALICE MAP: Insertion and deletion detection goes at great length. White paper: map/ Zook, J.M., Chapman B., Wang, J., Mittelman, D., Hofmann, O., Hide, W., and Salit, M. (2014). Integrating human sequenc data sets provides a resource of benchmark NSP and indel genotype calls. Nature Biotech 32, White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 23

24 7.0 More Information For more information please contact us: Phone: Website: Twitter: GenaliceDNA YouTube: GenaliceDNA Facebook: GenaliceDNA GENALICE BV Deventerweg 9d 3843 GA Harderwijk The Netherlands White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 page 24