Whole Genome Sequence Data Quality Control and Validation

Size: px
Start display at page:

Download "Whole Genome Sequence Data Quality Control and Validation"

Transcription

1 Whole Genome Sequence Data Quality Control and Validation GoSeqIt ApS / Ved Klædebo 9 / 2970 Hørsholm VAT No. DK / Phone / Web: / mail: mail@goseqit.com

2 Table of Contents 1. INTRODUCTION 4 2. ORGANISMS TO BE INCLUDED 5 3. QUALITY CONTROL METRICS Raw Data Quality Control Average Read Length after Trimming Depth of Coverage Uniformity of Coverage Contamination Data Analysis Quality Control SNPs Between Sample and Reference Genome Breadth of Coverage N50 Value Species Identification Multilocus Sequence Typing Identification of Genes for Antimicrobial Resistance or Virulence Establishing the Phylogenetic Relationship of Sample Organisms Sequencing Run Quality Control VALIDATION PERFORMANCE CHARACTERISTICS ADDITIONAL PERFORMANCE CHARACTERISTICS REFERENCES 16 2 of 16

3 The GoSeqIt system for Whole Genome Sequence (WGS) Validation and Quality Control has been developed on the basis of the GMI proficiency tests ( workgroups/about-the-gmi-proficiency-tests) and the publication Validation and Implementation of CLIA-Compliant Whole Genome Sequencing (WGS) in Public Health Laboratory by Kozyreva VK et al (1). 3 of 16

4 1. INTRODUCTION Whole Genome Sequencing (WGS) has for years been an integrated part of the workflow of leading research laboratories and holds tremendous promise for pathogen identification, antibiotic resistance profiling, and outbreak detection. For the technology to become part of routine diagnostics in clinical and public microbiology laboratories, standardisation and validation is of pivotal importance. GoSeqIt has developed a system for quality control and validation of bacterial WGS data, which we offer to anyone for whom it is of essence to be able to trust the data produced by their sequencing machine. The validation panel comprises 6 organisms (2 Escherichia coli, 1 Salmonella enterica, 1 Staphylococcus aureus, 1 Enterococcus faecalis, and 1 Aeromonas hydrophila). You may select to have all 6 strains included in the proficiency test or any subset hereof. How does it work? 1) You order the strains you wish to have included in the test via ATCC or a similar culture collection. 2) You extract DNA using the procedure you normally use and perform whole genome sequencing on your Illumina sequencer. Include PhiX as a positive control to each run. Note: While we will identify all quality parameters regardless of the library preparation procedure and sequencing kit used, we recommend using Nextera XT library preparation and 2X300 cycle MiSeq sequencing kits. If you do that thresholds for all quality parameters will also be available. 3) You share the output from the sequencing run with GoSeqIt via the Illumina Basespace Sequence Hub. 4) Within 3 working days following the sharing of your sequence data, you receive a comprehensive report providing a complete picture of the quality of your sequence data. If any of the quality parameters do not meet the expected thresholds, the report will additionally contain advice on how to troubleshoot your procedures, consumables, and equipment to ensure optimal quality of your sequence data. 4 of 16

5 2. ORGANISMS TO BE INCLUDED The validation panel comprises a diverse set of 6 bacterial organisms (Table 1) representing genome sizes from MB and a wide range of GC content (from 33%-62%). You may select to have all 6 strains included in the proficiency test or any subset hereof. ID Species ATCC strain ID GC content (%) Size (bp) Division C2 Aeromonas hydrophila ATCC gram-negative C3 Escherichia coli ATCC gram-negative C5 C6 C46 Staphylococcus aureus Salmonella enterica subsp. enterica serovar Typhimurium Enterococcus faecalis ATCC gram-positive ATCC gram-negative ATCC gram-positive C55 Escherichia coli ATCC gram-negative Table 1: Organisms that can be included in the proficiency test. We are continuously working to add more organisms to the panel. Please contact us (mail@goseqit.com), if you are interested in a species, which is not well represented by any of the organisms currently in the panel. 5 of 16

6 3. QUALITY CONTROL METRICS In the following, the quality control metrics that we will calculate on the basis of your raw and processed sequence data are described Raw Data Quality Control Average Read Length after Trimming Illumina sequencers determine the bases of single-stranded DNA templates when fluorescently labelled nucleotides are added to the complementary DNA strands by DNA polymerases (sequencing by synthesis). The output from the process is sequences that are a few hundred bases long - the so-called reads. Each base in a read is associated with a quality score (Phred score) that indicates how trustworthy the basecall is. A high quality score means that the base is likely to be correct, while the converse is true for a low score. A quality score of 30 (Q30) corresponds to a probability that 1 out of 1000 bases with this quality score will be wrong (the error probability is 10-3 ). A quality score of minimally 30 is often used as the threshold for accepting bases for downstream analyses. Illumina reads typically have high quality scores associated with the bases in the first part of the reads (the 5 end) with lower scores for the bases towards the 3 end of the reads (Figure 1A). 6 of 16

7 A B Figure 1: The per base sequence quality of reads that were generated when sequencing an organism of the species A. hydrophila. The horizontal axis denotes the position in the reads, while the vertical axis shows the average quality score of the basecalls across all reads at a particular position. A: The per base sequence quality before quality trimming. Note the typical drop in average quality scores towards the 3 end of the reads. B: The per base sequence quality after trimming to a minimum quality of Q30 from the 3 end. 7 of 16

8 The first Raw Data QC Metrics that is included in the report, is the average read length following 3 end trimming of the reads to Q30. This measure tells you how much of your data that have to be disregarded due to poor quality scores Depth of Coverage When sequencing, each position in the original bacterial genome is typically sequenced several times. The average number of times each position in the genome is sequenced is equivalent to the number of reads that cover the position (Figure 2), and is called the depth of coverage or just coverage. While it is diserable that the depth of coverage exceeds a certain threshold - particularly if the aim is to identify Single Nucleotide Polymorphisms (SNPs) - too high depth of coverage can cause other downstream analyses to fail. The second Raw Data QC Metrics that is included in the report, is the average depth of coverage of the genome. Figure 2: Depth of coverage describes how many times in average each position in the genome (shown in red) is covered by a read (shown in blue). 8 of 16

9 Uniformity of Coverage Some areas of bacterial genomes are typically sequenced with a lower depth of coverage than others, e.g., areas with a high GC content, since these areas are often poorly amplified during the template amplification step. The third and fourth Raw Data QC Metrics are the uniformity of coverage at 10X and 50X coverage, respectively. In other words, the percentage of positions in the sample genome that has a depth of coverage of at least 10 or at least Contamination Contamination can be caused by insufficient cleaning of lab working areas or the Illumina instrument. It is obviously of great importance for the ability to draw correct conclusions from downstream analyses that contamination is kept to a minimum. Reads generated on the basis of contaminating DNA will typically not resemble any sequence in the genome of the test organism, hence these reads will not map to the genome. The fifth Raw Data QC Metrics is the percentage of unmapped reads Data Analysis Quality Control The Data Analysis QC Metrics are calculated on the basis of processed sequence data and are described in details below SNPs Between Sample and Reference Genome For each of the organisms in the test panel, the complete genome has previously been established, which makes it possible to determine the accuracy of the base calls when comparing the bases of the reference genome with bases of the trimmed reads mapped to the reference genome. As the first Data Analysis QC Metrics, we identify the number of SNPs between the sample and reference genome Breadth of Coverage The process of putting together short reads into longer, continous stretches of DNA is called assembly. Rarely will the assembly process generate the entire bacterial chromosome as one continous stretch of DNA. A more common scenario is that the assembly step produces several 9 of 16

10 fragments, called contigs, for each chromosome and each plasmid, resulting in a so-called draft genome. The second Data Analysis QC Metrics is the breadth of coverage. The metrics describes the proportion of the reference genome that is covered by a position in the draft genome (Fig. 3). Figure 3: Breadth of coverage describes the proportion of a reference genome (shown in red) that is covered by a position in the draft genome (shown in blue) N50 Value The most widely used metrics for describing the quality of a draft genome is the N50 value. It is defined as the length (in basepairs) of the shortest contig in the set of longest contigs that together make up at least half of the size of the draft genome (Figure 4). The third Data Analysis QC Metrics is the N50 value. 10 of 16

11 Figure 4: Identification of the draft genome N50 value. In the above example, the draft genome consists of 6 contigs with the lengths , , , , , and bp. The size of the draft genome is hence bp and half the size bp. The N50 value is defined as the length of the shortest contig among the set of longest contigs that together make up half the assembly size. In the shown example, this results in a N50 value of bp Species Identification For several decades, bacterial molecular taxonomy has been dominated by the 16s rrna gene. With the advent of WGS it has, however, become possible to determine the bacterial species based on a larger proportion of the genome. We have thus previously shown that a kmer-based approach (kmers are oligonucleotides with the length k) that samples kmers dispersed across the entire bacterial genome, is superior to 16S rrna based species identification (2). As the fourth Data Analysis QC Metrics, the species of the sample organism is determined using a kmer-based approach Multilocus Sequence Typing Once the species of a bacterium have been determined, Multilocus Sequence Typing (MLST) can be used to further subtype the organism. Since the first scheme for Multilocus Sequence Typing (MLST) was developed for Neisseria meningitis in 1998, schemes for more than 130 bacterial species have been developed. When performing MLST, the sequence of internal regions of 11 of 16

12 typically seven genes are initially identified. The exact sequences correspond to alleles as specified in one of the public MLST databases (e.g., and the combination of specific alleles corresponds to a Sequence Type (ST). We determine the Multilocus Sequence Type of the sample organisms as the fifth Data Analyses QC metrics Identification of Genes for Antimicrobial Resistance or Virulence The aim of Whole Genome Sequencing of bacteria is often to determine if the isolate contains particular genes that could, for instance, make the isolate resistant to certain antimicrobials or particularly virulent. Using a pre-compiled database of antimicrobial resistance genes and virulence factors, we examine if the genes previously established to be present in the sample organisms can be identified using the processed sequence data. This is the sixth Data Analysis QC metrics Establishing the Phylogenetic Relationship of Sample Organisms Whole Genome Sequencing provides an unprecedented level of resolution for outbreak detection allowing discrimination of strains that are indistinguishable using traditional laboratory methods. Using WGS data as little as 1 SNP difference between two isolates can be used to differentiate them. For the seventh Data Analysis QC metrics, we generate phylogenetic trees to test the genetic relatedness of isolates from the same strain (accuracy of clustering, related; Figure 5A). Further, we generate phylogenetic trees to test the absence of clustering of isolate from the same species, but a different strain (accuracy of clustering, unrelated; Figure 5B). 12 of 16

13 A B Figure 5: Phylogenetic relationship established on the basis of SNPs. A: There are just 2 SNPs between the reference genome NZ_CP and the three sample genomes SRR , SRR , and SRR There are no SNPs in between the tree sample genomes. All organisms are E. coli EDL 933. B: SRR and the reference genome NZ_CP form a cluster in the tree, while the other 3 sample organisms are all several 1000 SNPs apart. SRR is E. coli ATCC 8739, SRR is E. coli O121:H19, and SRR is E. coli ATCC of 16

14 3.3. Sequencing Run Quality Control Besides the above metrics, the final report will include Sequencing Run QC Metrics. These metrics include: Percentage of bases with quality score Q > 30 for the run. This parameter is related to the basecalling accuracy of the sequencer. Cluster density for the run (density of the clusters formed by clonally amplified library fragments on the flow cell surface). Cluster passing filter of the run (percentage of clusters that pass quality filter for the purity of the signal). PhiX error rate (quality metrics of the spiked-in positive PhiX control sequence). This parameter is related to the base calling accuracy of the sequencer. 4. VALIDATION PERFORMANCE CHARACTERISTICS The described QC metrics are used to calculate accuracy of the platform, assay accuracy, analytic sensitivity, and analytic specificity as summarised in tabel of 16

15 15 of 16

16 5. ADDITIONAL PERFORMANCE CHARACTERISTICS The described metrics can be supplemented with values for repeatability (precision within run) by sequencing the same sample 3 times under the same conditions and evaluating the concordance of the assay results and performance. Further, the reproducibility (precision between runs) can be determined by sequencing the same sample at 3 different time points. It is also recommended to add sequencing of pure water as a negative control for detection of reagents contamination. 6. REFERENCES 1: Kozyreva VK, Truong CL, Greninger AL, Crandall J, Mukhopadhyay R et al. Validation and Implementation of Clinical Laboratory Improvements Act (CLIA)-Compliant Whole Genome Sequencing in Public Health Microbiology Laboratory. J Clin Microbiol (8), : Larsen MV, Cosentino S, Lukjancenko O, Saputra D, Rasmussen S et al. Benchmarking of methods for genomic taxonomy. J Clin Microbiol (5), of 16