Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

Next Generation Sequencing Bioinformatics small variants Data Analysis Guidelines genomescan.nl

GenomeScan s Guidelines for Small Variant Analysis on NGS Data Using our own proprietary data analysis pipelines Dear customer, As of the beginning of 2015 ServiceXS became a trademark of GenomeScan B.V. GenomeScan focuses exclusively on Molecular Diagnostics whereas our ServiceXS trademark is intended for your R&D projects. GenomeScan is dedicated to help you design and perform Next Generation Sequencing (NGS) experiments that generate high quality results. This guide provides information for our data analysis services and resources and tools for further analysis of your sequencing data. NGS experiments result in vast amounts of data and therefore data analysis can be challenging. Our ability to assist in the analysis of your results can be the key factor leading to a successful project. Our experience in the past years is that even state-of-the-art NGS software is not always able to fulfill the data analysis needs of our customers. To alleviate this problem our experienced team of bioinformaticians and molecular biologists can provide standard or custom bioinformatics solutions to get the most out of your project. GenomeScan provides a comprehensive package of bioinformatics services for our NGS customers, which enable them to utilise all the applications that are possible with billions of bases of sequence data per run. GenomeScan can advise and assist you in every step of the data analysis. Do not hesitate to contact us if you have any questions after reading this guideline! On behalf of the Bioinformatics team, Thomas Chin-A-Woeng Project Manager

GenomeScan Guidelines- Page 2 of 14 Document Outline Page 1 Introduction 3 2 Application Description 2.1 Quality Filtering and Trimming 2.2 Alignments 2.3 SNP Detection 2.4 SNP Filtering 2.5 Indel detection 2.6 Export Files 2.7 Consensus Sequence (optional) 2.8 SNP Effect Analysis 4 3 Analysis Results 3.1 Raw Sequencing Files 3.2 Alignment Files 3.3 Main SNP File 3.4 Human Readable SNP File 3.5 Genotype Summary 3.6 Assign Design File 3.7 Combined.tab 3.8 IUPAC and variant references 3.9 Visualisation 8 4 File Formats 4.1 Variant Analysis 4.2 Structural Variation 4.3 Reference Genomes 4.4 Assay Design 11 Changes to Previous Version (2.0) -Lay-out changes

GenomeScan Guidelines- Page 3 of 14 Chapter 1 Introduction Most organisms within a particular species differ very little in their genomic structure. These variations are referred to as allele changes. A single nucleotide polymorphism or SNP is a DNA sequence variation occurring when a single nucleotide - A, T, C, or G - in the genome differs between members of a species (or between paired chromosomes in an individual). Each individual has many single nucleotide polymorphisms that together create a unique DNA pattern for that individual. Typically, SNPs commonly observed in a population exhibit two alleles, a major allele, which is more prevalent, and a relatively rarely occurring minor allele. The study of single nucleotide polymorphisms is also important in genotyping in crop and livestock breeding. Single nucleotide polymorphisms may fall within coding sequences of genes, non- coding regions, or in the intergenic regions between genes. SNPs sometimes have very deleterious effects, such as a change in only one nucleotide can cause codon(s) to be misread and accordingly a wrong protein will form. SNPs within a coding sequence will not necessarily change the amino acid sequence of the protein that is produced, due to degeneracy of the genetic code. A SNP in which both forms lead to the same polypeptide sequence is termed synonymous, if a different polypeptide sequence is produced they are non-synonymous. SNPs that are not in protein coding regions may still have consequences for gene splicing, transcription factor binding, or the sequence of non-coding RNA. SNPs located in regulatory regions (promoters, UTRs) may have a significant influence on the expression level of a gene. Next- generation sequencing (NGS) allows SNP identification without prior target information. The high coverage possible in NGS also facilitates discovery of rare alleles within population studies. SNP detection algorithms compare the nucleotides present on aligned reads against the reference at each position (Fig. 1). Based on the distribution of As, Ts, Gs, and Cs at that position and the likelihood of a sequencing error, a judgement is made as to the existence of a SNP. Further downstream the SNP analysis the potential effects of SNPs associated with the DNA sequence can be evaluated. Fig.1. Alignment against a reference sequence This guideline describes the workflow for detection of small variants in a sample genome in comparison to a reference genome. The main steps are (1) quality filtering and adapter trimming, (2) alignment, (3) SNP detection (4) filtering of significant SNPs, and (5) optionally SNP effect analysis and clustering.

GenomeScan Guidelines- Page 4 of 14 Chapter 2 Sequencing Applications The following section describes the main steps for SNP/Indel analysis (Fig. 2). The most common workflow step in preparation for SNP analysis is to filter the reads and retain only those with high mapping and base qualities. After calling SNPs and choosing the appropriate thresholds for filtering, a VCF file is generated. From this VCF file various export formats that can be interpreted by the customer are derived. Fig. 2. SNP detection workflow. 2.1 Quality Filtering and Trimming The SNP/indel pipeline starts with quality filtering and trimming of the sequence reads. For filtering a set of standard thresholds is used which are optimised for the SNP/indel analysis pipeline. The main parameter defaults are: Table 1. Read filtering Filter Default Description Adapters On Illimina sequencing adapters are removed Minimal Q-score 22 All bases in the read should have at least a Q-score of 22 (corresponding to a chance of one error in 160 bases), bases with lower qualities are trimmed off Minimal read After trimming bases reads should be at least 36bp to be kept in the data 36 length set Treat paired-end On For paired-end reads both reads should be kept or removed altogether 5' or 3' trim Off 5' and 3' end of reads can be optionally trimmed for adapter sequences or other unwanted bases indicated by the customer Presumed adapter sequences are removed from the read when the bases match a sequence in the adapter sequence set (Illumina TruSeq adapters) with two or less mismatches and an alignment quality of at least 12. To remove noise introduced by sequencing errors, reads are filtered and clipped by quality. By default, the reads are filtered using a phred score of Q22 as a minimum threshold. Bases with phred scores below this level are removed and as a consequence reads are split. If the resulting reads are shorter than the minimal read length (36 bp by default), the reads are removed altogether (both pairs in paired-end reads) when paired-end mode is enforced. The filtered reads are written to FASTQ format and filtering statistics are calculated and reported. The filtered reads are used for the next stage of the pipeline. 2.2 Quality Filtering and Trimming The next step of the pipeline consists of aligning the filtered reads to the genome reference provided by the customer or generated using de novo assembly. The filtered reads are aligned to the reference sequence with a short read aligner based on Burrows Wheeler Transform. A mismatch rate of 4% (4 mismatches in a read of 100 bases) is used by default. This step lays the foundation for finding the SNPs and variations. The alignment files (BAM files sorted and indexed.bam files by the

GenomeScan Guidelines- Page 5 of 14 samtools v0.1.18 package) containing the mapped read information are provided on the harddisk in the Alignments folder. 2.3 Whole genome (re-)sequencing of strains or related organisms The pipeline performs SNP/Indel identification using Bayesian statistics similar to other commonly used software tools for SNP detection. It uses the nucleotide values taken by each read covering the location, as well as its associated base quality, and calculates a consensus genotype. Issues that a SNP caller has to be able to consider are quality of reads, mapping quality, coverage, homopolymeric tracts, and ploidy. The caller takes the following factors into consideration: A sequencer outputs a sequence of nucleotides corresponding to each read and assigns a quality value based on the confidence with which a particular base is called. The base quality values add weight to the called nucleotides. Misaligned reads create false positive SNPs or incorrect frequencies. Most alignment algorithms assign quality scores to a mapping based on the read alignment with the reference. These mapping scores indicate the likelihood of a read originating from the suggested position on the reference. The mapping quality score takes into account the inserts, deletions, and substitutions necessary for alignment at a particular position. The number of reads at a genomic position also determines the confidence of a found SNP. Greater sequencing depth leads to higher SNP calling accuracy. The ploidy of the sample determines the number of nucleotide inferences necessary to conclude the underlying genotype. When haploid, the algorithm does not assume the probability of seeing a heterozygote. Some sequencers exhibit inaccurate representations of homopolymers (e.g. AAAAAA) and their immediate neighbors due to limitations in the technology. Such regions are also handled by the SNP detection algorithm. The SNP/Indel pipeline is capable of detecting three types of variants: substitutions or mutations, deletions, and insertions. Substitutions consist of one or more nucleotide substitutions occurring at certain genomic positions. Deletions are one or more nucleotide deletions occurring at a given location. A deletion event is represented as a change from one or more consecutive nucleotides to a gap (no bases). Insertions are one or multiple consecutive nucleotide insertions occurring at a given location. The pipeline can process data in single- or multi-sample mode. In the default multi-sample mode, low-confidence calls occurring in multiple samples increase the confidence of the SNP call. An associated Phred quality is output along with the consensus genotype; this score represents the confidence in the variant call. High scores correspond to less possibility of error in the call. 2.4 SNP Filtering Genomic positions are reported to be potential SNP sites if they satisfy a set of predefined criteria that may be set by the customer or bioinformatician. They may be dependent upon the experimental setup of the experiment. These include the minimal read depth, minimal quality score, and minimal variant frequency. For all these criteria the number must exceed the thresholds defined. A VCF file is generated from the positions passing the filter. The results are reported in filtered.snps.vcf in VCF file v4.1 format in the Variants directory. From this VCF file various export formats that can be interpreted by the customer are derived. The following filters can be applied to the SNP list:

GenomeScan Guidelines- Page 6 of 14 Read depth: The deeper the sequencing the more reliable the SNP detection can determine whether it is a true SNP. A minimal threshold can be set to ascertain a minimal coverage before a SNP is reported. Quality score: SNPs are filtered based on their quality score. All SNPs with quality scores less than the defined threshold are filtered out. This ensures that SNPs with low quality are discarded, but when these should also be included the threshold can be lowered. The variant frequency is set according prior expectations about the data set amongst which are the ploidy and whether a pool of samples was analysed. 2.5 Indel Detection Small insertions-deletions (indels), up to 30 bases, are detected by the indel caller using in-read information (in contrast to mate or pair information). Aligners typically introduce gaps into reads for better mapping that may represent deletions. Similar to a base, a gap (deletion) is significant when the missing base(s) meet the filter criteria. Since deletion do not have an associated quality score the surrounding base qualities are used for computation of a confidence score. The indels are provided in VCF format and tab-delimited format. 2.6 Human-readable files The SNP list is stored in snps.tab in tab-delimited format (in the Export folder). These can be directly opened using a spreadsheet application such as MS Excel and LibreOffice if the number of rows does not exceed the limitations of the application. From this file the genotype columns are extracted into the summary.tab file. A SNP assay design file is generated for the SNPs reported in snps.tab and reported in design.tab. This file contains the contig information 75 bases upstream and downstream of the identified SNP position. Optionally, a file with the combined information of snps.tab, summary.tab, and design.tab is provided in the file combined.tab. This file also includes additional columns with the distance to the closest previous and next SNP and and the average sequence depth for all samples. Small indels are output in the indels.tab file in the Export folder. 2.7 Consensus Sequencing (optional) Based on the consensus call and the reference sequence a new reference sequence may be derived which includes the found SNPs and genotypes. The resulting file is in FastA format and may be coded in different ways. 2.8 SNP Effect Analysis (optional) SNP Effect Analysis processes the list of SNPs and reports the effect that these SNPs have on the genes in a given context. Using the genome feature information the SNPs are classified. The following classifications are detected and reported.

GenomeScan Guidelines- Page 7 of 14 Table 2. Read filtering Classification Intergenic Synonymous Non-synonymous Stop gain Stop loss Intronic Upstream Downstream Description A variant that does not fall within the neighborhood of any gene in the annotation Variant in an exon. Synonymous: mutation has no effect on the final amino acid sequence Variant in an exon. Synonymous: mutation has effect on the final amino acid sequence Result in a STOP codon STOP codon lost A mutation occurring in intronic regions A variant occuring upstream of the transcript A variant occuring downstream of the transcript Essential splice site Mutations to the donor and acceptor sites of the intron Splice site Mutations to locations of splicing signals (i.e. 3-8 bases into the intron from either side, 1-3 bases into neighboring exon) 5' UTR A variant in the 5' UTR region 3' UTR A variant in the 3' UTR region

GenomeScan Guidelines- Page 8 of 14 Chapter 3 Analysis Results 3.1 Raw Sequence Files The raw sequence files output by the Illumina pipeline are being used as input for the SNP detection. These sequence files are provided to the customer in FASTQ format in the 'Raw data' directory. The quality-filtered output performed in the first step in the pipeline is provided optionally to the customer. 3.2 Alignment files The alignment files are provided in sorted BAM format with an accompanying index file. See our Next-generation data analysis guideline for a full description of BAM files. 3.3 Main SNP file (snps.vcf) The main output of the SNP/indel pipeline is a text file in VCF format formatted according to the VCF 4.1 specification. VCF stands for Variant Call Format, and was originally used by the 1000 Genomes project to encode structural genetic variants. A short overview is given in Section 4.1.1. 3.4 Human readable SNP file (snps.tab) This text file contains information in tab delimited format. It is both human- and machine readable. Fig. 3. Layout of snps.tab and summary.tab files. The format specification of this file is defined in Section 4.1.3. Columns 1 to 4 are general columns applicable to all samples. Columns 5 to 7 contain SNP information for individual samples. Columns 7 to 10 contain genotype information. Columns 11 to 15 provide raw statistics on coverage and base composition. The layout of the columns is described in section 4.1.1 (Table 8, Fig. 3).

GenomeScan Guidelines- Page 9 of 14 3.5 Genotype summary (summary.tab) This tab-delimited file contains the consensus columns in the snps.tab. It is both human- and machine readable (Fig. 3 inset). 3.6 Assay design file (design.tab) This tab-delimited file shows the flanking sequences of each position in the SNP file. It is both human- and machine readable. Indicated in the flanking regions are neighbouring SNPs which may be of importance for the design of follow up assay. 3.7 Combined.tab (optional) This tab-delimited files combine the info from snps.tab, summary.tab, and desig.tab and included additional information about the distances to neighbouring SNPs and total coverages over all samples. 3.8 IUPAC and variant references The construction of the IUPAC reference is depicted in Fig. 4. Fig. 4. Generation of a IUPAC or variant reference A new IUPAC or reference with variant alleles is generated using the original reference and, read information, and variant tables. After alignment of the reads onto the reference sequence, each base position is evaluated for its variants, coverage, and quality. Regions or bases with no coverage are flagged in the new references with 'n'. Regions with coverage below a preset read depth (default <=2) or doubtful alignment quality are flagged with lowercase bases to indicate low quality. Variant alleles are depicted their IUPAC codes in the IUPAC reference or with the variant allele in the variant reference. The IUPAC reference in FASTQ format has an additional advantage that the genotype call score is encoded as quality score similar to the Sanger phred score encodings. An offset of 33 is used when translating ASCII encoding to the numerial score. The genotype call score is calculated as Q = 10log 10log P where P represents the probability that a polymorphism exists at the given location. Whether or not a variant allele is reported in the derived reference is dependent upon a set of key threshold values inclusing variant freqency (default 30% for heterozygous diploid organisms or 80% for haploid genomes), coverage or read depth (default 20), and mapping quality. 3.9 Visualisation Aligned reads, pileups, and SNPs can be viewed in numerous software packages for NGS. Using the reference file and alignment files this can be easily done in the IGV browser. See our Nextgeneration data analysis guideline how this can be performed.

GenomeScan Guidelines- Page 10 of 14 Chapter 4 File Formats This chapter describes the file formats specifically used for SNP and indel analysis. For other common formats such as sequence and alignment files, please refer to our NGS data analysis guideline. 4.1 Variant Analysis The FASTQ sequence files output by the Illumina sequencers are saved compressed in the commonly used GNU zip format. This is indicated by the.gz file extension. Most downstream data analysis tools automatically decompress the files when used as input as well a most decompression software packages can inflate this format. VCF files The Variant Call Format (VCF) is flexible format used to store any type of DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations by listing both the reference haplotype (the REF column) and the alternate haplotypes (the ALT column). The format was developed for the 1000 Genomes Project, and has been generally adapted by many scientists and software tools. The VCF format is a text file format which contains meta-information lines, a header line, and then data lines each containing information about a position in the genome. The specification for the format can be found at http://www.1000genomes.org/wiki/analysis/variant%20call%20format/vcfvariant-call-format-version-41 and published (Danecek et al. 2011. The variant call format and VCFtools. Bioinformatics 27:2156 2158. The full VCF specification also includes a set of recommended practices for describing complex variants. The header contains an arbitrary number of meta-information lines, each starting with characters ##, and a tab-delimited field definition line, starting with a single # character. The metainformation header lines provide a standardised description of tags and annotations used in the data section. The use of meta-information allows the information stored within a VCF file to be tailored to the dataset in question. It can be also used to provide information about the means of file creation, date of creation, version of the reference sequence, software used and any other information relevant to the history of the file. The field definition line names eight mandatory columns, corresponding to data columns representing the chromosome (CHROM), a 1-based position of the start of the variant (POS), unique identifiers of the variant (ID), the reference allele (REF), a comma separated list of alternate nonreference alleles (ALT), a phred-scaled quality score (QUAL), site filtering information (FILTER), and a semicolon separated list of additional, user extensible annotation (INFO). In addition, if samples are present in the file, the mandatory header columns are followed by a FORMAT column and an arbitrary number of sample IDs that define the samples included in the VCF file. The FORMAT column is used to define the information contained within each subsequent genotype column, which consists of a colon separated list of fields. E.g., the FORMAT field GT:GQ:DP in the fourth data entry of Fig. 5 indicates that the subsequent entries contain information regarding the genotype, genotype quality, and read depth for each sample. All data lines are tab- delimited and the number of fields in each data line must match the number of fields in the header line.

GenomeScan Guidelines- Page 11 of 14 Fig. 5. VCF file. The VCF specification includes several common keywords with standardised meaning. The following table gives some examples of the reserved tags. Table 3. SNP/genotype file Abbreviation Genotype columns GT PS DP GL GQ INFO column DB H3 VALIDATED AN AC SVTYPE END IMPRECISE CIPOS/CIEND Description Genotype, encodes alleles as numbers: 0 for the reference allele, 1 for the first allele listed in ALT column, 2 for the second allele listed in ALT and so on. The number of alleles suggests ploidy of the sample and the separator indicates whether the alleles are phased ( ) or unphased ( / ) with respect to other data lines. Phase set, indicates that the alleles of genotypes with the same PS value are listed in the same order. Read depth at this position. Genotype likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields. Genotype quality, probability that the genotype call is wrong under the condition that the site is being variant. Note that the QUAL column gives an overall quality score for the assertion made in ALT that the site is variant or no variant. dbsnp membership. Membership in HapMap3. Validated by follow-up experiment. Total number of alleles in called genotypes. Allele count in genotypes, for each ALT allele, in the same order as listed. Type of structural variant (DEL for deletion, DUP for duplication, INV for inversion, etc. as described in the specification. End position of the variant. Indicates that the position of the variant is not known accurately. Confidence interval around POS and END positions for imprecise variants. Missing values are represented with a dot. For practical reasons, the VCF specification requires that the data lines appear in their chromosomal order. VCF files can be stored in a compressed manner, compressed by bgzip, a program which utilizes the zlib-compatible BGZF library (Li et al., 2009). Files compressed by bgzip can be decompressed by the standard gunzip and zcat utilities. Fast random access retrieval of variants from a range of positions on the reference genome can be achieved by indexing genomic position using tabix, a

GenomeScan Guidelines- Page 12 of 14 generic indexer for tab-delimited files. Both programs, bgzip and tabix, are part of the samtools software package and can be downloaded from the SAMtools web site (http://samtools.sourceforge.net). BCF (Binary Call Format) Binary format used by samtools/bcftools for efficient storing and parsing of genotype likelihoods. A description can be found at http://vcftools.sourceforge.net/bcf.pdf SNP/Genotype The snps.tab file is a proprietary human-readable file with all information regarding SNPs and genotypes. The file is also machine-readable. Columns 1 to 4 are general columns applicable to all samples. Column 5 to 7 contain SNP information for individual samples. Columns 7 to 10 contain genotype information. Columns 11 to 15 provide raw statistics on coverage and base composition. The layout of the columns is as follows (Table 4.6, Fig. 3): Table 4. SNP/genotype file Column Format Description 1 Text Chromosome or contig 2 Numerical 1-based genomic position within chromosome or contig 3 Nucleotide base Reference allele 4 Nucleotide base Detected alleles 5 Nucleotide base Detected SNP, empty is no SNP or below significance 6 Numerical Variant frequency (%) 7 Numerical Quality score for SNP 8 IUPAC base Genotype 9 Numerical Genotype quality score 10 Numerical Depth to calculate genotype or SNP 11 Numerical Fraction of A 12 Numerical Fraction of C 13 Numerical Fraction of T 14 Numerical Fraction of G 15 Numerical Total depth (alignment) 16-26 Same as 5-15 for sample 2, etc Summary file The summary.tab file contains only the consensus genotypes of the samples. Table 5. Summary file Column Format Description 1 Text Chromosome or contig 2 Numerical 1-based genomic position within chromosome or contig 3 Nucleotide base Reference allele 4.. n IUPAC call Consensus genotype for sample Insertions/deletions file The indels.tab file contains short indel information in the following format:

GenomeScan Guidelines- Page 13 of 14 Table 6 Indel file format Column Name Format Description 1 chr1 Text Chromosome or contig 2 pos1 Numerical 1-based genomic position within chromosome or contig chr1 3 reference Nucleotide base Reference allele 4 sequence Nucleotide base(s) Detected variation 5 chr2 Text Not used 6 pos2 Numerical Not used 7 type Text Variation class: INS (insertion) or DEL (deletion) 8 size Numerical Size of the variation 9 varfreq Numerical Frequency at which the variation is observed (%) 10 score Numerical Quality score 11 depth Numerical Coverage at the indicated position 12..n Same as 8-11 for sample 2, etc 4.2 Structural Variation SV file The filtered.sv.tab file contains structural variation data (large insertions, deletions, duplication, interchromosomal and intrachromosomal translocations). Table 7. Structural variation file format Column Name Format Description 1 chr1 Text Chromosome or contig 2 pos1 Numerical 1-based genomic position within chromosome or contig chr1 3 reference Nucleotide base Reference allele 4 sequence Nucleotide base(s) Detected variation 5 chr2 Text Chromosome or contig 6 pos2 Numerical Second position in genome. 1-based genomic position within chromosome or contig chr2 7 type Text Variation class: INS (insertion), DEL (deletion), CTX (interchromosomal translocation), ITX (intrachromosomal translation) 8 size Numerical Size of the variation 9 varfreq Numerical Frequency at which the variation is observed (%) 10 score Numerical Quality score 11 depth Numerical Coverage at the indicated position 12 sample Text Optional sample id 4.3 Reference genomes IUPAC references A IUPAC reference describes a heterozygous genome in which the alleleles are indicated using the standard IUPAC codes for DNA. The file may be in sequence file format such as FastA format or FASTQ format.

GenomeScan Guidelines- Page 14 of 14 4.4 Assay Design Assay Design File The design.csv file contains all information required to design follow up assay such as qpcr and genotyping assays. Table 8. Assay design file Column Name Format Description 1 contig Text Chromosome or contig 2 position Numerical 1-based genomic position within chromosome or contig 3 reference Nucleotide base Reference allele 4 sequence Sequence DNA sequences left and right flanking the variant position. Any neigbouring SNVs are encoded in IUPAC. The actual SNV position is indicated using bracket notation ([A/T[). Notes