From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow Technical Overview Import VCF Introduction Next-generation sequencing (NGS) studies have created unanticipated challenges with respect to data mining and data storage as large numbers of genetic variants are reported from a single sequencing project. The scientific community has access to a plethora of tools for analyzing this data. Combining these tools to obtain biologically meaningful results is still a challenging task. While primary and secondary analysis can be automated, tertiary data exploration is largely done manually by a researcher (Figure 1). One of the outcomes of the tertiary analysis is a list of mutations identified from the secondary analysis. This information is usually stored in a Variant Call Format (VCF). The VCF has become an important template in modern biology since it is widely used to report variants. Typically, VCF files are flexible and are used to store all variant types including single nucleotide variants, insertions and deletions, copy number variants, and structural variants. Filter and sort variants Annotate and compare regions Translate regions to genes Identify genic regions Primary analysis Production of sequence data and reads Secondary analysis Alignment QC Variant calling on aligned data Tertiary analysis Annotation and filtering of variants Genome browser-driven exploration Biological contextualization Agilent GeneSpring GX Gene ontology analysis Pathway analysis Figure 2. The variant analysis workflow in Agilent GeneSpring GX allows users to import a list of SNPs for tertiary data analysis. Figure 1. NGS analysis can broadly be categorized into three different parts. Primary and secondary analysis is computationally extensive, and is usually automated. Tertiary analysis is the exploration of biologically relevant data. GeneSpring GX now includes a variant analysis workflow that allows users to sort and compare VCF files, identify genes affected by a variation, and perform pathway analysis on affected genes. The workflow includes the steps in Figure 2.
Key Functionalities and Benefits Supports processed NGS data with Variant call information in VCF format Enable simultaneous filtering of variants based on the variant associated information from the VCF file GeneSpring GX supports public and commercial databases including ClinVar, COSMIC, dbnsfp, and 1,000 Genomes. This information can be used for visualization and further analysis Powerful visualization options including elastic genome browser for interactive query of specific variant Perform multi-omic and inter genomic analysis using various tools including pathway analysis and correlation analysis. Importing and Viewing VCF Data This workflow supports VCF files that are exported from tools and portals such as 1000 Genomes (http://www.1000genomes.org/home), Agilent SureCall and Strand NGS. The workflow supports comparing VCF files to identify unique or common variants and can be viewed in the genome browser. Variant Analysis workflow in GeneSpring GX allows user to perform tertiary analysis by translating the effect of SNPs on biological pathways and overlay data in a multi-omics experiment. The user can determine the effect of variants (SNPs, insertions, deletions, Copy Number Variations or structural variants) on genes, transcripts, as well as regulatory regions. VCF files imported in GeneSpring GX are stored within the tool for analysis. Each VCF file is stored as a Region List in the tool upon data import. These can be individually viewed in Genome Browser or a spreadsheet with its corresponding annotations. The drag and drop feature of the tool allows viewing of results as well as annotations. Figure 3 shows the default view in a SNP analysis workflow. Analyses can be easily performed to identify all variants common between VCFs, those that are unique to a given VCF, as well as variants that are commonly detected in all samples. Mutations are color-coded based on subtypes for easy visualization. Data derived from the VCF analysis can be visualized as separate or merged tracks. Read coverage is plotted on the Y-axis. Annotation files (for example TargetScan; CpG Islands) help in understanding the effect of mutation on transcripts. Spreadsheet view of the VCF file, which can be sorted and copied to the clipboard. Figure 3. Agilent GeneSpring GX main view, showing the genome browser with its data and annotation tracks. Any track can be selected to display data as a spreadsheet. 2
Variant filtering The Region List Operations workflow offers the ability to filter variants and the associated data. These options are used to include or exclude certain sites from any analysis being performed by the program. For example, users can remove poor quality variants and common polymorphisms, and categorize SNPs into smaller lists that can be saved as region lists in the experiment navigator. The tool can also be used, for example, to exclude genotypes from any analysis being performed by the program. GeneSpring GX also allows users to cluster a list of filtered regions. Filtered regions can be exported as a text, Browser Extensible Data (BED), or reference file. Genomic information is increasingly used in prognosis and research that requires the need to visualize and analyze thousands of individuals and millions of variants. The variants analysis workflow in GeneSpring GX allows users to cluster variants on their zygosity score, allelic frequency, or any other value or tags that the VCF may have across various samples or VCF files. Figure 4 is an example of a hierarchical tree created to group regions on the column value derived from the VCF file. A Color range -126-63 0 63 126 B Region color by variant type Deletion Insertion Figure 4. A) Hierarchical tree showing 39,912 clustered regions; B) a zoomed-in view. Columns are labeled using the default VCF file columns on the left, and the labels on the top show the variant types. The figure legend shows the color code used for the labels. The color range is determined by the column used to cluster the regions. 3
Adding and Updating Publicly Available Annotations Public annotation databases are available for download from Annotations Manager, as shown in Figure 5. VCF and BED files that list filtered and ranked variants can be saved as part of the Annotations Manager for a specific model organism. Data can be downloaded either from the Agilent server or the local desktop. This information can then be used to compare lists of mutations with annotated mutations derived from public sources (Figure 6), and viewed in the Genome Browser. Annotate Region List can be used to append additional information from another Region List in the experiment or annotation databases such as DNase clusters, GENCODE genes, and so forth. The Import Region List utility allows the user to import region based annotations that can be curated to obtain filtered regions for downstream processing. Figure 5. Annotations Manager can store multiple builds for a given organism. Annotations for more than 30 different model organisms are available on the Agilent server for download, and custom annotations can be added for a specific build of a model organism. Figure 6. Agilent GeneSpring GX allows comparison of a source region list with a region list of choice in two different ways: either to find overlap or specify the maximum distance X (in bp) between two regions to be considered close to each other to compare regions in the variant analysis workflow. 4
Upstream Intronic Exonic Downstream 37.. 29.. 22.. 14.. 74.. chrx chr8 chr9 chr7 chr6 chr5 chr4 chr3 chr22 chr21 chr2 chr20 chr19 chr17 chr18 0 chr16 To identify genes and transcripts in a genomic region, GeneSpring GX takes a set of genome coordinates and retrieves a list of genes using Translate Regions To Genes. A desired flanking region can be set in the workflow. The result of this analysis is a list of genes that are near the selected Region List, within a certain distance (5,000 bp by default). For each gene, Find Genic Parts enables identification of exonic, intronic, upstream, and downstream regions based on user selected transcript model (RefSeq, Ensemble, or UCSC). chr15 Multi-Omic Analysis chr14 chr13 Pathway Analysis chr12 pathways in a single omic as well as multi-omic analysis (Figure 8). A detailed discussion of the multi omic analysis in the GeneSpring suite has been discussed elsewhere1. Users can query the list of genes against several pathway databases such as KEGG, BioCyc, and WikiPathways to identify statistically significant pathways that might be impacted by the variants identified in the study2. To explore the underlying mechanism by which various DNA variants affect a biological process, GeneSpring GX offers an overlay of translated genes on chr11 Gene Ontology (GO) chr1 chr10 For biological interpretation and contextualization of results, GeneSpring GX provides the following options: The translated gene list can then be an input to Gene Ontology analysis for identification of gene s molecular function, biological processes, or cellular localization. Counts Results Interpretation Figure 7. Histogram plot showing a translated gene list of regions with a specific variant. The colors represent the genic part that contains a specific variant such as an insertion, deletion, and so forth. Enriched genes with mutations from 1,000 genomes VCF data Differentially expressed genes from transcriptome experiment Enriched genes from both experiments Figure 8. MAP kinase pathway found to be significantly affected by mutations. 5
Conclusion Agilent GeneSpring GX software is a powerful exploratory tool for the identification, filtering, and curation of variants affecting a biological function. It offers high-resolution interactive browsing of reference genomes as well as different types of genomic annotations derived from a variety of public databases across complex datasets. The intuitive and easy to-use pathway analysis utility allows merging variant data with proteomics and metabolomics in a multi omic setting, as well as inter genomic analysis. References 1. Molecular Subtypes in Glioblastoma Multiforme: Integrated Analysis Using Agilent GeneSpring and Mass Profiler Professional Multi-Omics Software, Agilent Technologies, publication number 5991 5505EN. 2. Correlation Analysis in Agilent GeneSpring and Mass Profiler Professional, Agilent Technologies, publication number 5991-5165EN. www.agilent.com/chem For Research Use Only. Not for use in diagnostic procedures. This information is subject to change without notice. Agilent Technologies, Inc., 2017 Published in the USA, September 25, 2017 5991-8301EN