VEXOR: an integrative environment for prioritization of functional variants in fine-mapping analysis

Size: px
Start display at page:

Download "VEXOR: an integrative environment for prioritization of functional variants in fine-mapping analysis"

Transcription

1 VEXOR: an integrative environment for prioritization of functional variants in fine-mapping analysis Supplementary Material : VEXOR User Manual Audrey Lemaçon1[1], Charles Joly Beauparlant[1], Penny Soucy[1], Jamie Allen[2], Douglas Easton[2,3], Peter Kraft[4,5], Jacques Simard[1] and Arnaud Droit[1] 1.Genomics Center, Centre Hospitalier Universitaire de Québec - Université Laval Research Center, Quebec, Quebec, Canada 2.Centre for Cancer Genetic Epidemiology, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK 3.Centre for Cancer Genetic Epidemiology, Department of Oncology, University of Cambridge, Cambridge, UK 4.Program in Genetic Epidemiology and Statistical Genetics, Harvard T.H. Chan School of Public Health, Boston, MA, USA 5.Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA Supplementary Material : VEXOR Quick start Table of Contents Introduction VEXOR home page Data loading VEXOR features Annotations mapping Data exploration External Browsers Allele frequencies Linkage desequilibrium VEP RegulomeDB HaploReg CADD Manual scoring Experiments Differential expression Statistical frameworks for variants prioritization References 1

2 Introduction VEXOR from Variant EXplOreR is a Web interface coded in R (R Core Team, 2013) and based on the Shiny framework (Chang et al., 2016). The website and its server are hosted by the Compute Canada infrastructure and provide a high-performance computing environment. VEXOR significantly facilitates the visualization, exploration and interpretation of the outputs generated by genome-wide genotyping arrays and custom array such as icogs ( It also conveniently associates a selected set of variants with publicly available functional annotations. VEXOR clearly displays the overlaps between genetic variants and potential genomic predictors. Such overlaps emphasize potential functional variants, predict target genes or regulatory pathways, and allow the prioritization of variants for future functional assessment. 2

3 VEXOR home page Go to VEXOR Web interface at : 3

4 Data loading VEXOR supports as input a comma-delimited, tabulation-delimited or Excel file containing various information for each genetic variant such as position and alleles. The file must have a header, which will be used by VEXOR to build a query form. The minimum content of the input file is [Chr, Pos, End_pos, RsID, Ref, Alt]. However, some integrated tools require additional information such as the sample size or the allele frequency in the experiment. Uploaded file size is currently limited to 150 megaoctets (approximately 1 x 10 6 lines). For larger files, users will need to contact us to carry out data file entry, and private access to their data can be provided. Chr Pos End_pos RsID Ref Alt NULL T C rs T C rs G A NULL G A rs T C rs AAAC A rs C T First, define the genome build and coordinates of your association file. Then select your file through the file browser. 4

5 VEXOR features VEXOR provides a single point-of-entry to several useful tools and resources. No computational skills or file format editing between the tools is required. Users can execute all analyses needed within the VEXOR environment. Once the data has been successfully uploaded, the user can simultaneously study hundreds of variants of interest. Annotations mapping You can select annotations among some basic genomic annotations or map some custom annotations. To do so you have to provide your data in the form of a BED file (extension.bed). You can either map the annotation data in a dichotomic way (0 or 1), or map them as quantitative values (expression levels for example). Basic annotations The basic annotations are coming from UCSC Tables and have been selected because of their cell type nonspecificity. 5

6 Custom annotations You can use your own annotation files or access UCSC (Rosenbloom et al., 2015) Tables here : UCSC Tables. To retrieve an annotation : 1. Select the following options : clade : Mammal ; genome : Human and assemby : GRCh37/hg19 2. Select the track of interest 3. Select the genomic region(s) of interest or select whole genome 4. Select the output format : BED - browser extensible data 5. Fill the output file section with your output file name with.bed extension and press get output to download the annotation file on your computer. You can then load valid BED files and map these annotations through VEXOR 6

7 Data exploration Once the data has been successfully uploaded - and eventually annotated -, you can select a subset of variants according to a selection filters. These filters correspond to the input file columns as well as the name of mapped the annotations. Once all the filters are set, press the Query button to query your input file. The results are presented under the form of 1) a searchable table... 7

8 ... and 2) an interactive map if your file contains genomic annotations created with R package d3heatmap (Cheng and Galili, 2016) that can be saved as an htlm interactive file. After the filtering, you can also perform a subset of the previous selection by variants identifier RsID using a dropdown list (if less than 200 RsIDs to display) or by entering manually the RsID of interest. As in the previous section, if your file contains genomic annotations an interactive map will be displayed for you subselection of variants. 8

9 External Browsers VEXOR is connected to Ensembl (Yates et al., 2015) and UCSC (2015) Browsers. They are available from the Data selection section. You can easily visualise your variants under the form of a custom track within the browsers and then benefit from all their features. By default all the variants in the results table will be display unless some variants are selected in the table. See all variants in UCSC See all variants in Ensembl 9

10 You can select a subset of variant to display in the browsers by highlighting the in the results table. See selected variants in UCSC 10

11 See selected variants in Ensembl The following tools are available in the left panel in the Tools section. 11

12 Allele frequencies Go to the Allele frequencies subsection and press Run button to get the minor allele frequencies of your selected variants for the 6 populations of 1000 Genome Project (1000 Genomes Project Consortium et al., 2015). Press the Download button to save the table (csv format). 12

13 Linkage desequilibrium Go to the Linkage desequilibrium (LD) subsection, choose one of the 1000 Genome Project (2015) subpopulations and press Run LD Calculation button to get linkage information of your selected variants. The LD calculation and the Haploview-like representation are generated by the R package LDheatmap (Shin et al., 2006). Press the Save LD Matrix button to save the LD results under the form of a pairwise matrix (space delimited text file). 13

14 VEP VeP determines the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions. Go to the VeP (McLaren et al., 2016) subsection and press Run button to get the effect prediction computed by the VeP for your selected variants. The results are presented in two tables presenting regulatory and transcription consequences. The Transcript consequences panel presents information about genes potentially impacted by your selected variants and contains a link to GeneMANIA (Warde-Farley et al., 2010). GeneMANIA searches many large, publicly available biological datasets to find related genes. These include protein-protein, protein-dna and genetic interactions, pathways, reactions, gene and protein expression data, protein domains and phenotypic screening profiles. 14

15 Each table can be saved by pressing the Download button (csv format). 15

16 RegulomeDB RegulomeDB is a database that annotates SNPs with known and predicted regulatory elements in the intergenic regions of the H. sapiens genome. Go to the RegulomeDB (Boyle et al., 2012) subsection and press Run button to get the functional annotations of your selected variants. You can access to complementary information through the Info links. Press the Download button to save the table (csv format). 16

17 HaploReg HaploReg is designed for researchers developing mechanistic hypotheses of the impact of non-coding variants on clinical phenotypes and normal variation. Go to the HaploReg (Ward and Kellis, 2012) subsection and press Run button to get the functional annotations of your selected variants. You can access to complementary information through the Info links. 17

18 Press the Download button to save the table (csv format). The results table presents information about genes potentially impacted by your selected variants and contains a link to GeneMANIA (2010). GeneMANIA searches many large, publicly available biological datasets to find related genes. These include protein-protein, protein-dna and genetic interactions, pathways, reactions, gene and protein expression data, protein domains and phenotypic screening profiles. 18

19 19

20 CADD Go to the CADD (Kircher et al., 2014) subsection and press Run button to get CADD deleteriousness scores of your selected variants. Press the Download button to save the table (csv format). 20

21 Manual scoring This section enables to compute a local scoring for selected variants. The weights can be chosen arbitrary or can be based on enrichment results obtained from prediction algorithms. You can manually weight the features and change the thresholds. To create your manual scoring : 1. Select the feature you want to use 2. Select the value associated with the features 3. Select the weight 4. Add the weighted features 5. Repeat 1-4 steps for each feature of your model 6. Press Run Scoring button You can also save your model in text file so you can load it for further use. The weighting should perfectly match the loaded input file. The results are presented under the form of an interactive map created with R package d3heatmap (2016). 21

22 22

23 Experiments Taking advantage of R graphical abilities and particularly ggplot2 package, we have included a section dedicated to the visualisation of annotation data such as chromatin states (9-ENCODE cell types (Ernst et al., 2011)), enhancers (FANTOM5 (Lizio et al., 2015)), super-enhancers (Hnisz (Hnisz et al., 2015)), predicted enhancers targets with IM-PET (4DGenome (Teng et al., 2015)), computationally derived long poly-adenylated RNA transcripts referenced by MiTranscriptome catalog (Iyer et al., 2015) and relevant experimental data such as chromosomal interactions (Hi-C, 3C, 5C, ChIA-PET from 4DGenome (2015)) To visualize experiments : 1. Select the region 2. Select the annotations 3. Select the tissues or the cell types of interest... Optionally, the variant representation can be switched-off at will and is adapted to the number of variants to display. 4. Press Draw button 23

24 Finally, we add the possibility to colored highlight interactions based on a subset list of genes of interest. 24

25 By checking the See only overlapping features checkbox (and pressing Draw button again), you will visualize the overlapping annotations. 25

26 The overlaps are summarize in a table below the figure. Press the Download button to save the table (csv format). 26

27 Differential expression Go to the Differential expression subsection and press Run button to access GTEx (The GTEx Consortium and al., 2015) data for your selected variants. Press the Download button to save the table (csv format). 27

28 Statistical frameworks for variants prioritization Due to the algorithms complexity, the prior probabilities estimation can be time and resource consuming. Depending on the number of annotations and variants to analyse, the software will run instantaneously or be submitted to the server for calculation. In this case, the result archive will be send to the user by . fgwas fgwas is a command line tool for integrating functional genomic information into a genome-wide association study (GWAS). The underlying model is described in: Pickrell JK (2014) Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. - (Pickrell, 2014) To run, fgwas needs : - annotations which can be the basic annotations provided by VEXOR or user s own data - the observed allele frequency of one of the alleles (must be included in the original matrix file) - the Z-score association data (must be included in the original matrix file) - the sample size used in the association study at this SNP (this can vary from SNP to SNP due to, e.g., missing data) By default, fgwas will be run on your variants selection, but you can also use the multi loci mode. In this case, the matrix has to contain an additional column identifying the region of belonging of each variants by a unique identification number. The default parameters used to run fgwas are the following : fgwas -i [input] -o [output] -w [all_the_selected_annotations] -k [max between 5000 and the number of SNPs in region] -print. You can explicitely change k value through the Window size field. The -fine argument will be added if the user selects the multi-loci analysis. A lot of other arguments enable to modify fgwas behavior. You can add your tuning parameters by filling the fgwas tuning parameters area (see fgwas user manual for more details). 28

29 29

30 CAVIAR-BF CAVIAR-BF (Chen et al., 2015) is a fine-mapping method using marginal test statistics in the Bayesian framework. An advantage of the Bayesian framework is that it can answer both association and fine-mapping questions. To run, CAVIAR-BF needs : - annotations which can be the basic annotations provided by VEXOR or user s own data - linkage desequilibrium information which can be upload separately or computed through VEXOR - the Z-score association data (must be included in the original matrix file) By default, CAVIAR-BF will be run on your variants selection, but you can also use the multi loci mode. In this case, the matrix has to contain an additional column identifying the region of belonging of each variants by a unique identification number. A lot of other arguments enable to modify CAVIAR-BF behavior. For most of the parameters, we chose to run the software with the default values. caviarbf::caviarbffinemapping(locilistfile = "locilistfile", maxcausal = "provide by the user", nsample = "provide by user", exact = FALSE, #For binary traits, # only approximated Bayes factors are available eps = "0 if the user provide its own LD matrix, else 0.2", hyperparamselection = "topk", outputprefix = "outputprefix", LDSuffix = "LD.p1", keepbf = TRUE, priortype01 = 0, # 0 DEFAULT priorvalue = c(0.1, 0.2, 0.4), #GWAS option useidentitymatrix = FALSE, #FALSE DEFAULT BFFileListFile = NA, #NA DEFAULT priorprob = 0, # 0 DEFAULT, fittingmethod = "glmnetlassomin", # DEFAULT rthreshold = 0.2, # DEFAULT pvaluethreshold = 0.05, # DEFAULT K = "number of selected features", lambda = c(2^seq(from = -15, by = 2, to = 5), 100, 1000, 10000, 1e+05, 1e+06), # DEFAULT alpha = c(0, 0.2, 0.3, 0.5, 0.7, 0.8, 1), # DEFAULT annotationindices = -1, # DEFAULT useparallel = FALSE, # DEFAULT ncores = 1, # DEFAULT verbose = F, # DEFAULT overwriteexistingresults = T) 30

31 31

32 fastpaintor (PAINTOR 3.0) fastpaintor is a framework enabling to combine external functional annotations (sets of variants that localize within certain genomic features, (e.g introns, enhancers, transcription binding sites) with genetic association data (the strength of association between genetic variants and the phenotype) to improve the prioritization of causal variants. To run, fastpaintor (Kichaev et al., 2016) needs : - annotations which can be the basic annotations provided by VEXOR or user s own data - linkage desequilibrium information which can be upload separately or computed through VEXOR - the Z-score association data (must be included in the original matrix file) By default, fastpaintor will be run on your variants selection, but you can also use the multi loci mode. In this case, the matrix has to contain an additional column identifying the region of belonging of each variants by a unique identification number. 32

33 References 1000 Genomes Project Consortium et al. (2015) A global reference for human genetic variation. Nature, 526, Boyle,A.P. et al. (2012) Annotation of functional variation in personal genomes using RegulomeDB. Genome Res., 22, Chang,W. et al. (2016) shiny: Web Application Framework for R. Chen,W. et al. (2015) Fine Mapping Causal Variants with an Approximate Bayesian Method Using Marginal Test Statistics. Genetics, 200, Cheng,J. and Galili,T. (2016) d3heatmap: Interactive Heat Maps Using htmlwidgets and D3.js. Ernst,J. et al. (2011) Mapping and analysis of chromatin state dynamics in nine human cell types. Nature, 473, Hnisz,D. et al. (2015) Convergence of developmental and oncogenic signaling pathways at transcriptional super-enhancers. Mol. Cell, 58, Iyer,M.K. et al. (2015) The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet., 47, Kichaev,G. et al. (2016) Improved methods for multi-trait fine mapping of pleiotropic risk loci. Bioinformatics. Kircher,M. et al. (2014) A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet., 46, Lizio,M. et al. (2015) Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome Biol., 16, 22. McLaren,W. et al. (2016) The Ensembl Variant Effect Predictor. Genome Biology, 17, 122. Pickrell,J.K. (2014) Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. American Journal of Human Genetics, 94, R Core Team (2013) R: A Language and Environment for Statistical Computing R Foundation for Statistical Computing, Vienna, Austria. Rosenbloom,K.R. et al. (2015) The UCSC Genome Browser database: 2015 update. Nucleic Acids Res., 43, D Shin,J.-H. et al. (2006) LDheatmap : An R Function for Graphical Display of Pairwise Linkage Disequilibria Between Single Nucleotide Polymorphisms. J. Stat. Softw., 16. Teng,L. et al. (2015) 4DGenome: a comprehensive database of chromatin interactions. Bioinformatics, 31, The GTEx Consortium and al. (2015) The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science, 348, Ward,L.D. and Kellis,M. (2012) HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucleic Acids Res., 40, D Warde-Farley,D. et al. (2010) The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res., 38, W Yates,A. et al. (2015) Ensembl Nucleic Acids Research, 44, D