USER MANUAL for the use of the human Genome Clinical Annotation Tool (h-gcat) uthors: Klaas J. Wierenga, MD & Zhijie Jiang, P PhD

USER MANUAL for the use of the human Genome Clinical Annotation Tool (h-gcat)) Authors: Klaas J. Wierenga, MD & Zhijie Jiang, PhD First edition, May 2013 0

Introduction The Human Genome Clinical Annotation Tool, or h-gcat, is the product of a collaboration between Dr. Zhijie Jiang, PhD (ZJiang@med.miami.edu) at the Center for Computational Science (CCS) of the University of Miami, and Dr. Klaas Wierenga, MD (klaaswierenga@ouhsc.edu) at the Division of Genetics, Department of Pediatrics at the Oklahoma University Health Sciences Center. h-gcat is a server-based tool designed for the analysis of whole exome/genome sequencing data obtained from families affected by genetic disorder(s). The objective of h-gcat is to provide the user with a relatively simple and intuitive interface to proceed through the analysis. However, behind this deceptively simple front-end is a powerful analysis integrated with several clinical relevant databases. The user will be taken in a few steps through this analysis process. On the first page (Login Page), the user can register and, after registration, access the site. On the second page (Upload Page), a data file called variant call format (VCF) file can be uploaded onto the tool. On the third page (Quality Control Page), initial filtering parameters are set to remove the sites (or calls) that do not require further analysis. On the fourth page (Sample Page), the user can provide clinical, genetic and pedigree information, and select the type(s) of genetic locations and types of mutations that are deemed of highest interest. In case the user thinks that the disorder may already have a clinical annotation in the Human Gene Mutation Database (HGMD), Disease Ontology (DO) and/or Online Mendelian Inheritance in Man (OMIM), the user can use this clinical information to refine such a search. Within OMIM, the user can also search for OMIM-derived clinical keywords, or, alternatively, use Human Phenotype Ontology codes. Once this is accomplished, the user has the option of entering coordinates of regions of homozygosity (ROH) to focus the analysis on these regions, if homozygous recessive conditions within these ROH are deemed likely. Lastly, the user has an option of analyzing a limited number of genes (eg. a favorite gene list, based on genomic location, or on association with disorders with genetic heterogeneity). On the fifth page (Result Page), the user can review the sites (calls) that 'survived' after these iterative filtering activities. The result page provides the user with relevant information on the location of the sites (calls), the dbsnp status, the gene-based cdna location of the sites (calls) for the various isoforms, as well as the effect of the mutation on the protein. Link Outs to the National Center for Biotechnology Information (NCBI), UCSC Genome Browser and OMIM entries are provided for review. To the very right of the page, the user can review the data for 1

the various family members studied, and review the quality of the data underlying genotype status (ref/ref, ref/alt, alt/alt) by mousing over the various genotypes. On the last 2 pages (Sample Page and Result Page) the user can review the filter parameters selected by clicking Review your filtering selection. At the end of the analysis, the entire activity can be downloaded to the desktop in Excel spreadsheet format for record keeping. Our tool is still being developed. We will focus shortly on annotating sites for splice site mutation potential, and on building a missense mutation taster to provide an informed prediction as to whether a missense mutation is damaging or not. 2

Login Page The user needs to register for the site using full name, email address and affiliation (this should be the institute or company the user is associated with). Once registered, a password is sent to this email address. The password can be changed by the user any time. After registration has been completed successfully, and on any occasion thereafter, the user enters the email address and password to get access too the Upload Page. 3

Upload page Currently, the only file format that can be uploaded to thee h-gcat tool is a variant call format (VCF) file. The file should be located on a hard drive, memory stick, or cloud storage. Using the Browse button, the file to be analyzed can be identified, and selected for upload to the tool. The file to be uploaded should be visible on the box beside the Browse button. Keep the radiobutton at the default selection (VCF), and click the Upload button. Upload may take a few minutes, depending on the size of the file, and the upload speed of your internet access. We suggest thatt the size of the VCF filee should not exceed 1000 Mbyte. Once completed, you will be taken to the Quality Control page. 4

Quality Control page Typically, in a variant call format (VCF), there are 40,0000-80,000 sites (calls). This means that there are 40,000-80,000 sites that are different from thee reference sequence. However, many of these 40,000-80,,000 sites do not need to be considered as the cause of the Mendelian disorder being studied because some of them have poor sequencing quality, genotype calling quality, while others are major (common) alleles in somee populations. In the Quality Control page, we can filter quickly, by removing: 1. calls for which the data quality is poor 2. common SNPs, since a common SNP is not likely thee cause of a rare disorder 3. (calls on) chromosomes that we are not interested in ad 1. The sequencing quality of the site is to a large extent measured by the sequencing depth (30X, 50X, etc) ), and the genotype calling quality is measured by the transformed probability. Not every call however is of high quality. In preparing a VCFF file, not all sites with poor quality are removed, as there is no point in spending a lot of time analyzing poor quality data. Typically, the default values of the tool can be used. They can be changed as needed, based on information available to you concerning the platform used to obtain the VCF file.. ad 2. There are 2 standardd allele frequency datasets, derived from multiple populations, the 1,0000 genome dataset, or 1kG, and the 5600 NHLBI dataset, or ESP. The assumption is that it is unlikely that these common SNPs/variants are the causee of a rare Mendelian disorder. The user can decide which dataset(s) to use to filter out these common SNPs, given the known ethnicity of the patient/family being studied, and then decide on the minor allele frequency for filtering (ie. 5

any SNP/variant with a minor allele frequency greater than the user-defined cut-off will be removed). ad 3. Typically, the user will have informed assumptions of the type of inheritance that is likely underlying the disorder in the family studied. Typically, if the condition is assumed to be X-linked, the user can remove the sites (calls) from all other chromosomes, and just enter X. In autosomal recessive or autosomal dominant conditions, the user may want to discard sites on chromosomes X and Y, and enter 1-22. In all other cases, the user should use the default all. There is one more option, to be used selectively. This provides the user with information on the genes that were poorly covered by exome sequencing. This option may be useful at a later time, if data analysis fails to come up with reasonable hits. If used, be prepared to wait a while, as this feature takes time to complete. Once the selections are made, the Apply button can be clicked, and the h-gcat tool will filter out those variants/calls using the parameters selected in this page. The program then removes those sites lying outside of the set parameters, and retains a more limited dataset for further scrutiny. 6

Sample page On the Sample Page the user can click on the header Review your filtering selections to quickly review the filtering criteria and the outcome of thee filtering process, i.e., the number of sites removed, and the number of sites retained for further scrutiny. In the cartoon above, one can see that the VCF file contained 75,917 sites, and that 4,962 were of poor quality, and hence removed. One can also see that after removing common variants using 1kG for Global and ESP for All Populations, and after removing sites on chromosomes X and Y, of the 64,104 sites 'only' 8,586 remain. The Sample Page is the heart of the program. First, it allows the user to specify the inheritance pattern. Commonly assumed inheritance patterns are: 1. de novo dominant - the proband has a dominant disorder, but both parents do not have this condition. 2. dominant inheritance - the proband has a dominant disorder, and one of the parents, and possibly other relatives, have the same condition. 3. recessive inheritance, homozygous - the proband, and possibly some siblings as well, have a recessive disorder due to homozygous mutations, while parents are carriers of this mutation - there may be (remote) consanguinity or inbreeding. 4. recessive inheritance, compound heterozygous - the proband, and possibly some siblings as well, have a recessive disorder caused by the presence of 2, in trans, mutations. Each parent is a carrier for one of the mutations. 5. X-linked recessive - the (male) patient typically inherits the mutation from his mother. 6. de novo X-linked recessivee - similar to de novo dominant, the parents (here: the mother) does not carry the mutation causing the disorder in her son. 7

In the next few pages you will see the Sample Page filled out by inheritance pattern. Of course, it is not always known what the inheritance pattern is. The analysis can be done more than once. Eg., a single affected male in a family could have an X-linked de novo, X-linked inherited, a homozygous recessive, a compound heterozygous, or even a de novo dominant disorder, implying that the analysis may have to be done using 5 different templates. 8

Template for de novo dominant disorders All unaffected individuals are (ref/ref, ref = the referencee allele), while the single affected person would be (ref/alt) - the alt being a minor allele, in this case due to a spontaneous mutation. 9

Template for dominantly inherited disorders All unaffected individuals are (ref/ref), while the affected individuals are (ref/alt), in this case passed on from one affected individual to other(s). 10

Template for homozygous recessive inheritance In a setting where homozygosityy risk is increased, it is useful to identify any homozygous mutations in affected individuals (alt/alt). In such case, the parents should both be carrier (ref/alt), and the healthy siblings can be (ref/ref) or (ref/alt). In some cases, a SNP array may have been done, and in that case the regions of homozygosity are known (and short ROH can be obtained from the laboratory where the SNP array was done). The various ROH can be copied and pasted for further filtering, selecting only sites that map to the various ROH. For now, the coordinates of the various ROH need to be in the hg19 version of the human genome assembly. 11

Template for compound heterozygous recessive inheritance In cases to be evaluated for compound heterozygosity, the situation is complicated. Typically, 2 mutations in one gene can be in cis or in trans, also called phase. Phase cannot be assessed from the patient s dataa alone, and parental information is therefore required. Therefore, in order to properly evaluate for compound heterozygosity, the data of at least one affected individual and the 2 unaffected parents should be available. Filling out the expected genotypes in non-intuitive (and not needed), instead we have provided the user with a compound heterozygosity option, see black arrow. 12

Template for X-linked recessive inherited In X-linked disorder, the affected male is hemizygous (alt), but in the site annotation this appears as (alt/alt). In many instances the mother is a carrier (ref/alt), while the genotype of the father is irrelevant, and can be left blank. 13

Template for de novo X-linked recessive e In X-linked disorder, the affected male is hemizygous (alt), but in the site annotation this appears as (alt/alt). In this scenario, the mother is not a carrier, but wild type (ref/ref), while the genotypee of the father is irrelevant, and can be left blank. 14

Filters by Mutation(s) The next step on the Sample Page is the filter by mutation type. The user will have to decide whether to search for all sites, or to select only annotated sites, or alternatively to select only non-annotated (novel) sites. Next, the user will have to select where the sites to be considered may be located: within the gene, or outside the gene. If within the gene, the user can select exons or introns, and if exons are selected, decide to evaluate only untranslated regions (UTR) or only coding sequences (CDS). Then, if CDS is selected, the user can decide to select all mutations, only synonymous sites, or only non-synonymousynonymous sites. Currently, we cannot evaluate SNPs for splice mutation potential, but expect this to be available in the next few sites. On first run, it may be useful to select novel SNPs, CDS and non- weeks. 15

Filter by Annotations This option is useful if the user would like to know whether the patient may have a known (annotated), but unrecognized genetic condition. Humann Genome Mutation Database (HGMD) can be search for mutations in genes annotated as causing a disorder, similar to Disease Ontology (DO). Alternatively, OMIM keywords or HPO search terms can be used. OMIM search HPO search 16

Filter by Preferred Genes Lastly, the user has an option of checking on sites withinn a favorite gene list, for example a list of genes involved in a certain pathway, or found in conditions with genetic heterogeneity, eg. as seen in retinitis pigmentosa, channelopathies etc. 17

Result Page The result page reports the sitess that survived the various filtering activities. The sites are organized by chromosome, and ordered from pter to qter. The position is clickable, and Links Out to UCSC genome browser. If this position is a known SNP, a clickable entry can be found in the SNP ID column, and Links Out to dbsnp. Next listed are a reference site(s) (by isoforms) and alternative site( (s). The gene within which the sites are located is listed in the Gene column, which Links Out to NCBI gene.. Next listed are the various isoforms found for the gene, with the corresponding NM_xxxxxx identifier, which is also clickable. The SNP function provides information on the type of mutation, followed by the residue (amino acid) change in case of missense mutation. Then, the user can review the Linked Out OMIM entries (gene / associated disorders), as well as HGMD and DO entries. Lastly, the user can review the genotype data by individual, and by mousing over the genotype, the user can review the reads, to see if the genotype was called correctly. In the result page below, as an example, the user selected for compound heterozygosity, with the affected child being compound heterozygous, and each parents a carrier of one of the sites. In the result page below, as an example, a family with consanguineous parents and 3 children, 1 of whom is affected. The search was done looking for homozygous mutations in the affected child, with parents assumed to be carriers, and healthy siblings carriers or wild-type. 18

Download / Saving to desktop Lastly, the user can save the conducted search and results to the Desktop in Excel spreadsheet format, for later review and comparison. 19

Under development: We aim to develop two additions to the tool: 1. Checking sites for potential for causing splice mutations 2. Tasting missense mutations, using various available tools (SIFT, Provean, Polyphen2, etc) 20