Novel Variant Discovery Tutorial

Size: px
Start display at page:

Download "Novel Variant Discovery Tutorial"

Transcription

1 Novel Variant Discovery Tutorial Release Golden Helix, Inc. August 12, 2015

2

3 Contents Requirements 2 Download Annotation Data Sources Overview Import VCF Files Investigate Quality Metrics and Determine Filtering Thresholds Remove Low Quality Genotypes Apply Family Inheritance Filter Variant Classification Genomic Annotations Conclusion i

4 ii

5 Updated: August 12th, 2015 Level: Advanced Version: or higher Product: SVS This tutorial covers an advanced workflow that aims to find the causal variant in the X-linked fatal disorder, Ogden Syndrome. Several filtering procedures are applied to the family data that narrow down the variant set based on our knowledge of the syndrome. Contents 1

6 Requirements Since the data used in this tutorial is not publicly available, accessing the data requires permission from the study s principal investigator, Dr. Gholson J. Lyon. Use the following link to request access. Download Request Access Download Annotation Data Sources Before you begin this tutorial, you will need to download the annotation sources used in the tutorial. The NHLBI ESP6500 Exomes Variant Frequencies annotation source contains minor allele frequencies from the NHLBI Exome Sequencing project. The dbnsfp is an integrated database of functional annotations from multiple sources for the comprehensive collection of human non-synonymous SNPs (NSs). Its current version includes a total of 87,361,054 NSs. (Since beta 3, an additional 2.2 million splicing site mutations are added.) It compiles prediction scores from ten prediction algorithms (SIFT, Polyphen2, LRT, MutationTaster, MutationAssessor, FATHMM, MetaSVM, MetaLR, VEST, PROVEAN), eight conservation scores (phylop46way_primate, phylop46way_placental, phylop100way_vertebrate, phastcons46way_primate, phastcons46way_placental, phastcons100way_veterbrate, GERP++ and SiPhy) and other function annotations. For the purpose of this tutorial we will use the subset version of the dbnsfp data. dbnsfp Functional Predictions 2.9 is a subset of the full dbnsfp Functional Predictions and Scores 2.9 database. This subset only contains the 5 predictions for SIFT, Polyphen2 HumVar (HVAR), MutationTaster, MutationAssessor, and FATHMM. It is recommended that this annotation source be used to follow the tutorial or in cases where it is not desirable to download the full database. You will need to download these sources via the Data Source Manager. These are large files and the downloads may take awhile depending on your internet connection. From SVS, choose Tools >Manage Data Sources. Click Public Annotations to bring up a list of sources available on our data server. Navigate to and check the following sources: Reference Sequence GRCh37 g1k, 1000Genomes NHLBI ESP6500SI-V2-SSA137 Exomes Variant Frequencies , GHI dbnsfp Functional Predictions 2.9, GHI (make sure to check the version corresponding to the GRCh_37 build). Click Download. 2

7 After the downloads have finished, the annotation sources are automatically placed in the correct directory and you may proceed with the tutorial. 1. Overview This tutorial aims to reproduce the results in a study published in the American Journal of Human Genetics. In the study, five members of a family were sequenced with an X chromosome Exon Capture kit. The family has lost five affected male children over the past 30 years to the previously undescribed syndrome, none of which live past 18 months. The data collected includes one affected male child, an unaffected male child, an unaffected male uncle and the grandmother and mother who are both carriers of the disease. With some prior knowledge about the inheritance pattern of this disease as well as its rare (or rather undiscovered) status, this tutorial demonstrates how to use SVS to filter out variants that do not meet our study assumptions using genomic annotations and inheritance pattern specification. 2. Import VCF Files First create a new project and import the five VCF files that correspond to the five samples in the study. The VCF files were generated by a GATK pipeline and include several variant level meta data fields and sample-variant level quality metrics. You will use the sample-variant level Genotype Quality and Read Depth spreadsheets for filtering purposes, so be sure to check these options during import. Open SVS and click Create New Project. Name the project Ogden Reproduction and save it to an appropriate directory. The default genome assembly should be human build, GRCh37g1k. Click OK. Next, import the VCF files into the currently empty project. Since this data was generated from an X Chromosome Exon Capture kit, you can limit the import to only include data on the X-Chromosome. Since not all sequences that align to X are unique, there will be places that have ortholog sequences across the genome and thus data outside the X-chromosome. Limiting the import to only include X will clean these sequences up. Select Import >Import VCFs and Variant Files. Click Add Files and navigate to the directory that contains the appropriate VCF files. Shift-select all five vcf files and click Open The window should look like Figure 2-1 then click Next >. If needed the VCF files will be compressed and index to improve import performance. Now select the Family Samples relationship and click Next >. Now enter the pedigree structure of the data, the window should look like Figure 2-4, then click Next >. As shown in Figure 2-5, the VCF files contain a lot of information. In this tutorial, you will only use the Genotype (G_T), Read Depth (DP) and Genotype Quality (GQ) fields, so check these fields for import. Enter Ogden as the Sheet Base Name:. Check Specify Genomic Regions to Import: and enter chrx in the text box. The dialog should look like Figure 2-6. Click Finished. The project should now contain three mapped spreadsheets corresponding to the previous output options selected. 1. Overview 3

8 Figure 2-1. VCF Import - Step 1 4 Requirements

9 Figure 2-2. VCF Import - Step 2 2. Import VCF Files 5

10 Figure 2-3. VCF Import - Step 3 6 Requirements

11 Figure 2-4. VCF Import - Step 4 2. Import VCF Files 7

12 Figure 2-5. VCF Import - Step 5 8 Requirements

13 Figure 2-6 VCF Import - Step 6 2. Import VCF Files 9

14 3. Investigate Quality Metrics and Determine Filtering Thresholds Open the Ogden - Genotypes(G_T) - Sheet 1 spreadsheet. The five previously mentioned samples are in the row labels and there were about 5100 variants called (in at least one sample) in the X chromosome exon regions. The other two spreadsheets, Ogden - Read Depths (DP) - Sheet 1 and Ogden - Genotype Qualities (GQ) - Sheet 1 contain the read depths and quality scores for the same samples and called variants. Investigate Quality Metric Distributions First investigate the distribution of the quality metrics for the affected child. To do this, create a Sample Collated Spreadsheet that is essentially a transposed-merged-sorted combination of all input spreadsheets. From the project navigator, select Tools >Build Sample Collated Spreadsheet. Enter Ogden as the Base Dataset Name. Click Add Spreadsheet and choose the three spreadsheets in the project. The window should look like Figure 3-1. Click Next>. Figure 3-1. Build Sample Collated (Step 1) In the next dialog, change the first suffix (after Ogden Genotypes) to GT. The window should look like figure 3-2. Click OK. The collated spreadsheet contains a column for each sample-spreadsheet combination, sorted by sample. The first three columns should contain the Genotype (_GT), Read Depth (_DP) and Genotype Qualities (_GQ) for the affected male child. Plot histograms of the read depth and quality score columns to investigate the distributions. From Ogden - Sample Collated Spreadsheet - Sheet 1, select Plot >Histograms. Check AffectedMaleChild_DP, Mother_DP and UnaffectedBrother_DP. The window should look like Figure 3-3. Click Plot. In the histogram plots, you can adjust the bin count to view a finer resolution. Click on Graph 1 in the Graph Control Interface. Increase the Bin Count in the Graph tab below. Now repeat the above process in order to examine quality scores. From Ogden - Sample Collated Spreadsheet - Sheet 1, select Plot >Histograms. 10 Requirements

15 Figure 3-2. Build Sample Collated (Step 2) Figure 3-3 Plot Histograms of Read Depths 3. Investigate Quality Metrics and Determine Filtering Thresholds 11

16 Check AffectedMaleChild_GQ, Mother_GQ and UnaffectedBrother_GQ. The window should look like Figure 3-4. Click Plot. Figure 3-4 Plot Histograms of Quality Scores The plots should look similar to Figure 3-5 and Figure 3-6. The first set of plots show the coverage distribution for the affected sample and immediate family. Many of the variants have very low coverage and some have high coverage, up to 140+ reads. The second set of plots show the distribution of quality scores, or a measure of the confidence behind the variant caller s decision. Little trust should be placed in genotypes that correspond to low GQ values. Choosing the threshold cutoffs is a matter of personal preference. For this tutorial, genotypes with DP <= 10 and GQ <= 20 will be set to missing. 4. Remove Low Quality Genotypes Now use the determined threshold values to remove low-quality genotypes. From Ogden Genotypes(G_T) - Sheet 1, select DNA-Seq >Set Genotypes to No-Call based on Additional Spreadsheets. Click Add Spreadsheets, and select both Ogden - Read Depths (DP) - Sheet 1 and Ogden Genotype Qualities (GQ) - Sheet 1. Click OK. Click Next>. On the tab corresponding to Read Depths, enter 10 for the threshold. On the tab corresponding to Genotype Qualities, enter 20 for the threshold. The window should look like Figures 4-1 and 4-2 Click OK. The resulting spreadsheet has all genotypes that fell below the specified threshold for at least one metric set to missing. Also, any column that now contains all-missing values has been inactivated. About 3000 variant columns remain active, meaning that at least one sample contains a high-quality variant call. Also a subset spreadsheet Ogden - 12 Requirements

17 Figure 3-5. Histograms of Read Depth. 4. Remove Low Quality Genotypes 13

18 Figure 3-6. Histograms of Quality Score. 14 Requirements

19 Figure 4-1. Set Genotypes to No-Call based on Additional Spreadsheets (Tab 1) Figure 4-2. Set Genotypes to No-Call based on Additional Spreadsheets (Tab 2) 4. Remove Low Quality Genotypes 15

20 Genotypes(G_T) - Genotypes Filtered to No-call - Column Subset is created containing only those variants that remain active. Rename the subset Ogden Genotypes(G_T) - Filtered to High Quality by right clicking on the column subset node and choosing Rename Node. 5. Apply Family Inheritance Filter Certain characteristics of this syndrome and family structure allow for expectations regarding the sample genotypes. Since the disorder is linked to the X chromosome and only male children are affected, we should expect affected male child s causal variant to be homozygous alternate. Similarly, we know that the mother and grandmother are carriers of the disorder but not affected, thus we should expect the causal variant in these samples to be heterozygous, such that children are either passed the non-damaging reference allele or the causal alternate allele. The latter results in male children having the disease and female children being carriers (heterozygous). Figure 5-1. Ogden Family Pedigree. Ref_Alt circles represent female carriers in the family. Ref squares represent the unaffected males who inherited the non-damaging reference allele from their mother s X chromosome. Alt squares represent affected males who inherited the damaging variant. Finally, we should expect unaffected males have a homozygous reference or missing genotype for the causal variant. Since Missing was selected to fill holes during import, all of these values should be missing. Next, create a subset that only includes columns that fit the genotype inheritance pattern described above. From Ogden - Genotypes(G_T) - Filtered to High Quality, choose DNA-Seq >Activate Variants by Sample Genotypes. Under AffectedMaleChild check Alt_Alt. Under Grandmother and Mother check Alt_Ref. Under UnaffectedBrother and UnaffectedUncle check Ref_Ref and?_?. The dialog should look like Figure 5-2. Click OK. 16 Requirements

21 Figure 5-2. Activate Variants by Sample Genotypes dialog This tool quickly finds that only 67 variant columns follow the specified inheritance pattern. Create a column subset. Choose Select >Column >Column Subset Spreadsheet. Rename the spreadsheet Ogden - Genotypes(G_T) - Inheritance pattern. 6. Variant Classification The current variant set includes high-quality variants that follow the specified inheritance pattern. Next reduce the set to only include those classified as nonsynonymous (i.e. alters the amino acid sequence of a protein resulting in a biological change in the child). First run Variant Classification to determine which of the 67 variants are in coding regions and further those within coding regions that have been classified as nonsynonymous. From Ogden - Genotypes(G_T) - Inheritance pattern, choose DNA-Seq >Variant Classfication. Uncheck Variant Classification Counts by Gene and Variant Classification Report, leaving only Coding Variant Classification checked. The dialog should look like Figure 6-1. Click OK. On the next dialog check only Remove Non-coding variants and Nonsyn SNV. The dialog should look like Figure 6-2. Click OK. A new report spreadsheet, Coding Variant Classification, is created that contains the variants found within coding regions as row labels. The columns contain various information regarding the variants, including the Classification in the first column. This spreadsheet is then used by the tool to activate only variants classified as a nonsynonymous SNV and Ogden - Genotypes(G_T) - Inheritance pattern - Coding Classification Filter Applied is created. Only 8 of the 67 variants should be active at this point for a total of 13 columns. Create a column subset. 6. Variant Classification 17

22 Figure 6-1. Variant Classification Dialog Figure 6-2. Apply Filter to Spreadsheet dialog 18 Requirements

23 From Ogden - Genotypes(G_T) - Inheritance pattern - Coding Classification Filter Applied, click Select >Column >Column Subset Spreadsheet. Rename the spreadsheet to NS Candidate Variants. 7. Genomic Annotations SVS also allows you to compare variants against genomic annotations, including NHLBI Exome Sequencing Project Allele Frequencies and dbns Functional Predictions. Since this disorder had not yet been discovered and has a fatal outcome, we do not expect to find the causal variant at any frequency in the NHLBI ESP database. We also expect the causal variant to be predicted as damaging by the functional prediction methods in dbnsfp. Next generate the two annotation reports to investigate if any of the 8 remaining variants meet the expectations described above. From NS Candidate Variants, choose DNA-Seq > Annotate and Filter Variants. Click Add Track(s), then check NHLBI ESP6500SI-V2-SSA Exomes Variant Frequencies , GHI and dbnsfp Functional Predictions 2.9, GHI Click Select then Next>. On the NHLBI dialog uncheck Filter. The dialog should look like Figure 7-1. Click Next>. On the dbnsfp dialog also uncheck Filter. Figure 7-1. Annotate by Variant Frequency Catalog The dialog should look like Figure 7-2. Click Next>. The resulting NHLBI ESP6500SI-V2-SSA137 Exomes Variant Frequencies , GHI Variant Matches and Filters spreadsheet contains frequency information for any variant found within the annotation source. In some studies, the researcher may expect these frequencies to be low for rare disorders. Since the disease in this study is novel, we do not expect it to be cataloged. Notice that the report has only 7 rows, meaning one variant was not found in the database. All eight of the variants are in the dbnsfp NS Functional Predictions 2.9 Matched Variants report since a nonsynonymous filter was already applied to the variant set. Now merge the annotation information to investigate. From *NHLBI ESP6500SI-V2-SSA137 Exomes Variant Frequencies , GHI Variant Matches and Filters, select File >Join or Merge Spreadsheets. Highlight dbnsfp Predictions 2.9 Matched Variants and click OK. For New dataset name: enter NHLBI + dbnsfp Annotations After Unmatched Rows, select Keep (fill in empty cells as missing). 7. Genomic Annotations 19

24 Figure 7-2. Annotate by NS Functional Predictions. The dialog should look like Figure 7-3. Click OK. The tenth column in the merged spreadsheet contains allele frequencies for the European American study population. Sort on this column to move interesting variants to the top. Right click on European American MAF (column 10) and choose Sort Ascending. Now the first row contains the variant not found in the NHLBI database. Scroll over to the end of the spreadsheet to view the functional prediction information (starting on column 24). Notice that the first row is also listed as Damaging, Probably Damaging and Disease Causing by SIFT, PolyPhen2 and MutationTaster respectively. The second row is predicted as damaging by SIFT and disease causing by MutationTaster but does not contain a prediction for PolyPhen2. All other rows are not consistently predicted as damaging. X: SNV is very likely the causal variant. This variant is found in the 8th row of Coding Variant Classficiation. Looking at this spreadsheet, you can see that the variant falls in the NAA10 gene and does indeed change the protein sequence (Column 7: p.ser37pro). 8. Conclusion The causal variant found in this workflow is the same causal variant discovered by Dr. Lyon and his research team. This variant is found in the NAA10 Gene, and you can visually validate the inheritance pattern using SVS. Open NS Candidate Variants and select GenomeBrowse >Variant Map Then add in the BAM files for the AffectedMale, UnaffectedBrother and Mother by selecting File >Add and navigating to the downloaded files and clicking Plot & Close Turn on the Feature List to easily navigate to each identified variant View >Dock Window >Feature List In the screenshot below, the affected male child clearly carries the variant while the mother and grandmother carry the variant in a heterozygous state, and the unaffected males do not contain the variant. In the GenomeBrowse plot, you can also add all of the annotations hosted on our data server including the ones used in this tutorial. As expected, the variant is not found in any of the common variant probe tracks and is covered by the OMIM track. 20 Requirements

25 Figure 7-3. Join or Merge Spreadsheets dialog. Figure 7-4. Variant Classification for Potential Casual Variant. 8. Conclusion 21

26 Figure 8-1. Validate Results in GenomeBrowse plot 22 Requirements