RareVariantVis 2: R suite for analysis of rare variants in whole genome sequencing data.

Similar documents
Assignment 9: Genetic Variation

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Prioritization: from vcf to finding the causative gene

Novel Variant Discovery Tutorial

Introduction to Next Generation Sequencing (NGS) Andrew Parrish Exeter, 2 nd November 2017

USER MANUAL for the use of the human Genome Clinical Annotation Tool (h-gcat) uthors: Klaas J. Wierenga, MD & Zhijie Jiang, P PhD

Variant prioritization in NGS studies: Annotation and Filtering "

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

Genomics: Human variation

Exome Sequencing and Disease Gene Search

BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte

Variant calling in NGS experiments

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es

CITATION FILE CONTENT / FORMAT

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Variant Finding. UCD Genome Center Bioinformatics Core Wednesday 30 August 2016

Read Mapping and Variant Calling. Johannes Starlinger

Exploring genomic databases: Practical session "

SNP calling and VCF format

What is genetic variation?

OncoMD User Manual Version 2.6. OncoMD: Cancer Analytics Platform

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4

Analytics Behind Genomic Testing

Niemann-Pick Type C Disease Gene Variation Database ( )

SUPPLEMENTARY INFORMATION

MPG NGS workshop I: SNP calling

Overview of the next two hours...

Annotation of Genetic Variants

user s guide Question 1

UHT Sequencing Course Large-scale genotyping. Christian Iseli January 2009

Chapter 2: Access to Information

Identifying recessive gene candidates with GEMINI

SeattleSNPs Interactive Tutorial: Database Inteface Entrez, dbsnp, HapMap, Perlegen

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Axiom mydesign Custom Array design guide for human genotyping applications

Variant Analysis. CB2-201 Computational Biology and Bioinformatics! February 27, Emidio Capriotti!

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:

THE HEALTH AND RETIREMENT STUDY: GENETIC DATA UPDATE

Training materials.

Introduc)on to Genomics

The University of California, Santa Cruz (UCSC) Genome Browser

Supplemental Figure 1: Nisoldipine Concentration-Response Relationship on ipsc- CMs. Supplemental Figure 2: Exome Sequencing Prioritization Strategy

Runs of Homozygosity Analysis Tutorial

Genetics module. DNA Structure, Replication. The Genetic Code; Transcription and Translation. Principles of Heredity; Gene Mapping

Gathering of pathogenicity evidence for novel variants. By Lewis Pang

Hands-On Four Investigating Inherited Diseases

Detection of Rare Variants in Degraded FFPE Samples Using the HaloPlex Target Enrichment System

Investigating Inherited Diseases

Comparing a few SNP calling algorithms using low-coverage sequencing data

Using VarSeq to Improve Variant Analysis Research

ANNOVAR Variant Annotation and Interpretation

MassArray Analytical Tools

Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Genomes contain all of the information needed for an organism to grow and survive.

Molecular Markers CRITFC Genetics Workshop December 9, 2014

Introduction to RNA-Seq in GeneSpring NGS Software

Map-Based Cloning of Qualitative Plant Genes

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Evidence of Purifying Selection in Humans. John Long Mentor: Angela Yen (Kellis Lab)

The Gene Gateway Workbook

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

Result Tables The Result Table, which indicates chromosomal positions and annotated gene names, promoter regions and CpG islands, is the best way for

MedSavant: An open source platform for personal genome interpretation

Explain why the scientists used the same restriction endonuclease enzymes on each DNA sample

Genomic resources. for non-model systems

VARIANT ANNOTATION. Vivien Deshaies.

Supplementary information ATLAS

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

user s guide Question 3

The Diploid Genome Sequence of an Individual Human

Autozygosity by difference a method for locating autosomal recessive mutations. Geoff Pollott

Genetic databases. Anna Sowińska-Seidler, MSc, PhD Department of Medical Genetics

Motif Discovery in Drosophila

Handbook. Precautions/warnings: For professional use only. To support clinical diagnosis. Not yet offered in the US. Page 1

Types of Databases - By Scope

Genetics and Heredity Power Point Questions

Human Genetic Variation. Ricardo Lebrón Dpto. Genética UGR

user s guide Question 3

Applied Bioinformatics

Genetic load. For the organism as a whole (its genome, and the species), what is the fitness cost of deleterious mutations?

Guided tour to Ensembl

Nature Genetics: doi: /ng Supplementary Figure 1. Neighbor-joining tree of the 183 wild, cultivated, and weedy rice accessions.

Annotating your variants: Ensembl Variant Effect Predictor (VEP) Helen Sparrow Ensembl EMBL-EBI 2nd November 2016

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Read Complexity

Gene-centered databases and Genome Browsers

Gene-centered databases and Genome Browsers

Alignment & Variant Discovery. J Fass UCD Genome Center Bioinformatics Core Tuesday June 17, 2014

Transcription Start Sites Project Report

Strand NGS Variant Caller

Variant Callers. J Fass 24 August 2017

Identification of the Photoreceptor Transcriptional Co-Repressor SAMD11 as Novel Cause of. Autosomal Recessive Retinitis Pigmentosa

Release Notes for Genomes Processed Using Complete Genomics Software

Introduction to human genomics and genome informatics

Briefly, this exercise can be summarised by the follow flowchart:

SUPPLEMENTARY INFORMATION

Heredity: Inheritance and Variation of Traits

Supplementary Figure 2.Quantile quantile plots (QQ) of the exome sequencing results Chi square was used to test the association between genetic

Oral Cleft Targeted Sequencing Project

Transcription:

RareVariantVis 2: R suite for analysis of rare variants in whole genome sequencing data. Adam Gudyś and Tomasz Stokowy October 30, 2017 Introduction The search for causative genetic variants in rare diseases of presumed monogenic inheritance has been boosted by the implementation of whole genome sequencing(wgs). Analysis and visualisation of WGS data is demanding due to its size and complexity. To aid this challenge, we have developed a WGS data analysis suite RareVariantVis 2. This new, significantly extended implementation of RareVariantVis (Stokowy et al., Bioinformatics 2016) annotates, filters and visualises whole human genome in less than 30 minutes. Method accepts and integrates vcf files for single nucleotide, structural and copy number variants. Proposed method was successfully used to disclose causes of three rare monogenic disorders, including one non-coding variant. This vignette was created to present how to efficiently visualize and interprete genomic variants in R. Package RareVariantVis aims to present genomic variants (especially rare ones) in a global, per chromosome way. Visualization is performed in two ways standard that outputs png figures and interactive that uses JavaScript d3 package. Interactive visualization allows to analyze trio/family data, for example in search for causative variants in rare Mendelian diseases. This vignette presents example of Genome in a Bottle NA12878 sample (chromosome 19) which is analyzed in about one minute on laptop/desktop computer. Examples include reading of all chr19 variants, filtering, annotation, visualization and homozygous region calling. Example The analysis is preceeded by loading neccessary packages. library(rarevariantvis) library(datararevariantvis) After that, single nucleotide variants (SNV) and structural variants (SV) for NA12878 have to be loaded from DataRareVariantVis package. Directories of these files are stored in sample and sv sample variables. 1

sample = system.file("extdata", "CoriellIndex_S1.vcf.gz", package = "DataRareVariantVis") sv_sample = system.file("extdata", "CoriellIndex_S1.sv.vcf.gz", package = "DataRareVariantVis") The RareVariantVis package supports vcf files generated by speedseq and GATK variant callers. The discovery of rare variants in chromosome 19 is performed by executing the following command. chromosomevis(sample=sample, sv_sample=sv_sample, chromosomes=c("19")) As a result, the package creates a set of files describing discovered rare variants: ˆ RareVariants CoriellIndex S1.txt table with single nucleotide variants, ˆ ComplexVariants CoriellIndex S1.txt table with complex variants, ˆ StructuralVariants CoriellIndex S1.txt table with structural variants, ˆ CoriellIndex S1 chr19.png per-chromosome variants visualisation. If more than one chromosome is to be processed, separate visualisation files are generated. All tables are tab-separated text files with rows corresponding to variants and columns describing their properties. Single nucleotide variants They are de- This table contain variants with a single alternative allele. scribed by the following columns: ˆ final positions position in chromosome, ˆ variant type type of mutation: synonymous/nonsynonymous/nonsense, ˆ ref allele reference allele, ˆ alt allele alternative allele, ˆ final variations homozygosity (ratio of alternative allele to depth), ˆ conservation conservation index according to UCSC phastcons. Additional columns are filled only for variants localized in coding regions: ˆ gene name primary name of gene according to UCSC Human Genome annotation, ˆ in exon flag indicating whether variant is in exon, ˆ Entry corresponding UniProt entry, ˆ Status protein status (UniProt), ˆ Protein.names names of the corresponsing proteins (UniProt), ˆ Gene.names all gene aliases (UniProt), ˆ Annotation ˆ Tissue.specificity tissue specificity (UniProt), ˆ Gene.ontology..biological.process description of biological process the gene is involved in and ontology identifier (UniProt), ˆ Involvement.in.disease known involvement in diseases (UniProt) ˆ Cross.reference..Orphanet ˆ PubMed.ID colon-separated list of related PubMed identifiers (UniProt). 2

Complex variants Variants with more than one alternative allele are referred to as complex. They are described by: ˆ start position starting position in the chromosome, If variant spans over coding regions, additional additional UniProt columns are filled analogously as for SNVs. Structural variants The table with structural variants have following columns: ˆ start position starting position in the chromosome, ˆ ID optional variant identifier, ˆ REF reference allele, ˆ ALT alternative allele, ˆ QUAL Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data, ˆ SVTYPE variant type: DUP(duplication)/DEL(deletion), ˆ GT genotype: 0/1 (heterozygous) or 1/1 (homozygous) ˆ end position genotype If variant spans over coding regions, additional additional UniProt columns are filled analogously as in SNVs. Per-chromosome visualisation For each analyzed chromosome, discovered variants are visualized. The example visualisation file for NA12878 chromosome 19 is presented in Figure 1. Figure illustrates variants (blue dots) in their genomic coordinates (X axis). Ratio of alternative reads and depth (Y axis) gives information about type of variant: homozygous alternative (expected ratio 1) and heterozygous (expected ratio 0.5). Green dots represent rare variants that pass filters: coding/utr, nonsynonymous variant with dbsnp frequency 0.01 and ExAC frequency 0.01. Orange vertical lines depict position of centromere. Orange dots depict structural and copy number variants that overlap with coding region and are of relatively good quality (QUAL 0). Red curve illustrates moving average of alternative reads/depth ratio. High values of this curve (exceeding 0.75) can suggest potential homozygous/deleterious regions. Interactive visualisation In order to perform interactive visualisation of NA12878 rare SNVs, the following piece of code has to be executed: rarevariantvis("rarevariants_coriellindex_s1.txt", "RareVariants_CoriellIndex_S1.html", "CorielIndex") 3

Figure 1: The visualisation file for NA12878 sample, chromosome 19. It reads a file containing table of rare variants (obtained from chromosomevis function) and provides an adequate visualization (Figure 2). Function outputs visualization html figure in the current working directory. Figure illustrates variants (dots) in their genomic coordinates (X axis). Ratio of alternative reads and depth (Y axis) gives information about type of variant: homozygous alternative (expected ratio 1) and heterozygous (expected ratio 0.5). Zooming the plot is also supported. Pointing on variants provides basic information about the variant name of gene and the position on the chromosome. If variants from many chromosomes are present in the table, all chromosomes are added to the same html file. Figure 2: Interactive visualisation file for NA12878 sample, chromosome 19. Multiple sample visualisation Another feature of RareVariantVis2 package is the ability to perform visualisation of multiple samples. Let RareVariants CoriellIndex S1.txt and RareVariants Coriell S2.txt be variant tables from different geneomes. One can visualise 4

their 19 chromosome variants simulateously by running: inputfiles = c("rarevariants_coriellindex_s1.txt", "RareVariants_Coriell_S2.txt") samplenames = c("coriellindex_s1", "Coriell_S2"); multiplevis(inputfiles, "CorielSamples.html", samplenames, "19") 5