Variant Analysis. CB2-201 Computational Biology and Bioinformatics! February 27, Emidio Capriotti!

Similar documents
Variant Finding. UCD Genome Center Bioinformatics Core Wednesday 30 August 2016

Prioritization: from vcf to finding the causative gene

VARIANT ANNOTATION. Vivien Deshaies.

Next Generation Sequencing: Data analysis for genetic profiling

From raw reads to variants

Alignment & Variant Discovery. J Fass UCD Genome Center Bioinformatics Core Tuesday June 17, 2014

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Exome Sequencing and Disease Gene Search

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Read Mapping and Variant Calling. Johannes Starlinger

Introduction to Next Generation Sequencing (NGS) Andrew Parrish Exeter, 2 nd November 2017

BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte

Genomics: Human variation

Variant calling in NGS experiments

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Annotating your variants: Ensembl Variant Effect Predictor (VEP) Helen Sparrow Ensembl EMBL-EBI 2nd November 2016

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

Mining GWAS Catalog & 1000 Genomes Dataset. Segun Fatumo

Variant prioritization in NGS studies: Annotation and Filtering "

Introduc)on to Genomics

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4

CITATION FILE CONTENT / FORMAT

What is genetic variation?

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

MPG NGS workshop I: SNP calling

In silico variant analysis: Challenges and Pitfalls

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

SNP calling and VCF format

Proteogenomics. Kelly Ruggles, Ph.D. Proteomics Informatics Week 9

Evidence of Purifying Selection in Humans. John Long Mentor: Angela Yen (Kellis Lab)

Variant Detection in Next Generation Sequencing Data. John Osborne Sept 14, 2012

Release Notes for Genomes Processed Using Complete Genomics Software

Supplementary Figures and Data

Release Notes for Genomes Processed Using Complete Genomics Software

RareVariantVis 2: R suite for analysis of rare variants in whole genome sequencing data.

Annotation of Genetic Variants

Release Notes for Genomes Processed Using Complete Genomics Software

Functional Annotation and Prioritization of Whole Exome and Whole Genome Sequencing Variants. Mulin Jun Li

Variant Quality Score Recalibra2on

Variant Calling CHRIS FIELDS MAYO-ILLINOIS COMPUTATIONAL GENOMICS WORKSHOP, JUNE 19, 2017

An Automated Pipeline for NGS Testing and Reporting in a Commercial Molecular Pathology Lab: The Genoptix Case

SNP calling and Genome Wide Association Study (GWAS) Trushar Shah

Human Genetic Variation. Ricardo Lebrón Dpto. Genética UGR

Normal-Tumor Comparison using Next-Generation Sequencing Data

Bulked Segregant Analysis For Fine Mapping Of Genes. Cheng Zou, Qi Sun Bioinformatics Facility Cornell University

Genome STRiP ASHG Workshop demo materials. Bob Handsaker October 19, 2014

SUPPLEMENTARY INFORMATION

Oral Cleft Targeted Sequencing Project

Exploring genomic databases: Practical session "

Analytics Behind Genomic Testing

Deletion of Indian hedgehog gene causes dominant semi-lethal Creeper trait in chicken

Mapping errors require re- alignment

Sanger vs Next-Gen Sequencing

Release Notes for Genomes Processed Using Complete Genomics Software

Processing Ion AmpliSeq Data using NextGENe Software v2.3.0

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

Ensembl Tools. EBI is an Outstation of the European Molecular Biology Laboratory.

Assay Validation Services

Axiom mydesign Custom Array design guide for human genotyping applications

Genomic variants in Next Generation Sequencing data and their importance for biomedicine

Bionano Solve Theory of Operation: Variant Annotation Pipeline

An Automated Pipeline for NGS Testing and Reporting in a Commercial Molecular Pathology Lab: The Genoptix Case

Supplementary Information

Recherche de variants génomiques en oncologie clinique. Avec des diapos, données & scripts R de: Yannick Boursin, IGR Bastien Job, IGR

H3ABioNet Data Management workshop NGS data analysis for Variant Calling

Assignment 9: Genetic Variation

Identifying recessive gene candidates with GEMINI

NGS in Pathology Webinar

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:

Shannon pipeline plug-in: For human mrna splicing mutations CLC bio Genomics Workbench plug-in CLC bio Genomics Server plug-in Features and Benefits

Course Presentation. Ignacio Medina Presentation

Review Article A Survey of Computational Tools to Analyze and Interpret Whole Exome Sequencing Data

Browsing Genes and Genomes with Ensembl

Genetic Testing in the Clinic. Anne Goodeve Sheffield Diagnostic Genetics Service Sheffield Children s NHS Foundation Trust

Introduction to human genomics and genome informatics

Bioinformatics Core Facility IDENTIFYING A DISEASE CAUSING MUTATION

ANNOVAR Variant Annotation and Interpretation

Variant annotation. Variant Effect Predictor (VEP) Edinburgh Genomics. Marta Bleda Latorre. 23 October

Variant Callers. J Fass 24 August 2017

RV-TDT: Rare Variant Extensions of the Transmission Disequilibrium Test

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Our website:

PeCan Data Portal. rnal/v48/n1/full/ng.3466.html

Variant annotation. ANNOVAR and CellBase. University of Cambridge. Javier López. Acknowledgements: Marta Bleda Latorre.

Comparing a few SNP calling algorithms using low-coverage sequencing data

Variant detection analysis in the BRCA1/2 genes from Ion torrent PGM data

Setting Standards and Raising Quality for Clinical Bioinformatics. Joo Wook Ahn, Guy s & St Thomas 04/07/ ACGS summer scientific meeting

HiSeq Whole Exome Sequencing Report. BGI Co., Ltd.

Supplementary Information for:

Downloading PrecisionFDA Challenge Datasets 1. Consistency challenge (

THE HEALTH AND RETIREMENT STUDY: GENETIC DATA UPDATE

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

IMPACT User Manual. Version 1.0

Introducing combined CGH and SNP arrays for cancer characterisation and a unique next-generation sequencing service. Dr. Ruth Burton Product Manager

Package MADGiC. July 24, 2014

QIAseq Targeted Panel Analysis Plugin USER MANUAL

MedSavant: An open source platform for personal genome interpretation

Novel Variant Discovery Tutorial

Identifying copy number alterations and genotype with Control-FREEC

Transcription:

Variant Analysis CB2-201 Computational Biology and Bioinformatics February 27, 2015 Emidio Capriotti http://biofold.org/emidio Division of Informatics Department of Pathology

Variant Call Format The final result of the variant calling procedure is a VCF file. ##fileformat=vcfv4.1 ##tcgaversion=1.1 ##reference=<id=hg19,source=.> ##phasing=none ##geneanno=none ##INFO=<ID=VT,Number=1,Type=String,Description="Variant type, can be SNP, INS or DEL"> ##INFO=<ID=VLS,Number=1,Type=Integer,Description="Final validation status relative to non-adjacent Normal,... > ##FILTER=<ID=CA,Description="Fail Carnac (Tumor and normal coverage, tumor variant count, mapping quality,... > ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth at this position in the sample"> ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Depth of reads supporting alleles 0/1/2/3..."> ##FORMAT=<ID=BQ,Number=.,Type=Integer,Description="Average base quality for reads supporting alleles"> ##FORMAT=<ID=SS,Number=1,Type=Integer,Description="Variant status relative to non-adjacent Normal,0=wildtype,..."> ##FORMAT=<ID=SSC,Number=1,Type=Integer,Description="Somatic score between 0 and 255"> ##FORMAT=<ID=MQ60,Number=1,Type=Integer,Description="Number of reads (mapping quality=60) supporting variant"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL PRIMARY 1 10048. C CCT. CA VT=INS;VLS=5 GT:DP:AD:BQ:SS:SSC:MQ60 0/0:66:.,0:.:0:.:0 0/1:32:.,2:.:2:.:0 1 10078. CT C. CA VT=DEL;VLS=5 GT:DP:AD:BQ:SS:SSC:MQ60 0/0:25:.,0:.:0:.:0 0/1:13:.,2:.:2:.:0 1 10177. A AC. CA VT=INS;VLS=5 GT:DP:AD:BQ:SS:SSC:MQ60 0/0:57:.,0:.:0:.:0 0/1:22:.,2:.:2:.:0.......... 1 900505. G C. PASS VT=SNP;VLS=5 GT:DP:AD:BQ:SS:SSC:MQ60 0/1:188:.,89:26:1:.:81 0/1:210:.,113:24:1:.:100.......... 1 1991007. G T. PASS VT=SNP;VLS=5 GT:DP:AD:BQ:SS:SSC:MQ60 0/0:222:.,1:2:0:.:1 0/1:88:.,41:25:2:50:34.....

File Content The file contains information about single nucleotide variants and indels of single or multiple samples. For each variant the number of supporting reads for reference and alternative alleale The original VCF does not contain any information functional effect of the variants.

Main data sources Single genetic variants are collected in different databases: dbsnp - variation from all species. http://www.ncbi.nlm.nih.gov/snp/ EVS - specific for human. http://evs.gs.washington.edu/evs/ ClinVar - Variants and human health. http://www.ncbi.nlm.nih.gov/clinvar/ Cosmic - Somatic mutation in cancer. http://cancer.sanger.ac.uk/ This information is important for variant calling but useless for capturing the complexity of genotype/phenotype relationship. The VCF more informative because we can analyze co-occurring events. The major sources are: 1000 Genomes: WGS data of individuals http://www.1000genomes.org/ TCGA: Cancer Genomes https://tcga-data.nci.nih.gov/

Most common tools The most common tools for the manipulation of vcf files are: tabix: fast indexer for tab separated file distributed with samtools http://samtools.sourceforge.net/ vcftools: package designed for working with VCF files http://vcftools.sourceforge.net/

Tabix with SAM and VCF Tabix works with bgzip files. To work we need to have an object file bgzipped and an index file > bgzip $file.sam > tabix -p sam $file.sam.gz > tabix $file.sam.gz chr:pos1-pos2 How to get the variants found TP53 present in 1000 Genomes? TP53 = chr17:7571720-7590868 > tabix -h $ftpfile.gz chr:pos1-pos2 chr17: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ ALL.chr17.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz

vcftools Set of tools for the manipulation of vcf files > vcf-merge $file1.vcf.gz $file2.vcf.gz #indexed file > cat $file.vcf vcf-tstv > vcf-query $file.vcf.gz chr:pos1-pos2 Select particular samples in multisample VCF > vcf-subset -c sample1,sample2 $file.vcf.gz

Variant Annotation There are different tools for variant annotation among the most used Annovar and snpeff. # Annotation >java -jar snpeff.jar $db $file.vcf >$file.snpeff.vcf # Filtering > java -jar SnpSift.jar extractfields -s "," -e "." $file.anno.vcf CHROM POS REF ALT "ANN[*].EFFECT" ANN[*].GENE" "ANN[*].HGVS_P" # Remove 0/0 > cat $file.snpeff.vcf java -jar snpeff.jar RmRefGen

All in One Exomizer is a variant analysis tools that tests presence of variants associated to specific phenotypes http://www.sanger.ac.uk/resources/software/exomiser/submit/

Problem Write a shell script that takes in input: genomic location chr:start-end Sample ID and annotates the returning portion of genome. Calculate for the number of missense variants for two samples of your choice.