CNV and variant detection for human genome resequencing data - for biomedical researchers (II)

Similar documents
NGS in Pathology Webinar

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Course Presentation. Ignacio Medina Presentation

Data Analysis with CASAVA v1.8 and the MiSeq Reporter

Analytics Behind Genomic Testing

Sanger vs Next-Gen Sequencing

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

IDENTIFYING A DISEASE CAUSING MUTATION

DATA FORMATS AND QUALITY CONTROL

QIAseq Targeted Panel Analysis Plugin USER MANUAL

Next-Generation Sequencing. Technologies

SNP calling and VCF format

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

De Novo Assembly of High-throughput Short Read Sequences

Welcome to the NGS webinar series

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Matthew Tinning Australian Genome Research Facility. July 2012

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

Variant Discovery. Jie (Jessie) Li PhD Bioinformatics Analyst Bioinformatics Core, UCD

HiSeq Whole Exome Sequencing Report. BGI Co., Ltd.

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010

Normal-Tumor Comparison using Next-Generation Sequencing Data

RNA-Seq Module 2 From QC to differential gene expression.

G E N OM I C S S E RV I C ES

Introduction to bioinformatics (NGS data analysis)

Deep Sequencing technologies

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

Introduction to Next Generation Sequencing

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Bioinformatics Core Facility IDENTIFYING A DISEASE CAUSING MUTATION

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

Next-Generation Sequencing Services à la carte

NEXT GENERATION SEQUENCING. Farhat Habib

NGS part 2: applications. Tobias Österlund

Variant Callers. J Fass 24 August 2017

Bioinformatics in next generation sequencing projects

Compatible with: Ion Torrent Platforms Roche Sequencing Platforms Illumina Sequencing Platforms Life Technologies SOLiD System

L3: Short Read Alignment to a Reference Genome

Addressing Challenges of Ancient DNA Sequence Data Obtained with Next Generation Methods

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Processing Ion AmpliSeq Data using NextGENe Software v2.3.0

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Alignment & Variant Discovery. J Fass UCD Genome Center Bioinformatics Core Tuesday June 17, 2014

Introduction to NGS analyses

Genomic DNA ASSEMBLY BY REMAPPING. Course overview

Introduction to the MiSeq

BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte

Next Generation Sequencing. Tobias Österlund

Illumina s Suite of Targeted Resequencing Solutions

RNA-Seq analysis workshop

The Final Frontier. Data Analysis. Jean Jasinski, Ph.D. Field Application Scientist Sept. 27, 2017

GENOTYPING-BY-SEQUENCING USING CUSTOM ION AMPLISEQ TECHNOLOGY AS A TOOL FOR GENOMIC SELECTION IN ATLANTIC SALMON

Prioritization: from vcf to finding the causative gene

Gene Expression analysis with RNA-Seq data

Bioinformatics Advice on Experimental Design

Lecture 7. Next-generation sequencing technologies

Galaxy for Next Generation Sequencing 初探次世代序列分析平台 蘇聖堯 2013/9/12

02 Agenda Item 03 Agenda Item

Introduction to NGS. Data Analysis

Variant detection analysis in the BRCA1/2 genes from Ion torrent PGM data

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4

Bioinformatics for NGS projects. Guidelines. genomescan.nl

Francisco García Quality Control for NGS Raw Data

The New Genome Analyzer IIx Delivering more data, faster, and easier than ever before. Jeremy Preston, PhD Marketing Manager, Sequencing

Contact us for more information and a quotation

Next Generation Sequencing: An Overview

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es

Read Mapping and Variant Calling. Johannes Starlinger

Supplementary Figures and Data

Aaron Liston, Oregon State University Botany 2012 Intro to Next Generation Sequencing Workshop

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Read Complexity

NOW GENERATION SEQUENCING. Monday, December 5, 11

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

The Genome Analysis Centre. Building Excellence in Genomics and Computa5onal Bioscience

Transcriptome analysis

Comparing a few SNP calling algorithms using low-coverage sequencing data

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

Variant Finding. UCD Genome Center Bioinformatics Core Wednesday 30 August 2016

Genomic Data Analysis Services Available for PL-Grid Users

Accelerate precision medicine with Microsoft Genomics

14 March, 2016: Introduction to Genomics

NextSeq 500 System WGS Solution

Using New ThiNGS on Small Things. Shane Byrne

Quality assurance in NGS (diagnostics)

Introducing combined CGH and SNP arrays for cancer characterisation and a unique next-generation sequencing service. Dr. Ruth Burton Product Manager

Fast and Accurate Variant Calling in Strand NGS

Why QC? Next-Generation Sequencing: Quality Control. Illumina data format. Fastq format:

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

2nd (Next) Generation Sequencing 2/2/2018

Structural variation analysis using NGS sequencing

Next-Generation Sequencing: Quality Control

White Paper GENALICE MAP: Variant Calling in a Matter of Minutes. Bas Tolhuis, PhD - GENALICE B.V.

Galaxy Workshop

Variant Detection in Next Generation Sequencing Data. John Osborne Sept 14, 2012

Analysis Datasheet Exosome RNA-seq Analysis

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

Transcription:

CNV and variant detection for human genome resequencing data - for biomedical researchers (II) Chuan-Kun Liu 劉傳崑 Senior Maneger National Center for Genome Medican bioit@ncgm.sinica.edu.tw

Abstract Common NGS Data Analysis Pipelines Data Output of NGS Platforms Quality Check and Read Trimming Sequence Alignment Variant Detection HiPipe Project

Common NGS Data Analysis Pipelines

Common NGS data analysis pipelines (DNA) Format Conversion and Demultiplexing (CASAVA) Quality Check (FastX / FastQC) Sequence Alignment (BWA) de novo Assembly (Velvet) Sequence Markduplicate (Sambamba) Variant calling (Freebayes) Sequence Realignment (GATK) Sequence Markduplicate (GATK) Variants Annotation (VarioWatch) Multi-sample Variant calling (GATK) LOH Detection (ExomeCNV or NCGM) CNV Detection (ExomeCNV) Translocation (SVDetect) Variants Annotation (VarioWatch) Visualization (Circos)

Common NGS data analysis pipelines (RNA) Format Conversion and Demultiplexing (CASAVA) Quality Check (FastX / FastQC) Sequence Alignment (Bowtie2) Sequence Alignment (Bowtie) de novo assembly (Trinity) Differential expression (Cufflinks) Sequence Realignment (GATK) Gene-fusion Detection (TopHat-fusion) mirna expression analysis (mirdeep2) Visualization Variant calling Gene Annotation (CummeRbund) (GATK) (VarioWatch) Gene Annotation (VarioWatch) Variants Annotation (VarioWatch)

NGS data analysis pipelines (Others) ChIP RNA IP Methylation Aptamer Virus integration Metagenomics

Data Output of NGS Platforms

Data output of NGS platforms Illumina HiSeq Plaform (BCL files) Roche GS FLX+ (Standard Flowgram Format (SFF) files) Thermo Fisher Ion Proton (FASTQ files) FASTQ PacBio RS II (FASTQ files) 8

Format Conversion and Demultiplexing @HWI-ST688:211:C0F02ACXX:2:2310:7574:93175 2:N:0:GCCAAT ATGCAAATAAACTAGAAAATCTAGAAGAAATGGAGAAATTCCTGGACACAC + CCCFFFFFHHHGHJIIJJIJJJIIJJIJJIJJJFIIJAJJIJIJJJJI??? Illumina HiSeq 2000 (multiplexing BCL files) FASTQ by samples Phred Quality Score Probability of incorrect base call Base call accuracy 10 1 in 10 90 % 20 1 in 100 99 % 30 1 in 1000 99.9 % 40 1 in 10000 99.99 % 50 1 in 100000 99.999 %

ASCII code

Three described FASTQ variants Description, OBF name ASCII characters Quality score Range Offset Type Range Sanger standard* fastq-sanger 33-126 33 PHRED 0 to 93 Solexa/early Illumina* fastq-solexa 59-126 64 Solexa -5 to 62 Illumina 1.3+* fastq-illumina 64-126 64 PHRED 0 to 62 Illumina 1.8+ fastq-sanger 33-126 33 PHRED 0 to 93 *Nucleic Acids Res. 2010 April; 38(6): 1767 1771.

Quality Check and Read Trimming

FastQC Project Home Quality Check http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Merge fastq files into one file Java environment Interactive UI

Trimming (When) When to trim reads Format conversion Before alignment Upon alignment Trimming Trimming by lane Trimming by reads

How to trim reads Trimming (How) Specify the bases to be trimmed (eg. n5y*n5) Specify the lower bound threshold of base quality and trim accordingly @HWI-ST688:211:C0F02ACXX:2:2310:7574:93175 2:N:0:GCCAAT ATGCAAATAAACTAGAAAATCTAGAAGAAATGGAGAAATTCCTGGACACAC + CCCFFFFFHHHGHJIIJJIJJJIIJJIJJIJJJFIIJAJJIJIJJJJI???

A tool for read trimming Trimmomatic : A flexible read trimming tool for Illumina NGS data Project Home http://www.usadellab.org/cms/?page=trimmomatic Adaptor trimming Quality trimming Using Sliding Window strategy Cuts a read when the average base quality within a sliding window drops below the lower bound threshold

Trimming Threshold Bases above Q30 85% (2 x 50 bp) 80% (2 x 100 bp) Q30 or Q24 or Q5? Is it necessary to trim reads? http://www.illumina.com/systems/hiseq_2500_1500/performance_specifications.html

Sequence Alignment

Tools for Sequence Alignment Open source BWA aln, BWA mem Bowtie, Bowtie2 STAR Commercial CASAVA, Isaac (Illumina) CLC Genomics Workbench

Human reference sequence NCBI36 (Aug, 2005) the last assembly produced by Human Genome Project (HGP) hg18, Ensembl release 54 GRCh37 (Feb, 2009) The first assembly submitted by Genome Reference Consortium (GRC) hg19, Ensembl release 55~75 (v75 Feb, 2014) GRCh38 (Dec, 2013) hg38, Ensembl release 76+ (v76 Aug, 2014)

GRCh37 Primary assembly Chromosome assembly (chr1-22, X, Y, Mt) Unlocalized sequence Unplaced sequence Alternate loci A sequence that provides an alternate representation of a locus found in a largely haploid assembly. (MHC region, UGT2B17, MAPT) Patches A contig sequence that is released outside of the full assembly release cycle.

Choose your reference genome Human_g1k_v37 The reference sequence provided by the 1000 Genomes Project* Unlocalized and Unplaced sequence are included (Full primary assembly) Mitochondrial sequence was replaced with the revised Cambridge Reference Sequences (rcrs; AC:NC_012920) (APR, 30, 2010)** Male or female *ftp://ftp.ncbi.nih.gov/1000genomes/ftp/technical/reference/readme.human_g1k_v37.fasta.txt **http://www.ncbi.nlm.nih.gov/nuccore/nc_012920

SAM/BAM format Sequence Alignment/Map (SAM) format text file BAM format (binary file) to reduce the file size

Coverage Whole genome sequencing Mean coverage Whole exome sequencing Over 90% of target region were covered with 0.2x mean coverage Custom panel PCR based (~ Mean coverage) Capture-based (~ whole exome sequencing)

TruSeq Exome Enrichment Kit

Variant Detection

DNA Variant Detection Variant Detection Variants Alignment Result Sequencing Reads Reference Sequnece (GRCh37) A C G T

Input - bam files Output - vcf files Tools Variant Detection Germline mutation Genome Analysis Toolkit (GATK) (MAF > 5%) FreeBayes Somatic mutation MuTect Somatic Variant Caller (MAF > 1%)

Genome Analysis Toolkit (GATK) Project Home https://www.broadinstitute.org/gatk/ Maintenance - Broad Institute License Version before 2.3-9 Free for all users (the MIT license) Version after 2.4 Free for academics Fee for commercial use

GATK Best Practices 1. Pre-processing Mark duplicates Realign indels Recalibrate Bases 2. Variant discovery Call Varinats Filter Varinats 3. Callset refinement Refine Genotypes Annotate variants Evaluate variants https://www.broadinstitute.org/gatk/guide/best-practices

Mark Duplicates AGGGAAACCACACAGGCTTCTTAGGCCATTGGAAT GGAAACCACACAGGCTT---AGGCCATTGGAA GAAACCACACAGGCTT---AGGCCATTGGAAT AAACCACACAGGCTT---AGGCCATTGG AAACCACACAGGCTT---AGGCCATTGG CCACACAGGCTT---AGGCCATTGGAA CACACAGGCTT---AGGCCATTGGAAT

Mark Duplicates The same DNA fragments may be sequenced several times The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant The process mark the reads only, does not remove them NOT be applied to amplicon sequencing data https://www.broadinstitute.org/gatk/guide/bp_step.php?p=1

Realign Indels (Before) AGGGAAACCACACAGGCTTCTTAGGCCATTGGAAT GGAAACCACACAGGCTT---AGGCCATTGGAA GAAACCACACAGGC---TTAGGCCATTGGAAT AAACCACACAGGCT---TAGGCCATTGG AAACCACACAGGCTT---AGGCCATTGG CCACACAGGC---TTAGGCCATTGGAA CACACAGG---CTTAGGCCATTGGAAT

Realign Indels (After) AGGGAAACCACACAGGCTTCTTAGGCCATTGGAAT GGAAACCACACAGGCTT---AGGCCATTGGAA GAAACCACACAGGCTT---AGGCCATTGGAAT AAACCACACAGGCTT---AGGCCATTGG AAACCACACAGGCTT---AGGCCATTGG CCACACAGGCTT---AGGCCATTGGAA CACACAGGCTT---AGGCCATTGGAAT

Realignment reads that align on the edges of indels often get mapped with mismatching bases that might look like evidence for SNPs, but are actually mapping artifacts. https://www.broadinstitute.org/gatk/guide/bp_step.php?p=1

Project Home FreeBayes https://github.com/ekg/freebayes Maintenance Erik Garrison, Gabor Marth http://arxiv.org/abs/1207.3907 License Free for all users (the MIT license) Pipeline using FreeBayes SpeedSeq (http://dx.doi.org/10.1038/nmeth.3505) Varpipe (http://www.ashg.org/2015meeting/abstract/detail?page=inthtml&project=ashg15&id=150120759)

(1) Summary and description of fields VCF format (2) Basic info of variants (3) Variants info by individual

Output File Size Human whole genome 30X Raw reads (FASTQ) 100GB Alignment results (BAM) 100GB Variant detection results (VCF) 2GB Functional analysis of variants (CSV) 1GB Human whole exome 100X Raw reads(fastq) 20GB Alignment results(bam) 20GB Variant detection results(vcf) 0.1GB Functional analysis of variants(csv) 0.1GB

HiPipe Project

Why HiPipe? It is hard to analyze NGS data for researchers with biology background: Most analysis tools run only in the Linux environment Few websites provide proper analysis Many tools are only applied for small genome analysis It takes a long time to get a result Transferring large amount of data between websites is required to complete an analysis It is error-prone to orchestrate different analysis tools

User Expectations for NGS Easy to learn and use analysis tools Researchers without computer science background can complete an analysis themselves Doing analysis anywhere, anytime Suitable for any genome size Without file size limitation Upload sequence data upon read sequencing Get an analysis result rapidly Combine different experiment data in one analysis

Challenges familiar with bioinformatics tools ( 行中 亭均 ) Job scheduling and pipeline developing (Louis) Cluster and distributed system management ( 方智 ) Performance tuning Software ( 耀德 仁屏 ) Hardware (MIS Team) Bandwidth Disk I/O User Interface ( 爾瞻 萬嘉 中饋 ) Team coordinate ( 傳崑 ) Adam Yao

Architecture

HiPipe (Basic) http://hipipe.ncgm.sinica.edu.tw

HiPipe (Basic) Online since Jun, 2013 Provide 7 popular NGS data analysis pipelines: Whole Genome & Exome Variant Detection Differential Expression Analysis Gene Fusion Detection mirna Analysis RNA Variant Detection De novo Assembly Exome CNV Detection

HiPipe Professional Authentication / Authorization Web-based interface supporting unlimited upload file size and resumable file upload Multi-sample indel realignment and variant detection Cloud / Onsite Providing integrated NGS data analysis and data store service

HiPipe Basic vs HiPipe Professional HiPipe HiPipe Professional HiPipe Basic HiPipe Cloud HiPipe Cluster HiPipe Quad HiPipe Duo HiPipe Uno Single sample analysis Multi-sample analysis Data storage location Cloud Cloud Local Local Local Local Nodes 10 10 10 1 1 1 CPU cores 640 640 640 64 16 6 40X human whole genome variant detection 56 mins 56 mins 56 mins 7 hrs ~12 hrs ~36 hrs Expected ship date 2013 Q2 2015 Q4 2016 2016 2016 2016

Authentication / Authorization 1. Select a project 2. Add or remove members of the project

Upload files through a browser Unlimited upload file size and resumable file upload

Create Project / Analysis

Configure settings

Choose samples

Save settings

Confirm settings

Launch analysis

Multi-sample analysis result Multi-sample indel realignment

Integrative Genomics Viewer 1. Integrate and visualize NGS, microarray and annotation in genomics 2. Load data only in specified regions to reduce physical memory usage 3. Customize the attribute of samples for further filtering 4. Browse multiple region at the same time 5. It s FREE http://www.broadinstitute.org/igv

Browse VCF and BAM in IGV

VCF Track I The allele fraction A C G T

VCF Track II Heterozygous Homozygous Reference A C G T

Zoom in to see alignment results Thorvaldsdóttir H et al. Brief Bioinform 2013;14:178-192 The Author(s) 2012. Published by Oxford University Press.

Browse alignment result in larger region Set to 200kb for browsing most of genes Status Memory usage

VarioWatch Functional analysis of Variants http://genepipe.ncgm.sinica.edu.tw/variowatch

VarioWatch Visualized result

Performance (Human) Estimated Analysis Time Whole Genome 56 minutes Note 40x Coverage Exome 10 minutes 100x Coverage Transcriptome 1-3 days 12 * 2 Gb (Paired Sample) de novo 24 hours 100 Gb K-mer = 73

NGS data analysis service Analysis Pipeline Samples DNA Whole Genome Variant Detection 135 Whole Exome Variant Detection 1241 CNV Decection 68 LOH Detection 47 de novo assembly 3 RNA Differential Expression Analysis 153 Gene-fusion Detection 32 mirna expression analysis 14 Variant Detection 43 de novo assembly 4 Other ChIP 25 Aptamer 25 RNA IP 15 Virus integration 15 Total (2011/10 ~ 2015/06) 1820

Applying for NGS data analysis service Contact us and discuss detail information in Pre-Service Meeting Create account for LIMS system and issue tracking system (JIRA) Create issue for data analysis on issue tracking system (JIRA) Confirm the analysis pipeline and the estimated price Transfer the raw data ( raw Reads) NGS data analysis Provide NGS data analysis results (preserve 6 month) Confirm the final price Case closed and charge from LIMS system Discuss results in Post-Service Meeting

HiPipe Basic (for free) http://hipipe.ncgm.sinica.edu.tw HiPipe Cloudbeta apply for trial account https://goo.gl/i09e2s Thanks for your attention!