CNV and variant detection for human genome resequencing data - for biomedical researchers (II)

CNV and variant detection for human genome resequencing data - for biomedical researchers (II) Chuan-Kun Liu 劉傳崑 Senior Maneger National Center for Genome Medican bioit@ncgm.sinica.edu.tw

Abstract Common NGS Data Analysis Pipelines Data Output of NGS Platforms Quality Check and Read Trimming Sequence Alignment Variant Detection HiPipe Project

Common NGS Data Analysis Pipelines

Common NGS data analysis pipelines (DNA) Format Conversion and Demultiplexing (CASAVA) Quality Check (FastX / FastQC) Sequence Alignment (BWA) de novo Assembly (Velvet) Sequence Markduplicate (Sambamba) Variant calling (Freebayes) Sequence Realignment (GATK) Sequence Markduplicate (GATK) Variants Annotation (VarioWatch) Multi-sample Variant calling (GATK) LOH Detection (ExomeCNV or NCGM) CNV Detection (ExomeCNV) Translocation (SVDetect) Variants Annotation (VarioWatch) Visualization (Circos)

Common NGS data analysis pipelines (RNA) Format Conversion and Demultiplexing (CASAVA) Quality Check (FastX / FastQC) Sequence Alignment (Bowtie2) Sequence Alignment (Bowtie) de novo assembly (Trinity) Differential expression (Cufflinks) Sequence Realignment (GATK) Gene-fusion Detection (TopHat-fusion) mirna expression analysis (mirdeep2) Visualization Variant calling Gene Annotation (CummeRbund) (GATK) (VarioWatch) Gene Annotation (VarioWatch) Variants Annotation (VarioWatch)

NGS data analysis pipelines (Others) ChIP RNA IP Methylation Aptamer Virus integration Metagenomics

Data Output of NGS Platforms

Data output of NGS platforms Illumina HiSeq Plaform (BCL files) Roche GS FLX+ (Standard Flowgram Format (SFF) files) Thermo Fisher Ion Proton (FASTQ files) FASTQ PacBio RS II (FASTQ files) 8

Format Conversion and Demultiplexing @HWI-ST688:211:C0F02ACXX:2:2310:7574:93175 2:N:0:GCCAAT ATGCAAATAAACTAGAAAATCTAGAAGAAATGGAGAAATTCCTGGACACAC + CCCFFFFFHHHGHJIIJJIJJJIIJJIJJIJJJFIIJAJJIJIJJJJI??? Illumina HiSeq 2000 (multiplexing BCL files) FASTQ by samples Phred Quality Score Probability of incorrect base call Base call accuracy 10 1 in 10 90 % 20 1 in 100 99 % 30 1 in 1000 99.9 % 40 1 in 10000 99.99 % 50 1 in 100000 99.999 %

ASCII code

Three described FASTQ variants Description, OBF name ASCII characters Quality score Range Offset Type Range Sanger standard* fastq-sanger 33-126 33 PHRED 0 to 93 Solexa/early Illumina* fastq-solexa 59-126 64 Solexa -5 to 62 Illumina 1.3+* fastq-illumina 64-126 64 PHRED 0 to 62 Illumina 1.8+ fastq-sanger 33-126 33 PHRED 0 to 93 *Nucleic Acids Res. 2010 April; 38(6): 1767 1771.

Quality Check and Read Trimming

FastQC Project Home Quality Check http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Merge fastq files into one file Java environment Interactive UI

Trimming (When) When to trim reads Format conversion Before alignment Upon alignment Trimming Trimming by lane Trimming by reads

How to trim reads Trimming (How) Specify the bases to be trimmed (eg. n5y*n5) Specify the lower bound threshold of base quality and trim accordingly @HWI-ST688:211:C0F02ACXX:2:2310:7574:93175 2:N:0:GCCAAT ATGCAAATAAACTAGAAAATCTAGAAGAAATGGAGAAATTCCTGGACACAC + CCCFFFFFHHHGHJIIJJIJJJIIJJIJJIJJJFIIJAJJIJIJJJJI???

A tool for read trimming Trimmomatic : A flexible read trimming tool for Illumina NGS data Project Home http://www.usadellab.org/cms/?page=trimmomatic Adaptor trimming Quality trimming Using Sliding Window strategy Cuts a read when the average base quality within a sliding window drops below the lower bound threshold

Trimming Threshold Bases above Q30 85% (2 x 50 bp) 80% (2 x 100 bp) Q30 or Q24 or Q5? Is it necessary to trim reads? http://www.illumina.com/systems/hiseq_2500_1500/performance_specifications.html

Sequence Alignment

Tools for Sequence Alignment Open source BWA aln, BWA mem Bowtie, Bowtie2 STAR Commercial CASAVA, Isaac (Illumina) CLC Genomics Workbench

Human reference sequence NCBI36 (Aug, 2005) the last assembly produced by Human Genome Project (HGP) hg18, Ensembl release 54 GRCh37 (Feb, 2009) The first assembly submitted by Genome Reference Consortium (GRC) hg19, Ensembl release 55~75 (v75 Feb, 2014) GRCh38 (Dec, 2013) hg38, Ensembl release 76+ (v76 Aug, 2014)

GRCh37 Primary assembly Chromosome assembly (chr1-22, X, Y, Mt) Unlocalized sequence Unplaced sequence Alternate loci A sequence that provides an alternate representation of a locus found in a largely haploid assembly. (MHC region, UGT2B17, MAPT) Patches A contig sequence that is released outside of the full assembly release cycle.

Choose your reference genome Human_g1k_v37 The reference sequence provided by the 1000 Genomes Project* Unlocalized and Unplaced sequence are included (Full primary assembly) Mitochondrial sequence was replaced with the revised Cambridge Reference Sequences (rcrs; AC:NC_012920) (APR, 30, 2010)** Male or female *ftp://ftp.ncbi.nih.gov/1000genomes/ftp/technical/reference/readme.human_g1k_v37.fasta.txt **http://www.ncbi.nlm.nih.gov/nuccore/nc_012920

SAM/BAM format Sequence Alignment/Map (SAM) format text file BAM format (binary file) to reduce the file size

Coverage Whole genome sequencing Mean coverage Whole exome sequencing Over 90% of target region were covered with 0.2x mean coverage Custom panel PCR based (~ Mean coverage) Capture-based (~ whole exome sequencing)

TruSeq Exome Enrichment Kit

Variant Detection

DNA Variant Detection Variant Detection Variants Alignment Result Sequencing Reads Reference Sequnece (GRCh37) A C G T

Input - bam files Output - vcf files Tools Variant Detection Germline mutation Genome Analysis Toolkit (GATK) (MAF > 5%) FreeBayes Somatic mutation MuTect Somatic Variant Caller (MAF > 1%)

Genome Analysis Toolkit (GATK) Project Home https://www.broadinstitute.org/gatk/ Maintenance - Broad Institute License Version before 2.3-9 Free for all users (the MIT license) Version after 2.4 Free for academics Fee for commercial use

GATK Best Practices 1. Pre-processing Mark duplicates Realign indels Recalibrate Bases 2. Variant discovery Call Varinats Filter Varinats 3. Callset refinement Refine Genotypes Annotate variants Evaluate variants https://www.broadinstitute.org/gatk/guide/best-practices

Mark Duplicates AGGGAAACCACACAGGCTTCTTAGGCCATTGGAAT GGAAACCACACAGGCTT---AGGCCATTGGAA GAAACCACACAGGCTT---AGGCCATTGGAAT AAACCACACAGGCTT---AGGCCATTGG AAACCACACAGGCTT---AGGCCATTGG CCACACAGGCTT---AGGCCATTGGAA CACACAGGCTT---AGGCCATTGGAAT

Mark Duplicates The same DNA fragments may be sequenced several times The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant The process mark the reads only, does not remove them NOT be applied to amplicon sequencing data https://www.broadinstitute.org/gatk/guide/bp_step.php?p=1

Realign Indels (Before) AGGGAAACCACACAGGCTTCTTAGGCCATTGGAAT GGAAACCACACAGGCTT---AGGCCATTGGAA GAAACCACACAGGC---TTAGGCCATTGGAAT AAACCACACAGGCT---TAGGCCATTGG AAACCACACAGGCTT---AGGCCATTGG CCACACAGGC---TTAGGCCATTGGAA CACACAGG---CTTAGGCCATTGGAAT

Realign Indels (After) AGGGAAACCACACAGGCTTCTTAGGCCATTGGAAT GGAAACCACACAGGCTT---AGGCCATTGGAA GAAACCACACAGGCTT---AGGCCATTGGAAT AAACCACACAGGCTT---AGGCCATTGG AAACCACACAGGCTT---AGGCCATTGG CCACACAGGCTT---AGGCCATTGGAA CACACAGGCTT---AGGCCATTGGAAT

Realignment reads that align on the edges of indels often get mapped with mismatching bases that might look like evidence for SNPs, but are actually mapping artifacts. https://www.broadinstitute.org/gatk/guide/bp_step.php?p=1

Project Home FreeBayes https://github.com/ekg/freebayes Maintenance Erik Garrison, Gabor Marth http://arxiv.org/abs/1207.3907 License Free for all users (the MIT license) Pipeline using FreeBayes SpeedSeq (http://dx.doi.org/10.1038/nmeth.3505) Varpipe (http://www.ashg.org/2015meeting/abstract/detail?page=inthtml&project=ashg15&id=150120759)

(1) Summary and description of fields VCF format (2) Basic info of variants (3) Variants info by individual

Output File Size Human whole genome 30X Raw reads (FASTQ) 100GB Alignment results (BAM) 100GB Variant detection results (VCF) 2GB Functional analysis of variants (CSV) 1GB Human whole exome 100X Raw reads(fastq) 20GB Alignment results(bam) 20GB Variant detection results(vcf) 0.1GB Functional analysis of variants(csv) 0.1GB

HiPipe Project

Why HiPipe? It is hard to analyze NGS data for researchers with biology background: Most analysis tools run only in the Linux environment Few websites provide proper analysis Many tools are only applied for small genome analysis It takes a long time to get a result Transferring large amount of data between websites is required to complete an analysis It is error-prone to orchestrate different analysis tools

User Expectations for NGS Easy to learn and use analysis tools Researchers without computer science background can complete an analysis themselves Doing analysis anywhere, anytime Suitable for any genome size Without file size limitation Upload sequence data upon read sequencing Get an analysis result rapidly Combine different experiment data in one analysis

Challenges familiar with bioinformatics tools ( 行中亭均 ) Job scheduling and pipeline developing (Louis) Cluster and distributed system management ( 方智 ) Performance tuning Software ( 耀德仁屏 ) Hardware (MIS Team) Bandwidth Disk I/O User Interface ( 爾瞻萬嘉中饋 ) Team coordinate ( 傳崑 ) Adam Yao

Architecture

HiPipe (Basic) http://hipipe.ncgm.sinica.edu.tw

HiPipe (Basic) Online since Jun, 2013 Provide 7 popular NGS data analysis pipelines: Whole Genome & Exome Variant Detection Differential Expression Analysis Gene Fusion Detection mirna Analysis RNA Variant Detection De novo Assembly Exome CNV Detection

HiPipe Professional Authentication / Authorization Web-based interface supporting unlimited upload file size and resumable file upload Multi-sample indel realignment and variant detection Cloud / Onsite Providing integrated NGS data analysis and data store service

HiPipe Basic vs HiPipe Professional HiPipe HiPipe Professional HiPipe Basic HiPipe Cloud HiPipe Cluster HiPipe Quad HiPipe Duo HiPipe Uno Single sample analysis Multi-sample analysis Data storage location Cloud Cloud Local Local Local Local Nodes 10 10 10 1 1 1 CPU cores 640 640 640 64 16 6 40X human whole genome variant detection 56 mins 56 mins 56 mins 7 hrs ~12 hrs ~36 hrs Expected ship date 2013 Q2 2015 Q4 2016 2016 2016 2016

Authentication / Authorization 1. Select a project 2. Add or remove members of the project

Upload files through a browser Unlimited upload file size and resumable file upload

Create Project / Analysis

Configure settings

Choose samples

Save settings

Confirm settings

Launch analysis

Multi-sample analysis result Multi-sample indel realignment

Integrative Genomics Viewer 1. Integrate and visualize NGS, microarray and annotation in genomics 2. Load data only in specified regions to reduce physical memory usage 3. Customize the attribute of samples for further filtering 4. Browse multiple region at the same time 5. It s FREE http://www.broadinstitute.org/igv

Browse VCF and BAM in IGV

VCF Track I The allele fraction A C G T

VCF Track II Heterozygous Homozygous Reference A C G T

Zoom in to see alignment results Thorvaldsdóttir H et al. Brief Bioinform 2013;14:178-192 The Author(s) 2012. Published by Oxford University Press.

Browse alignment result in larger region Set to 200kb for browsing most of genes Status Memory usage

VarioWatch Functional analysis of Variants http://genepipe.ncgm.sinica.edu.tw/variowatch

VarioWatch Visualized result

Performance (Human) Estimated Analysis Time Whole Genome 56 minutes Note 40x Coverage Exome 10 minutes 100x Coverage Transcriptome 1-3 days 12 * 2 Gb (Paired Sample) de novo 24 hours 100 Gb K-mer = 73

NGS data analysis service Analysis Pipeline Samples DNA Whole Genome Variant Detection 135 Whole Exome Variant Detection 1241 CNV Decection 68 LOH Detection 47 de novo assembly 3 RNA Differential Expression Analysis 153 Gene-fusion Detection 32 mirna expression analysis 14 Variant Detection 43 de novo assembly 4 Other ChIP 25 Aptamer 25 RNA IP 15 Virus integration 15 Total (2011/10 ~ 2015/06) 1820

Applying for NGS data analysis service Contact us and discuss detail information in Pre-Service Meeting Create account for LIMS system and issue tracking system (JIRA) Create issue for data analysis on issue tracking system (JIRA) Confirm the analysis pipeline and the estimated price Transfer the raw data ( raw Reads) NGS data analysis Provide NGS data analysis results (preserve 6 month) Confirm the final price Case closed and charge from LIMS system Discuss results in Post-Service Meeting

HiPipe Basic (for free) http://hipipe.ncgm.sinica.edu.tw HiPipe Cloudbeta apply for trial account https://goo.gl/i09e2s Thanks for your attention!