CNV and variant detection for human genome resequencing data - for biomedical researchers (II) Chuan-Kun Liu 劉傳崑 Senior Maneger National Center for Genome Medican bioit@ncgm.sinica.edu.tw
Abstract Common NGS Data Analysis Pipelines Data Output of NGS Platforms Quality Check and Read Trimming Sequence Alignment Variant Detection HiPipe Project
Common NGS Data Analysis Pipelines
Common NGS data analysis pipelines (DNA) Format Conversion and Demultiplexing (CASAVA) Quality Check (FastX / FastQC) Sequence Alignment (BWA) de novo Assembly (Velvet) Sequence Markduplicate (Sambamba) Variant calling (Freebayes) Sequence Realignment (GATK) Sequence Markduplicate (GATK) Variants Annotation (VarioWatch) Multi-sample Variant calling (GATK) LOH Detection (ExomeCNV or NCGM) CNV Detection (ExomeCNV) Translocation (SVDetect) Variants Annotation (VarioWatch) Visualization (Circos)
Common NGS data analysis pipelines (RNA) Format Conversion and Demultiplexing (CASAVA) Quality Check (FastX / FastQC) Sequence Alignment (Bowtie2) Sequence Alignment (Bowtie) de novo assembly (Trinity) Differential expression (Cufflinks) Sequence Realignment (GATK) Gene-fusion Detection (TopHat-fusion) mirna expression analysis (mirdeep2) Visualization Variant calling Gene Annotation (CummeRbund) (GATK) (VarioWatch) Gene Annotation (VarioWatch) Variants Annotation (VarioWatch)
NGS data analysis pipelines (Others) ChIP RNA IP Methylation Aptamer Virus integration Metagenomics
Data Output of NGS Platforms
Data output of NGS platforms Illumina HiSeq Plaform (BCL files) Roche GS FLX+ (Standard Flowgram Format (SFF) files) Thermo Fisher Ion Proton (FASTQ files) FASTQ PacBio RS II (FASTQ files) 8
Format Conversion and Demultiplexing @HWI-ST688:211:C0F02ACXX:2:2310:7574:93175 2:N:0:GCCAAT ATGCAAATAAACTAGAAAATCTAGAAGAAATGGAGAAATTCCTGGACACAC + CCCFFFFFHHHGHJIIJJIJJJIIJJIJJIJJJFIIJAJJIJIJJJJI??? Illumina HiSeq 2000 (multiplexing BCL files) FASTQ by samples Phred Quality Score Probability of incorrect base call Base call accuracy 10 1 in 10 90 % 20 1 in 100 99 % 30 1 in 1000 99.9 % 40 1 in 10000 99.99 % 50 1 in 100000 99.999 %
ASCII code
Three described FASTQ variants Description, OBF name ASCII characters Quality score Range Offset Type Range Sanger standard* fastq-sanger 33-126 33 PHRED 0 to 93 Solexa/early Illumina* fastq-solexa 59-126 64 Solexa -5 to 62 Illumina 1.3+* fastq-illumina 64-126 64 PHRED 0 to 62 Illumina 1.8+ fastq-sanger 33-126 33 PHRED 0 to 93 *Nucleic Acids Res. 2010 April; 38(6): 1767 1771.
Quality Check and Read Trimming
FastQC Project Home Quality Check http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Merge fastq files into one file Java environment Interactive UI
Trimming (When) When to trim reads Format conversion Before alignment Upon alignment Trimming Trimming by lane Trimming by reads
How to trim reads Trimming (How) Specify the bases to be trimmed (eg. n5y*n5) Specify the lower bound threshold of base quality and trim accordingly @HWI-ST688:211:C0F02ACXX:2:2310:7574:93175 2:N:0:GCCAAT ATGCAAATAAACTAGAAAATCTAGAAGAAATGGAGAAATTCCTGGACACAC + CCCFFFFFHHHGHJIIJJIJJJIIJJIJJIJJJFIIJAJJIJIJJJJI???
A tool for read trimming Trimmomatic : A flexible read trimming tool for Illumina NGS data Project Home http://www.usadellab.org/cms/?page=trimmomatic Adaptor trimming Quality trimming Using Sliding Window strategy Cuts a read when the average base quality within a sliding window drops below the lower bound threshold
Trimming Threshold Bases above Q30 85% (2 x 50 bp) 80% (2 x 100 bp) Q30 or Q24 or Q5? Is it necessary to trim reads? http://www.illumina.com/systems/hiseq_2500_1500/performance_specifications.html
Sequence Alignment
Tools for Sequence Alignment Open source BWA aln, BWA mem Bowtie, Bowtie2 STAR Commercial CASAVA, Isaac (Illumina) CLC Genomics Workbench
Human reference sequence NCBI36 (Aug, 2005) the last assembly produced by Human Genome Project (HGP) hg18, Ensembl release 54 GRCh37 (Feb, 2009) The first assembly submitted by Genome Reference Consortium (GRC) hg19, Ensembl release 55~75 (v75 Feb, 2014) GRCh38 (Dec, 2013) hg38, Ensembl release 76+ (v76 Aug, 2014)
GRCh37 Primary assembly Chromosome assembly (chr1-22, X, Y, Mt) Unlocalized sequence Unplaced sequence Alternate loci A sequence that provides an alternate representation of a locus found in a largely haploid assembly. (MHC region, UGT2B17, MAPT) Patches A contig sequence that is released outside of the full assembly release cycle.
Choose your reference genome Human_g1k_v37 The reference sequence provided by the 1000 Genomes Project* Unlocalized and Unplaced sequence are included (Full primary assembly) Mitochondrial sequence was replaced with the revised Cambridge Reference Sequences (rcrs; AC:NC_012920) (APR, 30, 2010)** Male or female *ftp://ftp.ncbi.nih.gov/1000genomes/ftp/technical/reference/readme.human_g1k_v37.fasta.txt **http://www.ncbi.nlm.nih.gov/nuccore/nc_012920
SAM/BAM format Sequence Alignment/Map (SAM) format text file BAM format (binary file) to reduce the file size
Coverage Whole genome sequencing Mean coverage Whole exome sequencing Over 90% of target region were covered with 0.2x mean coverage Custom panel PCR based (~ Mean coverage) Capture-based (~ whole exome sequencing)
TruSeq Exome Enrichment Kit
Variant Detection
DNA Variant Detection Variant Detection Variants Alignment Result Sequencing Reads Reference Sequnece (GRCh37) A C G T
Input - bam files Output - vcf files Tools Variant Detection Germline mutation Genome Analysis Toolkit (GATK) (MAF > 5%) FreeBayes Somatic mutation MuTect Somatic Variant Caller (MAF > 1%)
Genome Analysis Toolkit (GATK) Project Home https://www.broadinstitute.org/gatk/ Maintenance - Broad Institute License Version before 2.3-9 Free for all users (the MIT license) Version after 2.4 Free for academics Fee for commercial use
GATK Best Practices 1. Pre-processing Mark duplicates Realign indels Recalibrate Bases 2. Variant discovery Call Varinats Filter Varinats 3. Callset refinement Refine Genotypes Annotate variants Evaluate variants https://www.broadinstitute.org/gatk/guide/best-practices
Mark Duplicates AGGGAAACCACACAGGCTTCTTAGGCCATTGGAAT GGAAACCACACAGGCTT---AGGCCATTGGAA GAAACCACACAGGCTT---AGGCCATTGGAAT AAACCACACAGGCTT---AGGCCATTGG AAACCACACAGGCTT---AGGCCATTGG CCACACAGGCTT---AGGCCATTGGAA CACACAGGCTT---AGGCCATTGGAAT
Mark Duplicates The same DNA fragments may be sequenced several times The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant The process mark the reads only, does not remove them NOT be applied to amplicon sequencing data https://www.broadinstitute.org/gatk/guide/bp_step.php?p=1
Realign Indels (Before) AGGGAAACCACACAGGCTTCTTAGGCCATTGGAAT GGAAACCACACAGGCTT---AGGCCATTGGAA GAAACCACACAGGC---TTAGGCCATTGGAAT AAACCACACAGGCT---TAGGCCATTGG AAACCACACAGGCTT---AGGCCATTGG CCACACAGGC---TTAGGCCATTGGAA CACACAGG---CTTAGGCCATTGGAAT
Realign Indels (After) AGGGAAACCACACAGGCTTCTTAGGCCATTGGAAT GGAAACCACACAGGCTT---AGGCCATTGGAA GAAACCACACAGGCTT---AGGCCATTGGAAT AAACCACACAGGCTT---AGGCCATTGG AAACCACACAGGCTT---AGGCCATTGG CCACACAGGCTT---AGGCCATTGGAA CACACAGGCTT---AGGCCATTGGAAT
Realignment reads that align on the edges of indels often get mapped with mismatching bases that might look like evidence for SNPs, but are actually mapping artifacts. https://www.broadinstitute.org/gatk/guide/bp_step.php?p=1
Project Home FreeBayes https://github.com/ekg/freebayes Maintenance Erik Garrison, Gabor Marth http://arxiv.org/abs/1207.3907 License Free for all users (the MIT license) Pipeline using FreeBayes SpeedSeq (http://dx.doi.org/10.1038/nmeth.3505) Varpipe (http://www.ashg.org/2015meeting/abstract/detail?page=inthtml&project=ashg15&id=150120759)
(1) Summary and description of fields VCF format (2) Basic info of variants (3) Variants info by individual
Output File Size Human whole genome 30X Raw reads (FASTQ) 100GB Alignment results (BAM) 100GB Variant detection results (VCF) 2GB Functional analysis of variants (CSV) 1GB Human whole exome 100X Raw reads(fastq) 20GB Alignment results(bam) 20GB Variant detection results(vcf) 0.1GB Functional analysis of variants(csv) 0.1GB
HiPipe Project
Why HiPipe? It is hard to analyze NGS data for researchers with biology background: Most analysis tools run only in the Linux environment Few websites provide proper analysis Many tools are only applied for small genome analysis It takes a long time to get a result Transferring large amount of data between websites is required to complete an analysis It is error-prone to orchestrate different analysis tools
User Expectations for NGS Easy to learn and use analysis tools Researchers without computer science background can complete an analysis themselves Doing analysis anywhere, anytime Suitable for any genome size Without file size limitation Upload sequence data upon read sequencing Get an analysis result rapidly Combine different experiment data in one analysis
Challenges familiar with bioinformatics tools ( 行中 亭均 ) Job scheduling and pipeline developing (Louis) Cluster and distributed system management ( 方智 ) Performance tuning Software ( 耀德 仁屏 ) Hardware (MIS Team) Bandwidth Disk I/O User Interface ( 爾瞻 萬嘉 中饋 ) Team coordinate ( 傳崑 ) Adam Yao
Architecture
HiPipe (Basic) http://hipipe.ncgm.sinica.edu.tw
HiPipe (Basic) Online since Jun, 2013 Provide 7 popular NGS data analysis pipelines: Whole Genome & Exome Variant Detection Differential Expression Analysis Gene Fusion Detection mirna Analysis RNA Variant Detection De novo Assembly Exome CNV Detection
HiPipe Professional Authentication / Authorization Web-based interface supporting unlimited upload file size and resumable file upload Multi-sample indel realignment and variant detection Cloud / Onsite Providing integrated NGS data analysis and data store service
HiPipe Basic vs HiPipe Professional HiPipe HiPipe Professional HiPipe Basic HiPipe Cloud HiPipe Cluster HiPipe Quad HiPipe Duo HiPipe Uno Single sample analysis Multi-sample analysis Data storage location Cloud Cloud Local Local Local Local Nodes 10 10 10 1 1 1 CPU cores 640 640 640 64 16 6 40X human whole genome variant detection 56 mins 56 mins 56 mins 7 hrs ~12 hrs ~36 hrs Expected ship date 2013 Q2 2015 Q4 2016 2016 2016 2016
Authentication / Authorization 1. Select a project 2. Add or remove members of the project
Upload files through a browser Unlimited upload file size and resumable file upload
Create Project / Analysis
Configure settings
Choose samples
Save settings
Confirm settings
Launch analysis
Multi-sample analysis result Multi-sample indel realignment
Integrative Genomics Viewer 1. Integrate and visualize NGS, microarray and annotation in genomics 2. Load data only in specified regions to reduce physical memory usage 3. Customize the attribute of samples for further filtering 4. Browse multiple region at the same time 5. It s FREE http://www.broadinstitute.org/igv
Browse VCF and BAM in IGV
VCF Track I The allele fraction A C G T
VCF Track II Heterozygous Homozygous Reference A C G T
Zoom in to see alignment results Thorvaldsdóttir H et al. Brief Bioinform 2013;14:178-192 The Author(s) 2012. Published by Oxford University Press.
Browse alignment result in larger region Set to 200kb for browsing most of genes Status Memory usage
VarioWatch Functional analysis of Variants http://genepipe.ncgm.sinica.edu.tw/variowatch
VarioWatch Visualized result
Performance (Human) Estimated Analysis Time Whole Genome 56 minutes Note 40x Coverage Exome 10 minutes 100x Coverage Transcriptome 1-3 days 12 * 2 Gb (Paired Sample) de novo 24 hours 100 Gb K-mer = 73
NGS data analysis service Analysis Pipeline Samples DNA Whole Genome Variant Detection 135 Whole Exome Variant Detection 1241 CNV Decection 68 LOH Detection 47 de novo assembly 3 RNA Differential Expression Analysis 153 Gene-fusion Detection 32 mirna expression analysis 14 Variant Detection 43 de novo assembly 4 Other ChIP 25 Aptamer 25 RNA IP 15 Virus integration 15 Total (2011/10 ~ 2015/06) 1820
Applying for NGS data analysis service Contact us and discuss detail information in Pre-Service Meeting Create account for LIMS system and issue tracking system (JIRA) Create issue for data analysis on issue tracking system (JIRA) Confirm the analysis pipeline and the estimated price Transfer the raw data ( raw Reads) NGS data analysis Provide NGS data analysis results (preserve 6 month) Confirm the final price Case closed and charge from LIMS system Discuss results in Post-Service Meeting
HiPipe Basic (for free) http://hipipe.ncgm.sinica.edu.tw HiPipe Cloudbeta apply for trial account https://goo.gl/i09e2s Thanks for your attention!