Data Analysis with CASAVA v1.8 and the MiSeq Reporter

Similar documents
Introduction to the MiSeq

Illumina s Suite of Targeted Resequencing Solutions

Illumina Custom Capabilities: Amplicons, Enrichment and iselect

The Expanded Illumina Sequencing Portfolio New Sample Prep Solutions and Workflow

Genome Analyzer. RNA ChIP

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

Introduction to Next Generation Sequencing

Illumina Technology Updates

Analytics Behind Genomic Testing

MiSeq. system applications

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Introduction to the NextSeq System

Illumina Sequencing Overview

Introduction to RNA-Seq in GeneSpring NGS Software

Next-Generation Sequencing: custom solutions

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

G E N OM I C S S E RV I C ES

How much sequencing do I need? Emily Crisovan Genomics Core


ChIP-seq and RNA-seq. Farhat Habib

ChIP-seq and RNA-seq

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Galaxy for Next Generation Sequencing 初探次世代序列分析平台 蘇聖堯 2013/9/12

Deep Sequencing technologies

Getting high-quality cytogenetic data is a SNP.

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

Revolutionizing genetics in the cattle industry. Extensive toolset. Proven technology. Collaborative approach.

Course Presentation. Ignacio Medina Presentation

How much sequencing do I need? Emily Crisovan Genomics Core September 26, 2018

EpiGnome Methyl-Seq - DNA Methylation Analysis using Whole Genome Bisulfite Sequencing

CNV and variant detection for human genome resequencing data - for biomedical researchers (II)

Illumina Diagnostics. Emily Winn-Deen, PhD Vice President, Product Development Diagnostics

IDENTIFYING A DISEASE CAUSING MUTATION

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010

Next-Generation Sequencing. Technologies

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Benefits to validating GWAS studies with custom genotyping

QIAseq Targeted Panel Analysis Plugin USER MANUAL

Introduction to NGS analyses

Genomics and Transcriptomics of Spirodela polyrhiza

Quantifying gene expression

DATA FORMATS AND QUALITY CONTROL

Quality Control of Next Generation Sequence Data

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics

Long and short/small RNA-seq data analysis

Reads to Discovery. Visualize Annotate Discover. Small DNA-Seq ChIP-Seq Methyl-Seq. MeDIP-Seq. RNA-Seq. RNA-Seq.

Next-Generation Sequencing Services à la carte

Sanger vs Next-Gen Sequencing

Applications of short-read

RNA-sequencing. Next Generation sequencing analysis Anne-Mette Bjerregaard. Center for biological sequence analysis (CBS)

Transcriptome analysis

Short Read Alignment to a Reference Genome

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte

About Strand NGS. Strand Genomics, Inc All rights reserved.

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

RNA-Sequencing analysis

14 March, 2016: Introduction to Genomics

Next Generation Sequencing. Tobias Österlund

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Galaxy Platform For NGS Data Analyses

Compatible with: Ion Torrent Platforms Roche Sequencing Platforms Illumina Sequencing Platforms Life Technologies SOLiD System

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

Bioinformatics for NGS projects. Guidelines. genomescan.nl

NGS part 2: applications. Tobias Österlund

Lecture 7. Next-generation sequencing technologies

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012

NGS in Pathology Webinar

RNA-Seq Software, Tools, and Workflows

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center

Alignment methods. Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics

Genomic resources. for non-model systems

SNP calling and VCF format

NEXT GENERATION SEQUENCING. Farhat Habib

RNA-Seq Module 2 From QC to differential gene expression.

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

Transforming Services in the World of Life Sciences

10/06/2014. RNA-Seq analysis. With reference assembly. Cormier Alexandre, PhD student UMR8227, Algal Genetics Group

Genome Resequencing. Rearrangements. SNPs, Indels CNVs. De novo genome Sequencing. Metagenomics. Exome Sequencing. RNA-seq Gene Expression

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

Bioinformatics in next generation sequencing projects

Analysing genomes and transcriptomes using Illumina sequencing

Applications of Next Generation Sequencing in Metagenomics Studies

Welcome to the NGS webinar series

SMRT Analysis Barcoding Overview (v6.0.0)

Illumina s GWAS Roadmap: next-generation genotyping studies in the post-1kgp era

Roadmap: genotyping studies in the post-1kgp era. Alex Helm Product Manager Genotyping Applications

Agilent Genomic Workbench 7.0

Introduction to bioinformatics (NGS data analysis)

Services Presentation Genomics Experts

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

02 Agenda Item 03 Agenda Item

Next Generation Sequencing: An Overview

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Variant Discovery. Jie (Jessie) Li PhD Bioinformatics Analyst Bioinformatics Core, UCD

Transcription:

Data Analysis with CASAVA v1.8 and the MiSeq Reporter Eric Smith, PhD Bioinformatics Scientist September 15 th, 2011 2010 Illumina, Inc. All rights reserved. Illumina, illuminadx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iselect, CSPro, GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.

CASAVA v1.8: Integrated Secondary Analysis Package Real Time Analysis Bcl conversion Demultiplex Alignment Variant Calling RNA counting Visualization (Genome Studio) Tertiary Analysis CASAVA v1.8 was released in June 2011. 2

Bcl Conversion with CASAVA v1.8 BCL files Bcl Conversion and De-multiplexing Compressed Fastq Files CASAVA (not OLB) will perform Bcl conversion and de-multiplexing in a single step. To support standard file formats, Fastq will replace Qseq as the output format. The Fastq quality score encoding will use the standard offset value of 33 rather than the previous Illumina-specific offset value of 64. 3

Project/Sample-based Sample Sheet The Sample Sheet is provided by the user for each sequencing run Sample Name Barcode/Index Project Name FCID Lane Sample SampleRef Index Description Control Recipe Operator Project FC62DBU 1 NA12156_Index_1 Human ATCACG 1:250method1 N test.xml AT testproject1 FC62DBU 1 NA11992_Index_2 Human CGATGT 1:250method1 N test.xml AT testproject1 FC62DBU 1 NA11882_Index_4 Human TGACCA 1:250method1 N test.xml AT testproject1 FC62DBU 1 NA11881_Index_3 Human TTAGGC 1:250method1 N test.xml AT testproject1 FC62DBU 1 lane1 unknown Undetermined unknown barcode N test.xml AT Undetermined_indices FC62DBU 2 NA12156_Index_1 Human ATCACG 1:500method1 N test.xml AT testproject1 FC62DBU 2 NA11992_Index_2 Human CGATGT 1:500method1 N test.xml AT testproject1 FC62DBU 2 NA11882_Index_4 Human TGACCA 1:500method1 N test.xml AT testproject1 FC62DBU 2 NA11881_Index_3 Human TTAGGC 1:500method1 N test.xml AT testproject1 FC62DBU 2 lane2 unknown Undetermined unknown barcode N test.xml AT Undetermined_indices FC62DBU 3 NA12156_Index_1 Human ATCACG equal Volume N test.xml AT testproject1 FC62DBU 3 NA11992_Index_2 Human CGATGT equal Volume N test.xml AT testproject1 FC62DBU 3 NA11882_Index_4 Human TGACCA equal Volume N test.xml AT testproject1 FC62DBU 3 NA11881_Index_3 Human TTAGGC equal Volume N test.xml AT testproject1 FC62DBU 3 lane3 unknown Undetermined unknown barcode N test.xml AT Undetermined_indices 4

Project-based Run Folder The Sample Sheet provides names of Project and Samples Run folder Unaligned Aligned Build 5

Project-based Run Folder The Sample Sheet provides names of Project and Samples Run folder Unaligned Bcl conversion and De-multiplexing Aligned Alignment Build Variant calling 6

Project-based Run Folder The Sample Sheet provides names of Project and Samples Unaligned Project DNA Sample Rat Run folder Unaligned Undetermined indexes Rat_ATGACG_L002_R1.001.fastq.gz Rat_ATGACG_L002_R1.002.fastq.gz Aligned Aligned Project DNA Build Sample Rat Export and Summary files Build Project DNA Sample Rat BAM and variant files 7

Alignment with CASAVA v1.8 ELAND Alignment (read 1) ELAND Alignment (read 2) Alignment Resolver Reference sequences no longer need to be squashed fasta format is supported. The GERALD summary file will be modified in accordance with the new directory structure The runtime of Alignment Resolver has been significantly reduced 8

Alignment: Overlapping Seeds Repeat resolution with Eland v2e First pass: align using a single seed Take reads that did not get matched or hit a repetitive sequence Second fourth passes: Align using overlapping seeds Report seeds that hit a non-repetitive sequence Increases sensitivity across repeat regions 100 bp Stage 1: (single seed) Stage 2: (multiple seeds) 9

CASAVA v1.8 Orphan Alignment Sometimes only one read in a paired-end fragment (orphaned read) aligns to the genome, this tends to happen around repeat regions: Read 1 repeat region CASAVA 1.8 uses a sensitive alignment algorithm around the region where the missing read is expected: Read 1 Read 2 repeat region This helps recover reads in highly repetitive regions. 10

Alignment Accuracy with ELANDv2e Comparative assessment 25 million simulated read pairs, introduced error rate <2%, introduced SNPs and short indels ELANDv2 bowtie bwa ELANDv2e bow tie bw a ELANDv2e Correctly mapped Incorrectly mapped 88.1% 94.1% 94.1% 3.0% 5.0% 0.09% 0% 20% 40% 60% 80% 100% correctly mapped pairs incorrectly mapped pairs unmapped pairs 11

CASAVA v1.8 Variant Calling Uses a new probabilistic Bayesian method (like MAQ and GATK) Single-sample variant caller Calculates genotype probabilities under two prior distributions - genomic and polymorphic The output format are text-based genotype calls and the standard file format BAM. Gapped Alignment (ELANDv2e) Bin/Sort Reads Remove Duplicates Filter by mismatch density, coverage, and mapping score Indel Detection (de novo assembly) Realignment of reads and contigs Genotyper BAM Sites Snps Indels 12

CASAVA 1.8: Indel Detection with Local Realignment 1. Cluster Orphan Reads Clusters of Orphan Reads 1. Cluster Shadow and Partially Aligned Reads Aligned Reads Reference TTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCCGTAGCATGCATGCATGCACGGACTTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCGTAGCATGCATGCGGCTTTTCGCGTAGA 2. Cluster Anomalous Reads 2. Cluster Anomalous Reads Aligned Reads Cluster of Anomalous Reads (insert too short) Reference 3. Merge Clusters from Same Event TTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCCGTAGCATGCATGCATGCACGGACTTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCGTAGCATGCATGCGGCTTTTCGCGTAGA 3. Merge Cluster from Same Event Merged Cluster 13

CASAVA 1.8: Indel Detection with Local Realignment 4. Assemble Cluster into Contig 4. De novo Assemble Clusters into Contigs Orphan Reads in Contig Anomalous Reads in Contig TAGCATGCATGCATGCACGATCGGTGTTTGTGGTGGGGGACTTT New Contig 5. Align Contig to Reference 5. Align Contigs to Reference GATCGGTGTTTGTGGTGGG Potential New Insert New Contig Reference TAGCATGCATGCATGCACGGACTTT TTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCCGTAGCATGCATGCATGCACGGACTTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCGTAGCATGCATGCGGCTTTTCGCGTAGA 6. Send Contig and Read Alignments to Genotyper for genotype calling Contig and Read Alignments Genotyper 14

Improved SNP Calling Accuracy and Sensitivity Sequenced human Yoruba trio: NA18507, NA18506, NA18508 95 Gb raw data each, 2x100 bp, 200G Chemistry Violations of Mendelian inheritance indicate miscalled variants Results for chromosome 20 CASAVA 1.7 CASAVA 1.8 Total sites ( N excluded) 59,505,520 59,505,520 Mapped in all 3 of trio 58,188,449 (97.8%) 58,988,964 (99.1%) Called in all 3 of trio 56,881,257 (95.6%) 57,696,195 (97.0%) Mendelian conflicts 5,072 162 Conflict rate 0.0087% of called sites 0.00028% of called sites 15

16 Analysis Visual Controller(AVC)

Analysis Visual Controller (AVC) a GUI for CASAVA AVC is a Windows GUI to run CASAVA (support for v1.8 coming soon). It s designed to allow biologists to run their own analyses without learning Linux commands. AVC supports multi-cores Linux machines as well as SGE clusters, and offers archive and deletion functions to manage your data (and disk space!). No installation is required on the Linux side. AVC can be downloaded on icom for free and is supported by Tech Support. 17

18 RNA Workflows

RNA Workflows: TopHat / Cufflinks or CASAVA to fit your application Feature TopHat + Cufflinks CASAVA v1.8 Counting of exons, genes, splice junctions Yes Yes Differential gene/isoform expression Yes Possible Novel isoform discovery Yes No Paired-End Reads Yes No Analysis without gene/exon annotations Yes SNP detection No (Yes with pileup) Yes Short indel detection No (Yes with pileup) Yes No Technical Support No (documentation provided by Illumina) Yes 19

igenomes Reference Files for CASAVA and TopHat/Cufflinks Includes commonly analyzed reference sequences and annotation files. Files downloaded from NCBI, UCSC, and Ensembl. 25 species supported; available on icom and ftp. Each igenome includes reference sequences, Bowtie index (for Tophat/Cufflinks) and annotation files (for RNA workflows). 20

Small RNA Alignment -- Coming Soon Align small RNA sequences using ELAND to: known small RNA sequences (eg mirbase) genome sequence Built from previous release of Flicker Adaptor sequences are trimmed Output will be in SAM format Downstream applications include: Differential expression De novo discovery 21

Use Software Tools That Meet Your Project s Needs Illumina software tools are available for you to use. Many 3 rd party tools are also very useful use what suits your needs best! All Illumina software has full technical support at your disposal. CASAVA NovoAlign BWA MAQ Bowtie BFAST Mosaik SOAP 22

23 MiSeq Reporter

MiSeq Reporter Workflows Overview Instrument Control Software (MCS) Images MiSeq Reporter RTA Base calls & Quality Scores Resequencing Amplicon Library QC Small RNA De Novo Assembly 16S Metagenomics Limited Visualization via HTTP interface Download Alignment/FASTQ, Variants and Statistics for further analysis The resequencing workflow will have the option to only generate FASTQ files for users that want to use their own tools. 24

MiSeq Specifications Up to ~5 million clusters per run Single and paired-end runs, with support for indexing Data analysis takes <2 hours Seven reference genomes delivered; users can add new ones Data volume for 2 x 150bp run (yields ~1.5 Gbp): For more info see http://www.illumina.com/help/miseq_reporter/default.htm 25

26 MiSeq Reporter: General Summary Report

MiSeq Reporter -- Resequencing Workflows Resequencing Amplicon Resequencing Workflow Output files: BAM, VCF Algorithms: based on CASAVA v1.8 Zoom in/out Library QC De novo Assembly SmallRNA Coverage and Error Plots Q-score Plot SNPs/indels & Gene Annotation Metagenomics Table of Samples and Variants Amplicon and Library QC workflows will have similar output 27

MiSeq Reporter De novo Assembly Workflows Resequencing De novo Assembly Workflow Output files: Fasta of contigs; optional dotplot Algorithms: based on Velvet Amplicon Library QC Syntenic dot-plot De novo Assembly SmallRNA Metagenomics Contig metrics 28

MiSeq Reporter Small RNA Workflows Distribution of reads by type of small RNA Histogram of most abundant small RNAs Resequencing Amplicon Library QC De novo Assembly SmallRNA Trimmed Read-Length Distribution (in Summary Report) Metagenomics Small RNA Workflow Output files: Count text files of small RNAs by type Algorithms: based on Flicker & CASAVA v1.8 29

MiSeq Reporter -- Metagenomics Workflows Resequencing Amplicon Library QC Distribution of reads by alignment to 16S rrna database De novo Assembly SmallRNA Metagenomics Metagenomics Workflow Output files: Annotation counts at different taxonomic levels Algorithms: alignment to 16S rrna 30

31

32

33 DNA Alignment with CASAVA

CASAVA v1.8 Alignment Resolver If in a paired-end fragment, each read aligns to several locations in the genome: Read 1 Read 2 The reads are expected to have a particular orientation: And occur within a given fragment length: Read 1 Read 2 Read 2 Read 1 good orientation, good fragment length Read 1 Read 2 chromosome 7 wrong orientation spurious alignment Read 1 Read 2 Read 1 chromosome 4 anomalous fragment length Read 1 Read 2 chromosome 13 34

35 Variant Calling with CASAVA

Other Bioinfomatics Tools Support for Tophat/Cufflinks for RNA-Seq (see Illumina s HowTo document) The Analysis Visual Controller (AVC) is a GUI for running CASAVA The igenomes are a collection of references sequence and annotation files for commonly analyzed species. Includes files needed for CASAVA and Tophat/Cufflinks. Downloadable on icom under Downloads. Flicker is small-rna tool that works with CASAVA coming soon. GenomeStudio is a tool that helps you visualize CASAVA output now taking input in BAM format. 36

CASAVA Methods RNA counting Raw read counts are normalized by: Gene/exon size to allow comparisons between genes/exons Read coverage to allow comparisons between experiments to produce RPKM values Normalized Values Genes and Exons RPKM = 10 9 x Cb / NbL Splice Junctions RPKM = 10 9 x Cr / NrL Counting Method RPKM = Reads Per Kilobase of exon model per Million mapped reads Cb= the number of bases that fall on the feature Nb= total number of mapped bases in the experiment L= the length of the feature in base pairs Cr= the number of reads that cover the junction point Nr= total number of mapped reads in the experiment L= the length of the feature in base pairs 37

38 TruSeq Exome Analysis

TruSeq Exome Script Output Run ID 101022_SN140_0234_A805N8ABXX_2 Lane 2 Read length 101 Targeted Regions TruSeq_exome Pull-down region size (calculated on single probe regions) 340 Total regions size 62085295 Mean coverage ESTIMATE (calculated from read counts) 102 # of reads (PF + Aligned) 85899275 # of reads in targeted regions 62759149 Enrichment in targeted regions 73.06 Enrichment in targeted regions +/-150 bp 83.88 # targeted regions 201071 # of targeted regions with no reads 473 % of targeted regions with no reads 0.24 General quality metrics Coverage plots Read counts by GC content Control coverage and snp frequencies 39

40 Visualization with GenomeStudio

GenomeStudio Data Visualization, now supports BAM Powerful coupling of table-based data with visualization tools Modules for DNA, RNA, ChIP-Seq Licensebased product 41

42 Data Volume Management

Throughput Continues To Increase 600 500 400 300 200 100 0 GB per Week 2008 2009 2010 2011 43

Data Volumes: HiSeq2000 + v3 Chemistry + CASAVA 1.8 (600Gb output per run; Q2, 2011) Data Volume File Format Size Comment Base Call / Quality Score Data Read-level Data Internal alignment output Alignment Output & Archiving Bcl compressed FASTQ export.txt BAM 660 GB 660 GB ~5 TB 660 GB ~5TB of temp space is needed per 600Gb HiSeq2000 run 44

45 MiSeq Reporter

Running A MiSeq User fills out Sample Sheet at desk Excel Add-In verifies that it is valid User scans bar-code on flow cell Sample Sheet is automatically loaded (if named after bar code) Flow cell/reagents loaded Run Begins User reviews progress and results via HTTP interface 46

MiSeq Reporter Workflow Reports Resequencing coverage, q-scores, variants (outputs BAM, VCF) Small RNA trimmed reads distribution, reads by RNA type, most abundant small RNAs De novo assembly dotplot and metrics (outputs fasta of contigs) Metagenomics 16S rrna alignment 47

MiSeq Reporter: Resequencing details report Coverage and error Scope of view Zoom in/out Q score SNP/indel + annotation Table of samples/ variants 48

MiSeq Reporter: De Novo details report Syntenic dot plot De novo metrics 49

MiSeq Reporter: Small RNA summary page Trimmed read lengths 50

MiSeq Reporter: Small RNA details page Distribution of small RNA reads by type Top 10 most abundant small RNA 51

52 MiSeq Reporter: 16S metagenomics details report

53 Future Software Products

Software Workflow Manager Under Development Run diverse workflows from a single GUI application Mix and match software components within a workflow Enter custom workflows Possible LIMS integration DNA Custom Workflow RNA Workflow Manager Methylation ChIP-Seq 54