Introduction to Next Generation Sequencing

Similar documents
NEXT GENERATION SEQUENCING. Farhat Habib

Lecture 7. Next-generation sequencing technologies

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Bioinformatics in next generation sequencing projects

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

Deep Sequencing technologies

Next Generation Sequencing. Tobias Österlund

Data Analysis with CASAVA v1.8 and the MiSeq Reporter

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010

Next-generation sequencing and quality control: An introduction 2016

About Strand NGS. Strand Genomics, Inc All rights reserved.

De Novo Assembly of High-throughput Short Read Sequences

Analysing genomes and transcriptomes using Illumina sequencing

Next Generation Sequencing: An Overview

Services Presentation Genomics Experts

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Introduction to RNA-Seq in GeneSpring NGS Software

Read Quality Assessment & Improvement. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Next Gen Sequencing. Expansion of sequencing technology. Contents

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

The New Genome Analyzer IIx Delivering more data, faster, and easier than ever before. Jeremy Preston, PhD Marketing Manager, Sequencing

Matthew Tinning Australian Genome Research Facility. July 2012

Genomic resources. for non-model systems

RADseq Data Analysis Workshop 3 February 2017

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

02 Agenda Item 03 Agenda Item

The Expanded Illumina Sequencing Portfolio New Sample Prep Solutions and Workflow

Galaxy for Next Generation Sequencing 初探次世代序列分析平台 蘇聖堯 2013/9/12

Reference genomes and common file formats

ChIP-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

Reference genomes and common file formats

RNA-Sequencing analysis

An introduction to RNA-seq. Nicole Cloonan - 4 th July 2018 #UQWinterSchool #Bioinformatics #GroupTherapy

Next Generation Sequencing Technologies. Some slides are modified from Robi Mitra s lecture notes

Next-Generation Sequencing. Technologies

Contact us for more information and a quotation

NGS in Pathology Webinar

QIAseq Targeted Panel Analysis Plugin USER MANUAL

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

Introduction to bioinformatics (NGS data analysis)

Chapter 7. DNA Microarrays

Bioinformatics for NGS projects. Guidelines. genomescan.nl

Reads to Discovery. Visualize Annotate Discover. Small DNA-Seq ChIP-Seq Methyl-Seq. MeDIP-Seq. RNA-Seq. RNA-Seq.

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

1. Introduction Gene regulation Genomics and genome analyses

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

The first thing you will see is the opening page. SeqMonk scans your copy and make sure everything is in order, indicated by the green check marks.

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Galaxy Platform For NGS Data Analyses

How much sequencing do I need? Emily Crisovan Genomics Core

ChIP-seq and RNA-seq. Farhat Habib

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Genomics AGRY Michael Gribskov Hock 331

Alignment methods. Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center

Nature Methods: doi: /nmeth Supplementary Figure 1. Construction of a sensitive TetR mediated auxotrophic off-switch.

Systematic evaluation of spliced alignment programs for RNA- seq data

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:

Introduction to human genomics and genome informatics

Read Mapping and Variant Calling. Johannes Starlinger

Gene Expression analysis with RNA-Seq data

Introduction to the MiSeq

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

10/06/2014. RNA-Seq analysis. With reference assembly. Cormier Alexandre, PhD student UMR8227, Algal Genetics Group

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Applications of short-read

Transcriptome analysis

Quantifying gene expression

SUPPLEMENTARY INFORMATION

Gene Expression Technology

Assembling a Cassava Transcriptome using Galaxy on a High Performance Computing Cluster

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

How much sequencing do I need? Emily Crisovan Genomics Core September 26, 2018

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

Analysis of ChIP-seq data with R / Bioconductor

MODULE 5: TRANSLATION

Wheat CAP Gene Expression with RNA-Seq

ISO/IEC JTC 1/SC 29/WG 11 N15527 Warsaw, CH June Introduction

Introductory Next Gen Workshop

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq

Genomic DNA ASSEMBLY BY REMAPPING. Course overview

Mapping strategies for sequence reads

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Introduction to NGS analyses

Welcome to the NGS webinar series

Supplement to: The Genomic Sequence of the Chinese Hamster Ovary (CHO)-K1 cell line

Sequence Annotation & Designing Gene-specific qpcr Primers (computational)

RNA-Seq data analysis course September 7-9, 2015

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

Transcription:

The Sequencing Revolution Introduction to Next Generation Sequencing Dena Leshkowitz,WIS 1 st BIOmics Workshop High throughput Short Read Sequencing Technologies Highly parallel reactions (millions to billions possible) Performed on cloned DNA populations Companies 454/Roche - Launched in 2005 Pyrosequencing by synthesis Solexa/Illumina - Launched in late 2006 Reversible terminator sequencing by synthesis (dye labeled nucleotides) Agencourt/ABI/Invitrogen Launched mid-2007 Sequencing by ligation (dye labeled dinucleotides) DNA Sequencing Throughput History Cost for Sequencing the Human Genome Objectives Illumina Genome Analyzer pipeline overview Pipeline Components Overview Master Script (GOAT) Interpreting pipeline s output Tools and approaches to further analyze the sequences Image Analysis Base Calling Alignment GERALD

Technology Overview Image Analysis & Base Calling flow cell A flow cell contains eight lanes Lane 1... Lane 8 Each cluster at each cycle, generates 4 fluorescence intensities Each lane contains two columns, each column contains up to 50 tiles Column 1 Column 2 DNA clusters are located and quantified, across all images Each tile is imaged four times per cycle one image per base Naively: highest of the 4 values determines the base Pipeline Components Overview Master Script (GOAT) Base Calling: Intensity Correction Cross talk correction: emission spectra of the four dyes overlaps Normalization: scaling factor to make intensities equivalent Image Analysis Base Calling Alignment Emission spectra of dye X Y GERALD X Y Base Calling Phasing/Prephasing Correction Phasing Prephasing Base Calling G C C C C C A Corrected Intensity C quality score Requires a sample with a random, balanced base composition and therefore is usually done on Phix our control A C G T

Quality Score Pipeline Components Overview Each base has a quality score Solexa's base scoring is similar to Phred scoresa way of expressing estimates of sequencing error probabilities. Q phred = -10 log10( Pe ) Pe = error probability of a particular base call Q20 = 1 error in 100 bases Q30 = 1 error in 1000 bases Master Script (GOAT) Image Analysis Base Calling Alignment GERALD The quality score is in ASCII format ASCII character code= quality value + 64 Quality Filtering GERALD Chastity threshold: The ratio of the brightest intensity over the sum of the brightest and second brightest intensities I A C = >0.6 I + I A B Filter (pure-bases): I A sequence which has a B chastity less than 0.6 on two or more bases among the first 25 bases will be filtered I A ELAND Very fast Alignment: Program ELAND Only 2 mismatches allowed in first 32 bases (N is not counted as a mismatch) Alignments are used to estimate error rates Alignment: Programs Gerald (Eland) Objectives Eland Types Application Description Illumina Genome Analyzer pipeline overview Eland_extended Eland_pair Single reads Paired reads Aligns single reads to a reference Aligns paired reads Interpreting pipeline s output Eland_tag DGE Aligns to a nonredundant reference set of sequence tags Tools and approaches to further analyze the sequences Eland_rna Single reads, whole transcriptome Aligns to a reference genome, splice junctions and contaminations

Sequence Output Formats FASTQ (s_1_sequence.txt) Sequence Output Line 1: Unique ID for a sequencing read Line 2: Sequences Line 3: Repeat of the ID (preceded with a + sign) Line 4: Base calling quality score (Analogous to Phred scores but in ASCII value) Example: @30LH2AAXX:8:1:984:225 ATTCCCCTGTACTGAGACATAGAGAGTTTGCAAGACCA +30LH2AAXX:8:1:984:225 \\\\\\\\\\Z\\\ZZZ\\\\\\W\\\\\ZYYYVYVVV Eland Alignment Outputs ELAND Outputs s_n_export.txt Results of alignment of all reads in the lane. The fields are tab separated to facilitate export to databases. The last field on each line is a flag telling you whether or not the read passed the filter (Y or N). s_n_sorted.txt Contains only entries for reads which : pass pure bases filtering have a unique alignment in the reference. Alignments are sorted by order of their alignment position Example : 30LL2AAXX 1 53 735 205 ACGTGCTTACCCTACCACTCTATACCACCATCACTACC UUUUUUUUUUUUUUUUU UUUUUULUULUUUQQOQQIOO NC_001133.fna 354 F 19T10C3ATG1 0 30LL2AAXX 1 8 348 612 ACGTGCTTACCCTACCACTTTATACCACCACCACATGC UUUUUUUUUUUUUUUUU UUUUUUUUUUUUUQQQQQOMO NC_001133.fna 354 F 38 59 30LL2AAXX 1 78 835 1401 TACCCTACCACTTTATACCACCACCACATGCCATACTC UUUUUUUUUUUUUUUUU Alignment File Format Tab Delimited Run Folder name Lane Tile X Coordinate of cluster Y Coordinate of cluster Index string (Blank for a non-indexed run) Read number (1 or 2 for paired-read analysis) Read Quality string In symbolic ASCII format Match chromosome Name of chromosome match OR code indicating why no match resulted Match Contig Gives the contig name Match Position Always with respect to forward strand Match Strand F for forward, R for reverse Match Descriptor Concise description of alignment Single-Read Alignment Score Paired-Read Alignment Score Partner Chromosome -paired read Partner Contig- paired read Partner Offset Partner Strand Filtering Did the read pass quality filtering? Y for yes, N for no 30LH2AAXX 8 85 1701 577 CAAATATGTTCAACAAAATTATAGTAGAAA GCTTTCCA ]]]]]]]]]]]]]]]]\]]]]]]\\]Z]]]YYYYYVVV NC_000067.5.fasta 3011999 F 30A7 11 Y Run Statistics Quality Control

Summary.htm Report Folder Run Statistics Summary.htm (Report folder) The number of detected clusters The number of cluster that Passed Filtering The average intensity of all color channels in all tiles for the first cycle. Should be above 100 Percent intensity after 20 cycles should be 50% or more %PF should be above 50% (possible problems: too many clusters, faint clusters ) %Aligned filtered reads uniquely aligned %Error rate Should be 1.5 and below The percentage of each base called as a function of the cycle. Each channel (ATGC) is plotted separately IVC.htm Intensity Versus Cycle The red bar shows the % of bases at each cycle that are wrong, based on the eland alignment The error rate raises with the cycles Remark: the sequences were selected upon there ability to align to the first 32 bases Error.htm Pipeline Outputs You will find the following folders within the folder run: Folder Data type Folder structure: Storage Space GERALD_29-01- 2009 FINAL_29-01- 2009 Report Original folder from pipeline Original folder from pipeline Original Gerald folder (can contain CASAVA) from pipeline Final text outputs: Sequences Alignments Summarized as web page Summary.htm Optional data Optional data Optional data (also found in GERALD) (also found in GERALD) FC1012X Gerald Images 750Gb 250Gb Transferred to storage server (dapsas) <100Gb

Statistics of Runs Performed Objectives Illumina Genome Analyzer pipeline overview Interpreting pipeline s output Tools and approaches to further analyze the sequences The Jigsaw Puzzle One Run with 4Gb made of 100 million pieces each of length of 40 bases and some do not fit correctly. Mapping: Aligning to a reference sequence 1. Resequencing 2. Transcriptome analysis (RNA-seq) 3. Cistrome analysis (Chip-seq) First Step in Analysis Sequence data (bases & quality) De novo Assembly: Assembling individual sequences to a larger sequences De Novo Sequencing Example: Pseudomonas syringae Butler et al. FEMS Microbiol Lett 291 (2009) 103 111 6 million genome X42 coverage ~3.5 million paired end reads of 36 bases De novo assembly using VELVET and EDENA, at least 3% of the reference genome was absent from the assembly (842 unassembled regions). Unassembled regions are noncoding RNA 90% of the protein-coding genes being assembled with 100% accuracy over their full length Differences Among the Mapping Applications Speed (Bowtie -string matching using Burrows Wheeler Transform) Use of quality data (MAQ, consed) Ability to perform multiple mapping (Nexalign) Amount of mismatches and indels supported (Soap) Length of seed alignment supported (Eland -32bases)

Resequencing Example SNP Detection & Reporting using CONSED Consed Can Detect Inserted Base CASAVA Consensus Assessment of Sequence And Variation (Illumina) RNA-Seq Post sequencing analysis: uses the export.txt files from the Eland alignment as input For resequencing projects: produces a set of allele calls of SNPs For RNA-seq (whole transcriptome sequencing): provides counts for exons, genes and splice junctions http://en.wikipedia.org/wiki/rna-seq RNA-seq The expression value is calculated by counting the number of reads per gene, exon or splice junction Normalization of the expression value is done by: Dividing the number of reads by the virtual length of the gene or exon Scaling the number of reads between the samples RNA Seq An example of alternative splicing Chromosome Start End GeneSymbol Count_Normalized Lane2 Count_Lane2 c12 11351282 11354633 PRB4 0.57268 524 c4 70896237 70902762 STATH 31.23833 18743 c7 142539296 142546956 PIP 11.35417 6540 c12 10889715 10893342 PRR4 137.22163 77393 c12 11310124 11313908 PRB3 7.40788 8082 c20 43314293 43316620 SLPI 26.63712 15929 2008 by Cold Spring Harbor Laboratory Press Marioni J C et al. Genome Res. 2008;18:1509-1517

Basic output files: BED An example of Bed format file for reads that mapped to a genome: Visualize the Sequence Data Importing to Genome Viewers CHR: START: STOP: NAME: COUNT: STRAND: chr1 17071700 17071733 seqname 2 + chr1 17071700 17071734 seqname 3 + chr1 17071700 17071735 seqname 4 + chr1 17071700 17071736 seqname 26 + chr1 17071701 17071736 seqname 2 + chr1 17071702 17071736 seqname 3 + chr1 17088793 17088829 seqname 1 + Basic output files: WIG Sequencing "signal" - wiggle track: Imported Bed & Wiggle files to IGB genome browser Locus 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 2 3 5 6 6 6 6 6 6 6 6 4 3 1 0 Signal variablestep chrom=chr1 1 0 2 2 3 3 4 5 5 6 6 6 7 6 8 6 9 6 10 6 11 6 12 6 13 4 14 3 15 1 Defining DNA protein interactions Chip-Seq MACS: Model-based Analysis for ChIP-Seq Binding Use confident peaks to model shift size Sultan et al. Science. 2008 Aug 15;321(5891):956-60 CSHL 2009 - Shirley Liu

Example of a Peak (MACS) Objectives Illumina Genome Analyzer pipeline overview Interpreting pipeline s output Tools and approaches to further analyze the sequences chr chr1 start 4838075 end 4838758 length 684 summit 278 tags 68-10LOG10 *(pvalue) 459.98 Fold enrich ment 42.53 FDR (%) 0.84 Bioinformatics wiki http://bip.weizmann.ac.il/wiki THANKS See you at the workshop this afternoon Everybody is invited to read and add to this wiki!