Next-Generation Sequencing. Technologies

Similar documents
Aaron Liston, Oregon State University Botany 2012 Intro to Next Generation Sequencing Workshop

Overview of Next Generation Sequencing technologies. Céline Keime

Welcome to the NGS webinar series

Matthew Tinning Australian Genome Research Facility. July 2012

Third Generation Sequencing

Deep Sequencing technologies

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday June 16, 2014

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday September 15, 2014

Next Generation Sequencing. Tobias Österlund

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

NextGen Sequencing Technologies Sequencing overview

Next Generation Sequencing. Jeroen Van Houdt - Leuven 13/10/2017

Lecture 7. Next-generation sequencing technologies

Outline General NGS background and terms 11/14/2016 CONFLICT OF INTEREST. HLA region targeted enrichment. NGS library preparation methodologies

Next-generation sequencing Technology Overview

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center

Next-generation sequencing and quality control: An introduction 2016

NGS in Pathology Webinar

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

The Journey of DNA Sequencing. Chromosomes. What is a genome? Genome size. H. Sunny Sun

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

The New Genome Analyzer IIx Delivering more data, faster, and easier than ever before. Jeremy Preston, PhD Marketing Manager, Sequencing

Next-generation sequencing technologies

Human genome sequence

Surely Better Target Enrichment from Sample to Sequencer and Analysis

Get to Know Your DNA. Every Single Fragment.

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Tuesday December 16, 2014

Introduction to Next Generation Sequencing (NGS)

Agilent NGS Solutions : Addressing Today s Challenges

NEXT GENERATION SEQUENCING. Farhat Habib

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

High Throughput Sequencing Technologies. UCD Genome Center Bioinformatics Core Monday 15 June 2015

Wheat CAP Gene Expression with RNA-Seq

you can see that if if you look into the you know the capability kilobases per day, per machine kind of calculation if you do.

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Surely Better Target Enrichment from Sample to Sequencer

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Bioinformatics Advice on Experimental Design

Integrated NGS Sample Preparation Solutions for Limiting Amounts of RNA and DNA. March 2, Steven R. Kain, Ph.D. ABRF 2013

Human Genome Sequencing Over the Decades The capacity to sequence all 3.2 billion bases of the human genome (at 30X coverage) has increased

A Crash Course in NGS for GI Pathologists. Sandra O Toole

The Expanded Illumina Sequencing Portfolio New Sample Prep Solutions and Workflow

Genome Resequencing. Rearrangements. SNPs, Indels CNVs. De novo genome Sequencing. Metagenomics. Exome Sequencing. RNA-seq Gene Expression

RNA Sequencing. Next gen insight into transcriptomes , Elio Schijlen

Contact us for more information and a quotation

Analytics Behind Genomic Testing

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

RNA-Seq data analysis course September 7-9, 2015

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics

DNA-Sequencing. Technologies & Devices

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center

About Strand NGS. Strand Genomics, Inc All rights reserved.

DNA-Sequencing. Technologies & Devices

SureSelect Target Enrichment for the Ion Proton TM Next Generation Sequencing System

G E N OM I C S S E RV I C ES

Next generation sequencing techniques" Toma Tebaldi Centre for Integrative Biology University of Trento

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Next Generation Sequencing. Target Enrichment

Understanding the science and technology of whole genome sequencing

Research school methods seminar Genomics and Transcriptomics

NGS technologies: a user s guide. Karim Gharbi & Mark Blaxter

02 Agenda Item 03 Agenda Item

Next Gen Sequencing. Expansion of sequencing technology. Contents

Sequencing techniques

Introductory Next Gen Workshop

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Wet-lab Considerations for Illumina data analysis

solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

Chapter 7. DNA Microarrays

The Genome Analysis Centre. Building Excellence in Genomics and Computa5onal Bioscience

Genome 373: High- Throughput DNA Sequencing. Doug Fowler

Targeted Sequencing in the NBS Laboratory

Reads to Discovery. Visualize Annotate Discover. Small DNA-Seq ChIP-Seq Methyl-Seq. MeDIP-Seq. RNA-Seq. RNA-Seq.

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

Genome Sequencing. I: Methods. MMG 835, SPRING 2016 Eukaryotic Molecular Genetics. George I. Mias

2nd (Next) Generation Sequencing 2/2/2018

Introduction to RNA-Seq in GeneSpring NGS Software

Next Generation Sequencing (NGS)

Outline. General principles of clonal sequencing Analysis principles Applications CNV analysis Genome architecture

Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

Modern Epigenomics. Histone Code

Next Generation Sequencing in Genetic Diagnostics Alan Pittman, PhD

Read Mapping and Variant Calling. Johannes Starlinger

Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

Sample to Insight. Dr. Bhagyashree S. Birla NGS Field Application Scientist

Introduction to the NextSeq System

CM581A2: NEXT GENERATION SEQUENCING PLATFORMS AND LIBRARY GENERATION

Using New ThiNGS on Small Things. Shane Byrne

Gene Expression Technology

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

Services Presentation Genomics Experts

Introducing QIAseq. Accelerate your NGS performance through Sample to Insight solutions. Sample to Insight

Next- gen sequencing. STAMPS 2015 Hilary G. Morrison Joe Vineis, Nora Downey, Be>e Hecox- Lea, Kim Finnegan

Sequencing techniques and applications

MHC Region. MHC expression: Class I: All nucleated cells and platelets Class II: Antigen presenting cells

Transcription:

Next-Generation Next-Generation Sequencing Technologies Sequencing Technologies Nicholas E. Navin, Ph.D. MD Anderson Cancer Center Dept. Genetics Dept. Bioinformatics Introduction to Bioinformatics GS011062

Generations of Sequencing Technologies Zero-Generation First-Generation Second-Generation 1970s-80s 1980s-90s 2000s Next-Generation Sequencing Technologies Sanger Chain Termination Radiolabelling Automated Fluorescent Capillary Sequencers Nicholas E. Navin, Ph.D. ABI3700 Massively Parallel Sequencing 454 Pyrosequencer Output: 700bp / day Output: 384kb /day Output: 1 gigabase / day

Next-Generation Sequencing Technologies Second-Generation 2005-2012 Third-Generation 2010 Third-Generation 2012 Next-Generation Sequencing Technologies Illumina HiSeq2000 PacBio Real-time sequencing Nicholas E. Navin, Ph.D. Output: 150gb / week Read Length: 100bp Output: 1gb /day Read Length: 10kb Oxford Nanopore Sequencing Output:? Read Length: 40kb

Data Analysis Bottleneck Today we will begin to learn how to do the $100,000 analysis

Genome Sequencing Centers! Initially, all genome sequencing was performed at Genome Sequencing Centers

Today NGS can be performed at most Universities Many Universities (including MD Anderson) have sequencing core facilities that can sequence DNA or RNA samples provided by users Machines such as the Illumina HiSeq 2000 can generate: ~150 gigabases per week Very low error rates: 0.01% ~2 human genomes at 40-50X coverage Cost of sequencing Genome: $5000 Exome: $1000 Transcriptome: $500

Next-Generation Sequencing Platforms Which sequencing platform to choose?

Technical Considera/ons: Considerations Sequencing Pla6orms Run Time Sequencing Error Rates Amount of Data Output Library Input Requirements Hardware Access Sample Multiplexing Life Sciences Proton Ion Life Sciences Ion Proton Complete Genomics Oxford Nanopore gridion Illumina HiSeq2000 PacBio RS

Technical Considerations Instrument Primary Errors Final Error Rate % Data Output Per Run Read Length (bp) AB 3730 (Capillary) Substitutions 0.1-1 48 Mb 700 454 GS FLX+ Indels 1 0.7 Gb 300-500 Illumina HiSeq Substitutions 0.1 300 Gb 150 Ion Proton Indels 1 10 Gb 100-200 SOLID 5500xl A-T Bias 0.1 180 Gb 50 Oxford Nanopore Deletions 4 variable 10-40kb PacBio RS CG Deletions 15 0.1 Gb 10,000 Complete Genomics Indels 0.0001 60 Gb 35

Pyro Sequencing Systems: GS Junior, GS FLX by Roche 454 Advantages: Long read length (300-500bp), short run time (10 hours) Disadvantages: Cannot resolve homopolymers, resulting in false indels; Data output is limited to ~700mb per run; emulsion pcr library prep

Semi-Conductor Sequencing Systems: Ion Torrent, Ion Proton (Life Technologies) Advantages: Highly Scalable, short run time 2 hours, 10gb data output Disadvantages: cannot resolve homopolymers (resulting in false indels), emulsion pcr library prep

Sequencing by Synthesis (Illumina) Systems: HiSeq2000 MiSeq, GA2 (Illumina) Advantages: High data output (150gb per run), Paired-end sequencing, inexpensive, resolves homopolymers well Disadvantages: Long runtime (1 hour per base), must run 8 lanes in parallel (hiseq) Illumina Video

Real-time Sequencing Zeptoliter (1e-20) chamber uses Zero Mode Wavelength (ZMW) camera Advantageous: Run time occurs in real time; several hours; Read length of 5-10kb; Can detect epigenetic modification to DNA; single molecule sequencing. Disadvantageous: High error rate 15%; low data output: 50mb Pacbio video

Nanopore Sequencing Advantageous: Very long read lengths (40kb), direct single molecule sequencing, low-throughput USB device will be released Disadvantageous: High error rate 5%, Probably many other problems, but no papers have been published yet Ofxord Nanapore Video

NGS Can Detect Different Classes of DNA Mutations Sequencing throughput has increased from 700 bases per run to 150 gigabases per run on the Illumina HiSeq2000 Meyerson et al, Nature 2011

Applications of NGS DNA Copy Number Amplifications and Deletions Structural Variations (translocations, inversions) Indels (insertions and deletions) Point mutations (nonsense, missense, splice-site) RNA RNA expression levels (mrna, microrna, lincrna) Alternative Transcripts Fusion transcripts Protein Binding Chip-Seq Epigenetic Bisulfite Sequencing Third Generation Sequencing Platforms

Illumina Next- Generation Sequencing Introduction to Bioinformatics

Illumina Sequencing Approaches Single-End 67% of the market Sequencing uses Illumina sequencing platforms Mate-Pair for NGS experiments Sequencing Resulting Data One sequence read per molecule; good for detecting point mutations and small indels Workflow for Illumina Sequencing Paired-End Sequencing Data Processing Good for genome assembly and resolving large regions with repetitive DNA elements Illumina

Single-End, Paired-End or Mate-Pairs? Single-End DNA seq RNA seq Chip-Seq Paired-End Genome Sequencing Genome Assembly Detection of Chromosome rearrangements Useful when more data is required; pooled exomes Mate-Pairs Genome Assembly Resolving genomic regions with complex repetitive elements

Library Construction Single or Paired-End Libraries Mate-Pair Sequencing Insert size = 200-500bp Insert size = 2000-5000bp

Considerations: Read Length Short Reads (36-76bp) Useful for Counting Applications : RNA-Seq for determining expression levels from read counts Copy Number Profiling measuring copy number from read count density Chip-Seq determining protein binding by read density Also, targeted sequencing of small genomic regions Long Reads (100-150bp) Useful for: RNA-Seq to detect alternative splice forms Exome sequencing in which more data is needed Genome Assembly Genome Sequencing Chromosome Structural Rearrangement Important: The read length must be shorter than the library insert size, otherwise many reads will sequence into the adapters

Insert Size Distributions Experimental Calculated from Sequence Read Data 200bp- 400bp 100bp- 200bp During library construction the insert size of the DNA libraries can be selected by gel extraction The insert size has important implications for sequencing, and determining which read length to use, 200-400bp is common Note: the insert size must be larger than the read length, otherwise the adapters will be sequenced this will cause errors in alignment

Library Construction DNA Fragmentation 8bp Indexes allow samples to be pooled together in a single lane Illumina

Indexing (Barcoding) Libraries for Pooled Sequencing Reactions Index = 8bp unique sequences that is added to each library Each library contains a unique index sequence Typically samples are pooled together using 4-12 samples ATGCAGAT TTCGAGAT CCGACGAC GAGATCAT CGATGCTA ACTGCAGC Sample 1 Reads Sample 2 Reads Sample 3 Reads Sample 4 Reads Sample 5 Reads Sample 6 Reads advantages: major cost and time savings

Exome Capture For many research projects it may be advantageous to sequence only the coding regions of the human genome (the exome) This can be achieved using exome capture kits (Illumina, Agilent, Nimblegen) to capture DNA in exonic regions by hybridizing to biotinylated probes before sequencing

Flowcell lane1 lane8 Flowcell surface is coated with a lawn of primers that hybridize to the library adapters During the sequencing run, reagents are flown through the chambers after each cycle Each Flowcell has 8 lanes for sequencing 8 different samples Illumina

Cluster Amplification Hybridization to primer lawn, 3 extension, denaturation Repeated rounds of bridge amplification and denaturation lead to clonal expansion of reads Reverse strand cleavage, blockage and hybridization of sequencing primers Sequencing libraries are hybridized to a flowcell surface and clonally amplified so that they can be imaged by fluorescent microscopy on the HiSeq2000 Illumina.

Reversible Terminator Chemistry Illumina Inc.

Sequencing Run Modified bases with reversible terminators and fluorophores are incorporated during each cycle The fluorophore is excited and TIF images are collected across the surface of the flowcell tiles for base-calling The fluorophore and terminator are then cleaved and the next base is incorporated Illumina

Paired-End Sequencing Run Illumina

Run Metrics: Cluster Density Excessive cluster density makes it difficult for the image analysis algorithm (RTA) to detect clusters for base calling Important: to quantify the DNA library concentration accurately before sequencing (qpcr, Bioanalyzer)

Illumina Sequencers Are Fluorescent Microscopes Illumina

Base Quality Scores Q= -10 log 10 (e) e = estimated probability base call is wrong Q Score Prob Incorrect 10 1 in 10 90% 20 1 in 100 99% Base Accuracy 30 1 in 1000 99.9% Base quality scores are encoded in ASCII characters representing numbers 0-41 (QSCORE +33) ACTCAGCATCATCATTTCTATTCATTAGTA DDCCBABA@@A?A?>>+<+<;;;::98765 ASCII Value Q-Score! 33 0 34 1 # 35 2 $ 36 3 % 37 4 & 38 5 ` 39 6 ( 40 7 ) 41 8 * 42 9 + 43 10, 44 11-45 12. 46 13 / 47 14 ) 48 15 1 49 16 2 50 17 3 51 18 4 52 19 5 53 20 6 54 21 7 55 22 8 56 23 9 57 24 : 58 25 ASCII Value Q-Score ; 59 26 < 60 27 = 61 28 > 62 29? 63 30 @ 64 31 A 65 32 B 66 33 C 67 34 D 68 35 E 69 36 F 70 37 G 71 38 H 72 39 I 73 40 J 74 41

Quality Score 100 Paired-End run 40 0 0 100 200 Quality scores decrease at the ends of reads due to fluorescence quenching after each cycle

Coverage Depth vs. Coverage Breadth Coverage depth mean number of read counts across all the bases of a sequenced sample. usually denoted by X : ex. 30X coverage depth Coverage breadth % of the genome sequence at 1X read coverage Usually denoted by %: ex. 80% coverage breadth 10X 80% *These metrics are calculated using reads with unique mappings only

Coverage Depth Coverage depth normally follows a Poisson distribution 62X Coverage Depth Yong Wang

Coverage Depth Poisson Distribution Using a Poisson probability distribution it is possible to calculate how many reads will have X amount of target coverage: Example: what it the probability that a site will be covered exactly ten times if the sample has 5X mean coverage depth? 5 10 * e -5 / 10! = 0.018 1.8% of the bases will have exactly 10X coverage

How Deep Should I Sequence? Depends largely on how rare the expect alleles are in the sample For a normal genome, most allele frequencies are 1.0 (homozygous) and 0.5 (heterozygous), so 30X coverage will provide 100% power of detection P(d) = 1 (1 a) c However in tumors (where heterogeneity is common) alleles may occur at 10% or 1% frequencies, which requires 50X or 500X coverage to detect with 100% power

Coverage Breadth Normally coverage breadth will approach 90% in a human genome sample, when reads are mapped with unique coordinates. About 10% of the genome cannot be mapped uniquely due to repetitive elements Alternatively coverage breadth can be calculated using multiple mapped reads, in which case it approaches 100% Good coverage breadth of the genome calculated using uniquely mapped reads Poor coverage breadth Navin, Wang

Data Processing Pipelines

Processing Pipeline Sequencing & Base Calling Library Construction Data Processing Align FASTQ files to Human Genome Barcoding Convert to SAM file Cluster Generation Sequencing Run Generate TIF Images Convert to BAM file (binary file) Sort Reads by Chromosome Position RTA Cluster Detection Remove PCR Duplicates Base Calling Variant Calling (GATK, Varscan) Deconvolution of Indexes Variant Annotation (Annovar, COSMIC) Generate FASTQ files Prediction Algorithms (SIFT, Polyphen)

Read Alignment Popular Short-Read Alignment Software: Paired-End Sequencing Mate-Pair Sequencing A sequencing run generates millions of reads that need to be aligned to the human genome (assembly HG18, HG19) in order to determine their chromosome positions In a typical experiment, between 80-90% of the reads will map uniquely to the human genome However, some samples, such as mouth swabs may contain significant contamination from bacterial DNA, which can decrease the number of reads that align to the human genome BWA Bowtie http://bio-bwa.sourceforge.net http://bowtie-bio.sourceforge.net MAPPING ERRORS: short read aligners have a higher number of mapping errors compared to pair-wise alignment methods (BLAST, Smith- Waterman). Poorly mapped reads can be filtered by mapping quality score before variant detection to eliminate false-positives Good for genome assembly and resolving large regions with repetitive DNA elements

Removing PCR Duplicates During library construction, PCR is used to enrich the DNA before sequencing, Single-End Sequencing which can generate duplicate reads with false-positive Mate-Pair mutations Sequencing Paired-End Sequencing PCR Duplicates Good for genome assembly and resolving large regions with repetitive DNA elements PCR duplicates usually constitute about 10% of the total sequence reads They can be removed informatically in post-processsing, because they have identical start and stop coordinates

Sequencing Errors and Stand Bias Mate-Pair Sequencing Paired-End Sequencing Sequencing errors can be randomly distributed (1) in single reads or expanded across multiple reads (2,3) Expanded errors are dangerous because they can be called as real mutations by variant calling algorithms One way to mitigate these errors is to look at strand bias Good for genome assembly and Normal mutations should occur on both + and strands resolving with large equal regions frequencies with repetitive DNA elements If the mutation occurs on only one strand (2,3) then it is likely an error

GC Bias Mate-Pair Sequencing Paired-End Sequencing In most genomes GC content follows a normal distribution However, PCR and sequencing can generate GC bias in which reads with lower GC content are amplified or sequenced more frequently (right panel) Good for genome assembly and resolving large regions with repetitive DNA elements

Indel Alignment Errors Before Indel Realignment Mate-Pair Sequencing After Indel Realignment Paired-End Sequencing False-positive mutations frequently occur in regions with indels. To remove these errors, it is necessary to realign the sequences around the indels using a more accurate alignment algorithm Good for genome assembly and resolving large regions with repetitive DNA elements

Summary Single-End Select the appropriate Sequencing sequencing platform for your project Mate-Pair and consider Sequencing what type of biological information you are interested in measuring (DNA, RNA, CHIP, Epigenetics) Sequencing technologies are constantly evolving, understand the caveats and advantageous of each system. The newest platform is not always best. Consider the appropriate read length, single or paired end sequencing and whether indexing or exome capture is advantageous for your study Paired-End Sequencing Determine what target sequence depth is appropriate for your sample and calculate the number of lanes necessary to achieve that depth Be aware of false-positive technical errors (PCR amplification, sequencing Good for genome assembly and resolving large regions with repetitive DNA elements errors, mapping errors) and how to mitigate them in your processing pipeline

Acknowledgements Several slides in this lecture were provided by: Jude Kendall (Cold Spring Harbor Laboratory) Future IPTC across the street Michael Schatz (Cold Spring Harbor Laboratory) Yong Wang (MD Anderson) Illumina Inc. (San Diego)