Analysing genomes and transcriptomes using Illumina sequencing

Similar documents
Introductory Next Gen Workshop

Matthew Tinning Australian Genome Research Facility. July 2012

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics

Next-generation sequencing technologies

Introduction to Next Generation Sequencing

Next Generation Sequencing. Tobias Österlund

Deep Sequencing technologies

Aaron Liston, Oregon State University Botany 2012 Intro to Next Generation Sequencing Workshop

RNA-Seq data analysis course September 7-9, 2015

RNA-Sequencing analysis

The Expanded Illumina Sequencing Portfolio New Sample Prep Solutions and Workflow

Ultrasequencing: methods and applications of the new generation sequencing platforms

How much sequencing do I need? Emily Crisovan Genomics Core

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center

NEXT GENERATION SEQUENCING. Farhat Habib

Third Generation Sequencing

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Next generation sequencing techniques" Toma Tebaldi Centre for Integrative Biology University of Trento

NextGen Sequencing and Target Enrichment

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Introduction to Next Generation Sequencing (NGS)

solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome

Transcriptome analysis

G E N OM I C S S E RV I C ES

Applications of Next Generation Sequencing in Metagenomics Studies

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies

Research school methods seminar Genomics and Transcriptomics

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

How much sequencing do I need? Emily Crisovan Genomics Core September 26, 2018

NextSeq 500 System WGS Solution

Next-Generation Sequencing. Technologies

Next Generation Sequencing. Jeroen Van Houdt - Leuven 13/10/2017

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center

NOW GENERATION SEQUENCING. Monday, December 5, 11

Introduction to RNA-Seq in GeneSpring NGS Software

Next-generation sequencing technologies

Read Quality Assessment & Improvement. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

RNA sequencing with the MinION at Genoscope

Contact us for more information and a quotation

Genome Sequencing. I: Methods. MMG 835, SPRING 2016 Eukaryotic Molecular Genetics. George I. Mias

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday September 15, 2014

Modern Epigenomics. Histone Code

About Strand NGS. Strand Genomics, Inc All rights reserved.

Genome Resequencing. Rearrangements. SNPs, Indels CNVs. De novo genome Sequencing. Metagenomics. Exome Sequencing. RNA-seq Gene Expression

Chapter 7. DNA Microarrays

CBC Data Therapy. Metagenomics Discussion

Wet-lab Considerations for Illumina data analysis

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Tuesday December 16, 2014

Human genome sequence

Genomic resources. for non-model systems

Welcome to the NGS webinar series

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

Bioinformatics Advice on Experimental Design

The New Genome Analyzer IIx Delivering more data, faster, and easier than ever before. Jeremy Preston, PhD Marketing Manager, Sequencing

High throughput sequencing technologies

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

Next Gen Sequencing. Expansion of sequencing technology. Contents

Overview of Next Generation Sequencing technologies. Céline Keime

Genome Analyzer. RNA ChIP

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

The Journey of DNA Sequencing. Chromosomes. What is a genome? Genome size. H. Sunny Sun

Wheat CAP Gene Expression with RNA-Seq

Illumina s Suite of Targeted Resequencing Solutions

Genome sequencing in Senecio squalidus

What is Bioinformatics?

Using New ThiNGS on Small Things. Shane Byrne

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

An introduction to RNA-seq. Nicole Cloonan - 4 th July 2018 #UQWinterSchool #Bioinformatics #GroupTherapy

Services Presentation Genomics Experts

Ultrasequencing: Methods and Applications of the New Generation Sequencing Platforms

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

High Throughput Sequencing Technologies. UCD Genome Center Bioinformatics Core Monday 15 June 2015

RADSeq Data Analysis. Through STACKS on Galaxy. Yvan Le Bras Anthony Bretaudeau Cyril Monjeaud Gildas Le Corguillé

GREG GIBSON SPENCER V. MUSE

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems

High Cross-Platform Genotyping Concordance of Axiom High-Density Microarrays and Eureka Low-Density Targeted NGS Assays

Low input RNA-seq library preparation provides higher small non-coding RNA diversity and greatly reduced hands-on time

Processing Data from Next Generation Sequencing

Reads to Discovery. Visualize Annotate Discover. Small DNA-Seq ChIP-Seq Methyl-Seq. MeDIP-Seq. RNA-Seq. RNA-Seq.

DNA. bioinformatics. epigenetics methylation structural variation. custom. assembly. gene. tumor-normal. mendelian. BS-seq. prediction.

FRAUNHOFER INSTITUTE FOR INTERFACIAL ENGINEERING AND BIOTECHNOLOGY IGB NEXT-GENERATION SEQUENCING. From wet lab to dry lab complete sample analysis

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

Lecture 7. Next-generation sequencing technologies

Next Generation Sequencing Technologies. Some slides are modified from Robi Mitra s lecture notes

Differential gene expression analysis using RNA-seq

Measuring and Understanding Gene Expression

02 Agenda Item 03 Agenda Item

DNA-Sequencing. Technologies & Devices

Next- gen sequencing. STAMPS 2015 Hilary G. Morrison Joe Vineis, Nora Downey, Be>e Hecox- Lea, Kim Finnegan

The Genome Analysis Centre. Building Excellence in Genomics and Computa5onal Bioscience

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

Next-generation sequencing Technology Overview

Transcription:

Analysing genomes and transcriptomes using Illumina uencing Dr. Heinz Himmelbauer Centre for Genomic Regulation (CRG) Ultrauencing Unit Barcelona

The Sequencing Revolution High-Throughput Sequencing 2000 High-Throughput Sequencing 2009 Image credit: U.S. Department of Energy, JGI 96 uences per hour 2.6 million uences per hour

Next generation uencing platforms Applications mrna-seq: expression profiling, transcript discovery Small RNAs (mirnas) ChIP-Seq, RIP-Seq DNA methylation Re-uencing, haplotyping (SNPs and structural variation) De novo uencing

Sequencing platforms Technology Year Read length Chemistry Template (nt) amplification Sanger 1977 1000 SBS cloning/pcr 454 2005 250-500 Pyro PCR Solexa 2006 36-75 SBS PCR SOLiD 2007 35 Ligation PCR Helicos 2008 30 SBS no amplification PacBio 2010? 1000 SBS no amplification VisiGen 2009? 1000 SBS no amplification Oxford Nanopore?? Elco no amplification SBS = Sequencing by Synthesis Pyro = Pyrouencing Ligation = Sequencing by ligation assays Elco = Electrochemical detection

Illumina/Solexa technology Library preparation Cluster generation Cycle uencing Solexa GA II run: - 45 hours - 1 TB image data - 10 x 8 mio. 36-50 nt - up to 2.9 Gbp Base calling

Summary of Solexa properties Reads are short (36 nt), but get progessively longer (50 nt, 75 nt) 8 samples can be uenced in parallel Protocols in use at the CRG: Shotgun uencing (genomic DNA, cdna, ChIP-Seq, RIP-Seq) mrna-seq Indexed RNA-Seq Small RNA (mirnas) identification and profiling Gene expression profiling (Tag uencing, similar to SAGE) Paired end uencing (long and short paired ends) Advantages Millions of reads from your sample Relatively cheap Disadvantages Millions of reads from your sample Some biases, poorly understood Sequencing errors (but Illumina improves) Demands on IT infrastructure (data processing, analysis, backup)

Illumina GA II workflow 4 images/tile per cycle ~23 GB per cycle ~800GB for 36 cycles 1 intensity file per tile ~1.4 GB per cycle ~ 50 GB for 36 cycles Illumina Genome Analyzer 1 flowcell 8 lanes Run summary Error report Alignment coordinates Several files per lane ~40 GB for 36 cycles Gerald: Aligns uences with Eland Firecrest: quantifies the intensity levels in images. Bustard: calls bases by comparing intensity levels 1 uence file per tile ~1.1 GB per cycle ~40 GB for 36 cycles

Images taken by GA II Computing Infrastructure Images written to Dell Precision Workstation (2 Dual-cores 2.66 Ghz, 3GB RAM, 1TB RAID) Images copied to HP Proliant IPAR (2 Quad cores 3 GHz, 16GB RAM, 4TB RAID). ensity files created 'real-time'. Images Images Images ensity files copied to analysis server ensity files Data stored on 1TB FATA discs (17TB RAID). Images written to tape and deleted. Base-calling, Gerald and post-analysis performed on HP Proliant analysis servers (2 or 4 Quad-cores ~3.2 GHz, 32GB RAM)

Solexa/Illumina uencing at the CRG Set up of GA I Upgrade to GA II Set up of 2nd GA II 60 Number of flowcell lanes used 50 40 30 20 10 0 02/2008 04/2008 06/2008 08/2008 10/2008 12/2008 02/2009 04/2009 01/2008 03/2008 05/2008 07/2008 09/2008 11/2008 01/2009 03/2009 05/2009 Month

Open-source LIMS for Solexa run management Dadabik (database interface creator) web browser user interface Dadabik web browser admin interface SQL database Automated scripts which update the database and send notification e-mails Requires Dadabik, SQL, Apache, PHP, and a Linux web server

Open-source LIMS for Solexa run management

Cluster densities 150.000 clusters per tile recommended 100 tiles per lane, 15 million raw uences per lane Number of usable uences not affected by further density increase Improved software (Illumina pipeline 1.3.2 and 1.4.0) to deal with merged clusters

Solexa uencing: Many millions of of short reads : AATTATGCCTTGATGTGTGGGCGATCGCTTCACATA : ATGAAAAAATCGCGCTAGATTTATTGCCGGTGATTT : AAACTCAAATTCCTATTATATCAAACAATTTTTATA : AACTGACGGAGTTCCTTTTTGGTCGCTCCGCTTTGC : AGCTTGAATGCCCCTAGTGCTTTTATTTTATCGCGC : ATCCGCATAAAAGCATTTATCCATGTGGGTAAGCTG : ATTGGGGGTGGGATTAGCGCTTGCACTAGATTTTGG : AATAAAACCCCATGCGTTTTACTTTTTCTTGGGCGT : AGCAAAATAAACCAATCATCTTAAGCCAGCAAAGCA : GCTCTTTTGCACGATAAATTTCACCGCTTCTTTATG : AAAAGCGCAATTGAAAGACGATTTACGCCAATTGAG : AGCTTACCCTCTTTATCATGGTAAGTGAAAGGGGGG : GTGGTCAATATTTTGGGGTTTTTTAACATTAAATTT : GGGTTTAAAGCAGTCTTTTTTATAAAAAAGTGATTT : AACAACAAGCTCTCTGTGGGGCTTTTTGGCGGTATC : AGGGATCGTTTTTGCTCACTAACTCGCCCTTTGGCT : GTGATTAAAATGGGGCGCACCCAGCTTCAAGACGCT : ATGAAGCTTTGTCGCTTTGCTTAGTTATAGAGAGTT : GGGTGGTAGGAGTGTGGAGCTAGTAGAGGCTGTGGT : AATTTGGAATAAGTAGAGATAAGTCGTTTAGTATCC : GAGTTTTAAAGTGTCTAGCCCATAAGAAGAAAAAGT [...]

Analysis of Solexa data Comparison with reference uence (not possible with de novo uencing) Genome uence Transcripts or gene predictions Selected data sets (e.g. mirna uences) Positioning Solexa reads on the reference genome by alignment No match Match at exactly one position Match at more than one position Mostly only reads with exactly one match in the reference uence are considered for downstream analysis Fast algorithms required

Exemplary data (GA I) Postioning 12.3 million reads on the Helicobacter acinonychis reference genome Up to two mismatches permitted

Software to align reads against a reference genome Programs differ by: Speed, sensitivity, mismatch tolerance, permitted read length Eland (Illumina) - 2 mismatches in alignment tolerated, no Indels - Reads need to have identical lengths - Read lengths up to 32 nt + Very fast SeqMap (Jiang, Wong, Bioinformatics, 2008) + Tolerates up to 5 mismatches and Indels + Different read lengths of input uences is possible + Reads may exceed lengths of 32 nt - Slow Soap (Li et al., Bioinformatics, 2008) + Tolerates up to 5 mismatches and Indels - Reads should have similar lengths + Read lengths up to 60 nt + Fast for entire genomes, demanding on hardware

Solexa reuencing Alignment against reference genome Estimation of uencing error rate Discovery of biologically relevant uence differences Discovery of variation within a species, by inter-strain comparisons

De novo genome assembly from Solexa data? YES! Input data: 12.3 million reads (32 nt) from Helicobacter acinonychis str. Sheeba (1.6 Mbp) Previously uenced with Sanger technology (Eppinger et al., 2006) Identical DNA preparations used Reads Contigs NotMatched Mean Len Max Len N50 N80 Seq Cov a Q >20 1174569 1330 5 1157 10866 2257 1015 97,39% Q >25 1098233 1379 5 1110 13684 2232 958 97,16% Q >30 1034034 1628 4 937 13673 1893 772 96,49% Q >35 978900 1967 4 773 13020 1558 597 96,33% b merged 932 5 1603 18577 3476 1364 95,21% c all merged 937 1636 18854 3659 1501 97,70% d contig overlap >5 bp 228 6762 62478 19890 8012 97,70% contigs N50 size N80 size 0 10 20 30 40 50 60 70 80 90 100 %

De novo genome assembly from Solexa data? YES! Genome Research, June 2009

Acknowledgements CRG - Ultrauencing Unit Lab: Ana Vivancos Ester Castillo Anna Menoyo Maik Zehnsdorf Bioinformatics: Robert Kofler Juliane Dohm Matt Ingham Debayan Datta André Minoche Cooperations CRG Systems Biology Program Marc Güell Luis Serrano CRG- Genes and Disease Program Eulàlia Martí Universitat Autonoma de Barcelona Miguel Pérez-Enciso Mònica Bayés