NOW GENERATION SEQUENCING. Monday, December 5, 11

Similar documents
De novo genome assembly with next generation sequencing data!! "

De novo assembly in RNA-seq analysis.

Bioinformatics Advice on Experimental Design

Assembling a Cassava Transcriptome using Galaxy on a High Performance Computing Cluster

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

N50 must die!? Genome assembly workshop, Santa Cruz, 3/15/11

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Aaron Liston, Oregon State University Botany 2012 Intro to Next Generation Sequencing Workshop

Transcriptome analysis

de novo paired-end short reads assembly

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

From Lab Bench to Supercomputer: Advanced Life Sciences Computing. John Fonner, PhD Life Sciences Computing

Contact us for more information and a quotation

Reads to Discovery. Visualize Annotate Discover. Small DNA-Seq ChIP-Seq Methyl-Seq. MeDIP-Seq. RNA-Seq. RNA-Seq.

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Deep Sequencing technologies

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

About Strand NGS. Strand Genomics, Inc All rights reserved.

De novo whole genome assembly

Matthew Tinning Australian Genome Research Facility. July 2012

ChIP-seq and RNA-seq. Farhat Habib

Genomic resources. for non-model systems

Introduction to RNA-Seq in GeneSpring NGS Software

ChIP-seq and RNA-seq

TREE CODE PRODUCT BROCHURE

NEXT GENERATION SEQUENCING. Farhat Habib

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

De novo whole genome assembly

Next-generation sequencing Technology Overview

NGS technologies: a user s guide. Karim Gharbi & Mark Blaxter

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Introduction to NGS analyses

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

02 Agenda Item 03 Agenda Item

De novo whole genome assembly

RNA-Seq analysis workshop

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

Illumina s Suite of Targeted Resequencing Solutions

Analysing genomes and transcriptomes using Illumina sequencing

Experimental Design Microbial Sequencing

De Novo Assembly of High-throughput Short Read Sequences

NGS part 2: applications. Tobias Österlund

The Genome Analysis Centre. Building Excellence in Genomics and Computa5onal Bioscience

An introduction to RNA-seq. Nicole Cloonan - 4 th July 2018 #UQWinterSchool #Bioinformatics #GroupTherapy

Next-Generation Sequencing. Technologies

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

DNA METHYLATION RESEARCH TOOLS

Next Generation Sequencing. Tobias Österlund

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

LARGE DATA AND BIOMEDICAL COMPUTATIONAL PIPELINES FOR COMPLEX DISEASES

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

CBC Data Therapy. Metagenomics Discussion

de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

FGCZ NEWSLETTER FALL Next Generation Sequencing at the Functional Genomics Center Zurich

A Roadmap to the De-novo Assembly of the Banana Slug Genome

Haploid Assembly of Diploid Genomes

De novo genome assembly. Dr Torsten Seemann

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis

How much sequencing do I need? Emily Crisovan Genomics Core

Bioinformatics. Dick de Ridder. Tuinbouw Digitaal, 12/11/15

High throughput sequencing technologies

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Compute- and Data-Intensive Analyses in Bioinformatics"

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Genomics. Data Analysis & Visualization. Camilo Valdes

How much sequencing do I need? Emily Crisovan Genomics Core September 26, 2018

NGS Approaches to Epigenomics

Welcome to the NGS webinar series

Bioinformatics: A perspective

NGS-based innovations within the Leiden Network

Chapter 2: Access to Information

Research school methods seminar Genomics and Transcriptomics

State of the art de novo assembly of human genomes from massively parallel sequencing data

Quality Control of Next Generation Sequence Data

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015

RADSeq Data Analysis. Through STACKS on Galaxy. Yvan Le Bras Anthony Bretaudeau Cyril Monjeaud Gildas Le Corguillé

Introduction to Bioinformatics

How to deal with your RNA-seq data?

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

ChIP-Seq Tools. J Fass UCD Genome Center Bioinformatics Core Wednesday September 16, 2015

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Genomic Data Analysis Services Available for PL-Grid Users

Introduction to Next Generation Sequencing (NGS) Data Analysis and Pathway Analysis. Jenny Wu

G E N OM I C S S E RV I C ES

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Introduction to Microbial Sequencing

Bioinformatics: A perspective

IMGM Laboratories GmbH. Sales Manager

Lecture 7. Next-generation sequencing technologies

DNA. bioinformatics. epigenetics methylation structural variation. custom. assembly. gene. tumor-normal. mendelian. BS-seq. prediction.

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Using New ThiNGS on Small Things. Shane Byrne

Single Cell Transcriptomics scrnaseq

Transcriptome Assembly and Evaluation, using Sequencing Quality Control (SEQC) Data

Transcription:

NOW GENERATION SEQUENCING 1

SEQUENCING TIMELINE 1953: Structure of DNA 1975: Sanger method for sequencing 1985: Human Genome Sequencing Project begins 1990s: Clinical sequencing begins 1998: NHGRI $1000 Genome Goal established 2003: Human Genome Project completed. $100 million x 10 years 2005: First Next-Generation Platform released 2011: Small form-factor NGS: IonTorrent, MiSeq 2

EXPONENTIAL GROWTH Bases/Run FLOPS Rate at which sequence data can be acquired is outpacing Moore s Law Year 3

EXPONENTIAL GROWTH Bases/Run FLOPS Rate at which sequence data can be acquired is outpacing Moore s Law Year 4

PLATFORM NUTS+BOLTS Roche 454 http://454.com/products-solutions/multimedia-presentations.asp Illumina Genome Analyzer http://www.youtube.com/watch?v=77r5p8ibwjk Life Technologies SOLiD http://media.invitrogen.com.edgesuite.net/ab/applications-technologies/solid/solid_video_final.html Pacific Biosciences http://www.youtube.com/watch?v=_b_cuz8hsyu IonTorrent http://www.youtube.com/watch?v=yvf2295jqug Platform mechanistic underpinnings have real bioinformatic consequences 5

QUICK COMPARISON Platform Read length Run time Gb/run Cost Pros Cons Roche/454 Titanium GS FLX Titanium 400-500bp 8-12h 0.45 $500k Longer read length; Short run time. Relatively expensive compared to Illumina or SOLID. Polynucleotide tract errors. Illumina GA II, HiSeq, MiSeq 36-100bp 1-4d 18-500 $550k Massive throughput; Inexpensive Higher error rates; Rapidly evolving platform. Less ability to multiplex. Lifetech SOLiD and ABI 5500 50bp 4-5d 30-170 $595k Colorspace (2bit) bioinformatics. Colorspace bioinformatics; Lower adoption rate than Illumina. Long run times. PacBio 850-1500bp 40m 0.40 $695k Speed; Run length Perpetual beta; Strobing; Low throughput; Cost IonTorrent >200bp 2h 0.1-1.0 $75k Low instrument cost and convenient form factor. Short run-time. Good run length. Low throughput Source: Metzker, Nat. Rev. Genetics, Jan 2010 + Additional research 6

EXPERIMENT TYPES* Genome discovery genomic DNA De novo assembly Reference-guided assembly Variant identification SNPs Indels and rearrangements Enrichment-Seq ChIPseq C3 Histone occupancy DNAse I Hypersensitvity Methylation profiling Bisulfite sequencing MeDIP (ChIP) Methylation-sensitive restriction enzymes RNA Transcript discovery De novo assembly Reference-guided discovery Small molecule profiling mirna sequencing Small RNA sequeuncing Quantitative Measures RNA-seq Allele-specific expression 7

De novo assembly GENOME DISCOVERY Reference-guided assembly related genome Many sequence reads, randomly generated algorithm Contigs/chromosomes De novo assembly 8

De novo assembly GENOME DISCOVERY Many sequence reads, randomly generated reference genome Consensus sequence Deletion algorithm Rearrangement Genome sequence alignments 9

De novo assembly ENRICHMENT reference genome Genome sequence alignments Complex biochemistry involving ultrasonic DNA breakage, antibodies, liquid buffers and other things that will ruin your computer... Many sequence reads, enriched for specific areas of the Reference Genome algorithm algorithm Intervals 10

Visualization is key to ChIP Empirical method to assess saturation Genome alignments Run feature-finding algorithm on each sample Sub-sample 0.1, 0,3, 0.5, 0.7, 0.9 x Plot features vs coverage depth ChIP-seq data in Broad Institute IGV Distribution of reads is decidedly non-random Visible peak structures in ChIP data Features detected 90 68 45 23 0 0.1 0.3 0.5 0.7 0.9 Subsampling ChIPseq Community Challenge: http://seqanswers.com/forums/showthread.php?t=1039 11

De novo assembly TRANSCRIPT DISCOVERY Reference-guided discovery Many RNA sequence reads, semi-randomly generated reference genome algorithm De novo discovery Transcript contigs 12

QUANTITATIVE RNA-SEQ RNA libraries from n treatments reference genome Transcriptome alignment 1/n algorithm Sample 1/n Sample 2/n algorithm Transcriptome alignment 2/n Sequence alignments Correct for Library size Transcript length Sequencing biases Platform biases Paralogous genes Polymorphism Tabular results 13

ASSEMBLERS 1. Hash read space 2. Build De Bruijn graph 3. Walk graph, starting at arbitrary nodes. 4. Pop bubbles 5. Perform scaffolding 14

ASSEMBLERS Assembler Special details SSAKE/VCAKE ALLPATHS-LG Velvet Newbler Trinity ABySS ALLPATHS SOAPdenovo clcbio Celera WGS Overlap and extend. Limited assembly size. Winner of the recent Assemblathon. Up-front sequencing recipe required. De Bruijn. Robust repeat resolution. Support for mixed assemblies. Large RAM requirement 454 only. Unpublished mechanics. Optimized for transcriptome assembly. Large RAM requirement* MPI-Parallel. Complex implementation. Can still require large RAM node at one step. Assumes specific sequencing and library construction strategy. Ranked well in Assemblathon. Assumes specific sequencing and library construction strategy. Ranked well in Assemblathon. Doesn t do scaffolding. Commercial. But fast and considered to be good. Older bioinformatics formats, Supports mixed assemblies. Not very fast :-/ 15

STRATEGIC THINKING Whole genome WGS or Gene Space RR Estimate Coverage: (Contig lengths * Desired Coverage) / Read length ALLPATHS SOAPdenovo What Assembler(s) Used Construct PE libs of very specific size Gene Space RNAseq or Reduced Representation Estimate Coverage: Consult literature! Others Less constrained library construction RNAseq 1. Do you need to do full de novo assembly? 2. Simulation is good practice 3. How will you evaluate your assembly qualities? 16

META-ASSEMBLY Assembler URL Velvet ABySS PCAP/CAP3 http://seq.cs.iastate.edu/ Contig Set A Contig Set B Minimus2 http://sourceforge.net/apps/mediawiki/amos/index.php? title=minimus2 Combine results from multiple methods to leverage best of features of various algorithms meta Even Better Contigs 17

Get up to date on the Assemblathon http://assemblathon.org/pages/results Learn all about... N50 vs NG50 Fragment Analysis Repeat analysis NG50 = Scaffold/contig length at which you have covered 50% of total genome length Count how many randomly chosen fragments from species A genome can be found in assembly Choose fragments that either overlap or don t overlap a known repeat Gene finding How many genes can be identified in the new assembly? Bacterial contamination How much non-target contamination is in the genome assembly? MAUVE analysis Miscalled, Uncalled, Missing, Extra bases; Misassemblies; Double cut & join distance BWA analysis What fraction of contigs align to a reference? Lengths of coverage islands. Validity, multiplicity, parsimony, Which assembler was THE BEST? 18

VIEWERS: THE BIG 5 IGV SeqMonk Gbrowse UCSC Ensembl 19

Browser Pros Cons Broad IGV Easy to configure and use. Can add new genomes. Integrates with networkaccessible data Limited by client side resources and network speed. No capacity to perform analyses. SeqMonk Ensembl Gbrowse UCSC Easy to configure and use. Can add new genomes. Includes basic analytical workflows. Web client is easy to configure and use. Powerful query API. Integrates with network-accessible data. Highly responsive. Web client is easy to configure and use. Powerful query API. Integrates with network-accessible data. Can be highly responsive. Highly responsive. Nice configuration options. Powerful web-based query API. Massive # of tracks. Limited by client side resources and network speed Server setup and administration is complex. Large, sophisticated database schema. Server setup and administration is complex due to Perl module dependencies. You will never figure out how to install or extend it on your own. 20

DO I NEED HPC? I don t have time to learn the command line or any of this UNIX stuff. I just need to get from point A to point B! Galaxy http://usegalaxy.org/ : Complexity of analysis and data management is hidden, but can still rear its head clcbio Workbench: $10,000 per seat. Consulting arrangements: Communications-intensive. Longer turnaround times. Dependency relationship. 21

GALAXY Shared data Data products Tool selection and data entry Parameterize and view results of analyses Contextual help 22

CLCBIO WORKBENCH 23

DO I NEED HPC? I can use Linux and all the associated tools but don t need a lot of horsepower Deploy a personal system You have 100% control of the system -> You re 100% responsible for the system Your upgrade path is limited to your immediate financial resources Use The Cloud CPU time is cheap. But data transfer and storage fees add up rapidly Limited performance and resources Use a traditional HPC cluster You learn in shared environment Master getting data in and out of the shared system for other applications Someone can help you with maintenance, support, data security System is constantly being upgraded. sometimes transparently to the user You don t have to consume 10000 cores and a PB of disk to qualify for TACC systems. Use as little or as much of the systems as you need. 24

DO I NEED HPC? I need a lot of resources System Best features Use cases Ranger Loads of cores. Optimized for MPI. Very high core count MPI jobs. Longhorn Lonestar >2000 GPGPU; Fast disk; Fat RAM (144GB) nodes Very fast cores; 1 TB RAM nodes; GPGPU nodes; Faster bandwidth than Ranger Interactive sessions; GPU computing; SciViz Pretty much any throughput biocomputing. Genome assembly. System choice is important! 25

PRACTICALS Log into Lonestar Survey Lonestar s filesystems Accounts and allocations Quick introduction to modules Tutorial 1: A simple serial job (Picard MergeSamfiles) Tutorial 2: An MPI-based parallel job (mpiblast) Tutorial 3: A parameteric serial job (multi-cpu BWA) 26