Haploid Assembly of Diploid Genomes

Similar documents
Mapping strategies for sequence reads

Analysis of RNA-seq Data

De novo assembly in RNA-seq analysis.

Barnacle: detecting and characterizing tandem duplications and fusions in transcriptome assemblies

ChIP-seq and RNA-seq

De novo genome assembly with next generation sequencing data!! "

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

ChIP-seq and RNA-seq. Farhat Habib

RNA-Sequencing analysis

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

de novo paired-end short reads assembly

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

De Novo Assembly of High-throughput Short Read Sequences

De novo whole genome assembly

de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ

Genomics and Transcriptomics of Spirodela polyrhiza

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly

RNA-SEQUENCING ANALYSIS

NOW GENERATION SEQUENCING. Monday, December 5, 11

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

De novo assembly and analysis of RNA-seq data

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

De novo whole genome assembly

Eucalyptus gene assembly

NUCLEOTIDE RESOLUTION STRUCTURAL VARIATION DETECTION USING NEXT- GENERATION WHOLE GENOME RESEQUENCING

State of the art de novo assembly of human genomes from massively parallel sequencing data

Introduction to RNA sequencing

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

Compute- and Data-Intensive Analyses in Bioinformatics"

The Diploid Genome Sequence of an Individual Human

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Lecture 14: DNA Sequencing

De novo whole genome assembly

02 Agenda Item 03 Agenda Item

Introduction to RNA-Seq in GeneSpring NGS Software

RNA-Seq de novo assembly training

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

Supplementary Table 1. Summary of whole genome shotgun sequence used for genome assembly

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

BIOINFORMATICS ORIGINAL PAPER

10/20/2009 Comp 590/Comp Fall

Genomic resources. for non-model systems

Yellow-bellied marmot genome. Gabriela Pinho Graduate Student Blumstein & Wayne Labs EEB - UCLA

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

RNA-sequencing. Next Generation sequencing analysis Anne-Mette Bjerregaard. Center for biological sequence analysis (CBS)

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Supplemental Materials

A wide spectrum of somatic mutations in high-risk neuroblastoma

Transcriptome analysis

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

RNA-Seq with the Tuxedo Suite

DNA polymorphisms and RNA-Seq alternative splicing blow bubbles in de Bruijn Graphs

RNA-Seq analysis workshop

De novo genome assembly. Dr Torsten Seemann

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Genome Sequencing-- Strategies

Concepts and methods in genome assembly and annotation

SureSelect Target Enrichment for the Ion Proton TM Next Generation Sequencing System

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Introduction to Bioinformatics

Taking Advantage of Long RNA-Seq Reads. Vince Magrini Pacific Biosciences User Group Meeting September 18, 2013

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

Analysis of RNA-seq Data. Bernard Pereira

RNA-Seq Module 2 From QC to differential gene expression.

Surely Better Target Enrichment from Sample to Sequencer

CSE 549: RNA-Seq aided gene finding

De novo meta-assembly of ultra-deep sequencing data

DE NOVO WHOLE GENOME ASSEMBLY AND SEQUENCING OF THE SUPERB FAIRYWREN. (Malurus cyaneus) JOSHUA PEÑALBA LEO JOSEPH CRAIG MORITZ ANDREW COCKBURN

Mate-pair library data improves genome assembly

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics

Functional genomics to improve wheat disease resistance. Dina Raats Postdoctoral Scientist, Krasileva Group

Analysis of structural variation. Alistair Ward USTAR Center for Genetic Discovery University of Utah

Alignment and Assembly

CSE182-L16. LW statistics/assembly

Assessing De-Novo Transcriptome Assemblies

De novo Genome Assembly

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment

Analytics Behind Genomic Testing

Bioinformatics in next generation sequencing projects

Genome Assembly and Annotation of Isochrysis Galbana

Machine Learning. HMM applications in computational biology

Surely Better Target Enrichment from Sample to Sequencer and Analysis

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

De Novo and Hybrid Assembly

SOAPdenovo-Trans: De novo transcriptome assembly with short RNA-Seq reads

Quality Control of Next Generation Sequence Data

Analysis of structural variation. Alistair Ward - Boston College

CPSC 583 Fall 2010 biovis. Sheelagh Carpendale

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

From Infection to Genbank

Purpose of sequence assembly

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

Slide 1. Slide 2. Slide 3

PERGA: A Paired-End Read Guided De Novo Assembler for Extending Contigs Using SVM and Look Ahead Approach

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

High throughput sequencing technologies

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Transcription:

Haploid Assembly of Diploid Genomes Challenges, Trials, Tribulations 13 October 2011 İnanç Birol

Assembly By Short Sequencing IEEE InfoVis 2009 2

3

in Literature ~40 citations on tool comparisons ~20 citations on using ABySS for a biology study Crowded field 17 teams in Assemblathon 1 4 Overlap-Overlay-Consensus ARACHNE CAP3 Celera assembler MIRA Newbler Phred/Phrap SGA De Bruijn Graph Euler Velvet ABySS SOAPdenovo ALLPATHS

Assembly Problem TCGATCGATTTTCGGCCTAA read1 ATTTTCGGCCTAATATTAGG read2 GCATCGATCGATTTTCGGCCTAATATTAGGCCGATAATCGACGATC 5 A partial and unambiguous read-to-read alignment extends the length of sequence information First stage of an assembly algorithm is to find such alignments Assembly algorithms differ in the way they find and use these alignments

Algorithm SE Assembly: k-mer extension on a de Bruijn graph PE Assembly: search for unambiguous contig merging along paths d=6±5 d=5±4 Scaffolding: search for unambiguous linkage across distant contigs d=12±5 d=26±9 6

7 Software

De Bruijn Graph Description of read-to-read overlaps 2x4 possible extension of every k-mer Provides and O(n) algorithm for SE assembly GACATTGC seq1 GACATTAT seq2 GACAT ACATT CATTG CATTA ATTGC ATTAT 8 k = 5

Adjacency Graph Description of contig overlaps Built during SE assembly Overlap = k-1 bp Generalized during PE assembly Arbitrary overlap 9

Linkage Graph Built through read pairs aligned to different contigs PE reads from a tight fragment length distribution Reliable distance estimates MP reads from broader insert length distribution Noisy data Used in PE assembly (PE) and scaffolding (PE and MP) stages 10

Anchor Scrubbing homozygous variations Indel SNPs 11

Anchor Local directional assembly scaffold gap filling (bridging) extension (planking) 12

Case Study Mountain Pine Beetle Genome Assembly 13

Mountain Pine Beetle Genome Assembly statistics contigs scaffolds n 1,128,463 1,103,221 n:500bp 33,591 11,657 n:n50 4,324 82 N50 (bp) 11,220 541,443 Max (bp) 276,135 3,583,207 Reconstruction (Mb) 201.9 200.4 14

Assembly As a Hairball ABySS v1.2.7 PE/MP information disambiguates short contig extensions out in Node connectivity* 1 2 3 4 5 6+ 1 15822 7354 1882 530 109 1 2 7354 9814 1817 456 72 3 3 1882 1817 1074 238 31 1 4 530 456 238 126 13 1 5 109 72 31 13 10 0 6+ 1 3 1 1 0 0 * For contigs 2 kb 15

16 Scaffolding

Quality Assessment Alignment of 81,047,980 reads Before Anchor After Anchor Change Mapped 65,624,456 (80.97%) Paired 43,207,118 (53.31%) Single-end 9,536,178 (11.77%) Gene alignments 66,949,341 (82.60%) 44,732,320 (55.19%) 8,846,977 (10.92%) + 1,324,885 + 1,525,202-689,201 2,180 ESTs 248 Conserved Genes Complete Partial Complete Partial Contigs 968 1169 212 18 Scaffolds 1,481 619 228 5 17

Date ABySS Version Data n:500 N50 Max Sum August 2009 1.0.11 3x GAiix 81,431 1,526 20,755 107.3e6 November 2009 1.0.15 +2x GAiix 104,958 2,333 55,845 195.8e6 February 2010 1.1.1 +4x GAiix 157,081 2,790 136,637 346.3e6 July 2010 1.2.0 +2x GAiix 146,313 3,354 129,008 376.2e6 November 2010 1.2.4 +1x GAiix +1x GAiix (MP) 100,690 4,474 294,323 268.8e6 May 2011 1.2.7 -- 18,660 108,158 1,908,773 201.4e6 July 2011 1.2.7 + 1x HiSeq +1x HiSeq (MP) 11,657 541,443 3,583,207 200.4e6 August 2011 1.2.7 -- 11,523 561,847 3,746,698 206.5e6 18

19 Transcriptome Assembly

Transcriptome Sequencing RNA-seq protocol Brings information on how a genome acts Expression levels Allelic expression Present isoforms Gene fusions Other transcriptional events Post-transcriptional RNA editing Rodrigo Goya 20

Transcriptome Assembly Transcriptome assembly is different from genome assembly varying coverage levels varying expression levels split assembly paths isoforms/splice variants small contig sizes small product sizes Transcript models 21

22 What Overlap to Choose?

23 Selection of k

What Overlap to Choose? Selection of parameter k depends on read coverage depth Expression levels vary over 5 orders of magnitude 24

Assembly Merging buried parent untouched 25

Multi-k Assembly We capture a wide range of expression levels Gray: all transcripts with a read alignment Blue: at least 80% of a transcript in a single contig Red: at least 80% of a transcript is reconstructed 26

Trans-ABySS A versatile tool for Transcript reconstruction Gene identification InDel and SNV discovery Chimeric transcript discovery Gene fusions Trans-splicing Expression analysis 27

Transcriptome Assembly Trans-ABySS De novo assembly based on ABySS Cufflinks 0.8.3 Scripture Reference-based assembly based on TopHat alignments [Trapnell et al., 2010; Guttman et al., 2010; Trapnell et al., 2009] 28

Events 29 + chimeric transcripts

Performance Compared to mapping-based analysis tools Trans-ABySS constructs as many transcripts with better sensitivity and specificity 30 [Trapnell et al., 2010; Guttman et al., 2010; Trapnell et al., 2009]

Case Study Acute Myeloid Leukemia Transcriptome Assembly 31

Fusions 1 2 4 5 6 Lucas Swanson, Readman Chiu and Gordon Robertson Assembled transcriptome contigs span multiple genes Break point (usually) corresponds to exon boundaries Break point is supported by Spanning reads Read pairs linking regions Gene fusions are often drivers in AML and define subtypes (e.g. PML/RARα and M3 subtype) 32

Number of patients AML Gene Fusions 16 14 12 9% 71 events in 65/173 (38%) patients 30 different gene fusions identified 94% validation by RT-PCR sequencing Known AML fusion events (12) Known polymorphism (1) Novel fusion event (17) 10 5% 8 4% MLL fusions 6 4 Low frequency (<1%) 2 0 33 Candidate fusion events Karen Mungall

Validation of a Novel Fusion Chr 17p13.1 Chr 19p13.2 DNA directed RNA polymerase II polypeptide A (POLR2A) 5 UTR Exon 1 2 Fibrillin 3 (FBN3) Exon 47 48 M: 1kb plus DNA ladder 1: A00160 (2938) POLR2A-FBN3 5 UTR Exon 1 Exon 48 Exon 63 1 M EGF-like, calcium binding domains 505bp 34 Andy Mungall

Internal Tandem Duplications 2 2 2 2 Contig alignments result in Query gaps Contiguous target blocks Read support on break point(s) Aberrant read pair distances Known AML ITDs: 29/173 (17%) harbour partial FLT3 exon 14 duplication 6/173 (3.5%) harbour partial WT1 exon 7 duplication Nakao et al., Leukemia 1996; Christiansen et al., Leukemia 2001 35

Known ITD in FLT3 A 33 bp duplication in exon 14 CTCCCATttgagatcatattcatattctctgaaatcaacgTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAA 36 Karen Mungall

Partial Tandem Duplications 2 3 Usually coexist with the wild-type PTD event manifested in a particular contig type A short contig with 50/50 split alignment Break point is supported by Spanning reads Read pairs in opposite orientation Known AML PTD: 10/173 (5.8%) harbour duplication of MLL exons 2-10 Dorrance et al., Blood 2008 Identified 88 genes with PTDs 37

Novel PTD in Arid1a Exons 2-4 tandemly repeated in 5 AML libraries WT CT Recurrent across tissues and species Source AML LBC Normal mouse NCBI EST Observations 5/173 Libraries 5/54 Libraries 3/7 Libraries colon_ins, placenta_normal 38

39 Summary

ABySS Team: Shaun Jackman Tony Raymond Rod Docking Beetle Project: Joerg Bohlmann Chris Keeling Nancy Liao Greg Taylor Simon Chan Diana Palmquist Trans-ABySS Team: Readman Chiu Karen Mungall Gordon Robertson Ka Ming Nip Jenny Qian Rong She Lucas Swanson AML Project: Richard Moore Yongjun Zhao Andy Mungall Aly Karsan GSC: Sequencing Team Library Core Systems Team Steven Jones Marco Marra

Final Hairball ABySS v1.2.7 Read pairs and inferred distances allow for scaffolding contigs scaffolds n 1,128,463 1,103,221 n:500bp 33,591 11,657 n:n50 4,324 82 N50 (bp) 11,220 541,443 Max (bp) 276,135 3,583,207 Reconstruction (Gb) 201.9 200.4 41

Biotin Read-Through circularized insert 42

43

Triage of MP Reads Challenge: A B B A Which one? 44 Information: Distances from contig ends Base mismatches on read ends Inferred contig orientations

Triage of MP Reads Read 1 Read 2 x xx MP-like x xxx x x x xxx PE-like MP-like PE-like MP-like PE-like 45