Genome Assembly Workshop Titles and Abstracts

Genome Assembly Workshop Titles and Abstracts TUESDAY, MARCH 15, 2011 08:15 AM Richard Durbin, Wellcome Trust Sanger Institute A generic sequence graph exchange format for assembly and population variation Although the inputs and standard outputs of assemblers are sets of sequences (reads and contigs, respectively), all modern assemblers internally use a sequence graph representation that allows sequence segments to connect in multiple different ways. Furthermore, assembly is a multi-step process, but because sequence graph representations are private to assemblers, each software package typically must implement all steps. In addition, essentially the same sequence graph representation is natural for representing population variation in a species, with any individual genome composed of a walk through segments representing all the sequence observed in the population. I will discuss a generic exchange format and interface for sequence graphs, with initial draft implementation SQG (for SeQuence Graph), to support development of modular software. Sequence is attached to nodes, and arbitrary additional information can be attached to nodes or edges and carried through processing steps. There is a standard notation for walks that correspond to finite sequences that are consistent with the graph. As well as supporting modular assembly components, an aspiration is that availability of a toolkit including efficient search will enable users of assemblies to use the resulting graph in place of the set of contigs, since the graph is richer, and also support a population reference sequence representation that allows more accurate and complete alignment of reads than a single linear reference. If time is available I will show how the FM index used by BowTie, BWA, and SOAP can be extended to sequence graphs, supporting an efficient search process on SQG graphs. 08:45 AM Ian Korf, University of California, Davis Dent Earl, University of California, Santa Cruz Results of the Assemblathon 10:00 AM David B. Jaffe, Broad Institute High-quality draft assemblies of a dozen vertebrate genomes from massively parallel sequence data Massively parallel DNA sequencing technologies are revolutionizing genomics by making it possible to generate billions of relatively short (~100 base) sequence reads at very low cost. While such data can be readily used for a wide range of biomedical applications, it has proven difficult to use them to generate high-quality de novo genome assemblies of large, repeat-rich Page 4

vertebrate genomes. To date, the genome assemblies generated from such data have fallen far short of those obtained with the older (but 1000 times more expensive) capillary-based sequencing approach. We report the development of a new algorithm for genome assembly, ALLPATHS-LG, and its application to massively parallel DNA sequence data from a dozen vertebrate genomes, generated on the Illumina platform. The resulting draft genome assemblies have good accuracy, short-range contiguity, long-range connectivity, and coverage of the genome. In particular, the base accuracy is high ( 99.95%) and the scaffold sizes (e.g., N50 size = 11.5 Mb for human and 17.4 Mb for mouse) approach those obtained with capillary-based sequencing. The combination of new sequencing technology and new computational methods should now make it possible to increase dramatically the de novo sequencing of large genomes. The ALLPATHS-LG program is available at http://www.broadinstitute.org/science/programs/genome-biology/crd. 10:30 AM Sante Gnerre, Broad Institute ALLPATH-LG algorithms for large genome assembly ALLPATHS-LG is a new algorithm to assemble low-cost ~100 base reads, producing high quality assemblies of genomes from megabase-sized bacteria to gigabase-size vertebrates. This process includes a series of innovations and optimizations. 1) Error correction. For each 24-mer, the algorithm examines the stack of reads containing it and then proposes edits to the reads, in cases where individual read bases differ from the overwhelming consensus of the stack. Read bases having conflicting status (from membership in multiple stacks) are not edited, thus avoiding false corrections. 2) Read doubling. Read pairs whose ends overlap or have only a small gap are merged together using a third read from some other pair, thus yielding double reads of size ~180 bases and thereby enabling use of a large minimum overlap K. This increases resilience to repeats. 3) Local assembly of low coverage regions. Despite sequencing at ~100x coverage, parts of the genome recalcitrant to sequencing can be poorly covered. We handle such data by allowing much smaller K in localized regions. 4) Optimized use of jumping reads. Assembly quality depends on linking from jumping libraries, yet their read pairs have artifacts: they contain circularization junctions and are polluted with nonjump pairs in reverse orientation. The algorithm handles these by locating junction points and by treating each pair as belonging to two libraries. 5) Computational performance. Finally, ALLPATHS-LG has been optimized and parallelized to reduce run time and memory usage, and for mammaliansize genomes it can be run on a commercial server (Dell R815, $39,000). Page 5

11:00 AM Daniel Zerbino, University of California, Santa Cruz Columbus: Templated assembly of partially mapped reads Since the advent of high-throughput sequencing, analysis tools had to be adapted to deal with the exponential increase in the quantity of data and the greater ambiguity posed by the shorter reads. This led to specialized mapping and de novo assembly tools that are now routinely used in large-scale projects. The mapping tools are extremely efficient computationally and can process large amounts of data, whereas the de novo assembly are more flexible with respect to new sequence structures. We therefore extended the Velvet de novo assembler with a module named Columbus which accepts the output of a generic read mapper (using SAM/BAM files) and helps Velvet anchor its assembly onto a known template, while still keeping all of its de novo assembly capacities. Columbus was tested on 17 mouse strain resequencing projects and was shown to resolve many more complex sequences than Velvet alone could. We expect that Columbus will allow users to easily design and implement novel analysis pipelines, which combine the computational efficiency and biological a priori of read mapping with the flexibility of de novo assembly. 11:30 AM Yingrui Li, BGI-Shenzhen NGS de novo assembly: Progresses and challenges Assembling new genomes from scratch has often been a hotspot for bioinformatics development. The issue becomes especially attractive when Now-Generation Sequencing (NGS) is available to provide large-scale but short-in-read-length data at a significantly lower per-base cost comparing with Sanger sequencing technology. Several types of algorithms have been applied to prove the concept that NGS could do de novo assembly for large genomes, yet the quality and continuity are always of great concern for annotations and follow-up studies. Here we present our progress and discuss explorations of challenging issues in NGS de novo assembly methodologies, especially on genomes with different levels of complexity. We believe that NGS assembly still has large potential to achieve more satisfactory results that bases further studies of any species. 01:00 PM Aaron Klammer, Pacific Biosciences, Inc., Menlo Park CA De novo assembly of Vibrio cholerae using Pacific Biosciences SMRT DNA sequencing technology We present the de novo assembly of the genome of the bacteria Vibrio cholerae using reads from Pacific Biosciences single-molecule real-time (SMRT ) DNA sequencing technology combined with Illumina short reads. Our approach uses the open source scaffolder Bambus along with other elements of AMOS assembly software package and employs several novel algorithms tailored to Pacific Biosciences reads. Using this suite of algorithms we are able to produce an assembly of the V. cholerae genome from 30X sequence coverage of PacBio long reads with significantly longer contig N50s than a comparable assembly using Illumina Page 6

reads alone. In addition, we scaffolded the V. cholerae assembled contigs using 20X sequence generated by the PacBio strobe sequencing technology a sequencing protocol that allows the linkage of multiple reads across large distances, in a fashion similar to mate-pair sequencing. The addition of strobe reads further increases the scaffold N50 for the V. cholerae genome by spanning of large repeats on the order of several kilobases. The use of PacBio long and strobe reads shows high promise for simplifying the completion of draft and finished bacterial genomes. 01:30 PM Jared Simpson, Sanger Centre Efficient assembly algorithms using the FM-index The assembly of large genomes from short reads remains a computational challenge. Currently available assemblers require either very large amounts of memory, typically in the hundreds of gigabytes, or a large compute cluster to assemble a human genome. To address this challenge, we have developed a set of efficient algorithms based on the FM-index data structure. As the FMindex is compressed, our method has a very low memory footprint. Using this data structure, we have designed parallel algorithms for error correction, read filtering, and string graph construction. We have packaged these algorithms into an assembler called SGA (for String Graph Assembler) which is opensource and available at github.com/jts/sga. In our talk, we will present the algorithms and results for a recent human genome assembly from 40X sequence data which required less than 60GB of memory. We will also discuss our experience with the assemblathon competition. WEDNESDAY, MARCH 16, 2011 08:00 AM Jason Rafe Miller, J. Craig Venter Institute, Rockville MD HMP assembly analysis at JCVI The Human Microbiome Project (HMP) seeks to characterize the microbial load carried by healthy people. Bacterial populations were sampled from multiple individuals at several body sites and at up to three time points. Samples were analyzed by either 16S sequencing or metagenomics sequencing. Select bacterial strains are being cultured and sequenced to generate a reference genome collection. As part of the reference genome effort, our institute has put over 200 bacterial genomes through a high-throughput whole-genome shotgun pipeline based on next-generation sequencing technology. Reference strains are sequenced by Illumina paired end, 454 paired end, 454 unpaired, or some combination of those. Sequence coverage is adjusted using pooled libraries of bar-coded samples. Sequence data is assembled by several assembly programs. Assembled results are reviewed for completeness, accuracy, and signs of contamination, and compared with each other. At most one assembly per genome is submitted to the public databases. We will present analysis that was aimed at optimizing this pipeline. We will rate the utility of various measures of assembly quality and list features that ideal assemblers would self-report so as to facilitate assembly comparison. Page 7

08:30 AM Jay Shendure, University of Washington Experimental approaches to massively parallel contiguity mapping Massively parallel technologies have reduced the per-base cost of DNA sequencing by several orders of magnitude. However, limited read lengths and a lack of methods to establish contiguity over even modest distances have prevented these technologies from achieving the high-quality, low-cost de novo assembly of mammalian genomes. Even as revolutionary sequencing technologies further mature, it may continue to be the case that the best technologies in terms of cost-per-base yield reads that are of an insufficient length or quality for the effective de novo assembly of large genomes. To meet this need, we are exploring novel experimental strategies to facilitate the massively parallel recovery of contiguity information at different scales. 09:00 AM Jim Knight, Roche Newbler and large genome assembly This talk describes recent updates to the Newbler assembler for large genomes, including support for FASTQ files and hybrid assemblies of 454, Sanger, and/or Illumina sequences, as well as algorithms for handling diploid genome assembly. Updates on the new 454 long reads will also be presented. 09:30 AM Graham Ruby, University of California, San Francisco De novo genome assembly from metagenomic mixtures using PRICE Many organisms cannot be collected or cultured independently from their ecological surroundings. This is particularly true of disease-causing pathogens that directly depend on host biology to persist and replicate. The presence of large quantities of irrelevant sequence in metagenomic shotgun datasets poses a particular challenge to the assembly of pathogen genomes. In order to address this challenge, we have devised and implemented a strategy for genome assembly using paired-end reads and iterative contig extension (PRICE). We have applied this strategy to the targeted de novo assembly of novel viral genomes from complex metagenomic samples that were sequenced using low-cost, high-throughput, short-read DNA sequencing technology. We have also successfully applied PRICE to conventional (nonmeta) genome sequencing and de novo assembly. 10:15 AM Michael Schatz, Cold Spring Harbor Laboratory Assembly and validation of large genomes from short reads During my presentation I ll describe the short-read genome assembly pipeline developed in conjunction with the University of Maryland, the National Biodefense Analysis and Countermeasures Center, and the J. Craig Venter Institute. This pipeline includes the new algorithm Quake for pre-assembly sequence error correction and quality trimming, the Celera Assembler enhanced for Illumina sequences, and other related tools for post-assembly contig and scaffold refinement. I will describe the effectiveness of this pipeline for assembling short reads using four recently sequenced genomes ranging in size from 2 Mbp to 3 Gbp: Page 8

Staphylococcus aureus, Bombus impatiens (a species of bee), Linepithema humile (the Argentine ant), and human. The results of these assemblies, along with detailed comparisons to the assemblies of these data with other leading assemblers, are posted as part of our Genome Assembly Goldstandard Evaluations (GAGE), available at http://gage.cbcb.umd.edu. It is our hope with GAGE to produce a realistic assessment of the current state of the art in genome assembly software using real data in the rapidly changing field of next-generation sequencing. I will conclude my presentation by describing our genome assembly forensics pipeline for validating assemblies and discovering mis-assemblies. The pipeline includes various statistical tests for recognizing abnormal variations in depth of coverage, read heterogeneity, sequence composition, mate-pair placement, and read-breakpoint analysis. We find these mis-assembly signatures have nearly perfect sensitivity for detecting mis-assemblies, which can be used to guide assembly repair routines or reconcile differences between alternate assemblies. 10:45 AM Joan Pontius, National Cancer Institute, Frederick, Maryland A call for standardization of physical markers for use in the analysis of genome assemblies Physical maps of genomic markers of unique sequence (Sequence Tagged Sites, STSs) allow scaffolds from genome assemblies to be assigned to chromosome positions. The STSs are mapped using radiation hybrid (RH) experiments, analysis of genetic linkage, and cytogenetic analysis of chromosomes using fluorescent in situ hybridization (FISH). Physically mapped markers can also be used to assess the accuracy of the final assembly, for example, by helping to detect chimeric scaffolds which include markers derived from more than one chromosome. Chimeric scaffolds can also be detected when a scaffold sequence aligns to more than one chromosome of the genome assembly of a closely related species. Although some of these scaffolds may represent real rearrangements that have occurred over the course of evolution, others may uncover assembly artifacts (or physical map inaccuracies) that can be remedied. The importance of physical markers call for standardization of data describing, namely: 1) The primer sequences and the sequence of their PCR products should be documented so that accurate computational mapping to the assembly can be confirmed. 2) Ideally, markers should map to one and only one locus in the genome and also map to unique and orthologous regions in a second genome. 3) The genomics community would benefit by adopting a standard for mapping information, so that efficient computational methods may be developed for their use. Here we present a QC analysis of several vertebrate species genome sequences, with attention to scaffold chimerism, syntenic orthology, and physical map concordance with other genome assemblies. Page 9

11:15 AM David C. Schwartz, University of Wisconsin, Madison Optical mapping and nanocoding systems for genome assembly and analysis Modern sequence data acquisition and assembly techniques are rapidly increasing the quality of genome assemblies while decreasing their cost. Consequently, these developments are fueling the basis for efforts like the Genome 10K Project which will require equally innovative ways to effectively complete and validate genome assemblies as references for comparative studies. This problem becomes more acute as increasingly obscure species are sequenced and analyzed in the absence of associated scientific communities, detailed knowledge of life cycles, and/or genetic resources venerable elements used for the completion of genomes. Irrespective of any sequencing technology, multiple data types must be used for genome assembly and validation, since all measurements and analysis schemes have errors requiring complementary approaches for accurate, comprehensive mediation. In this regard, the Optical Mapping System, a purely singlemolecule platform, has cost-effectively complemented over 80 sequencing efforts through high-resolution physical maps scaffolding nascent sequence assemblies allowing comprehensive and independent validation across entire genomes. These genomes have included human, mouse, rat, rice, maize, and numerous fungal and bacterial genomes. We are now complementing Optical Mapping with a newer approach Nanocoding which is promising higher resolution, higher throughput, and lower costs. These advancements are providing the means for the broad dissemination of this new technology for genome assembly and analysis (structural variation) through greatly simplified systems. 11:45 AM Can Alkan, University of Washington "Dark side" of genomes: What s missing in current sequence assemblies? The advances in genome sequencing technology have opened the way to analyze genomes at a previously unimaginable pace. Building a draft reference genome assembly previously cost billions of dollars and took years. Now these can be done for a fraction of the cost and within a very short time frame. Although it is feasible to construct de novo genome assemblies in a few months, there has been relatively little attention to what is lost by sole application of short-sequence reads. We recently compared the recent de novo assemblies of human genomes using the short sequence reads generated using the Illumina platform and found that 420.2 megabase pairs of common repeats and 99.1% of validated duplicated sequences were missing from the genome. Recent improvements in sequence quality, larger insert sizes (or "jump libraries"), and algorithmic innovations promise to ameliorate this effect to generate better assemblies. In this talk, I will present genome quality comparisons, mainly based on the segmental duplication content, and compare the clone-by-clone sequencing (NCBI Build 36) with capillary-based WGS assemblies (Celera), short read sequencing (YH assembly with SOAP, and NA12878 with ALLPATHS-LG). I will also present similar analyses on non-human genome assemblies such as the Page 10

bonobo (454), gorilla (capillary and Illumina), and mouse (Illumina), and describe what we can expect" to miss in our analyses. 01:15 PM Deanna M. Church, Genome Reference Consortium and NCBI Assembly Groups Modernizing and managing genome assembly data As we celebrate the publications of the first draft human assemblies, it is useful to review what we have learned over the last decade. During this time we ve seen a dramatic improvement in the quality of the human reference as the public assembly continues to be improved and challenging regions finished. The availability of this data has increased our understanding of genomic biology and caused us to rethink the models we must use to represent an organism s genome. As part of our curation of the human genome, the Genome Reference Consortium (GRC) has helped propose a more robust assembly model that represents complex allelic variants in a way that facilitates annotation. Additionally, we have seen an explosion of genome assemblies from multiple species. With over 2,400 assemblies available in GenBank, robust management of assembly data is needed. While GenBank is well suited for tracking the history of a single sequence, most genome assemblies represent a collection of sequences, and there is a need to track both the relationship of these sequences as well as any metadata that is associated with the assembly. To this end we have developed an assembly database to manage assembly submission and retrieval. Finally, we are developing tools to allow for assembly comparison and quality assurance. 01:45 PM Federica DiPalma, Broad Institute What quality do we need to achieve for Genome 10K genomes? The Broad Institute has been involved in the sequencing of >30 vertebrate genomes. Our goal has always been to design genome projects of high scientific merit, produce high quality reference sequence, and to ensure that the community s needs are met. Genomes have been sequenced for various scientific reasons, including the generation of reference sequences for biomedical models, to study vertebrate evolution, and to generate a better annotation of the human genome through comparative sequence analysis. Different types of projects require different levels of accuracy and continuity in the assemblies, which in turn require different amounts and quality of DNA input. We will discuss these needs and how to achieve them. 02:15 PM Laurie Goodman, BGI- Shenzhen Page 11