Bioinformatics for High Throughput Sequencing

Size: px
Start display at page:

Download "Bioinformatics for High Throughput Sequencing"

Transcription

1

2 Bioinformatics for High Throughput Sequencing

3

4 Naiara Rodríguez-Ezpeleta Ana M. Aransay Editors Michael Hackenberg Bioinformatics for High Throughput Sequencing

5 Editors Naiara Rodríguez-Ezpeleta Genome Analysis Platform CIC biogune Derio, Bizkaia, Spain Ana M. Aransay Genome Analysis Platform CIC biogune Derio, Bizkaia, Spain Michael Hackenberg Computational Genomics and Bioinformatics Group Genetics Department & Biomedical Research Center (CIBM) University of Granada, Spain ISBN e-isbn DOI / Springer New York Dordrecht Heidelberg London Library of Congress Control Number: Springer Science+Business Media, LLC 2012 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (

6 Preface The purpose of this book is to collect in a single volume the essentials of high throughput sequencing data analysis. These new technologies allow performing, at an unprecedented low cost and high speed, a panoply of experiments spanning the sequencing of whole genomes or transcriptomes, the profiling of DNA methylation, and the detection of protein DNA interaction sites, among others. In each experiment a massive amount of sequence information is generated, making data analysis the major challenge in high throughput sequencing-based projects. Hundreds of bioinformatics applications have been developed so far, most of them focusing on specific tasks. Indeed, numerous approaches have been proposed for each analysis step, while integrated analysis applications and protocols are generally missing. As a result, even experienced bioinformaticians struggle when they have to discern among countless possibilities to analyze their data. This, together with a lack of enough qualified personnel, reveals an urgent need to train bioinformaticians in existing approaches and to develop integrated, from start to end software applications to face present and future challenges in data analysis. Given this scenario, our motivation was to assemble a book covering the aforementioned aspects. Following three fundamental introductory chapters, the core of the book focuses on the bioinformatics aspects, presenting a comprehensive review of the methods and programs existing to analyze the raw data obtained from each experiment type. In addition, the book is meant to provide insight into challenges and opportunities faced by both, biologists and bioinformaticians, during this new era of sequencing data analysis. Given the vast range of high throughput sequencing applications, we set out to edit a book suitable for readers from different research areas, academic backgrounds and degrees of acquaintance with this new technology. At the same time, we expect the book to be equally useful to researchers involved in the different steps of a high throughput sequencing project. The newbies eager to learn the basics of high throughput sequencing technologies and data analysis will find what they yearn for specially by reading the first introductory chapters, but also by obviating the details and getting the rudiments of the v

7 vi Preface core chapters. On the other hand, biologists that are familiar with the fundamentals of the technology and analysis steps, but that have little bioinformatic training will find in the core chapters an invaluable resource where to learn about the different existing approaches, file formats, software, parameters, etc. for data analysis. The book will also be useful to those scientists performing downstream analyses on the output of high throughput sequencing data, as a perfect understanding of how their initial data was generated is crucial for an accurate interpretation of further outcomes. Additionally, we expect the book to be appealing to computer scientists or biologists with a strong bioinformatics background, who will hopefully find in the problematic issues and challenges raised in each chapter motivation and inspiration for the improvement of existing and the development of new tools for high throughput data analysis. Naiara Rodríguez-Ezpeleta Michael Hackenberg Ana M. Aransay

8 Contents 1 Introduction... 1 Naiara Rodríguez-Ezpeleta and Ana M. Aransay 2 Overview of Sequencing Technology Platforms Samuel Myllykangas, Jason Buenrostro, and Hanlee P. Ji 3 Applications of High-Throughput Sequencing Rodrigo Goya, Irmtraud M. Meyer, and Marco A. Marra 4 Computational Infrastructure and Basic Data Analysis for High-Throughput Sequencing David Sexton 5 Base-Calling for Bioinformaticians Mona A. Sheikh and Yaniv Erlich 6 De Novo Short-Read Assembly Douglas W. Bryant Jr. and Todd C. Mockler 7 Short-Read Mapping Paolo Ribeca 8 DNA Protein Interaction Analysis (ChIP-Seq) Geetu Tuteja 9 Generation and Analysis of Genome-Wide DNA Methylation Maps Martin Kerick, Axel Fischer, and Michal-Ruth Schweiger 10 Differential Expression for RNA Sequencing (RNA-Seq) Data: Mapping, Summarization, Statistical Analysis, and Experimental Design Matthew D. Young, Davis J. McCarthy, Matthew J. Wakefield, Gordon K. Smyth, Alicia Oshlack, and Mark D. Robinson vii

9 viii Contents 11 MicroRNA Expression Profiling and Discovery Michael Hackenberg 12 Dissecting Splicing Regulatory Network by Integrative Analysis of CLIP-Seq Data Michael Q. Zhang 13 Analysis of Metagenomics Data Elizabeth M. Glass and Folker Meyer 14 High-Throughput Sequencing Data Analysis Software: Current State and Future Developments Konrad Paszkiewicz and David J. Studholme Index

10 Contributors Ana M. Aransay Genome Analysis Platform, CIC biogune, Parque Tecnológico de Bizkaia, Derio, Spain Douglas W. Bryant, Jr. Department of Botany and Plant Pathology, Center for Genome Research and Biocomputing, Oregon State University, Corvallis, OR, USA Department of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA Jason Buenrostro Division of Oncology, Department of Medicine, Stanford Genome Technology Center, Stanford University School of Medicine, Stanford, CA, USA Yaniv Erlich Whitehead Institute for Biomedical Research, Cambridge, MA, USA Axel Fischer Cancer Genomics Group, Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany Elizabeth M. Glass Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA Computation Institute, The University of Chicago, Chicago, IL, USA Rodrigo Goya Canada s Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, Canada Centre for High-Throughput Biology, University of British Columbia, Vancouver, BC, Canada Department of Computer Science, University of British Columbia, Vancouver, BC, Canada Michael Hackenberg Computational Genomics and Bioinformatics Group, Genetics Department, University of Granada, Granada, Spain ix

11 x Contributors Hanlee P. Ji Division of Oncology, Department of Medicine, Stanford Genome Technology Center,, Stanford University School of Medicine, Stanford, CA, USA Martin Kerick Cancer Genomics Group, Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany Marco A. Marra Canada s Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, Canada Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada Davis J. McCarthy Bioinformatics Division, Walter and Eliza Hall Institute, Melbourne, Australia Folker Meyer Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA Computation Institute, The University of Chicago, Chicago, IL, USA Institute for Genomics and Systems Biology, The University of Chicago, Chicago, IL, USA Irmtraud M. Meyer Centre for High-Throughput Biology, University of British Columbia, Vancouver, BC, Canada Department of Computer Science, University of British Columbia, Vancouver, BC, Canada Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada Todd C. Mockler Department of Botany and Plant Pathology, Center for Genome Research and Biocomputing, Oregon State University, Corvallis, OR, USA Samuel Myllykangas Division of Oncology, Department of Medicine, Stanford Genome Technology Center, Stanford University School of Medicine, Stanford, CA, USA Alicia Oshlack Bioinformatics Division, Walter and Eliza Hall Institute, Melbourne, Australia School of Physics, University of Melbourne, Melbourne, Australia Murdoch Childrens Research Institute, Parkville, Australia Konrad Paszkiewicz School of Biosciences, University of Exeter, Exeter, UK Paolo Ribeca Centro Nacional de Análisis Genómico, Baldiri Reixac 4, Barcelona, Spain

12 Contributors xi Mark D. Robinson Bioinformatics Division, Walter and Eliza Hall Institute, Melbourne, Australia Department of Medical Biology, University of Melbourne, Melbourne, Australia Epigenetics Laboratory, Cancer Research Program, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia Naiara Rodríguez-Ezpeleta Genome Analysis Platform, CIC biogune, Parque Tecnológico de Bizkaia, Derio, Spain Michal-Ruth Schweiger Cancer Genomics Group, Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany David Sexton Center for Human Genetics Research, Vanderbilt University, Nashville, TN, USA Mona A. Sheikh Whitehead Institute for Biomedical Research, Cambridge, MA, USA Gordon K. Smyth Bioinformatics Division, Walter and Eliza Hall Institute, Melbourne, Australia Department of Mathematics and Statistics, University of Melbourne, Melbourne, Australia David J. Studholme School of Biosciences, University of Exeter, Exeter, UK Geetu Tuteja Department of Developmental Biology, Stanford University, Stanford, CA, USA Matthew J. Wakefield Bioinformatics Division, Walter and Eliza Hall Institute, Melbourne, Australia Department of Zoology, University of Melbourne, Melbourne, Australia Matthew D. Young Bioinformatics Division, Walter and Eliza Hall Institute, Melbourne, Australia Michael Q. Zhang Department of Molecular and Cell Biology, Center for Systems Biology, The University of Texas at Dallas, Richardson, TX, USA Bioinformatics Division, TNLIST, Tsinghua University, Beijing, China

13 wwwwwwwwwww

14 Chapter 1 Introduction Naiara Rodríguez-Ezpeleta and Ana M. Aransay Abstract Thirty-five years have elapsed since the development of modern DNA sequencing till today s apogee of high-throughput sequencing. During that time, starting from the sequencing of the first small phage genome (5,386 bases length) and going towards the sequencing of 1,000 human genomes (three billion bases length each), massive amounts of data from thousands of species have been generated and are available in public repositories. This is mostly due to the development of a new generation of sequencing instruments a few years ago. With the advent of this data, new bioinformatics challenges arose and work needs to be done in order to teach biologist swimming in this ocean of sequences so they get safely into port. 1.1 History of Genome Sequencing Technologies Sanger Sequencing and the Beginning of Bioinformatics The history of modern genome sequencing technologies starts in 1977, when Sanger and collaborators introduced the dideoxy method (Sanger et al ), whose underlying concept was to use nucleotide analogs to cause base-specific termination of primed DNA synthesis. When dideoxy reactions of each of the four nucleotides were electrophoresed in adjacent lanes, it was possible to visually decode the corresponding base at each position of the read. From the beginning, this method allowed to read sequences of about 100 bases length, which was latter increased to 400. By the late 1980s, the amount of sequence data obtained by a single person in a day went up to 30 kb (Hutchison 2007 ). Although seemingly ridiculous compared N. Rodríguez-Ezpeleta (*) A.M. Aransay Genome Analysis Platform, CIC biogune, Parque Tecnológico de Bizkaia, Building 502, Floor 0, Derio, Spain nrodriguez@cicbiogune.es; amaransay@cicbiogune.es N. Rodríguez-Ezpeleta et al. (eds.), Bioinformatics for High Throughput Sequencing, DOI / _1, Springer Science+Business Media, LLC

15 2 N. Rodríguez-Ezpeleta and A.M. Aransay to the amount of sequence data we deal with today, already at this scale data analysis and processing represented an issue. Computer programs were needed in order to gather the small sequence chunks into a complete sequence, to allow editing of the assembled sequence, to search for restriction sites, or to translate sequences into all reading frames. It was during this beginning of bioinformatics that the first suite of computer programs applied to biology was developed by Roger Staden. With the Staden package (Staden 1977 ), still in use today (Staden et al ; Bonfield and Whitwham 2010 ), a widely used file formats (Dear and Staden 1992 ) and ideas, such as the use of base quality scores to estimate accurate consensus sequences (Bonfield and Staden 1995 ), were already advanced. As the amount of sequence data increased, the need for a data repository became evident. In 1982, GenBank was created by the National Institute of Health (NIH) to provide timely, centralized, accessible repository for genetic sequences (Bilofsky et al ), and 1 year later, more than 2,000 sequences were already stored in this database. Rapidly, tools for comparing and aligning sequences were developed. Some spread fast and are still in use today, such as FASTA (Pearson and Lipman 1988 ) and BLAST (Altschul et al ). Even during those early times, it became already clear that bioinformatics is central to the analysis of sequence data and to the generation of hypothesis and resolving of biological questions Automated Sequencing In 1986, Applied Biosystems (ABI) introduced automatic DNA sequencing for which different fluorescently end-labelled primers were used in each of the four dideoxy sequencing reactions. When combined in a single electrophoresis gel, the sequence could be deduced by measuring the characteristic fluorescence spectrum of each of the four bases. Computer programs were developed that automatically converted fluorescence data into a sequence without needing to autoradiography the sequencing gel and manually decode the bands (Smith et al ). Compared to manual sequencing, the automation allowed the integration of data analysis into the process so that problems at each step could be detected and corrected as they appeared (Hutchison 2007 ). Very shortly after the introduction of automatic sequencing, the first sequencing facility with six automated sequencers was set up at the NIH by Craig Venter and colleagues, which was expanded to 30 sequencers in 1992 at The Institute for Genomic Research (TIGR). One year later, one of today s most important sequencing centres, the Wellcome Trust Sanger Institute, was established. Among the earliest achievements of automated sequencing was the reporting of 337 new and 48 homolog-bearing human genes via the expressed sequence tag (EST) approach (Adams et al ), which allows to selectively sequence fragments of gene transcripts. Using this approach, fragments of more than 87,000 human transcripts were sequenced shortly after, and today over 70 million ESTs from over 2,200 different organisms are available in dbest (Boguski et al ). In 1996, DNA sequencing