Bioinformatics for High Throughput Sequencing

Naiara Rodríguez-Ezpeleta Ana M. Aransay Editors Michael Hackenberg Bioinformatics for High Throughput Sequencing

Editors Naiara Rodríguez-Ezpeleta Genome Analysis Platform CIC biogune Derio, Bizkaia, Spain nrodriguez@cicbiogune.es Ana M. Aransay Genome Analysis Platform CIC biogune Derio, Bizkaia, Spain amaransay@cicbiogune.es Michael Hackenberg Computational Genomics and Bioinformatics Group Genetics Department & Biomedical Research Center (CIBM) University of Granada, Spain mlhack@gmail.com ISBN 978-1-4614-0781-2 e-isbn 978-1-4614-0782-9 DOI 10.1007/978-1-4614-0782-9 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2011937571 Springer Science+Business Media, LLC 2012 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Preface The purpose of this book is to collect in a single volume the essentials of high throughput sequencing data analysis. These new technologies allow performing, at an unprecedented low cost and high speed, a panoply of experiments spanning the sequencing of whole genomes or transcriptomes, the profiling of DNA methylation, and the detection of protein DNA interaction sites, among others. In each experiment a massive amount of sequence information is generated, making data analysis the major challenge in high throughput sequencing-based projects. Hundreds of bioinformatics applications have been developed so far, most of them focusing on specific tasks. Indeed, numerous approaches have been proposed for each analysis step, while integrated analysis applications and protocols are generally missing. As a result, even experienced bioinformaticians struggle when they have to discern among countless possibilities to analyze their data. This, together with a lack of enough qualified personnel, reveals an urgent need to train bioinformaticians in existing approaches and to develop integrated, from start to end software applications to face present and future challenges in data analysis. Given this scenario, our motivation was to assemble a book covering the aforementioned aspects. Following three fundamental introductory chapters, the core of the book focuses on the bioinformatics aspects, presenting a comprehensive review of the methods and programs existing to analyze the raw data obtained from each experiment type. In addition, the book is meant to provide insight into challenges and opportunities faced by both, biologists and bioinformaticians, during this new era of sequencing data analysis. Given the vast range of high throughput sequencing applications, we set out to edit a book suitable for readers from different research areas, academic backgrounds and degrees of acquaintance with this new technology. At the same time, we expect the book to be equally useful to researchers involved in the different steps of a high throughput sequencing project. The newbies eager to learn the basics of high throughput sequencing technologies and data analysis will find what they yearn for specially by reading the first introductory chapters, but also by obviating the details and getting the rudiments of the v

vi Preface core chapters. On the other hand, biologists that are familiar with the fundamentals of the technology and analysis steps, but that have little bioinformatic training will find in the core chapters an invaluable resource where to learn about the different existing approaches, file formats, software, parameters, etc. for data analysis. The book will also be useful to those scientists performing downstream analyses on the output of high throughput sequencing data, as a perfect understanding of how their initial data was generated is crucial for an accurate interpretation of further outcomes. Additionally, we expect the book to be appealing to computer scientists or biologists with a strong bioinformatics background, who will hopefully find in the problematic issues and challenges raised in each chapter motivation and inspiration for the improvement of existing and the development of new tools for high throughput data analysis. Naiara Rodríguez-Ezpeleta Michael Hackenberg Ana M. Aransay

Contents 1 Introduction... 1 Naiara Rodríguez-Ezpeleta and Ana M. Aransay 2 Overview of Sequencing Technology Platforms... 11 Samuel Myllykangas, Jason Buenrostro, and Hanlee P. Ji 3 Applications of High-Throughput Sequencing... 27 Rodrigo Goya, Irmtraud M. Meyer, and Marco A. Marra 4 Computational Infrastructure and Basic Data Analysis for High-Throughput Sequencing... 55 David Sexton 5 Base-Calling for Bioinformaticians... 67 Mona A. Sheikh and Yaniv Erlich 6 De Novo Short-Read Assembly... 85 Douglas W. Bryant Jr. and Todd C. Mockler 7 Short-Read Mapping... 107 Paolo Ribeca 8 DNA Protein Interaction Analysis (ChIP-Seq)... 127 Geetu Tuteja 9 Generation and Analysis of Genome-Wide DNA Methylation Maps... 151 Martin Kerick, Axel Fischer, and Michal-Ruth Schweiger 10 Differential Expression for RNA Sequencing (RNA-Seq) Data: Mapping, Summarization, Statistical Analysis, and Experimental Design... 169 Matthew D. Young, Davis J. McCarthy, Matthew J. Wakefield, Gordon K. Smyth, Alicia Oshlack, and Mark D. Robinson vii

viii Contents 11 MicroRNA Expression Profiling and Discovery... 191 Michael Hackenberg 12 Dissecting Splicing Regulatory Network by Integrative Analysis of CLIP-Seq Data... 209 Michael Q. Zhang 13 Analysis of Metagenomics Data... 219 Elizabeth M. Glass and Folker Meyer 14 High-Throughput Sequencing Data Analysis Software: Current State and Future Developments... 231 Konrad Paszkiewicz and David J. Studholme Index... 249

Contributors Ana M. Aransay Genome Analysis Platform, CIC biogune, Parque Tecnológico de Bizkaia, Derio, Spain Douglas W. Bryant, Jr. Department of Botany and Plant Pathology, Center for Genome Research and Biocomputing, Oregon State University, Corvallis, OR, USA Department of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA Jason Buenrostro Division of Oncology, Department of Medicine, Stanford Genome Technology Center, Stanford University School of Medicine, Stanford, CA, USA Yaniv Erlich Whitehead Institute for Biomedical Research, Cambridge, MA, USA Axel Fischer Cancer Genomics Group, Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany Elizabeth M. Glass Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA Computation Institute, The University of Chicago, Chicago, IL, USA Rodrigo Goya Canada s Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, Canada Centre for High-Throughput Biology, University of British Columbia, Vancouver, BC, Canada Department of Computer Science, University of British Columbia, Vancouver, BC, Canada Michael Hackenberg Computational Genomics and Bioinformatics Group, Genetics Department, University of Granada, Granada, Spain ix

x Contributors Hanlee P. Ji Division of Oncology, Department of Medicine, Stanford Genome Technology Center,, Stanford University School of Medicine, Stanford, CA, USA Martin Kerick Cancer Genomics Group, Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany Marco A. Marra Canada s Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, Canada Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada Davis J. McCarthy Bioinformatics Division, Walter and Eliza Hall Institute, Folker Meyer Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA Computation Institute, The University of Chicago, Chicago, IL, USA Institute for Genomics and Systems Biology, The University of Chicago, Chicago, IL, USA Irmtraud M. Meyer Centre for High-Throughput Biology, University of British Columbia, Vancouver, BC, Canada Department of Computer Science, University of British Columbia, Vancouver, BC, Canada Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada Todd C. Mockler Department of Botany and Plant Pathology, Center for Genome Research and Biocomputing, Oregon State University, Corvallis, OR, USA Samuel Myllykangas Division of Oncology, Department of Medicine, Stanford Genome Technology Center, Stanford University School of Medicine, Stanford, CA, USA Alicia Oshlack Bioinformatics Division, Walter and Eliza Hall Institute, School of Physics, University of Melbourne, Murdoch Childrens Research Institute, Parkville, Australia Konrad Paszkiewicz School of Biosciences, University of Exeter, Exeter, UK Paolo Ribeca Centro Nacional de Análisis Genómico, Baldiri Reixac 4, Barcelona, Spain

Contributors xi Mark D. Robinson Bioinformatics Division, Walter and Eliza Hall Institute, Department of Medical Biology, University of Melbourne, Epigenetics Laboratory, Cancer Research Program, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia Naiara Rodríguez-Ezpeleta Genome Analysis Platform, CIC biogune, Parque Tecnológico de Bizkaia, Derio, Spain Michal-Ruth Schweiger Cancer Genomics Group, Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany David Sexton Center for Human Genetics Research, Vanderbilt University, Nashville, TN, USA Mona A. Sheikh Whitehead Institute for Biomedical Research, Cambridge, MA, USA Gordon K. Smyth Bioinformatics Division, Walter and Eliza Hall Institute, Department of Mathematics and Statistics, University of Melbourne, David J. Studholme School of Biosciences, University of Exeter, Exeter, UK Geetu Tuteja Department of Developmental Biology, Stanford University, Stanford, CA, USA Matthew J. Wakefield Bioinformatics Division, Walter and Eliza Hall Institute, Department of Zoology, University of Melbourne, Matthew D. Young Bioinformatics Division, Walter and Eliza Hall Institute, Michael Q. Zhang Department of Molecular and Cell Biology, Center for Systems Biology, The University of Texas at Dallas, Richardson, TX, USA Bioinformatics Division, TNLIST, Tsinghua University, Beijing, China

wwwwwwwwwww