Bioinformatics for High Throughput Sequencing

Similar documents
Bioinformatics for High Throughput Sequencing

Microarrays in Diagnostics and Biomarker Development

The Genetics of Obesity

Marine Bioactive Compounds

E E. International Series in Operations Research & Management Science. Series Editor: Fredrick S. Hillier Stanford University

Operations Research/Computer Science Interfaces Series

Infectious Disease. Vassil St. Georgiev. For further volumes:

ADVANCED MEDIA PLANNING

Exercises in Environmental Physics

Lecture Notes in Energy 5

A Computer Scientist s Guide to Cell Biology

Genetic and Molecular Epidemiology of Multiple Myeloma

International Series in Operations Research & Management Science

From reads to results: differential. Alicia Oshlack Head of Bioinformatics

Simulation Strategies to Reduce Recidivism

Principles of Food Sanitation. Fifth Edition

Field Guidelines for Genetic Experimental Designs in High-Throughput Sequencing

Genomic Elements in Health, Disease and Evolution

Research & Management Science

Shabbir A. Shahid Mahmoud A. Abdelfattah Michael A. Wilson John A. Kelley Joseph V. Chiaretti. United Arab Emirates Keys to Soil Taxonomy

Environmental Radiation Effects on Mammals

Biomaterials. Third Edition

Ethics for Biomedical Engineers

Machining of Polymer Composites

Reliability of Microtechnology

Chapter 3: Assessing and Measuring Wetland Hydrology

Philip Simpson. FPGA Design. Best Practices for Team-based Design

SpringerBriefs in Food, Health, and Nutrition Series

Lecture Notes in Earth System Sciences 144

Biomphalaria Snails and Larval Trematodes

M ETHODS IN MOLECULAR BIOLOGY

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics

Jinsong Zhou Zhongyang Luo Yanqun Zhu Mengxiang Fang. Mercury Emission and its Control in Chinese Coal-Fired Power Plants

Information and Organization Design Series

Structure and Chemistry of Crystalline Solids

Recombinant Enzymes From Basic Science to Commercialization

Materials Research and Engineering Edited by B. Ilschner and N.J. Grant

Microsystems for Pharmatechnology

Introduction to BIOINFORMATICS

ESSENTIAL BIOINFORMATICS

MANAGING IN THE INFORMATION ECONOMY. Current Research Issues

Agile Project Management: Managing for Success

OPTIMIZATION IN PUBLIC TRANSPORTATION. Stop Location, Delay Management and Tariff Zone Design in a Public Transportation Network

Stem Cell Biology and Regenerative Medicine

Introduction to Bioinformatics

Managing e-business Projects

Quickstart Molecular Biology

Engineering Genetic Circuits

Lecture Notes in Management and Industrial Engineering

Nanotechnology Enabled In situ Sensors for Monitoring Health

M e t h o d s in Molecular Biology

DIRECT -CONTACT HEAT TRANSFER

Energy-Efficient HVAC Design

Advances in Soil Science

The Search for Human Chromosomes

Advances in Soil Science

Management of Network Organizations

A. McDermott R. H. Burdon A. E. Smith C. Jones P. Cohen R. Denton, C. I. Pogson D. M. Moore L. M. Cook H. H. Rees

AAPS Advances in the Pharmaceutical Sciences Series

Management for Professionals

TREE CODE PRODUCT BROCHURE

Introduction to Bioinformatics

Integrated Pest Management

Leading Pharmaceutical Operational Excellence

Management for Professionals

Structural Design Guide to the

Nested Partitions Method, Theory and Applications

Tree and Forest Measurement

Lee-Jun C. Wong Editor. Next Generation Sequencing. Translation to Clinical Diagnostics

DNA. bioinformatics. epigenetics methylation structural variation. custom. assembly. gene. tumor-normal. mendelian. BS-seq. prediction.

RESEARCH AREA NEUROSCIENCE

Handbook of Media Branding

Design and Management of Energy-Efficient Hybrid Electrical Energy Storage Systems

Next-generation sequencing technologies

SpringerBriefs in Microbiology

Environmental Sustainability Issues in the South Texas Mexico Border Region

ADAMAS UNIVERSITY FACULTY OF SCIENCE - DEPARTMENT OF BIOTECHNOLOGY BACHELOR OF SCIENCE (Honours) SEMESTER - I

DNA Microarray Technology and Data Analysis in Cancer Research Downloaded from

Energy Systems. Series Editor: Panos M. Pardalos, University of Florida, USA. For further volumes:

Advances in Soil Science

Management of Permanent Change

Public Administration, Governance and Globalization

bioinformatica 6EF2F181AA1830ABC10ABAC56EA5E191 Bioinformatica 1 / 5

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Pharmaceutical Prices in the 21st Century

Entrepreneurial Marketing for SMEs

What is Bioinformatics?

The GMO Handbook. Genetically Modified Animals, Microbes, and Plants in Biotechnology. Edited by. Sarad R. Parekh, PhD

Springer New York Berlin Heidelberg Barcelona Budapest Hong Kong London Milan Paris Santa Clara Singapore Tokyo

Computational Challenges of Medical Genomics

Biology 644: Bioinformatics

Bioinformatics Advice on Experimental Design

GREG GIBSON SPENCER V. MUSE

SOCIAL DIMENSIONS OF INFORMATION AND COMMUNICATION TECHNOLOGY POLICY

Biotechnology Fifth edition

Nucleic Acids and Molecular Biology

METHODS OF MICROARRAY DATA ANALYSIS III

Microbial Metabolism Systems Microbiology

Engineering Materials

Sustainable Water Resources Planning and Management Under Climate Change

Transcription:

Bioinformatics for High Throughput Sequencing

Naiara Rodríguez-Ezpeleta Ana M. Aransay Editors Michael Hackenberg Bioinformatics for High Throughput Sequencing

Editors Naiara Rodríguez-Ezpeleta Genome Analysis Platform CIC biogune Derio, Bizkaia, Spain nrodriguez@cicbiogune.es Ana M. Aransay Genome Analysis Platform CIC biogune Derio, Bizkaia, Spain amaransay@cicbiogune.es Michael Hackenberg Computational Genomics and Bioinformatics Group Genetics Department & Biomedical Research Center (CIBM) University of Granada, Spain mlhack@gmail.com ISBN 978-1-4614-0781-2 e-isbn 978-1-4614-0782-9 DOI 10.1007/978-1-4614-0782-9 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2011937571 Springer Science+Business Media, LLC 2012 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Preface The purpose of this book is to collect in a single volume the essentials of high throughput sequencing data analysis. These new technologies allow performing, at an unprecedented low cost and high speed, a panoply of experiments spanning the sequencing of whole genomes or transcriptomes, the profiling of DNA methylation, and the detection of protein DNA interaction sites, among others. In each experiment a massive amount of sequence information is generated, making data analysis the major challenge in high throughput sequencing-based projects. Hundreds of bioinformatics applications have been developed so far, most of them focusing on specific tasks. Indeed, numerous approaches have been proposed for each analysis step, while integrated analysis applications and protocols are generally missing. As a result, even experienced bioinformaticians struggle when they have to discern among countless possibilities to analyze their data. This, together with a lack of enough qualified personnel, reveals an urgent need to train bioinformaticians in existing approaches and to develop integrated, from start to end software applications to face present and future challenges in data analysis. Given this scenario, our motivation was to assemble a book covering the aforementioned aspects. Following three fundamental introductory chapters, the core of the book focuses on the bioinformatics aspects, presenting a comprehensive review of the methods and programs existing to analyze the raw data obtained from each experiment type. In addition, the book is meant to provide insight into challenges and opportunities faced by both, biologists and bioinformaticians, during this new era of sequencing data analysis. Given the vast range of high throughput sequencing applications, we set out to edit a book suitable for readers from different research areas, academic backgrounds and degrees of acquaintance with this new technology. At the same time, we expect the book to be equally useful to researchers involved in the different steps of a high throughput sequencing project. The newbies eager to learn the basics of high throughput sequencing technologies and data analysis will find what they yearn for specially by reading the first introductory chapters, but also by obviating the details and getting the rudiments of the v

vi Preface core chapters. On the other hand, biologists that are familiar with the fundamentals of the technology and analysis steps, but that have little bioinformatic training will find in the core chapters an invaluable resource where to learn about the different existing approaches, file formats, software, parameters, etc. for data analysis. The book will also be useful to those scientists performing downstream analyses on the output of high throughput sequencing data, as a perfect understanding of how their initial data was generated is crucial for an accurate interpretation of further outcomes. Additionally, we expect the book to be appealing to computer scientists or biologists with a strong bioinformatics background, who will hopefully find in the problematic issues and challenges raised in each chapter motivation and inspiration for the improvement of existing and the development of new tools for high throughput data analysis. Naiara Rodríguez-Ezpeleta Michael Hackenberg Ana M. Aransay

Contents 1 Introduction... 1 Naiara Rodríguez-Ezpeleta and Ana M. Aransay 2 Overview of Sequencing Technology Platforms... 11 Samuel Myllykangas, Jason Buenrostro, and Hanlee P. Ji 3 Applications of High-Throughput Sequencing... 27 Rodrigo Goya, Irmtraud M. Meyer, and Marco A. Marra 4 Computational Infrastructure and Basic Data Analysis for High-Throughput Sequencing... 55 David Sexton 5 Base-Calling for Bioinformaticians... 67 Mona A. Sheikh and Yaniv Erlich 6 De Novo Short-Read Assembly... 85 Douglas W. Bryant Jr. and Todd C. Mockler 7 Short-Read Mapping... 107 Paolo Ribeca 8 DNA Protein Interaction Analysis (ChIP-Seq)... 127 Geetu Tuteja 9 Generation and Analysis of Genome-Wide DNA Methylation Maps... 151 Martin Kerick, Axel Fischer, and Michal-Ruth Schweiger 10 Differential Expression for RNA Sequencing (RNA-Seq) Data: Mapping, Summarization, Statistical Analysis, and Experimental Design... 169 Matthew D. Young, Davis J. McCarthy, Matthew J. Wakefield, Gordon K. Smyth, Alicia Oshlack, and Mark D. Robinson vii

viii Contents 11 MicroRNA Expression Profiling and Discovery... 191 Michael Hackenberg 12 Dissecting Splicing Regulatory Network by Integrative Analysis of CLIP-Seq Data... 209 Michael Q. Zhang 13 Analysis of Metagenomics Data... 219 Elizabeth M. Glass and Folker Meyer 14 High-Throughput Sequencing Data Analysis Software: Current State and Future Developments... 231 Konrad Paszkiewicz and David J. Studholme Index... 249

Contributors Ana M. Aransay Genome Analysis Platform, CIC biogune, Parque Tecnológico de Bizkaia, Derio, Spain Douglas W. Bryant, Jr. Department of Botany and Plant Pathology, Center for Genome Research and Biocomputing, Oregon State University, Corvallis, OR, USA Department of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA Jason Buenrostro Division of Oncology, Department of Medicine, Stanford Genome Technology Center, Stanford University School of Medicine, Stanford, CA, USA Yaniv Erlich Whitehead Institute for Biomedical Research, Cambridge, MA, USA Axel Fischer Cancer Genomics Group, Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany Elizabeth M. Glass Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA Computation Institute, The University of Chicago, Chicago, IL, USA Rodrigo Goya Canada s Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, Canada Centre for High-Throughput Biology, University of British Columbia, Vancouver, BC, Canada Department of Computer Science, University of British Columbia, Vancouver, BC, Canada Michael Hackenberg Computational Genomics and Bioinformatics Group, Genetics Department, University of Granada, Granada, Spain ix

x Contributors Hanlee P. Ji Division of Oncology, Department of Medicine, Stanford Genome Technology Center,, Stanford University School of Medicine, Stanford, CA, USA Martin Kerick Cancer Genomics Group, Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany Marco A. Marra Canada s Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, Canada Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada Davis J. McCarthy Bioinformatics Division, Walter and Eliza Hall Institute, Folker Meyer Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA Computation Institute, The University of Chicago, Chicago, IL, USA Institute for Genomics and Systems Biology, The University of Chicago, Chicago, IL, USA Irmtraud M. Meyer Centre for High-Throughput Biology, University of British Columbia, Vancouver, BC, Canada Department of Computer Science, University of British Columbia, Vancouver, BC, Canada Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada Todd C. Mockler Department of Botany and Plant Pathology, Center for Genome Research and Biocomputing, Oregon State University, Corvallis, OR, USA Samuel Myllykangas Division of Oncology, Department of Medicine, Stanford Genome Technology Center, Stanford University School of Medicine, Stanford, CA, USA Alicia Oshlack Bioinformatics Division, Walter and Eliza Hall Institute, School of Physics, University of Melbourne, Murdoch Childrens Research Institute, Parkville, Australia Konrad Paszkiewicz School of Biosciences, University of Exeter, Exeter, UK Paolo Ribeca Centro Nacional de Análisis Genómico, Baldiri Reixac 4, Barcelona, Spain

Contributors xi Mark D. Robinson Bioinformatics Division, Walter and Eliza Hall Institute, Department of Medical Biology, University of Melbourne, Epigenetics Laboratory, Cancer Research Program, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia Naiara Rodríguez-Ezpeleta Genome Analysis Platform, CIC biogune, Parque Tecnológico de Bizkaia, Derio, Spain Michal-Ruth Schweiger Cancer Genomics Group, Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany David Sexton Center for Human Genetics Research, Vanderbilt University, Nashville, TN, USA Mona A. Sheikh Whitehead Institute for Biomedical Research, Cambridge, MA, USA Gordon K. Smyth Bioinformatics Division, Walter and Eliza Hall Institute, Department of Mathematics and Statistics, University of Melbourne, David J. Studholme School of Biosciences, University of Exeter, Exeter, UK Geetu Tuteja Department of Developmental Biology, Stanford University, Stanford, CA, USA Matthew J. Wakefield Bioinformatics Division, Walter and Eliza Hall Institute, Department of Zoology, University of Melbourne, Matthew D. Young Bioinformatics Division, Walter and Eliza Hall Institute, Michael Q. Zhang Department of Molecular and Cell Biology, Center for Systems Biology, The University of Texas at Dallas, Richardson, TX, USA Bioinformatics Division, TNLIST, Tsinghua University, Beijing, China

wwwwwwwwwww