in Biomedicine A Gentle Introduction to Support Vector Machines Volume 1: Theory and Methods

A Gentle Introduction to Support Vector Machines in Biomedicine Volume 1: Theory and Methods

This page intentionally left blank

A Gentle Introduction to Support Vector Machines in Biomedicine Volume 1: Theory and Methods Alexander Statnikov New York University, USA Constantin F Aliferis New York University, USA Douglas P Hardin Vanderbilt University, USA Isabelle Guyon ClopiNet, USA World Scientific NEW JERSEY 7922tp.indd 2 LONDON SINGAPORE BEIJING SHANGHAI HONG KONG TA I P E I CHENNAI 1/19/11 11:07 AM

Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. A GENTLE INTRODUCTION TO SUPPORT VECTOR MACHINES IN BIOMEDICINE Volume 1: Theory and Methods Copyright 2011 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher. For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher. ISBN-13 978-981-4324-38-0 ISBN-10 981-4324-38-8 Typeset by Stallion Press Email: enquiries@stallionpress.com Printed in Singapore.

Contents Preface... About the Authors... 1. Introduction... 1 Classes of Data-Analytic Problems Considered in This Book... 1 Basic Principles of Classification... 6 Main Ideas of the Support Vector Machine (SVM) Classification Algorithm... 12 History of SVMs and Their Use in the Literature... 16 2. Necessary Mathematical Concepts... 19 Geometrical Representation of Objects... 19 Basic Operations on Vectors... 24 Hyperplanes as Decision Surfaces... 29 Basics of Optimization... 34 3. Support Vector Machines (SVMs) for Binary Classification: Classical Formulation... 40 Hard-Margin Linear SVM for Linearly Separable Data... 40 Soft-Margin Linear SVM for Data That is not Exactly Linearly Separable Due to Noise or Outliers... 49 Non-Linear SVM and Kernel Trick For Linearly Non-Separable Data... 57 ix xiii 4. Basic Principles of Statistical Machine Learning... 64 Generalization and Overfitting... 64 Loss + Penalty Paradigm for Learning to Avoid Overfitting and Ensure Generalization... 68 v

vi A Gentle Introduction to SVMs in Biomedicine, Volume 1: Theory and Methods 5. Model Selection for SVMs... 73 Motivation of Model Selection Strategy... 74 Commonly Used Parameters/Kernels of SVM Classifiers... 79 Cross-Validation for Accuracy Estimation... 81 Cross-Validation for Accuracy Estimation and Model Selection... 85 Statistical Considerations... 90 6. SVMs for Multi-Category Classification... 91 One-Versus-Rest SVMs... 91 One-Versus-One SVMs... 94 Methods by Crammer and Singer and by Weston and Watkins... 96 7. Support Vector Regression (SVR)... 97 Hard-Margin Linear ε-insensitive SVR for Modeling Linear Relations... 97 Soft-Margin Linear ε-insensitive SVR for Modeling Almost Linear Relations... 106 Non-Linear ε-insensitive SVR for Modeling Non-Linear Relations... 111 Comparing ε-insensitive SVR with Other Popular Regression Methods... 113 On Model Selection for ε-insensitive SVR... 118 8. Novelty Detection with SVM-Based Methods... 119 Hard-Margin Linear One-Class SVM... 123 Soft-Margin Linear One-Class SVM... 125 Non-Linear One-Class SVM... 129 On Model Selection for One-Class SVM... 135 9. Support Vector Clustering (Contributed by Nikita I. Lytkin)... 136 The Minimal Enclosing Hyper-Sphere... 140 Cluster Assignment in SVC... 144 Dealing with Noise in the Data... 148 Relationship Between the Minimal Enclosing Hyper-Sphere and One-Class SVM... 153 10. SVM-Based Variable Selection... 154 Understanding the SVM Weight Vector... 156 Simple SVM-Based Variable Selection Algorithm... 161

Contents vii SVM-RFE Variable Selection Algorithm... 164 Variable Selection and Estimation of Generalization Accuracy... 166 11. Computing Posterior Class Probabilities for SVM Classifiers... 168 Simple Binning Method for Posterior Probability Estimation... 168 Platt s Method for Posterior Probability Estimation... 171 12. Conclusions... 174 Appendix... 176 Bibliography... 178 Index... 181

This page intentionally left blank

Preface Recent breakthroughs in molecular biology methods including both structural assays (e.g., sequencing) and functional ones (e.g., gene expression) combined with an explosive proliferation of electronically accessible medical data offer remarkable opportunities for new discoveries that are already revolutionizing the biomedical research and patient care landscapes. These discoveries are directly relevant to improved patient management (e.g., enhanced diagnostic modeling, as well as personalized treatment modeling), and to more foundational scientific discoveries (e.g., understanding the molecular mechanisms of diseases and devising targeted new therapies to prevent or cure them). Closing the gap between the massive amounts of biomedical knowledge that is hidden inside raw data waiting to be discovered on the one hand, and the next generation of diagnostics, personalized treatments and new drugs, on the other hand, completely relies on sophisticated data analytics. Fortunately in parallel with the development of new powerful ways to measure microscopic and macroscopic phenotypes, the scientific community has been working hard at and has witnessed the emergence of many powerful analytics that can measure up to the challenge. The purpose of the present book is to disseminate conceptually clear and operationally useful information about a cutting edge class of data analysis tools, the family of Support Vector Machines ( SVMs ) learning methods. SVMs are very important because they can handle well datasets and modeling tasks that are very problematic for other analysis methods. For example: (i) SVMs work well in datasets that have a very large number of variables and a relatively small sample size; (ii) SVMs can learn both simple and highly complex models; (iii) SVMs have strong built-in protections against a phenomenon that is deleterious to modern high-dimensional modeling, known as overfitting (intuitively, when models work well in discovery datasets but fail subsequently in application or validation data). Because of these properties, SVMs have documented superior performance compared to other algorithms in many, if not the majority of, types of biomedical data to date. ix

x A Gentle Introduction to SVMs in Biomedicine, Volume 1: Theory and Methods However, biomedical researchers often experience difficulties grasping both the theory and applications of these important methods. This is, we believe, primarily due to the lack of necessary technical background in mathematics, computer science and machine learning. A consequence of this situation is the observed significant lag in the adoption of SVMs in the biomedical scientific community compared to the general sciences. The purpose of this book is to help alleviate this problem by introducing SVMs and their extensions, thus allowing biomedical researchers to understand them and apply them effectively in real-life research, education, and possibly clinical practice. We acknowledge that many excellent books have been written on SVMs, and we ourselves have been beneficiaries of their teachings. However, in our experience as educators, the majority of biomedical researchers cannot fully benefit from the current literature on SVMs because of the advanced level of mathematical sophistication it requires. Our work aims to circumvent these prerequisites and to be accessible to the full spectrum of biomedical researchers assuming only a rudimentary prior knowledge of mathematics and computation (roughly high-school or first-year college level). Given the stated focus, we use terminology which is familiar to biomedical researchers and healthcare professionals. The book consists of two parts: in volume I we cover basic theory; in volume II we present several application and validation case studies. As the reader will quickly discover, volume I (theory) follows the approach of programmed learning whereby material is presented in short sections ( frames ). Each frame consists of a very small amount of information to be learned, a problem, and answers to the problem. The reader proceeds to the next frame after verifying that he/she gave (or understands the) correct answers to the current frame. We chose this method because it is particularly effective for breaking down technically complex concepts and making them easily digestible. Volume II with case studies follows a conventional narrative approach since it does not involve hard-to-grasp technical material. Over the years we have taught SVMs (at different levels of depth) to a variety of audiences that include: graduate students and post-doctoral fellows in Biomedical Informatics, Computer Science, Clinical Investigation, Clinical Pathology and professional statisticians, medical informaticians, biological researchers, etc. We have taught aspects of SVMs in graduate school courses, professional society seminars, and pathology residents (due to the relevancy of SVMs to molecular profiling of diseases). In a research context, authors of this book have contributed to the invention of the core SVM methods; they have produced theorems and empirical results for understanding the behavior of SVMs; they have applied SVMs to solve real-life scientific and industrial projects; they have guided students M.S. and Ph.D. theses with a heavy SVM emphasis; and they have authored software that uses SVMs to perform complex analytic tasks. We hope that these theoretical and applied contributions and experiences will be adequately

Preface xi reflected in making this book both technically accurate and up-to-date as well as grounding it on state-of-the-art research. We wish to emphasize that in writing this book we were very careful to point out both the strengths and the weaknesses of SVMs and to not create the impression that SVMs are a one solution fits all data analysis paradigm (or that we advocate them as such). Toward that goal we have dedicated considerable space in making comparisons of SVMs with other important statistical and machine learning models, showing current limitations of SVMs whenever they exist, and pointing out theoretical and empirical evidence for appropriate use of the right tool for the job. The measure of success of this book is whether it will empower researchers, educators and practitioners in the health sciences in both academic and industry settings to advance science and patient care through the understanding and use of a very powerful class of computational methods. The book is accompanied by a website that provides errata and many useful updated links on SVM software, interactive demonstrations and animations: http://www.svms-inbiomedicine.org. In closing, we would like to acknowledge: Nikita Lytkin for contributing a chapter on SVM-based clustering and valuable advice on presentation of various concepts in the book; Zhiguo Li for help with proofreading this book; Discovery Holdings, LLC for permission to use proprietary materials and information. Alexander Statnikov My colleagues and students at Vanderbilt University and New York University for working with me over the years on numerous exciting projects that involved SVMs; I would not have been involved in SVM research in particular without the mentoring of my good friend Doug Hardin (a co-author to this book) and this is a good place to express my great gratitude for this; Discovery Holdings, LLC for permission to use proprietary materials and information. Constantin Aliferis

This page intentionally left blank

About the Authors Alexander Statnikov is Assistant Professor in the Department of Medicine and Center for Health Informatics and Bioinformatics at New York University Langone Medical Center, Director of the Computational Causal Discovery Laboratory, and Benchmarking Director of the Best Practices Integrative Informatics Consultation Service. He is an author of more than 40 peer-reviewed publications (books, book chapters, journal papers, conference papers, etc.) and a co-inventor of 4 pending patents in machine learning and biomedical informatics. Most of his papers and research rely on the use of Support Vector Machines (SVM) algorithms. Dr. Statnikov is a co-inventor and a primary developer of the SVM-based software system GEMS for automated development of molecular signatures and biomarker discovery from microarray gene expression data that has more than 1000 registered users all over the world. The primary publication about the GEMS system has received more than 300 publications so far. Dr. Statnikov has also made a significant contribution to the development of the SVM-based system FAST-AIMS for automated analysis of mass-spectrometry data. In addition to the above, Dr. Statnikov designed many algorithms, conducted their empirical evaluations, and made other important contributions to the fields of machine learning and pattern recognition, analaysis of high-throughput biomedical data, computational causal discovery, and biomedical informatics. xiii

xiv A Gentle Introduction to SVMs in Biomedicine, Volume 1: Theory and Methods Constantin Aliferis is the Director of the Center for Health Informatics and Bioinformatics at New York University, the Director of Informatics for the NYU Clinical and Translational Science Institute, Director of the Molecular Signatures Laboratory, Scientific Director of the Best Practices Integrative Informatics Consultation Service at NYULMC, and an elected Fellow of the American College of Medical Informatics (ACMI). He is an Associate Professor in the Department of Pathology of NYU School of Medicine, and has adjunct appointments in Biomedical Informatics and Biostatistics at Vanderbilt University. In the past, he has also held faculty appointments in Computer Science and Cancer Biology at Vanderbilt. Dr. Aliferis has made methodological and applied contributions to the fields of machine learning and pattern recognition, analysis of high-throughput biomedical data, computational causal discovery, biomedical information retrieval, and biomedical informatics. He is an author of 100 peer-reviewed publications and principal investigator, co-principal investigator and co-investigator in 20 federal grants. He has 9 pending and granted patents in machine learning and biomedical informatics. In addition, Dr. Aliferis is a co-inventor of two SVM-based software systems (GEMS and FAST-AIMS). Dr. Aliferis and his lab have used extensively SVMs for academic and commercial projects in a variety of biomedical application domains. Douglas Hardin is Professor in the Departments of Mathematics and Biomedical Informatics at Vanderbilt University. He is an author of more than 60 peer-reviewed publications and investigator in 11 grants. One of his primary research directions is variable selection with SVMs. Dr. Hardin has designed a course for biomedical informatics students and fellows that teaches SVMs without requiring extensive mathematical background. This course was offered at Vanderbilt University for the last 7 years and was very successful in educating dozens of students and fellows about SVMs. Isabelle Guyon is a researcher and consultant in pattern recognition, machine learning, statistical data analysis, and data mining. She is a co-inventor of the support vector machine (SVM) method and the SVM-RFE variable selection method. These contributions have thousands of citations. For the past 10 years, she has been involved in analysis of high-throughput molecular data (for DNA microarrays, antibody arrays, and mass-spectrometers) and development of

About the Authors xv predictive models/signatures for many complex diseases and phenotypes. Her application of SVMs to prostate cancer DNA microarray data has led to the development of a diagnosis test for prostate cancer which is in the process of commercialization. She organized several challenges in machine learning involving biological high-throughput molecular data in an effort to benchmark methods and identify the most effective ones. She co-authored and edited several books.