Computing with large data sets

Size: px
Start display at page:

Download "Computing with large data sets"

Transcription

1 Computing with large data sets Richard Bonneau, spring 2009 Lecture 16 (week 10): bioconductor: an example R multi-developer project

2 Acknowledgments and other sources: Ben Bolstad, Biostats lectures, Berkely Bioinformatics and Computational Biology Solutions Using R and Bioconductor ( Gentleman, Carey, Huber, Irizarry, Dudiot) The bioconductor website: v : computing with data, Richard Bonneau Lecture 14 16

3 Central dogma Gene expression DNA RNA Protein

4 measuring mrna, Bacillus RNA subtilis : microarrays Genome of 4,106 protein coding genes, one spot-one gene PCR-amplified probes printed on aminosilane coated slides, UV-crosslinked 3

5 measuring mrna, Bacillus RNA subtilis : microarrays Genome of 4,106 protein coding genes, one spot-one gene PCR-amplified probes printed on aminosilane coated slides, UV-crosslinked Spotting inconsistencies 3

6 measuring mrna, Affymetrix RNA : chip microarrays Time spent on experiment ~7 days Cost of Experiment $150-$600 3

7 2D - LC/LC Study protein complexes without gel electrophoresis (trypsin) Peptides all bind to cation exchange column Successive elution with increasing salt gradients separates peptides by charge Complex mixture is simplified prior to MS/MS by 2D LC Peptides are separated by hydrophobicity on reverse phase column v : computing with data, Richard Bonneau Lecture 14 16

8 transcription factors control expression of genes 3

9 bioconductor Bioconductor (BioC) is an open source and open development software that is actively developing tools for the analysis of many types genomic data. Mainly written in R Global and open source. Licensed under the GPL/LGPL/BSD licenses 8

10 bioconductor R gives us a wide range of powerful statistical and graphical methods. Tracking and managment of biological metadata in the analysis of experimental data R facilitates the rapid development of extensible, scalable, and interoperable software; Each package has high-quality documentation and reproducible research. The team provide training workshops in computational and statistical methods for genomic analysis. 9

11 transcription bioconductor factors features control expression of genes Platform independent Linux/Unix, Windows Predominantly command line interface Often object oriented: S4 objects Most of the current tools are designed for the analysis of microarray data R is used by many statisticians and has a large repository of packages which might also be useful cran.r-project.org 10

12 bioconductor : open source Full access to algorithms and their implementation The ability to fix bugs To encourage good scientific computing and statistical practice by providing appropriate tools and instruction To provide a workbench of tools that allow researchers to explore and expand the methods used to analyze biological data To ensure that the international scientific community is the owner of the software tools needed to carry out research 11

13 transcription bioconductor: factors docs control expression of genes Each package contains at least one vignette a document that provides a textual, task-oriented description of the package's functionality and that can be used interactively. Many are simple "HowTo"s, that is, they are designed to demonstrate how a particular task can be accomplished with that package's software. Others provide a more thorough overview of the package, or might even discuss general issues related to the package. The vignettes are generated using the Sweave function from the R package tools. They are documents that intermix text, code, and output (textual and graphical) and can be regenerated automatically whenever the data or analyses change. 12

14 bioconductor : packages There are currently almost 90 packages in the 1.4 release (May 2004). The first release in May 2002 had only 15 packages Some are very simple while others provide extensive capabilities for the analysis of a particular type of data There is some level of dependency among the packages We will explore a subset of the packages 13

15 bioconductor : biobase Accessor functions that can be applied to exprsets exprs() - access the expression values se.exprs() access standard error estimates pdata() access phenotype data description() obtain the MIAME information genenames() access the names of the genes samplenames() names of the samples 14

16 bioconductor : affy, a package for affymetrix The core package for low-level analysis of Affymetrix data Provides Mechanisms for reading and storing cel file data (raw probe intensities) Tools for exploring probe-intensity data Methods for pre-processing background correction, normalization Computing expression measures 15

17 bioconductor : affy, a package for affymetrix boxplot() hist() 16

18 transcription factors control expression of genes affyplm - Pseudo-chip images Weights Residuals image() Positive Residuals Negative Residuals 17

19 transcription factors control expression of genes affyplm - RLE Plots Relative Log Expression Mbox() 18

20 affyplm - NUSE plots Normalized Unscaled Standard Errors boxplot() 19

21 QC : affyplm Fitting probe-level models to Affymetrix data provides quality control information Quality assessment focuses on Residuals Weights from a robust fitting procedure Relative log expression Standard errors 20

22 getting data from the web 21

23 getting data from the web library(biobase) library(geoquery) #Download GDS file, put it in the current directory, and load it: gds858 <- getgeo('gds858', destdir=".") #Or, open an existing GDS file (even if its compressed): gds858 <- getgeo(filename='gds858.soft.gz') good example of using the connection to GEO. peter_cock/r/geo/ 22

24 annotate Handles annotation Convert between Unigene, LocusLink, Affymetrix probeset ids and other annotation methods Methods for accessing online information from PubMed, GenBank 23

25 Rgraphviz Allows you to create graphs with nodes and edges 24

26 other useful packages clust: for clustering class: for classification rpart: trees mlclust: model based clustering mgcv: smoothers 25

27 other useful packages there is only One take home message from this entire lecture: R can support a big effort very well, with web services, interface to data repositories, better languages used for web and database programing, access to high level stats and machine learning work, graphical interfaces, automated generation of interactive reports, etc. v : computing with data, Richard Bonneau Lecture 14 16

28 Acknowledgments and other sources: Ben Bolstad, Biostats lectures, Berkely Bioinformatics and Computational Biology Solutions Using R and Bioconductor ( Gentleman, Carey, Huber, Irizarry, Dudiot) The bioconductor website: v : computing with data, Richard Bonneau Lecture 14 16