pint: probabilistic data integration for functional genomics

Similar documents
The SVM2CRM data User s Guide

Analysis of genomic arrays using quantile smoothing

An Introduction to exomepeak

IdeoViz Plot data along chromosome ideograms

IdeoViz Plot data along chromosome ideograms

Using DeMAND, a package for Interrogating Drug Mechanism of Action using Network Dysregulation analysis

Generating Grey Lists from Input Libraries

immunoclust - Automated Pipeline for Population Detection in Flow Cytometry

DBChIP: Detecting differential binding of transcription factors with ChIP-seq

immunoclust - Automated Pipeline for Population Detection in Flow Cytometry

immunoclust - Automated Pipeline for Population Detection in Flow Cytometry

User s guide to flowmap

An Introduction Mixture Model Normalization with the metabomxtr Package

ichip: A Package for Analyzing Multi-platform ChIP-chip data with Various Sample Sizes

Exploring protein interaction data using ppistats

DBChIP: Detecting differential binding of transcription factors with ChIP-seq

GEOsubmission. Alexandre Kuhn October 30, 2017

Introduction to the Data Analysis of the Roche xcelligence System with RTCA Package

Using the attract Package To Find the Gene Expression Modules That Represent The Drivers of Kauffman s Attractor Landscape

methyanalysis: an R package for DNA methylation data analysis and visualization

The CopyNumber450k User s Guide: Calling CNVs from Illumina 450k Methylation Arrays

mirnatap example use

daglogo guide Jianhong Ou, Lihua Julie Zhu April 16, Introduction 1 4 using daglogo to analysis Catobolite Activator Protein 9 5 References 11

msgbsr: an R package to analyse methylation sensitive genotyping by sequencing (MS-GBS) data

MGFM: Marker Gene Finder in Microarray gene expression data

cogena, a workflow for co-expressed gene-set enrichment analysis Zhilong Jia, Michael R. Barnes

mirnatap example use

prot2d : Statistical Tools for volume data from 2D Gel Electrophoresis

1 curatedbladderdata: Clinically Annotated Data for the Bladder Cancer Transcriptome... 1

flowvs: Variance stabilization in flow cytometry (and microarrays)

Design Microarray Probes

flowvs: Variance stabilization in flow cytometry (and microarrays)

SVM2CRM: support vector machine for the cis-regulatory elements detections

GeneExpressionSignature: Computing pairwise distances between different biological states

Bioconductor mapkl Package

traser: TRait-Associated SNP EnRichment analyses

The REDseq user s guide

The REDseq user s guide

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis

ChIP-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

I/O Suite, VCF (1000 Genome) and HapMap

User s Manual Version 1.0

Package COCOA. November 6, 2018

Bioinformatics for Biologists

The first thing you will see is the opening page. SeqMonk scans your copy and make sure everything is in order, indicated by the green check marks.

Human Fibroblast IMR90 Hi-C Data (Dixon et al.)

Oscope: a statistical pipeline for identifying oscillatory genes in unsynchronized single cell RNA-seq experiments

Oscope: a statistical pipeline for identifying oscillatory genes in unsynchronized single cell RNA-seq experiments

Package ABC.RAP. October 20, 2016

Bioinformatics for Biologists

methyanalysis: an R package for DNA methylation data analysis and visualization

Bioinformatics. Microarrays: designing chips, clustering methods. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

Bayesian Inference of Regulatory influence on Expression (birte)

Package GSAQ. September 8, 2016

methyanalysis: an R package for DNA methylation data analysis and visualization

prot2d : Statistical Tools for volume data from 2D Gel Electrophoresis

Feature Selection of Gene Expression Data for Cancer Classification: A Review

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

ChIP-seq and RNA-seq. Farhat Habib

Finding Chimeric Sequences

Machine Learning in Computational Biology CSC 2431

Calculation of Spot Reliability Evaluation Scores (SRED) for DNA Microarray Data

Gene-Level Analysis of Exon Array Data using Partek Genomics Suite 6.6

CS273B: Deep learning for Genomics and Biomedicine

Introduction to human genomics and genome informatics

seqpattern: an R package for visualisation of oligonucleotide sequence patterns and motif occurrences

Retrieval of Gene Expression Measurements with Probabilistic Models

seqpattern: an R package for visualisation of oligonucleotide sequence patterns and motif occurrences

Comp/Phys/Mtsc 715. Example Videos. Administrative 4/12/2012. Bioinformatics Visualization. Vis 2005, Bertram. Vis 2005, Cantarel(tighten.

The opossom Package. Henry Löffler-Wirth, Martin Kalcher. October 30, 2017

Epigenetics and DNase-Seq

ChIP-seq and RNA-seq

Bayesian Variable Selection and Data Integration for Biological Regulatory Networks

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Syllabus for BIOS 101, SPRING 2013

Inferring Gene-Gene Interactions and Functional Modules Beyond Standard Models

goseq: Gene Ontology testing for RNA-seq datasets

Cross-organism prediction of drug hepatotoxicity by sparse group factor analysis

Machine Learning. HMM applications in computational biology

Agilent Genomics Software Future Directions

user s guide Question 1

The essentials of microarray data analysis

About Strand NGS. Strand Genomics, Inc All rights reserved.

Shannon pipeline plug-in: For human mrna splicing mutations CLC bio Genomics Workbench plug-in CLC bio Genomics Server plug-in Features and Benefits

Introduction to gene expression microarray data analysis

Introduction to Bioinformatics and Gene Expression Technology

Computing with large data sets

Bayesian Inference of Regulatory influence on Expression (birte)

Biology 644: Bioinformatics

Data Mining for Biological Data Analysis

CS-E5870 High-Throughput Bioinformatics Microarray data analysis

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

Supplementary Methods

3.1.4 DNA Microarray Technology

ROAD TO STATISTICAL BIOINFORMATICS CHALLENGE 1: MULTIPLE-COMPARISONS ISSUE

goseq: Gene Ontology testing for RNA-seq datasets

Tutorial. Bisulfite Sequencing. Sample to Insight. September 15, 2016

Transcription:

pint: probabilistic data integration for functional genomics Olli-Pekka Huovilainen 1* and Leo Lahti 1,2 (1) Dpt. Information and Computer Science, Aalto University, Finland (2) Dpt. Veterinary Bioscience, University of Helsinki, Finland October 30, 2017 1 Introduction Multiple genomic observations from the same samples are increasingly available in biomedical studies, including measurements of gene- and micro-rna expression levels, DNA copy number, and methylation status. By investigating dependencies between different functional layers of the genome it is possible to discover mechanisms and interactions that are not seen in the individual measurement sources. For instance, integration of gene expression and DNA copy number can reveal cancer-associated chromosomal regions and associated genes with potential diagnostic, prognostic and clinical impact [6]. This package implements probabilistic models for integrative analysis of mrna expression levels with DNA copy number (acgh) measurements to discover functionally active chromosomal alterations. The algorithms can be used to discover functionally altered chromosomal regions and to visualize the affected genes and samples. The algorithms can be applied also to other types of biomedical data, including epigenetic modifications, SNPs, alternative splicing and transcription factor binding, or in other application fields. The methods are based on latent variable models including probabilistic canonical correlation analysis [2] and related extensions [1, 4, 6], implemented in the dmt package in CRAN [5, 3]. Probabilistic formulation deals rigorously with uncertainty associated with small sample sizes common in biomedical studies and provides tools to guide dependency modeling through Bayesian priors [6]. 1.0.1 Dependencies The CRAN packages dmt and mvtnorm are required for installation. * ohuovila@gmail.com 1

2 Examples This Section shows how to apply the methods for dependency detection in functional genomics. For further details on the dependency modeling framework, see the dependency modeling package dmt in CRAN 1. 2.1 Example data Our example data set contains matched observations of gene expression and copy number from a set of gastric cancer patients [7]. Load the package and example data with: > library(pint) > data(chromosome17) The example data contains (geneexp and genecopyn um) objects. These lists contain two elements: ˆ data matrix with gene expression or gene copy number data. Genes are in rows and samples in columns and rows and columns should be named and the probes and samples are matched between the two data sets. ˆ inf o data frame with additional information about the genes in the data object; in particular, loc indicates the genomic location of each probe in base pairs; chr and arm indicate the chromosome and chromosomal arm of the probe. The models assume approximately Gaussian distributed observations. With microarray data sets, this is typically obtained by presenting the data in the log 2 domain, which is the default in many microarray preprocessing methods. 2.2 Discovering functionally active copy number changes Chromosomal regions that have simultaneous copy number alterations and gene expression changes will reveal potential cancer gene candidates. To detect these regions, we measure the dependency between expression and copy number for each region and pick the regions showing the highest dependency as such regions have high dependency between the two data sources. A sliding window over the genome is used to quantify dependency within each region. Here we show a brief example on chromosome arm 17q: > models <- screen.cgh.mrna(geneexp, genecopynum, windowsize = 10, chr = 17, arm = 'q') The dependency is measured separately for each gene within a chromosomal region ( window ) around the gene. A fixed dimensionality (window size) is necessary to ensure comparability of the dependency scores between windows. The scale of the chromosomal regions can be tuned by changing the window size ( windowsize ). The default dependency modeling method is a constrained version of probabilistic CCA; [6]. See help(screen.cgh.mrna) for further options. 1 http://dmt.r-forge.r-project.org/ 2

2.3 Application in other genomic data integration tasks Other genomic data sources such as micro-rna or epigenetic measurements are increasingly available in biomedical studies, accompanying observations of DNA copy number changes and mrna expression levels [8]. Given matched probes and samples, the current functions can be used to screen for dependency between any pair of genomic (or other) data sources. 3 Summarization and interpretation 3.1 Visualization Dependency plots will reveals chromosomal regions with the strongest dependency between gene expression and copy number changes: > plot(models, showtop = 10) Dependency score for chromosome 17q dependency score 0.00 0.05 0.10 0.15 0.20 0.25 10 1 234 5678 9 30 40 50 60 70 80 gene location Figure 1: The dependency plot reveals chromosomal regions with the strongest dependency between gene expression and copy number. Here the highest dependency is between 30-40Mbp which is a known gastric cancer-associated region. Note that the display shows the location in megabasepairs while location is provided in basepairs. The top-5 genes with the highest dependency in their chromosomal neighborghood can be retrieved with: > topgenes(models, 5) 3

[1] "ENSG00000141738" "ENSG00000141736" "ENSG00000173991" "ENSG00000131748" [5] "ENSG00000141744" It is also possible investigate the contribution of individual patients or probes on the overall dependency based on the model parameters W and the latent variable z that are easily retrieved from the learned dependency model (Fig. 2). In 1-dimensional case the interpretation is straightforward: z will indicate the shared signal strength in each sample and W describes how the shared signal is reflected in each data source. With multi-dimensional W and z, the variableand sample effects are approximated (for visualization purposes) by the loadings and projection scores corresponding of the first principal component of W z is used to summarize the shared signal in each data set. > model <- topmodels(models) > plot(model, geneexp, genecopynum) matched case Wx = Wy with unconstrained W. Check covariances from parameters. model around gene ENSG00000141738 at 35.155936 Sample effects 6 Weight 4 2 0 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S20 S21 S22 S23 S24 S25 S26 S27 S28 S29 S30 S31 S32 S33 S34 S35 S36 S37 S38 S39 S40 S41 S42 S43 S44 S45 S46 S47 S48 S49 S50 S51 Variable effects (first data set) Variable effects (second data set) Weight 0.4 0.3 0.2 0.1 0.0 Weight 0.4 0.3 0.2 0.1 0.0 ENSG00000173991 ENSG00000141744 ENSG00000161395 ENSG00000141736 ENSG00000141738 ENSG00000186075 ENSG00000108342 ENSG00000008838 ENSG00000171475 ENSG00000094804 ENSG00000173991 ENSG00000141744 ENSG00000161395 ENSG00000141736 ENSG00000141738 ENSG00000186075 ENSG00000108342 ENSG00000008838 ENSG00000171475 ENSG00000094804 Figure 2: Samples and variable contribution to the dependencies around the gene with the highest dependency score between gene expression and copy number measurements in the chromosomal region. The visualization highlights the affected patients and genes. 3.2 Summarization Other useful functions for summarizing and investigating the results include: 4

ˆ join.top.regions: merge overlapping models that exceed the threshold; gives a list of distinct, continuous regions detected by the models. ˆ summarize.region.parameters: provides a summary of sample and probe effects over partially overlapping models 4 The dependency modeling framework Detailed description of the model parameters and available dependency detection methods is provided with the dmt package in CRAN [5]. The models are based on probabilistic canonical correlation analysis and related extensions [2, 6]. In summary, the shared signal between two (multivariate) observations X, Y is modeled with a shared latent variable z. This can have different manifestation in each data set, which is described by linear transformations W x and W y. Standard multivariate normal distribution for the shared latent variable and data set-specific effects gives the following model: X W x z + ε x Y W y z + ε y (1) The data set-specific effects are modeled with multivariate Gaussians ε. N (0, Ψ. ) with covariances Ψ x, Ψ y, respectively. Dependency between the data sets X, Y is quantified by the ratio of shared vs. data set-specific signal (see?dependency.score ), calculated as T r(w W T ) T r(ψ) (2) 5 Details ˆ Licensing terms: the package is licensed under FreeBSD open software license ˆ Citing pint: Please cite [6] This document was written using: > sessioninfo() R version 3.4.2 (2017-09-28) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.3 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.6-bioc/r/lib/librblas.so LAPACK: /home/biocbuild/bbs-3.6-bioc/r/lib/librlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 5

[7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grdevices utils datasets methods base other attached packages: [1] pint_1.28.0 dmt_0.8.20 MASS_7.3-47 Matrix_1.2-11 mvtnorm_1.0-6 loaded via a namespace (and not attached): [1] compiler_3.4.2 tools_3.4.2 grid_3.4.2 lattice_0.20-35 Acknowledgements We would like to thank prof. Sakari Knuutila (University of Helsinki) for providing the example data set. References [1] C. Archambeau, N. Delannay, and M. Verleysen. Robust probabilistic projections. In W. Cohen and A. Moore, editors, Proceedings of the 23rd International conference on machine learning, volume 148, pages 33 40, Pittsburgh, Pennsylvania, 2006. ACM. [2] F. R. Bach and M. I. Jordan. A probabilistic interpretation of canonical correlation analysis. Technical Report 688, Department of Statistics, University of California, Berkeley, 2005. [3] O.-P. Huovilainen. Screening of functional copy number changes with dependency models. Master s thesis, Aalto University School of Science and Technology, Department of Information and Computer Science, 2010. [4] A. Klami and S. Kaski. Probabilistic approach to detecting dependencies between data sets. Neurocomputing, 72(1-3):39 46, 2008. [5] L. Lahti et al. Dependency modeling toolkit. ICML workshop, June 2010. Implementation provided in the package dmt at CRAN. [6] L. Lahti, S. Myllykangas, S. Knuutila, and S. Kaski. Dependency detection with similarity constraints. In Proceedings MLSP 09 IEEE International Workshop on Machine Learning for Signal Processing XIX, pages 89 94, Piscataway, NJ, September 2-4 2009. IEEE Signal Processing Society. [7] S. Myllykangas, S. Junnila, A. Kokkola, R. Autio, I. Scheinin, T. Kiviluoto, M.-L. Karjalainen-Lindsberg, J. Hollmén, S. Knuutila, P. Puolakkainen, and O. Monni. Integrated gene copy number and expression microarray analysis of gastric cancer highlights potential target genes. International Journal of Cancer, 123(4):817 825, 2008. 6

[8] The Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature, 455:1061 1068, 2008. 7