Mayday and beyond: New Approaches for the Visualisation of Genomics and Transcriptomics Data

Mayday and beyond: New Approaches for the Visualisation of Genomics and Transcriptomics Data Kay Nieselt Center for Bioinformatics Tübingen University of Tübingen

Genomics Genomics is the study of the genomes of organisms Today large projects that compare genomes within a species, such as 1000 genomes project (in human), 1001 genomes project (in A. thaliana), many many more 2

Genome Comparison Comparison often based on alignment of whole genomes or parts of genomes Whole-genome alignments elucidate similarity and diversity on different scales! Large-scale variations: genomic rearrangements (translocations, inversions) 3 Inversion Translocation Duplication

GenomeRing: Visualizing genomic diversity Application of the SuperGenome to the visualization of aligned multiple genomes Outer Ring = Forward strand 6 SuperGenome is a common coordinate system of all aligned genomes, independent of a prechosen reference genome Alignments represented as blocks Color coding for genomes Paths represent genomic architecture Inner Ring = Reverse strand Herbig, Jäger, Battke, Nieselt, 2012, Bioinformatics

GenomeRing: some applications GenomeRing of 4 Campylobacter jejuni species 7

GenomeRing: some applications GenomeRing of 32 Staphylococcus aureus species 8

GenomeRing: some applications GenomeRing of 8 Yersinia pestis strains 9

From Genome to Transcriptomes Observation: dissimilarity of genomes not sufficient to explain difference in genotypes Add expression data In particular: study all the genes of a cell or tissue, at the DNA (genotype) and mrna (transcriptome) (or protein (proteome)) levels 10

The central dogma of molecular biology Biological events are controlled by gene expression: the process by which information from a gene is used in the synthesis of a functional gene product. 11 When and to which extent is each gene in a given cell expressed? mrna levels easier to measure than protein levels Regulation of gene expression can only be understood by studying the transcriptome http://en.wikipedia.org/wiki/ Central_dogma_of_molecular_biology

Key research questions Catalogue all species of transcripts, e.g. RNAs of protein- as well as non-coding genes Compare abundance changes of genes between different conditions detection of differentially expressed genes Identification of genes expressed in the same process How are genes regulated (derive regulatory network) Determine the transcriptional structure of genes transcriptional start sites (TSS) 5 and 3 UTRs splicing patterns (eukaryotes) fusion genes Antisense transcription 12

Technologies Most commonly employed method uses microarrays 13 Microarrays measure mrna concentration through hybridization to a probe immobilized on a carrier material ( chip ) Two-dimensional grid of features (mostly oligos) Usually arranged on a glass slide Typical microarray contains 100,000-1,000,000 microscopic DNA spots

Work flow for µ-array expression data 14 Normalization Reduce the technical variation to a minimum, make each array experiment comparable Filtering Identify transcripts with very little signal variation and filter out Analyze Expression levels Clustering Differential expression Pathway analysis

RNA-seq: Digital transcription profiling mrna are first converted into a library of fragmented cdna adaptors are added high-throughput sequencing yields short sequence reads with fast mapping algorithms reads are aligned to the reference sequence expression quantification yields a base-resolution digital count for each transcript 15 Figure from: Wang et al. 2009, Nature Reviews Genetics

Work flow for RNA-seq data 16 Raw read processing Map reads Aggregate and quantify Normalize 44.5 86.3 21.4 Analyse Expression levels Novel genes Differential expression Alternative splicing

Expression profiling Measures expression (activity) of genes under certain conditions 17 expression profiling: when, where and to what extent is a gene expressed gene 1 gene 2

Challenges for visualisation Transcriptomics data (as other omics data) is high-dimensional: can be more than 50,000 transcripts in 5-100 experiments noisy linear and non-linear correlations Contains patterns 18 Visualizations should help to detect important signals active genes be guided by Shneiderman s visual information seeking mantra: present an overview of the entire data, then zoom and filter, finally details on demand

Mayday Mayday short for Microarray Data Analysis Workbench for visualization, analysis and storage of expression data (originally derived with microarrays, latest version can analyze any type of abundance data) Written in JAVA programming language " Runs on all platforms supporting Java runtime environment 1.7 " Stand-alone or as Java Webstart version Open source software (GNU General Public License) Ongoing project with continuously added functionality 19 Battke, Symons, Nieselt, 2010, BMC Bioinformatics

Mayday s Features 20 Data Mining Methods Statistical Methods Processing Pipeline Dynamic Filtering Visualisations Partitioning Clustering (k- Means, QT,...) Hierarchical Clustering (NJ, UPGMA,...) Multi-class gene mining Machine learning (via WEKA) Gene Set Enrichment Analysis (GSEA) Student s t- test, WAD, SAM, Rank Product Information Gain, Gini Index, Quartet Mining,... Repetitive processing steps can be automated Processing modules can easily be combined into pipelines Image of a processing pipeline Filters are built from modules chained together Modules chains can be linked with AND, OR, NOT Scatterplots, Profile plots, Heatmap, Enhanced Heatmap Dendrogram Genome Browser

Visualisations in Mayday 21 Online Data Manipulation Interactivity Change the data only for one visualisation (z-scoring) Zooming, selections,... Connectivity Different visualisations of the same data are synchronized Enhancements Color for additional information Enhance plots using meta information Powerful Framework New plots are easy to implement with small amounts of code

Example Application Systems Biology for Microorganisms 22 Time series Transcriptome, Proteome and Metabolome data from submerged batch fermentations of wildtype and several mutant strains of Streptomyces coelicolor grown under different starvation conditions Transcriptomes: assessed with customdesigned Affymetrix GeneChip for Streptomyces coelicolor (protein- and noncoding genes)

Transcriptome Time Series wildtype 32 samples taken along growth curve 20-44h hourly; 46-60h 2-hourly Array data produced for all 32 samples from one 7.0 fermenter 6.0 23 5.0 CDW [g/l] 4.0 3.0 2.0 1.0 0.0 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 time after inoculation Goal: find differentially expressed genes and compare to proteome data Nieselt et al., BMC Genomics 2010, Battke et al., Software Tools and Algorithms for Biological Systems 2010, Thomas et al., Mol Cel. Prot. 2012

Bioinformatics pipeline All powered by Mayday: 1. Normalization of Affymetrix CEL files: using Mayday SeaSight RMA (Background correction, Quantile Normalization, Summarization), check quality 2. Differential expression: filtering by regularized variance 3. Unsupervised Clustering of sample expression profiles (Neighbor Joining, Euclidean) Time tree 4. Unsupervised Clustering of transcript expression profiles (QT, Pearson Correlation) 24 All assessed by visualizations in Mayday

1. Normalization: use SeaSight Import raw microarray data as well as mapped sequencing reads 25 Background correction RNA-seq: Computing expression values from mapped reads Normalization Summarization Linking microarray probes to genomic coordinates Combine transformations in a user-friendly graphical interface to quickly construct powerful normalization pipelines

SeaSight applied to CEL files 26

Check normalisation Boxplot 27 histogram qq-plot

2. Compute diff. expressed genes 28

2. Compute diff. expressed genes 29

3. Clustering: finding co-expressed transcripts Paradigm of expression profiling: similar profiles co-expression coregulation Clustering is the process of assigning transcripts with similar profiles into a common cluster Many different clustering algorithm (types): hierarchical: one transcript can be member of more than one cluster partitioning: each transcript is member of one cluster 30

Clustering of time points: Trees 31 Phosphate depletion

Visualisations: profile plots 32

Visualisations: profile plot 33

Visualisations: profile plot Linked visualisations 34

Visualisations: profile plot Colored by gene annotation 35

Visualisations: profile plot Additional column attached to data (concept developed in SpRay*) 36

SpRay: Visual Analytics of expression data 37 Dietzsch, Heinrich, Nieselt, Bartz, IEEE Symp. on VAST 2009.

Enhanced Heatmap Add features to traditional heatmap: 38 Additional columns representing meta information: Meta information derived from annotation data or statistical computations (e.g. p-value of test) Sorting of rows also according to additional columns Gehlenborg et al. 2005, Inform Vis Battke et al. 2010, BMC Bioinformatics

TIALA - Time Series Alignment Analysis Powerful visual analytics approach for large scale expression data First and only tool for comparing two and more time series of expression data both analytically as well as visually 39 Jäger, Battke, Nieselt, IEEE Symposium on Biol. Data Visualization 2011.

TIALA - Time Series Alignment Analysis 40

Mayday as an -Omics tool Mayday is not limited to expression data Any matrix of numerical data can be analyzed or several ones Microarray, RNA-seq data 41 Proteom Data Metabolome Data qpcr results eqtl data...

Integrative Omics 42 Here comparing the proteome with the transcriptome across 8 time points of S. coelicolor wildtype.

Integrative Omics Combining Genomes and Expression 43 Expression in the genomic context: GenomeBrowser in Mayday

Genome Browser Continuous zoom from whole genome to single base resolution 44

Integrative Omics Combining Genomes and Expression 45 GenomeRing is integrated into Mayday and allows for linking of genome and for example differentially expressed genes

Genomics and Transcriptomics GenomeRing of 3 Helicobacter pylori strains with added expression track that shows up- and down-regulated genes in genome 26695 46 Linked with genome browser

Going even further Combining and integrating 47 Genomics (SNP genotypes) Transcriptomics (Expression phenotypes) and Clinical Phenotypes

48 Reveal - Visual eqtl Analytics

Genetic variation and disease association (Complex) Diseases can be better understood by studying genetic variation across the whole genome Genome-wide association studies (GWAS) examine genetic variants (mainly SNPs) in connection with traits, e.g. diseases 49

ihat: interactive Hierarchical Aggregation Table SNPs Meta-Information 50 aggregated view Subjects reference genotype SNP in one allele SNP in both alleles Blue-whitered color gradient for quantitative metainformation offers aggregation techniques to reveal hidden structure in the data

eqtl - expression Quantitative Trait Locus GWAS cannot specify the genes causal for the phenotype Biological events are controlled by gene expression: the process by which information from a gene is used in the synthesis of a functional gene product. eqtl are genomic loci regulating gene expression Goal: connect those genotypes with phenotypes such that causal associations are identified 51

Key challenges of eqtl experiments detect those significant genomic variations that affect expression levels identify the underlying mechanisms, typically large networks very large, heterogeneous and complex data: a typical complete data set would comprise O(10 6 ) loci, O(10 4 ) genes in O(10 2 ) tissues for O(10 3 ) subjects 52

Levels of complexity in eqtl studies 53 Level 1 Level 2 Level 3 Clinical Phenotype Clinical Phenotype Genotype (SNPs) Genotype (SNPs) GWAS Gene Expression Analysis Association Linkage Analysis Clinical Phenotype Genotype (SNPs) Gene Expression Gene Expression Gene Expression Increasing Complexity

Reveal (part of Mayday) With Reveal we address these challenges and have introduced various different visualisations, one is the association graph: a node-link graph that visualizes relationship of SNPs and gene expression (phenotype), allows visualisation of trans as well as cis effects (for level 2) 54 Jäger, Battke, Nieselt, Bioinformatics 2012.

Association Graph Visualize the association of genotype and expression start with pairs of SNPs that commonly affect the expression of a gene (result of a statistical analysis) 55 for each SNP identify closest gene each gene is represented by a node give node a color if at least one SNP within a two-locus pair lies in that gene an edge is drawn if there exists a twolocus SNP pair edge gets color of node of gene that is affected by that SNP pair common edges are aggregated and weighted according to number of SNP pairs 1280 SNP pairs from CDH22 and CDH7 influence the expression of CDH10

Association Graph Example: 56 15 genes, full graph based on 62,136 SNP pairs

Association Graph - edge weight filtering edges filtered: only edges with weights larger 50

Expression Heatmap rank genes by significant differential expression 58 Affected patients Unaffected patients aggregate

Integrating SNP - expression 59

Outlook Challenges of future applications Scalability: data becomes bigger and bigger 1000s of genomes and other omics data sets to analyze and visualize: powerful aggregation and innovative techniques are needed Full visual analytics methods Combine all levels of complexity of eqtl data: inphap* (includes also phased haplotype data) 60 *Jäger, Peltzer, Nieselt, BMC Bioinformatics 2014.

Acknowledgements My current and former doctoral students: Florian Battke Alexander Herbig Günter Jäger Aydin Polatkan Stephan Symons Collaboration partners: Michael Bonin (MFT Services, now IMGM Munich) Wolfgang Wohlleben (SysMO S. coelicolor, Univ. Tü) Karsten Borgwardt (Reveal, ETH Zürich) 61

62 Thank you for your attention! Questions? Download Mayday at: http://it.inf.uni-tuebingen.de/

63 : 10-11 July 2015 Find out more at: http://biovis.net/ Call for participation: Papers Feb 15 Posters May 29 Data contest - May 1 Design contest May 1