Mayday and beyond: New Approaches for the Visualisation of Genomics and Transcriptomics Data

Similar documents
AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Gene Expression Technology

Introduction to Microarray Analysis

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Agilent GeneSpring GX 10: Beyond. Pam Tangvoranuntakul Product Manager, GeneSpring October 1, 2008

Introduction to Bioinformatics and Gene Expression Technologies

RNA-Sequencing analysis

SNPs - GWAS - eqtls. Sebastian Schmeier

About Strand NGS. Strand Genomics, Inc All rights reserved.

Introduction to Bioinformatics and Gene Expression Technology

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

Analysis of Microarray Data

Agilent Genomics Software Future Directions

TSSpredator User Guide v 1.00

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM)

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Recent technology allow production of microarrays composed of 70-mers (essentially a hybrid of the two techniques)

Outline. Array platform considerations: Comparison between the technologies available in microarrays

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

Knowledge-Guided Analysis with KnowEnG Lab

Smart India Hackathon

Philippe Hupé 1,2. The R User Conference 2009 Rennes

Introduction to Bioinformatics. Fabian Hoti 6.10.

The first and only fully-integrated microarray instrument for hands-free array processing

Methods of Biomaterials Testing Lesson 3-5. Biochemical Methods - Molecular Biology -

Introduction to gene expression microarray data analysis

Functional Genomics Overview RORY STARK PRINCIPAL BIOINFORMATICS ANALYST CRUK CAMBRIDGE INSTITUTE 18 SEPTEMBER 2017

Microarray Technique. Some background. M. Nath

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

BTRY 7210: Topics in Quantitative Genomics and Genetics

Lecture #1. Introduction to microarray technology

The 150+ Tomato Genome (re-)sequence Project; Lessons Learned and Potential

Microbial Metabolism Systems Microbiology

Analysis of Microarray Data

Introduction to BioMEMS & Medical Microdevices DNA Microarrays and Lab-on-a-Chip Methods

Humboldt Universität zu Berlin. Grundlagen der Bioinformatik SS Microarrays. Lecture

Lecture 2: Biology Basics Continued

Runs of Homozygosity Analysis Tutorial

Mapping strategies for sequence reads

Enhancers mutations that make the original mutant phenotype more extreme. Suppressors mutations that make the original mutant phenotype less extreme

Welcome to the NGS webinar series

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

Agilent Genomic Workbench 7.0

Feature Selection of Gene Expression Data for Cancer Classification: A Review

Linking Genetic Variation to Important Phenotypes

Next-Generation Sequencing Gene Expression Analysis Using Agilent GeneSpring GX

Gene Signal Estimates from Exon Arrays

Introduction to Genome Wide Association Studies 2014 Sydney Brenner Institute for Molecular Bioscience/Wits Bioinformatics Shaun Aron

Analysis of Microarray Data

SIMS2003. Instructors:Rus Yukhananov, Alex Loguinov BWH, Harvard Medical School. Introduction to Microarray Technology.

Estimating Cell Cycle Phase Distribution of Yeast from Time Series Gene Expression Data

BABELOMICS: Microarray Data Analysis

measuring gene expression December 5, 2017

Analysis of a Tiling Regulation Study in Partek Genomics Suite 6.6

Serial Analysis of Gene Expression

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

SNP calling and VCF format

Applications and Uses. (adapted from Roche RealTime PCR Application Manual)

Multiple Traits & Microarrays

Introduction to the UCSC genome browser

RNA Sequencing Analyses & Mapping Uncertainty

Technical note: Molecular Index counting adjustment methods

Mixed effects model for assessing RNA degradation in Affymetrix GeneChip experiments

Measuring transcriptomes with RNA-Seq

Chapter 15 Gene Technologies and Human Applications

QIAGEN s NGS Solutions for Biomarkers NGS & Bioinformatics team QIAGEN (Suzhou) Translational Medicine Co.,Ltd

Outline and learning objectives. From Proteomics to Systems Biology. Integration of omics - information

Stefano Monti. Workshop Format

Midterm 1 Results. Midterm 1 Akey/ Fields Median Number of Students. Exam Score

Gene-Level Analysis of Exon Array Data using Partek Genomics Suite 6.6

Goals of pharmacogenomics

Advanced Bioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2018

Péter Antal Ádám Arany Bence Bolgár András Gézsi Gergely Hajós Gábor Hullám Péter Marx András Millinghoffer László Poppe Péter Sárközy BIOINFORMATICS

RNA-Seq with the Tuxedo Suite

China National Grid --- BioNode. Jun Wang Beijing Genomics Institute

Complementary Technologies for Precision Genetic Analysis

Please purchase PDFcamp Printer on to remove this watermark. DNA microarray

Cancer Genetics Solutions

6. GENE EXPRESSION ANALYSIS MICROARRAYS

Introduction to Genome Wide Association Studies 2015 Sydney Brenner Institute for Molecular Bioscience Shaun Aron

Next Generation Sequencing. Target Enrichment

You will need genotypes for up to 100 SNPs, and you must also have the Affymetrix CEL files available for import.

CS 5984: Application of Basic Clustering Algorithms to Find Expression Modules in Cancer

Multi-omics in biology: integration of omics techniques

Next-Generation Sequencing. Technologies

DNA Microarray Data Oligonucleotide Arrays

DNA Arrays Affymetrix GeneChip System

From DNA to Protein: Genotype to Phenotype

Recombinant DNA Technology. The Role of Recombinant DNA Technology in Biotechnology. yeast. Biotechnology. Recombinant DNA technology.

TEKS 5C describe the roles of DNA, ribonucleic acid (RNA), and environmental factors in cell differentiation

Axiom mydesign Custom Array design guide for human genotyping applications

What we ll do today. Types of stem cells. Do engineered ips and ES cells have. What genes are special in stem cells?

Examination Assignments

Engineering Genetic Circuits

Exploration, Normalization, Summaries, and Software for Affymetrix Probe Level Data

Nature Methods: doi: /nmeth.4396

American Society of Cytopathology Core Curriculum in Molecular Biology

Identifying Signaling Pathways. BMI/CS 776 Spring 2016 Anthony Gitter

Do engineered ips and ES cells have similar molecular signatures?

Transcription:

Mayday and beyond: New Approaches for the Visualisation of Genomics and Transcriptomics Data Kay Nieselt Center for Bioinformatics Tübingen University of Tübingen

Genomics Genomics is the study of the genomes of organisms Today large projects that compare genomes within a species, such as 1000 genomes project (in human), 1001 genomes project (in A. thaliana), many many more 2

Genome Comparison Comparison often based on alignment of whole genomes or parts of genomes Whole-genome alignments elucidate similarity and diversity on different scales! Large-scale variations: genomic rearrangements (translocations, inversions) 3 Inversion Translocation Duplication

Genome Comparison Comparison often based on alignment of whole genomes or parts of genomes Whole-genome alignments elucidate similarity and diversity on different scales! Large-scale variations: genomic rearrangements (translocations, inversions)! Small-scale variations: mutations, insertions, deletions 4 ACGGTGCAGTTACCA! Deletion Mutation AC----CAGTCACCA!

Genome Comparison Comparison often based on alignment of whole genomes or parts of genomes Whole-genome alignments elucidate similarity and diversity on different scales! Large-scale variations: genomic rearrangements (translocations, inversions)! Small-scale variations: gene content, insertions, deletions 5 Genome variations and their analyses are often subject to visualization

GenomeRing: Visualizing genomic diversity Application of the SuperGenome to the visualization of aligned multiple genomes Outer Ring = Forward strand 6 SuperGenome is a common coordinate system of all aligned genomes, independent of a prechosen reference genome Alignments represented as blocks Color coding for genomes Paths represent genomic architecture Inner Ring = Reverse strand Herbig, Jäger, Battke, Nieselt, 2012, Bioinformatics

GenomeRing: some applications GenomeRing of 4 Campylobacter jejuni species 7

GenomeRing: some applications GenomeRing of 32 Staphylococcus aureus species 8

GenomeRing: some applications GenomeRing of 8 Yersinia pestis strains 9

From Genome to Transcriptomes Observation: dissimilarity of genomes not sufficient to explain difference in genotypes Add expression data In particular: study all the genes of a cell or tissue, at the DNA (genotype) and mrna (transcriptome) (or protein (proteome)) levels 10

The central dogma of molecular biology Biological events are controlled by gene expression: the process by which information from a gene is used in the synthesis of a functional gene product. 11 When and to which extent is each gene in a given cell expressed? mrna levels easier to measure than protein levels Regulation of gene expression can only be understood by studying the transcriptome http://en.wikipedia.org/wiki/ Central_dogma_of_molecular_biology

Key research questions Catalogue all species of transcripts, e.g. RNAs of protein- as well as non-coding genes Compare abundance changes of genes between different conditions detection of differentially expressed genes Identification of genes expressed in the same process How are genes regulated (derive regulatory network) Determine the transcriptional structure of genes transcriptional start sites (TSS) 5 and 3 UTRs splicing patterns (eukaryotes) fusion genes Antisense transcription 12

Technologies Most commonly employed method uses microarrays 13 Microarrays measure mrna concentration through hybridization to a probe immobilized on a carrier material ( chip ) Two-dimensional grid of features (mostly oligos) Usually arranged on a glass slide Typical microarray contains 100,000-1,000,000 microscopic DNA spots

Work flow for µ-array expression data 14 Normalization Reduce the technical variation to a minimum, make each array experiment comparable Filtering Identify transcripts with very little signal variation and filter out Analyze Expression levels Clustering Differential expression Pathway analysis

RNA-seq: Digital transcription profiling mrna are first converted into a library of fragmented cdna adaptors are added high-throughput sequencing yields short sequence reads with fast mapping algorithms reads are aligned to the reference sequence expression quantification yields a base-resolution digital count for each transcript 15 Figure from: Wang et al. 2009, Nature Reviews Genetics

Work flow for RNA-seq data 16 Raw read processing Map reads Aggregate and quantify Normalize 44.5 86.3 21.4 Analyse Expression levels Novel genes Differential expression Alternative splicing

Expression profiling Measures expression (activity) of genes under certain conditions 17 expression profiling: when, where and to what extent is a gene expressed gene 1 gene 2

Challenges for visualisation Transcriptomics data (as other omics data) is high-dimensional: can be more than 50,000 transcripts in 5-100 experiments noisy linear and non-linear correlations Contains patterns 18 Visualizations should help to detect important signals active genes be guided by Shneiderman s visual information seeking mantra: present an overview of the entire data, then zoom and filter, finally details on demand

Mayday Mayday short for Microarray Data Analysis Workbench for visualization, analysis and storage of expression data (originally derived with microarrays, latest version can analyze any type of abundance data) Written in JAVA programming language " Runs on all platforms supporting Java runtime environment 1.7 " Stand-alone or as Java Webstart version Open source software (GNU General Public License) Ongoing project with continuously added functionality 19 Battke, Symons, Nieselt, 2010, BMC Bioinformatics

Mayday s Features 20 Data Mining Methods Statistical Methods Processing Pipeline Dynamic Filtering Visualisations Partitioning Clustering (k- Means, QT,...) Hierarchical Clustering (NJ, UPGMA,...) Multi-class gene mining Machine learning (via WEKA) Gene Set Enrichment Analysis (GSEA) Student s t- test, WAD, SAM, Rank Product Information Gain, Gini Index, Quartet Mining,... Repetitive processing steps can be automated Processing modules can easily be combined into pipelines Image of a processing pipeline Filters are built from modules chained together Modules chains can be linked with AND, OR, NOT Scatterplots, Profile plots, Heatmap, Enhanced Heatmap Dendrogram Genome Browser

Visualisations in Mayday 21 Online Data Manipulation Interactivity Change the data only for one visualisation (z-scoring) Zooming, selections,... Connectivity Different visualisations of the same data are synchronized Enhancements Color for additional information Enhance plots using meta information Powerful Framework New plots are easy to implement with small amounts of code

Example Application Systems Biology for Microorganisms 22 Time series Transcriptome, Proteome and Metabolome data from submerged batch fermentations of wildtype and several mutant strains of Streptomyces coelicolor grown under different starvation conditions Transcriptomes: assessed with customdesigned Affymetrix GeneChip for Streptomyces coelicolor (protein- and noncoding genes)

Transcriptome Time Series wildtype 32 samples taken along growth curve 20-44h hourly; 46-60h 2-hourly Array data produced for all 32 samples from one 7.0 fermenter 6.0 23 5.0 CDW [g/l] 4.0 3.0 2.0 1.0 0.0 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 time after inoculation Goal: find differentially expressed genes and compare to proteome data Nieselt et al., BMC Genomics 2010, Battke et al., Software Tools and Algorithms for Biological Systems 2010, Thomas et al., Mol Cel. Prot. 2012

Bioinformatics pipeline All powered by Mayday: 1. Normalization of Affymetrix CEL files: using Mayday SeaSight RMA (Background correction, Quantile Normalization, Summarization), check quality 2. Differential expression: filtering by regularized variance 3. Unsupervised Clustering of sample expression profiles (Neighbor Joining, Euclidean) Time tree 4. Unsupervised Clustering of transcript expression profiles (QT, Pearson Correlation) 24 All assessed by visualizations in Mayday

1. Normalization: use SeaSight Import raw microarray data as well as mapped sequencing reads 25 Background correction RNA-seq: Computing expression values from mapped reads Normalization Summarization Linking microarray probes to genomic coordinates Combine transformations in a user-friendly graphical interface to quickly construct powerful normalization pipelines

SeaSight applied to CEL files 26

Check normalisation Boxplot 27 histogram qq-plot

2. Compute diff. expressed genes 28

2. Compute diff. expressed genes 29

3. Clustering: finding co-expressed transcripts Paradigm of expression profiling: similar profiles co-expression coregulation Clustering is the process of assigning transcripts with similar profiles into a common cluster Many different clustering algorithm (types): hierarchical: one transcript can be member of more than one cluster partitioning: each transcript is member of one cluster 30

Clustering of time points: Trees 31 Phosphate depletion

Visualisations: profile plots 32

Visualisations: profile plot 33

Visualisations: profile plot Linked visualisations 34

Visualisations: profile plot Colored by gene annotation 35

Visualisations: profile plot Additional column attached to data (concept developed in SpRay*) 36

SpRay: Visual Analytics of expression data 37 Dietzsch, Heinrich, Nieselt, Bartz, IEEE Symp. on VAST 2009.

Enhanced Heatmap Add features to traditional heatmap: 38 Additional columns representing meta information: Meta information derived from annotation data or statistical computations (e.g. p-value of test) Sorting of rows also according to additional columns Gehlenborg et al. 2005, Inform Vis Battke et al. 2010, BMC Bioinformatics

TIALA - Time Series Alignment Analysis Powerful visual analytics approach for large scale expression data First and only tool for comparing two and more time series of expression data both analytically as well as visually 39 Jäger, Battke, Nieselt, IEEE Symposium on Biol. Data Visualization 2011.

TIALA - Time Series Alignment Analysis 40

Mayday as an -Omics tool Mayday is not limited to expression data Any matrix of numerical data can be analyzed or several ones Microarray, RNA-seq data 41 Proteom Data Metabolome Data qpcr results eqtl data...

Integrative Omics 42 Here comparing the proteome with the transcriptome across 8 time points of S. coelicolor wildtype.

Integrative Omics Combining Genomes and Expression 43 Expression in the genomic context: GenomeBrowser in Mayday

Genome Browser Continuous zoom from whole genome to single base resolution 44

Integrative Omics Combining Genomes and Expression 45 GenomeRing is integrated into Mayday and allows for linking of genome and for example differentially expressed genes

Genomics and Transcriptomics GenomeRing of 3 Helicobacter pylori strains with added expression track that shows up- and down-regulated genes in genome 26695 46 Linked with genome browser

Going even further Combining and integrating 47 Genomics (SNP genotypes) Transcriptomics (Expression phenotypes) and Clinical Phenotypes

48 Reveal - Visual eqtl Analytics

Genetic variation and disease association (Complex) Diseases can be better understood by studying genetic variation across the whole genome Genome-wide association studies (GWAS) examine genetic variants (mainly SNPs) in connection with traits, e.g. diseases 49

ihat: interactive Hierarchical Aggregation Table SNPs Meta-Information 50 aggregated view Subjects reference genotype SNP in one allele SNP in both alleles Blue-whitered color gradient for quantitative metainformation offers aggregation techniques to reveal hidden structure in the data

eqtl - expression Quantitative Trait Locus GWAS cannot specify the genes causal for the phenotype Biological events are controlled by gene expression: the process by which information from a gene is used in the synthesis of a functional gene product. eqtl are genomic loci regulating gene expression Goal: connect those genotypes with phenotypes such that causal associations are identified 51

Key challenges of eqtl experiments detect those significant genomic variations that affect expression levels identify the underlying mechanisms, typically large networks very large, heterogeneous and complex data: a typical complete data set would comprise O(10 6 ) loci, O(10 4 ) genes in O(10 2 ) tissues for O(10 3 ) subjects 52

Levels of complexity in eqtl studies 53 Level 1 Level 2 Level 3 Clinical Phenotype Clinical Phenotype Genotype (SNPs) Genotype (SNPs) GWAS Gene Expression Analysis Association Linkage Analysis Clinical Phenotype Genotype (SNPs) Gene Expression Gene Expression Gene Expression Increasing Complexity

Reveal (part of Mayday) With Reveal we address these challenges and have introduced various different visualisations, one is the association graph: a node-link graph that visualizes relationship of SNPs and gene expression (phenotype), allows visualisation of trans as well as cis effects (for level 2) 54 Jäger, Battke, Nieselt, Bioinformatics 2012.

Association Graph Visualize the association of genotype and expression start with pairs of SNPs that commonly affect the expression of a gene (result of a statistical analysis) 55 for each SNP identify closest gene each gene is represented by a node give node a color if at least one SNP within a two-locus pair lies in that gene an edge is drawn if there exists a twolocus SNP pair edge gets color of node of gene that is affected by that SNP pair common edges are aggregated and weighted according to number of SNP pairs 1280 SNP pairs from CDH22 and CDH7 influence the expression of CDH10

Association Graph Example: 56 15 genes, full graph based on 62,136 SNP pairs

Association Graph - edge weight filtering edges filtered: only edges with weights larger 50

Expression Heatmap rank genes by significant differential expression 58 Affected patients Unaffected patients aggregate

Integrating SNP - expression 59

Outlook Challenges of future applications Scalability: data becomes bigger and bigger 1000s of genomes and other omics data sets to analyze and visualize: powerful aggregation and innovative techniques are needed Full visual analytics methods Combine all levels of complexity of eqtl data: inphap* (includes also phased haplotype data) 60 *Jäger, Peltzer, Nieselt, BMC Bioinformatics 2014.

Acknowledgements My current and former doctoral students: Florian Battke Alexander Herbig Günter Jäger Aydin Polatkan Stephan Symons Collaboration partners: Michael Bonin (MFT Services, now IMGM Munich) Wolfgang Wohlleben (SysMO S. coelicolor, Univ. Tü) Karsten Borgwardt (Reveal, ETH Zürich) 61

62 Thank you for your attention! Questions? Download Mayday at: http://it.inf.uni-tuebingen.de/

63 : 10-11 July 2015 Find out more at: http://biovis.net/ Call for participation: Papers Feb 15 Posters May 29 Data contest - May 1 Design contest May 1