Bioinformatica. Centro de Investigación Príncipe Felipe (CIPF), and Functional genomics node, (INB),

Size: px
Start display at page:

Download "Bioinformatica. Centro de Investigación Príncipe Felipe (CIPF), and Functional genomics node, (INB),"

Transcription

1 Bioinformatica Department of Bioinformatics and Genomics, (BIG) Centro de Investigación Príncipe Felipe (CIPF), and Functional genomics node, (INB), Valencia, Spain. org

2 The pre-genomics paradigm Genes in the DNA... code for proteins... >protein kunase acctgttgatggcgacagggactgtatgctgatct g gg g ggg g g g atgctgatgcatgcatgctgactactgatgtgggg gctattgacttgatgtctatc... From genotype to phenotype. produces the final phenotype whose structure accounts for function... plus the environment...

3 Next Generation Sequencing 10 9 bp per round when expressed in the proper moment and place... A typical tissue is expressing among 5000 and genes Genes in the DNA... >protein kunase acctgttgatggcgacagggactgtatgctgatct atgctgatgcatgcatgctgactactgatgtgggg gctattgacttgatgtctatc... From genotype to phenotype. (in the functional postgenomics scenario) which can be different because of the variability. 15 million SNPs whose final effect configures the phenotype... code for conforming complex proteins... interaction networks... That undergo posttranslational modifications, somatic recombination K-500K proteins whose structures account for function... Each protein has an average of 8 interactions in cooperation with other proteins

4 El futuro que nos viene

5 Affymetrix Agilent Preprocessor T-Rex Prophet Multi classes Two classes Correlation KNN DLDA SVM Survival Random forest Gene Set enrichment GSEA FatiScan Babelomics Two-colour arrays Raw data ISACGH Hierarchical SOM CAAT FatiGO+ Marmite TMT FatiGO GEPAS Herrero et al., 2003, 2004; Vaquerizas et al., 2005 NAR; Montaner et al., 2006 NAR; Al-Shahrour et al., 2005, 2006, 2007 NAR; 2005 Bioinformatics, 2007 BMC Bioinformatics; Tarraga et al., NAR 2008; Al- Shahrour et al., NAR 2008 K-means SOTA Normalization Clustering Differential expresion Arrays-CGH Class Prediction Functional Annotation RIDGE analysis Functional enrichment BLAST2GO: Automatic functional annotation

6 Clustering of experiments: Distinctive gene expression patterns in human mammary epithelial cells and breast cancers Overview of the combined in vitro and breast tissue specimen cluster diagram. A scaled-down representation of the 1,247-gene cluster diagram The black bars show the positions of the clusters discussed in the text: (A) proliferation-associated, (B) IFNregulated, (C) B lymphocytes, and (D) stromal cells. Symbolic representation

7 Gene selection. The simplest way: univariant i gene-bygene. Other multivariant approaches can be used Two classes T-test Bayes Data-adaptive Clear Multiclass l Anova Clear Continuous variable (e.g. level of a metabolite) Pearson Spearmam Regression Survival Cox model The T-rex tool

8 Genes differentially expressed between normal endometrium and endometrioid endometrial carcinomas NE EEC G Symbol ANumber 86 genes with significantly different expression patterns between Normal Endometrium and Endometrioid Endometrial Carcinoma (FDR adjusted p<0.05) selected among the ~7000 genes in the CNIO oncochip Moreno et al., 2003 Cancer Research 63,

9 Prognostic and diagnostic predictors The PROPHET and the MAQCII Initiative Medina 2007 Bioinormatics

10 The MicroArray Quality Control (MAQC) Project: An FDA-Led Effort Toward Personalized Medicine MAQC Website: MAQC-II Objective: Reaching consensus on the best practices (Data Analysis Protocol, DAP) in developing and validating microarray-based predictive models (classifiers) for clinical and preclinical applications. A international consortium of 36 data analysis teams submitted prediction results from 18,202 models for 6 datasets to the MAQC-II

11 Studying copy number alterations Correlation CNA expression. amplification deletion Minimum region with gains and losses Zoom of the region

12 C PTL LB Understanding why genes differ in their expression between two different conditions Limphomas from mature lymphocytes (LB) and precursor T-lymphocyte (PTL). Genes differentially expressed, selected among the ~7000 genes in the CNIO oncochip Genes differentially expressed among both groups were mainly related to immune response (activated in mature lymphocytes) Martinez et al., Clinical Cancer Research. 10:

13 Genome Annotation Structural Annotation Biological Databases Functional Annotation Protein-Protein interactions Protein Structure KEGG pathways Keywords Swissprot Biocarta pathways Gene Ontology Biological Process Molecular Function Cellular Component Motifs Domains Gene Annotation Bioentities from literature Gene Set Annotation Gene Expression Modules Reactome Regulatory elements mirna CisRed Transcription Factor Binding Sites msigdb

14 Case study: functional differences in a class comparison experiment A 8 with impaired tolerance (IGT) + 18 with type 2 diabetes mellitus (DM2) B 17 with normal tolerance to glucose (NTG) (Mootha et al., 2003) A B No one single gene shows significant ifi differential expression upon the application of a t-test Healthy vs diabetic Upregulated Dowregulated Repository Functional class GO KEGG Swissprot keyword Oxidative phosphorylation X X ATP synthesis X Ribosome X Ubiquinone X Ribosomal protein X Ribonucleoprotein X Mitochondrion X X Transit peptide X Nucleotide biosynthesis i NADH dehidrogenase (ubiquinone) activity Nuclease activity Insulin signalling pathway Nevertheless, many pathways, and functional blocks are significantly activated/deactivated X X X X

15 Beyond discrete variables: Survival data Microarrays 34 samples from tumours of hypopharyngeal cancer (GEO GDS1070) Gene selection Cox Proportional- Hazards model to study how the expression of each gene across patients is related to their survival Since FatiScan depends only on a list of ordered genes, and not on the original experimental values, it can be applied to different experimental designs - Survival Gen risk Gen1 5.8 Gen2 5 6 Gen3 5.4 Gen4 5.2 Gen5 5.2 Gen Gen Gen Survival

16 Metodologías redes de interacciones entre proteínas Evaluación del comportamiento cooperativo de la lista Generación del módulo funcional (términos GO, rutas de KEGG o bioentidades) Red de Conexión Mínima (módulo funcional) Hallamos los caminos más cortos entre todos los pares de nodos en la lista. Aceptamos los caminos que conectan dos nodos bien directamente o por medio de un número determinado de nodos no incluidos en la lista. Lista de proteínas seleccionadas Caminos cortos RCM Prot 1 Prot 2 Nodos incluidos en la lista Nodos no incluidos en la lista 19 de 36

17 Variación conexiones físicas en rutas bioquímicas (normales vs cáncer) Cellular Processes (connection gains in cancer) Prostate 1 1 Mammary Gland auto-connections in Cell cycle 2 - Cell cycle - Tight junction 3 - Gap junction - Insulin signaling pathway 4 - Gap junction - Fc epsilon RI signaling pathway 5 - Toll like receptor signaling pathway auto-connections 6 - Toll like receptor signaling pathway - B cell receptor signaling 7 - Insulin signaling pathway - Melanogenesis

18 The babelomics suite for functional profiling of genomic experiments Al-Shahrour et al., 2005, 2006, 2007, 2008 NAR; 2004, 2005 Bioinformatics, 2007 BMC Bioinformatics; Biological information from: GO Interpro motifs KEGG pathways Biocarta pathways Swissprot keywords TFBSs(Transfac) Regulatory motifs (CisRED) mirnas Protein interactions Tissues Text-mining Chromosomal location For Human, mouse, rat, chicken, cow, fly, worm, yeast, A. thaliana and bacteria Tests for functional enrichment gene set enrichment

19 Expanding the concept of functional profiling Better functional annotations will help Testing models will increase our sensitivity Functions and pathways are correlated (higher levels of organization). In general (systems) biology is behind. Our questions g ( y ) gy q must be inspired directly by biology

20 Successful reception by the scientific community GEPAS: currently is the most cited web-based platform for transcriptomic analysis (482 scholar google citations) Babelomics. Third most cited platform (575 scholar google citations; FatiGO is amongst the 50 most cited papers in Bioinformatics) Microarray data analysis webtools with at least 10 citations 1. Web tool URL Citations 1 GEPAS ExpressionProfiler 52 cageda i i /GEDA h 36 GenePublisher 25 ExpressYourself 26 RACE ch/ 22 ArrayPipe 20 VAMPIRE 19 MIDAW 15 t-profiler 16 CARMAweb 12 Approximately 1000 users per day 1500 registered users (6 months) Publications ) Scholar Google citations over all the references of the tool (June 2008).

21 Functional Genomics SNP analysis PupaSuite Interactive selection of optimal sets of SNPs for large-scale genotyping SNPeffect database. Phenotyping of human SNPs and disease mutations

22 Genome projects. Design, and implementation of: Workflow for High-throughput genotyping at CeGen. Problem 1: feed the monster. E.g. Illumina: genotipes at a time Problem 2: store results... Problem 3: query the database... Experimen tal design (linkage, pathway, etc) Computeraided selection. PupaSuite Conde et al. 2004, 2005, 2007 NAR Cancer SNPs DB server October 2004: SNPs designed...along with clinical data LD, Case-control, haplotypes, ODD ratios, etc....and submit to analysis programs

23 Si-RNA DEsign Functional Genomics sirna analysis SiDE Highly specific and accurate selection of sirnas for high-throughput functional assays

24 Next generation sequencing: throughput up to 10 9 bp/day Illumina Genome Analyzer (Solexa) Genome Sequencer FLX System 454 SOLID. Applied Biosystems Re-sequencing, de novo sequencing, CNVs, SNPs, transciptomics, Chip-on-chip, etc.

25 Next Generation Sequencing For: Transcriptomics Resequencing SNPs Copy number Chip-on-chip like

26 Technological services Sequencing Microarray Computing Facility Facility Power Applied Biosystems 3730XL DNA analyzer 1.6 Mbp pb per day Sequencing Plasmid Cosmid BAC ends SNPs microsatellite methylation profiles fragment analysis Labelling, Hybridization, Scan Commertial arrays Aglient Operon Eppendorf Clontech GE Healthcare DNA microarrays ChIp on Chip Exon arrays micrornas BAC arrays Array design: probe selection Computing cluster 230 CPU s 20 servers x2 xeon quad-core 20 servers x 2 opteron process 30 pcs x1 Athlon proces 20 Tbs disk storage x86_64 arquitechture 0.5 M Blast runs against nr in 24 hours

27 The bioinformatics department at the Centro de Investigación Príncipe Felipe (Valencia, Spain)... Joaquín Dopazo Eva Alloza Leonardo Arbiza Fátima Al-Shahrour Emidio Capriotti Jose Carbonell Ana Conesa Hernán Dopazo Pablo Escobar Francisco García Stefan Goetz Jaime Huerta Rafael Jimenez Marc Martí Ignacio Medina Pablo Minguez David Montaner François Serra Joaquín Tárraga...the INB, National Institute of Bioinformatics (Functional Genomics Node) and the CIBER-ER Nertwork of Centers for Rare Diseases