Current questions in science How can Bioinformatics help to to solve them?
Overview Introduction Historical Historical overview overview Current Current questions questions in in science science Genome projects Proteomics Data analysis Current Current status status of of genome genome projects projects Sequencing Sequencing strategies strategies and and methods methods Strategies Strategies for for gene gene identification identification Proteome: Proteome: 2D 2D gels, gels, mass mass spectroscopy spectroscopy Protein Protein interactions interactions Gene Gene expression: expression: Microarrays, Microarrays, SAGE SAGE Sequence Sequence comparison comparison Data Data mining mining
Historical overview Classification in biology Carl von Linne (1707-1778) Evolution Charles Darwin (1807-1882) Genetics Gregor Mendel (1822-1884) 1869 1952 1953 1970 1983 Discovery of nuclein Friedrich Miescher (1844-1895) DNA is the genetic material Hershey-Chase Molecular structure of DNA Chargaff, 1962 Nobel Prize James Watson, Francis Crick Recombinant DNA, DNA sequencing 1980 Nobel Prize Walter Gilbert, Frederick Sanger, Paul Berg Amplification of DNA (PCR) Kary Mullis & others, 1993 Nobel Prize
Classical genetics Mutant Phaenotypic Feature Protein Function Gene Biochemical Pathways Enzymes Cell cycle Visual signal response Development of tissues Development of organisms Embryogenesis Immune response Receptor proteins Hormones
The limits The dogma: Gene Protein Specific function is not true for all biological functions. Cellular processes involve many different gene products and their interactions. Cellular processes are complex and multi dimensional. This asks for a completely new kind of research.
Current questions in science Genome Transcriptome Regulome High throughput! Proteome Metabolome
Current questions in science To understand complex biological processes Proteomics in the cell and organism. Biology Research Medicine Disease Diagnostics Biotechnology Pharmacology Drug targeting Synthetic substances
Methanococcus jannaschii
Overview Introduction Historical Historical overview overview Current Current questions questions in in science science Genome projects Proteomics Data analysis Current Current status status of of genome genome projects projects Sequencing Sequencing strategies strategies and and methods methods Strategies Strategies for for gene gene identification identification Proteome: Proteome: 2D 2D gels, gels, mass mass spectroscopy spectroscopy Protein Protein interactions interactions Gene Gene expression: expression: Microarrays, Microarrays, SAGE SAGE Sequence Sequence comparison comparison Data Data mining mining
Highlights in Genome Projects Organism Year Millions of bases Number of genes Number of genes per million bases Saccharomizes cerevisiae Caenorabditis elegans Drosophila melanogaster Arabidopsis thaliana Human genome (public sequence) Human genome (Celera) 1996 12 5800 483 1998 97 19099 197 2000 116 13601 117 2000 115 25498 221 2001 2693 31780 12 2001 2654 39114 15
Complete genomes Whole-genome Whole-genome sequences sequences for for more more than than 800 800 organisms organisms (bacteria, (bacteria, archaea, archaea, and and eukaryota eukaryotaas as well well as as many many viruses viruses and and organells) organells) are are either either complete complete or or being being determined. determined.
Human Genome Project Goals: Determine the sequence (0.75Gb of of data) Identify all all the genes in in the human DNA Store this information in in databases Develop tools for for data analysis and Address the ethical, legal, and social issues that may arise from the project.
Human Genome Project 30-40.000 genes Current estimate: 100.000 --140.000 functional genes More transcripts due to to alternative ~ one splicing gene three or or proteins recombination More than 95% of of the human genome is isnot coding Mostly DNA with Proteome: unknown functions ~ 250.000 proteins CpG islands (45000 per haploid 1.300 genome) protein families Repeated sequences (sines and lines)
Overview Introduction Historical Historical overview overview Current Current questions questions in in science science Genome projects Proteomics Data analysis Current Current status status of of genome genome projects projects Sequencing Sequencing strategies strategies and and methods methods Strategies Strategies for for gene gene identification identification Proteome: Proteome: 2D 2D gels, gels, mass mass spectroscopy spectroscopy Protein Protein interactions interactions Gene Gene expression: expression: Microarrays, Microarrays, SAGE SAGE Sequence Sequence comparison comparison Data Data mining mining
High throughput! Genome projects Sequencing a genome Clone large parts into special vectors (BACs, can contain up to 1Mbp) Primerwalking Sequence the BACs from beginning to end Shotgun sequencing Fractionate the BAC insert into small fragments Shotgun sequence these (only the ends) computer-assemble all pieces 10-fold excess essential
Genome projects Physical maps of genomes by mapping of known diseases to certain areas or by placing more abstract landmarks on the map such as: PCR fragments (STS sequence tagged sites) These can be random fragments of DNA or those corresponding to ESTs or other cdnas EXAMPLE: Duchenne Muscular Dystrophy Duchenne muscular dystrophy (DMD) is one of a group of muscular dystrophies characterized by enlargement of muscles. All are Y-linked and affect mainly males. "Dystrophy" refers to any of a number of disorders characterized by weakening, degeneration or abnormal development of muscle. Y chromosome
Genome projects Fluorescent in situ hybridisation
Overview Introduction Historical Historical overview overview Current Current questions questions in in science science Genome projects Proteomics Data analysis Current Current status status of of genome genome projects projects Sequencing Sequencing strategies strategies and and methods methods Strategies Strategies for for gene gene identification identification Proteome: Proteome: 2D 2D gels, gels, mass mass spectroscopy spectroscopy Gene Gene expression: expression: Microarrays, Microarrays, SAGE SAGE Protein Protein interactions interactions Sequence Sequence comparison comparison Data Data mining mining
Predicting protein encoding genes Transcription: prerna Splicing: mrna (A) 200 Translation: Modification Protein
Three basic strategies to to find gene specific sequence motives Homology searching Analysis of of sequence signals Statistical analysis Whole genome comparison
Three basic strategies to to find gene specific sequence motives Homology searching Analysis of of sequence signals Statistical analysis Whole genome comparison Ideally, gene prediction tools should be be able to to identify and automatically annotate all all genes.
Recognition sites for gene regulation
Three basic strategies to to find gene specific sequence motives Homology searching Analysis of of sequence signals Statistical analysis Whole genome comparison
Genome comparison
Genomes
Bioinformatics Why Sequence Comparison? Evolutionary relationships paralog ancestor ortholog species 1 species 2 species 3
Homo sapiens chromosome X versus Mus musculus chromosome X
Overview Introduction Historical Historical overview overview Current Current questions questions in in science science Genome projects Proteomics Data analysis Current Current status status of of genome genome projects projects Sequencing Sequencing strategies strategies and and methods methods Strategies Strategies for for gene gene identification identification Proteome: Proteome: 2D 2D gels, gels, mass mass spectroscopy spectroscopy Gene Gene expression: expression: Microarrays, Microarrays, SAGE SAGE Protein Protein interactions interactions Sequence Sequence comparison comparison Data Data mining mining
Identical genome Totally different proteome
Proteome Sequence the proteome Tissue Isolate proteins Run a 2 D SDS-PAGE Isolate single protein dots Sequence the protein
2-dimensional SDS-PAGE 1. Step: + -
2-dimensional SDS-PAGE 2. Step: - +
Proteome Tissue Identify a protein with mass spectrometry Isolate proteins Run a 2 D SDS-PAGE Isolate single protein dots Enzymatic digestion Peptide mass fingerprinting Database search
Proteome 5'UTR 3'UTR prerna: Exon 1 Exon 2 Exon 3 Intron 1 Intron 2 ATG TAA mrna: Splicing / Polyadenylation ATG polya TAA AAAAAAAAA active protein: Translation CPLTW...GFL CPLTW...PJC Splice variant Posttranslationale Modification CPLTW...LAC
Proteome
Proteomics Ultimate goal of proteomics Identical expression pattern Receptor/ligand relationship Sequence identity
Novel definitions in biology Genome The complete set of chromosomes with the genes they contain It s more or less static information! Proteome All proteins encoded by the genome - Splice variants, - Post-translational modifications, - Polymorphismen, - Disease mutations, Proteomics
HPI Human Proteomics Initiative A major effort of the Swiss Institute of Bioinformatics (SIB) and the European Bioinformatics Institute (EBI) GOALS Annotation of all known human proteins (+ splice variants). Annotation of all known human polymorphisms and disease mutations. Annotation of all known post-translational modifications in human proteins. Tight links to structural information. Annotation of mammalian orthologs of human proteins.
Proteomics - Many interactions with other proteins and compounds - Changes of protein concentrations: - Subcellular localization - Time - Tissue - Developmental stages -..
Proteomics Goal Reconstructing molecular circuitry of a living cell. Techniques Molecular genetics: Gen expression: Micro arrays SAGE Protein analysis: 2-D gel electrophoresis Mass spectroscopy Protein interactions (peptide arrays, yeast two hybrid) And bioinformatics to integrate heterogenous data from different knowledge databases
Overview Introduction Historical Historical overview overview Current Current questions questions in in science science Genome projects Proteomics Data analysis Current Current status status of of genome genome projects projects Sequencing Sequencing strategies strategies and and methods methods Strategies Strategies for for gene gene identification identification Proteome: Proteome: 2D 2D gels, gels, mass mass spectroscopy spectroscopy Gene Gene expression: expression: Microarrays, Microarrays, SAGE SAGE Protein Protein interactions interactions Sequence Sequence comparison comparison Data Data mining mining
High throughput! DNA Microarrays DNA microarrays are perfectly suited for comparing gene expression Different probes are compared to find, e.g. Tissue-specific Genes Regulatory Gene Defects in Cancer Medicine Disease related metabolic pathways Candidate genes Cellular Responses to the Environment
Microarray technique Affymetrix GeneChip or Spotted DNA Microarrays Two different cell samples RNA extraction and reverse transcription Labelling Hybridisation
Microarrays: a flood of data Data collection Arrays are scanned to extract signal intensities from the image. Normalization Data is calibrated e.g. by dividing RNA signal by genomic DNA signal. Clustering Bioinformatics methods identify groups of Up and down regulated genes. Annotation Bioinformatics methods / Data mining To get more information about the function and interaction of up and down regulated genes. Submission to public repositories
High throughput! Serial Analysis of Gene Expression (SAGE) AAAAAAA Isolate tissue specific RNA. 4 Nucleotides, RT, primer TTTTTT... Reverse transcribe to cdna TTTTTTTT cdna is linked to matrix via biotin/streptavidin. TTTTTTTT Digest with enzyme 1. TTTTTTTT Remove unbound fragments.
Serial Analysis of Gene Expression (SAGE) 14 bp Linker + RE TTTTTTTT Divide sample in two parts. Ligate two different linkers to the samples. 14 bp Linker + RE TTTTTTTT Digest with (type II) enzyme. Linker + RE Linker + RE Linker + RE Linker + RE Ligate and multiply/amplify with PCR, clone.
Serial Analysis of Gene Expression (SAGE) The result is a huge chain of 14 bp fragments. 14 bp The sequence of the concatemer is determined. Tumor cells copies 60 interleukin 93 actin 14 synthase 110 unknown Healthy cells copies 10 91 14 0 14 bp (4 14 possible combinations) are sufficient to characterize any individual RNA Determine the frequency of each transcript. Goal: identify novel genes involved in disease or investigate how known genes are regulated.
Transcriptome High throughput! Next to to determining the sequence of of the genome (DNA), many laboratories determine the sequence of of Expressed Sequence Tags (ESTs) Tissue Isolate RNA Reverse transcribe into DNA Sequence both ends Max 500bp The resulting sequences are ESTs
Overview Introduction Historical Historical overview overview Current Current questions questions in in science science Genome projects Proteomics Data analysis Current Current status status of of genome genome projects projects Sequencing Sequencing strategies strategies and and methods methods Strategies Strategies for for gene gene identification identification Proteome: Proteome: 2D 2D gels, gels, mass mass spectroscopy spectroscopy Gene Gene expression: expression: Microarrays, Microarrays, SAGE SAGE Protein Protein interactions interactions Sequence Sequence comparison comparison Data Data mining mining
High throughput! Protein-protein interaction Peptide arrays Peptide chips provide new ways to study protein-protein interaction, unravel signal transduction pathways, perform multi-parameter diagnosis, study individual immunological repertoires, e.g. autoimmune reactions.
Protein-protein interaction Yeast two hybrid system
Overview Introduction Historical Historical overview overview Current Current questions questions in in science science Genome projects Proteomics Data analysis Current Current status status of of genome genome projects projects Sequencing Sequencing strategies strategies and and methods methods Strategies Strategies for for gene gene identification identification Proteome: Proteome: 2D 2D gels, gels, mass mass spectroscopy spectroscopy Gene Gene expression: expression: Microarrays, Microarrays, SAGE SAGE Protein Protein interactions interactions Sequence Sequence comparison comparison Data Data mining mining
Bioinformatics Current status of data analysis Scientific exploitation of molecular biology databases Database searches to find related sequences Pair-wise comparison of two sequences Alignment of multiple sequences Evolutionary analysis of molecular sequence data Analysis of protein secondary structure Analysis of RNA secondary structure Geneprediction... Thousands of software tools exist
Bioinformatics Data analysis Analyze one Sequence Compare Sequences e.g. Restriction maps e.g. Calculation of MW Structure prediction Gene prediction Database searches Assembling Sequence alignments Phylogeny Enter and Edit Sequences
Bioinformatics Why sequence comparison? Sequence comparison is often used: To find related genes in the database; When dealing with a sequence of unknown function the presence of similar domains implies similar function. Homologous sequences share the same ancestral sequence They can be ortholog or paralog
Bioinformatics Example: Blast2P Searching for homologous sequences Searching...done Sequences producing significant alignments: Score E (bits) Value >>>swissprot:25a1_mouse P11928 mus musculus (mouse). (2'-5')oli... 629 e-180 >>>swissprot:25a2_mouse P29080 mus musculus (mouse). (2'-5')oli... 495 e-140 >>>swissprot:25a2_human P04820 homo sapiens (human). (2'-5')oli... 495 e-140 >>>swissprot:25a1_human P00973 homo sapiens (human). (2'-5')oli... 495 e-140 >>>swissprot:25a3_mouse P29081 mus musculus (mouse). (2'-5')oli... 492 e-139 >>>swissprot:25a6_human P29728 homo sapiens (human). 69/71 kd (... 350 2e-96 >>>swissprot:tr14_human Q15646 homo sapiens (human). thyroid re... 77 4e-14 >>>swissprot:rn14_yeast P25298 saccharomyces cerevisiae (baker'... 32 1.5..
Bioinformatics Why sequence comparison? Evolution of genes and proteins Many proteins consist of many different domains which have specific functions Gene Protein Gene duplication Domain shuffling
Bioinformatics Why sequence comparison? Evolution of genes and proteins Gene duplication: Mostly pseudogenes (without function) or Similar gene product with new function (e.g. haemoglobin alpha, beta chain)
Bioinformatics Why sequence comparison? Evolution of genes and proteins Many genes and proteins are members of families which share a common biochemical function or evolutionary origin. Protein A Protein B Protein C Protein D1 Protein D2
Bioinformatics Why sequence comparison? Evolutionary relationships paralog ancestor ortholog species 1 species 2 species 3
Bioinformatics The birth of molecular evolution In In the the early early days, days, evolution was was studied studied by by comparison of of morphologic features In In the the 50s 50s and and 60s, 60s, the the protein protein sequences of of insulin insulin (Sanger), heamoglobins and and cytochrome c were were available and and sequence comparisons became possible.
Bioinformatics The birth of molecular evolution The phylogenetic tree of all cytochrome c proteins The phylogenetic tree of the species (organisms) Comparison... revealed a great overlap, supporting classical phylogeny At the same time, minor variations helped to improve existing trees
Gene prediction The Challenge The gap between data collection and data interpretation is is growing rapidly.
High-throughput data collection World wide collection of data Storage in databases Global efforts to collect: sequence data structure data protein expression profiles functional data metabolic pathways.. Data analysis Bioinformatics Data mining
The Biocomputing Service Group HUSAR Sequence Retrieval Analysis Packages BIOCCELERATOR EST CLUSTERING PHYLIP SRS STADEN Databases (EMBL, GENBANK, Swissprot, PIR, TRANSFAC,., Genome databases GDB/OMIM, Flybase, AceDB,...) Heidelberg UNIX Sequence Analysis Resources GCG / EGCG User Support Scientific Consulting, Training, Workshops, Hotline Hardware Environment Mapping Methods Linkage Package, Mapmaker, Crimap, Map, Pedpack, APM, LIPED, LDB, SIGMA
GCG (~130 programs) EGCG In-house developments - own programs - automated tasks EMBOSS (~150 programs) HUSAR Program Package Third-party Programs (~150 programs) DATABASES - >300 - Prompt updates (daily, weekly) SRS (Sequence Retrieval System)
Number of analysis programs is huge and must be combined for many purposes. Users need compact presentable reports on analysis results, especially for high throughput analysis
!" mapping in the human genome exhaustive gene structure analysis extraction of most recent annotation information merging with precomputed data from the NCBI pipeline
#$$!"
%&
'%! % %(' "!#$ % % & ' ( & % ) ) &) * + ' +, -
)$% (%"
%"*$"$% %%"%" %%% (%
+$("! "!#! $% http://genome.dkfz- &'! heidelberg.de