Dina El-Khishin (Ph.D.) Bioinformatics Research Facility. Deputy Director of AGERI & Head of the Genomics, Proteomics &

Size: px
Start display at page:

Download "Dina El-Khishin (Ph.D.) Bioinformatics Research Facility. Deputy Director of AGERI & Head of the Genomics, Proteomics &"

Transcription

1

2 Dina El-Khishin (Ph.D.) Deputy Director of AGERI & Head of the Genomics, Proteomics & Bioinformatics Research Facility Agricultural Genetic Engineering Research Institute (AGERI) Giza EGYPT

3 Bioinformatics Bibliotheca Alexandrina December 2007

4 Assumptions PC with Microsoft Windows Internet connection Use of internet browsers and computers Background in molecular biology

5 What s in the name? Multiple Sequence Alignment Genome Mapping Protein Analysis Proteomics Database Homology Searching Sequence Analysis Bio Informatics 3D Modeling Homology Modeling Docking Sample Registration & Tracking Integrated Data Repositories Common Visual Interfaces Intellectual Property Auditing

6 What is Bioinformatics? Computerized annotation of genomic and biological information and data (databases). Transformation and manipulation of these data (software tools). - computational analysis of biological data.

7 The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology Here we will consider the use of Bioinformatics tools rather than their design and construction Here we will consider the access and analysis of data and information items rather than their generation, storage or annotation

8 Typical Bioinformatics Multi-Disciplinary Scientists Experimental Design & Interpretation Laboratory Protocols & Standards/Controls Mathematicians Analysis & Correlation of Data Validation methodologies Computer Scientists Information Storage / Control Vocabulary Data Mining

9 Bioinformatics USERS of Information of Tools of Instrumentation In-Silico Modeling INTERPRETERS of Information DEVELOPERS* of Information of Tools of Instrumentation * of Architecture/Storage Algorithms Modeling Strategies Visualization

10 Overall Aim of Bioinformatics: Provides biologically important predictions from annotated data and transformation / manipulation of these data.

11 Bioinformatics SCOPE: Biological information Acquisition Processing Storage Distribution Analysis Interpretation TECHNIQUES: Mathematics Computer science Biology OBJECTIVE: Understanding biological significance of biological data

12 Bioinformatics Databases Nucleotide and protein sequences. Protein structures. All sorts of functional data related to genes, proteins and their regulation, interactions etc. Curated and non-curated databases.

13 Software tools (computer programs): Software tools: sequence analysis, database construction and management, evolutionary relations, structural analyses, pathways, microarray analysis, proteomic analysis. Software tools integrated into databases.

14 The Need for Bioinformatics: Whole Genome Analyses and Sequences. Experimental analyses for thousands of genes simultaneously. DNA Chips and Array Analyses. -Expression Arrays. -Comparative Analyses between Species and Strains. Proteomics: 'Proteome' of an Organism... 2D gels, Mass Spec. Medical applications: Genetic Disease... SNPs. -Pharmaceutical and Biotech Industry. Forensic applications. Agricultural applications.

15 Main Bioinformatics Applications Comparison and analysis of nucleotide and protein sequences. Comparison and analysis of molecular structures (especially proteins). Results of comparisons can be used in evolutionary and phylogenetic studies. Data mining of genomic data (large-scale gene expression results etc.).

16 Main Goals Introduction to biological databases. Good knowledge of major databases. An overview of the large variety of minor databases. Learning to use tools at major database sites. Genbank/ncbi, expasy/swissprot, PDB. Introduction to sequence searching and sequence alignments. The tools with most practical "everyday" usability.

17 DNA Sequence Submission Sequence Alignments (Pairwise and Multiple) Scoring Matrices Motifs and Patterns Genes, Exons, and Introns Promoters, Transcription-factor-binding Sites Other Regulatory Sites RNA Secondary Structure RNA-specifying Genes, Motifs Protein Sequence Alignment Motifs, Patterns, and Profiles

18 II. Sequence Databases and Their Use: A. Primary Sequence Databases: Nucleic Acid Databases NCBI (Natl Center Biotech Information) - GenBank EBI (European Bioinformatics Institute) - EMBL DISC - DNA Information and Stock Center, Japan

19 Protein Databases NCBI - GenPept ExPASy - SwissProt and TrEMBL EBI (European Bioinformatics Institute) SwissProt, TrEMBL, PIR DISC - DNA Information and Stock Center, Japan

20 B. Uses of Sequence Databases: Information Retrieval Analysis: "given a new DNA sequence, what's in it?" Finding Homologues Finding Genes Finding Motifs - DNA Binding Sites

21 C. Retrieve Info from Sequence NCBI - Entrez Types of Databases Available Entrez Help Retrieve Large Data Sets ExPASy - SwissProt and TrEMBL SwissProt - Bairoch well-annotated non-redundant protein DB TrEMBL - Translation of EMBL DNA coding sequences Databases: DISC - DNA Information and Stock Center, Japan - DDBJ SRS - Sequence Retrieval System Software Tools - FASTA, BLAST, MpSrch EBI - SwissProt, TrEMBL, PIR SRS - Sequence Retrieval System Software Tools - FASTA, WU- Blast2, ClustalW EBI2 - second server at EBI PDB - Protein DataBank Protein 3D Structure database

22 D. Sequence Analysis: finding Homologues Homologues - sequences descending from common ancestor Comparison of Sequences using Distance Matrix approach DOT PLOTS - 2D graph of alignment of two sequences FASTA - fast, global database search tool of Pearson and Lipman

23 BLAST - Basic Local Alignment Sequence Tool -BLASTN - NA query NA database -BLASTP - Protein query Protein database -BLASTX - NA query (translated) Protein database -TBLASTN - Protein query NA (translated) database -TBLASTX - NA query (translated) NA (translated) database

24 E. Sequence Analysis: finding Genes in DNA Methods: Gene Search by Signal -Look for Signals - Promoter Sites, Splice Sites,... Gene Search by Content Open Reading Frame Use of Statistical Properties of Protein Coding Regions Unequal use of amino acids Unequal numbers of codons per amino acid Codons available not equally used - Codon Usage

25 F. Sequence Analysis: finding Motifs Motifs: Motif - a recurrent thematic element Structural motifs - pieces of folded 3D structure Sequence motifs - conserved "blocks" of sequences DNA Motifs: Protein binding sites... regulatory elements Relatively short Statistically difficult Cooperative binding often important Structural elements may be important - bends, kinks

26 Protein Motifs: Secondary structure - alpha helices, beta sheets Super secondary structure - 4 helix bundle, etc., Basic Methods: Consensus sequence - single, best sequence Regular Expression - multiple characters per site Weight-Matrix - any character per site, with score - Profile Hidden Markov Model

27 Protein Family Classifications Prosite: Database of protein families and domains - at ExPASy and elsewhere Regular expressions (Patterns) and Profiles Programs Search Prosite for Pattern or Profile ScanProsite - scan a sequence against ProSite, or pattern against SwissProt ProfileScan - scan a sequence against Profile Database

28 G. Multiple Sequence Analysis Basics Progressive Sequence Alignment -Pairwise alignment of most similar, then next most similar, etc., Steps -Do pair wise alignment for all sequences -Get Matrix of approximate Distances between each pair -Create an approximate phylogenetic tree - Guide Tree -Use this to determine order of addition of sequences to alignment -Align: two sequences; seq. to sub-alignment; two subalignments -Keep GAPS that appear early - 'Once a gap, always a gap'

29 Web sites for Multiple Sequence Alignment Clustal W: Weighting - different weights given to unequally sampled sequences Position Dependent Weights Position-Specific Gap Penalties (Opening vs Extension) Sequence Weighting Weights for Adding New Sequences to existing Alignment - extra weight to sequences most similar to alignment Clustal W Servers MAP, PIMA, MSA Many others available Other Web Programs :

30 Web Databases of Multiple Sequence Alignments Fold Classification via Structure-Structure Protein alignments (FSSP) Homology derived Secondary Structure Assignments (HSSP) Database of Secondary Structure Assignments (DSSP)

31 H. Phylogenetics Basics: Trees - Rooted vs Unrooted Rooted Tree - position of Ancestor is known Unrooted Tree - no Ancestral Node Topology - Branching Pattern of the Tree

32 Terminology 1, 2, 3, 4, 5: Taxa or External Nodes X, Y, Z: Internal Nodes R1: Root a, b, c, d, e: External Branches f, g: Internal Branches h: Internal Branch ONLY IF tree is Rooted; else h is part of the Outgroup: Taxan 5... used to "root trees"

33 Methods: Distance Matrix methods UPGMA - Unweighted Pair Group Method of Averages Fixed 'clock', averages used to get distances Fitch & Margoliash - 3 branches calculated at a time Neighbor Joining - Pairs of taxa, finding closest pair tree with smallest sum of Branch Lengths Other methods also available Parsimony methods Find tree with fewest inferred mutations Programs: PHYLIP package; PAUP Maximum Likelihood methods Use a mathematical model of process of evolution Model contains a parameter which is used to Maximize the Likelihood that observed changes took place

34 Confidence - "How good is the Tree? Bootstrap - permutation resampling of the sequences -How robust is the tree to such resampling? always same tree? How much better is this "best" tree than other trees? -Use set of "User defined" Trees... how good is each? -PHYLIP programs

35 III. Whole Genomes A. Implications TOTAL information on Heritable Properties of an Organism What an Organism CAN do... and CAN NOT do... Major step toward Understanding an Organism and toward making Biology a PREDICTIVE SCIENCE Current: identify Genes, predict Function

36 Next: Deduce Life Style of the Organism Predict Metabolic and Genetic Pathways Predict Adaptive Responses, Developmental Pathways ORGANISM DATABASES

37 B. TIGR The Institute for Genomic Research First to Sequence whole Genome of Free-living Organism Sequenced the first Three Eubacteria and First Two Archae TIGR Database (TDB) - organisms links to specific

38 IV. Organisms and Other Databases A. Need for Organism Databases Direct result of Genome Physical Mapping efforts Need for Maps, Genes, Sequences, References Incomplete Genome Information plus other Information NOW: Complete Genome Information

39 B. Web Organism Databases ACeDB - A C. elegans Data Base Created by Durbin and Thierry-Mieg for Sulston R.Mapping Program Over 40 organisms represented in ACeDB databases Highly variable Types of Information in each Examples: C. elegans, yeast, fly, grains, Arabidopsis, human chroms

40 Saccharomyces Genome Database (SGD) Basic database is Web enhancement over ACeDB SacchDB Excellent interface to yeast genome maps: Many resources including analysis tools BLAST and FASTA facilities SacchDB extended to include Genome Deletion Project Yeast Evolution Project Sacch3D - protein 3D structure information Worm and Mammalian Homology to Yeast Yeast SAGE data

41 The Arabidopsis Information Resource (TAIR): Arabidopsis thaliana Database based on Oracle relational database system Much underlying information from ACeDB AatDB Analysis tools and Viewers, including BLAST and FASTA Arabidopsis Genome Initiative (AGI) PlantsP: Plant Phosphorylation Proteins (kinases, phosphatases) underlying MySQL database display and usage is Web based many other resources, links, download, etc

42 Berkeley Drosophila Genome Project (BDGP) Outgrowth of Encyclopedia of Drosophila (EofD) Excellent Map Viewers - largely Java applets Example: CytoView Includes FlyBase, ACeDB database of Drosophila Mouse Genome Informatics (MGI) Integrated access to mouse genetics and biology Mouse Genome Database (MGD) Mouse Gene Expression Database (GXD) Encyclopedia of the Mouse Genome links to Mouse Tumor Biology database Rat Data resource

43 Human Genome Resources at NCBI Information and links to Human Genome Project Human Genes OMIM - Online Mendelian Inheritance in Man McKusick catalog of human genes and disorders Over 10,000 entries LocusLink - single interface to all human locus info Human/Mouse Homology Relationships Examples of Info on Candidate Human Genes for Hypertension

44 A. Problems: VI. Problems... Directions to Go Sequence DBs and Others are Flat File Database one piece of information at a time Analysis Tools are largely Single Task oriented from Task to Task, User must make Decisions Automate Basic Analytical Tasks for new DNA Sequences This is now done currently in some facilities and in some expensive commercial packages Examples: Pangea, Incyte

45 B. Need: "smart" Analysis Packages Need "smart" Analysis Packages that can "learn" from DB info. "predict" next best options for User. Analysis: DNA seq --> gene --> protein --> motifs --> 3D structure.

46 Basic Problem with Biology becoming a "Predictive Science 1-Large number of Different Molecules, eg Proteins -Large Variety per Cell -Variety Changes with Type of Cell in Organism 2-Often a Small number of Each Molecule Thus: Statistical Analysis is often not Appropriate

47 The Potential of Bioinformatics The new paradigm, is that all the genes will be known "resident in database available electronically" Biological investigations will be theoretical Scientists will start with a theoretical conjecture and only then turning to experiment to follow or test the hypothesis.

48 Bioinformatics scientist have developed new techniques to analyze genes on an industrial scale resulting in a new area of science known as 'Genomics' Genomics is revolutionizing our entire approach to science.

49 Gene Discovery Informatics Microdissection Create DNA Libraries Signature Hybridization Clustering by Signature Expression Profiles Tissues & Cell Lines DNA Libraries Database Clustering Database Tissue & Cell Lines Database Clones Database Annotated Sequence Database Differential Expression Small Molecule Drugs Small Molecule Database Assays & Validation Database DNA Sequencing Micro Array Database In Situ Hybridization Functional Assays In situ Hybridization Micro Arrays Functional Predictions Gene Assignments

50

51 Thank you

52 References Martti Tolvanen & Bairong Shen University of Tampere Bioinformatics for Dummies (Wiley 2003) Internet Bioinformatics