Big picture and history (and Computational Biology) CS-5700 / BIO-5323
Outline 1 2 3 4
Outline 1 2 3 4
First to be databased were proteins The development of protein- s (Sanger and Tuppy 1951) led to the of representatives of several of the more common protein families such as cytochromes from a variety of organisms. Margaret Dayhoff (1972, 1978) and her collaborators at the National Biomedical Research Foundation (NBRF), Washington, DC, were the first to assemble of these into a protein sequence atlas in the 1960s, and their collection center eventually became known as the Information Resource (PIR, formerly Identification Resource Dayhoff and her coworkers organized the proteins into families and superfamilies based on the degree of sequence similarity.
First to be databased were proteins
Outline 1 2 3 4
were first assembled at Los Alamos National Laboratory (LANL), New Mexico, by Walter Goad and colleagues in the GenBank database and at the European Molecular Biology Laboratory (EMBL) in Heidelberg, Germany. Initially, a sequence entry included a computer filename and DNA or protein sequence files. These were eventually expanded to include much more information about the sequence, such as function, mutations, encoded proteins, regulatory sites, and references. This information was then placed along with the sequence into a database format that could be readily searched for many types of information.
Outline 1 2 3 4
from public An important step in providing sequence database access was the development of Web pages that allow queries to be made of the major sequence (GenBank, EMBL, etc.). An early example of this technology at NCBI was a menu-driven program called GEN-INFO developed by D. Benson, D. Lipman, and colleagues. This program searched rapidly through previously indexed sequence for entries that matched a biologist s query. Subsequently, a derivative program called ENTREZ with a simple window-based interface, and eventually a Web-based interface, was developed at NCBI. The idea behind these programs was to provide an easy-to-use interface with a flexible search procedure to the sequence.
Outline 1 2 3 4
Because DNA involves ordering a set of peaks (A, G, C, or T) on a gel, the process can be quite error-prone, depending on the quality of the data. As more s became available in the late 1970s, interest also increased in developing computer programs to analyze these in various ways. In 1982 and 1984, Nucleic Acids Research published two special issues devoted to the application of computers for sequence analysis, including programs for large mainframe computers down to the then-new microcomputers.
Outline 1 2 3 4
for comparing In 1970, A.J. Gibbs and G.A. McIntyre (1970) described a new for comparing two amino acid and nucleotide in which a graph was drawn with one sequence writ- ten across the page and the other down the left-hand side. Whenever the same letter appeared in both, a dot was placed at the intersection of the corresponding sequence positions on the graph
Outline 1 2 3 4
, global, local, and multiple Various s for aligning entire matching segments, small matching adjacent segments, and multiple variable-length segments.
Outline 1 2 3 4
Prediction of RNA secondary Methods for predicting RNA secondary on computers were also developed at an early time. For example, if the complement of a sequence on an RNA molecule is repeated down the sequence in the opposite chemical direction, the regions may base-pair and form a hairpin
Prediction of protein and RNA There are a large number of proteins whose are known, but very few whose s have been solved. Solving protein s involves the time-consuming and highly specialized procedures of X-ray crystallography and nuclear magnetic resonance (NMR). Consequently, there is much interest in trying to predict the of a protein, given its sequence. Early attempts were made at predicting protein from sequence.
Outline 1 2 3 4
, DNA, and RNA Variations within a family of related nucleic acid or protein provide an invaluable source of information for evolutionary biology, enabling the discovery of between species in an objectively quantifiable manner.
Outline 1 2 3 4
The first genome database The first genome database, was called ACEDB (a C. elegans database), and the s to access this database were developed by Mike Cherry and colleagues (Cherry and Cartinhour 1993). This database was accessible through the internet and allowed of, information about genes and mutants, investigator addresses, and references. Similar were subsequently developed using the same s for A. thaliana and S. cerevisiae.
Outline 1 2 3 4
And then the field of bioinformatics exploded from 1982 to the present, the number of bases in GenBank has doubled approximately every 18 months. As of 15 August 2017, GenBank release 221.0 has 203,180,606 loci, 240,343,378,258 bases, from 203,180,606 reported.
Outline 1 2 3 4
Outline 1 2 3 4
Nexus of many fields
Nexus of many fields
Contrasted to data science Same job, way worse pay...
Slightly more detail
Even more detail
A different perspective
AI s in
Outline 1 2 3 4
and computational biology https://en.wikipedia.org/wiki/computational_epidemiology https://en.wikipedia.org/wiki/mathematical_modelling_of_infectious_disease https://en.wikipedia.org/wiki/compartmental_models_in_epidemiology https://en.wikipedia.org/wiki/computational_biology https://en.wikipedia.org/wiki/ https://en.wikipedia.org/wiki/_assembly https://en.wikipedia.org/wiki/_analysis https://en.wikipedia.org/wiki/comparative_genomics https://en.wikipedia.org/wiki/health_informatics https://en.wikipedia.org/wiki/imaging_informatics https://en.wikipedia.org/wiki/neuroinformatics https://en.wikipedia.org/wiki/computational_neuroscience https://en.wikipedia.org/wiki/modelling_biological_systems https://en.wikipedia.org/wiki/computational_phylogenetics https://en.wikipedia.org/wiki/computational_genomics https://en.wikipedia.org/wiki/biodiversity_informatics https://en.wikipedia.org/wiki/biological_network https://en.wikipedia.org/wiki/structural_bioinformatics https://en.wikipedia.org/wiki/ecosystem_model https://en.wikipedia.org/wiki/models_of_dna_evolution https://en.wikipedia.org/wiki/translational_bioinformatics https://en.wikipedia.org/wiki/gene_ https://en.wikipedia.org/wiki/gene_prediction https://en.wikipedia.org/wiki/bioimage_informatics https://en.wikipedia.org/wiki/ prediction https://en.wikipedia.org/wiki/computational_anatomy https://en.wikipedia.org/wiki/cellular_model
Outline 1 2 3 4
Ontology In computer science and information science, an is a formal naming and definition of the types, properties, and inter of the entities that really exist in a particular domain of discourse. An upper (or foundation ) is a model of the common objects that are generally applicable across a wide range of domain ontologies. It usually employs a core glossary that contains the terms and associated object descriptions as they are used in various relevant domain sets, for example, the Basic Formal Ontology (BFO) Domain : Open Biomedical (abbreviated OBO; formerly Open Biological ) is an effort to create controlled vocabularies for shared use across different biological and medical domains. As of 2006, OBO forms part of the resources of the U.S. National Center for Biomedical Ontology where it will form a central element of the NCBO s BioPortal.
The Ontology (SO) at www.sequence.org/ is a collaborative project for the definition of sequence features used in biological sequence annotation. For example, an X element combinatorial repeat is a repeat region located between the X element and the telomere or adjacent Y element.
The Gene Ontology (GO) is a controlled vocabulary that connects each gene to one or more functions. The is intended to categorize gene products rather than the genes themselves. Different products of the same gene may play very different roles, and labelling and treating all of these functions under the same gene name may (and often does) lead to confusion.
Outline 1 2 3 4
Databases More to come later