BIOINFORMATICS IN BIOCHEMISTRY

Similar documents
Sequence Databases and database scanning

Evolutionary Genetics. LV Lecture with exercises 6KP

Bioinformatics. Ingo Ruczinski. Some selected examples... and a bit of an overview

Basic Bioinformatics: Homology, Sequence Alignment,

Why learn sequence database searching? Searching Molecular Databases with BLAST

ELE4120 Bioinformatics. Tutorial 5

Exploring the Genetic Basis for Behavior. Instructor s Notes

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Biotechnology Explorer

Chimp Sequence Annotation: Region 2_3

Understanding DNA Structure

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Genome Sequence Assembly

BIOINFORMATICS Introduction

AP BIOLOGY. Investigation #3 Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST. Slide 1 / 32. Slide 2 / 32.

Worksheet for Bioinformatics

Integration of data management and analysis for genome research

Bioinformatics for Proteomics. Ann Loraine

Introduction to BIOINFORMATICS

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Protein Bioinformatics Part I: Access to information

Bioinformatics, in general, deals with the following important biological data:

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

Database Searching and BLAST Dannie Durand

MATH 5610, Computational Biology

SAMPLE LITERATURE Please refer to included weblink for correct version.

Sequence Analysis Lab Protocol

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

Gene-centered resources at NCBI

Types of Databases - By Scope

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

Engineering Genetic Circuits

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Introduction to Bioinformatics

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases.

Product Applications for the Sequence Analysis Collection

The use of bioinformatic analysis in support of HGT from plants to microorganisms. Meeting with applicants Parma, 26 November 2015

From AP investigative Laboratory Manual 1

Recommendations from the BCB Graduate Curriculum Committee 1

Exploring Similarities of Conserved Domains/Motifs

Sequence Analysis '17 -- lecture Secondary structure 3. Sequence similarity and homology 2. Secondary structure prediction

Use of Drosophila Melanogaster as a Model System in the Study of Human Sodium- Dependent Multivitamin Transporter. Michael Brinton BIOL 230W.

Getting To Know Your Protein

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE

Examination Assignments

What we ll do today. Types of stem cells. Do engineered ips and ES cells have. What genes are special in stem cells?

The Cell Theory: A Brief History

Do engineered ips and ES cells have similar molecular signatures?

O C. 5 th C. 3 rd C. the national health museum

Interpretation of sequence results

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG

Changing Mutation Operator of Genetic Algorithms for optimizing Multiple Sequence Alignment

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

Alignment to a database. November 3, 2016

Introduction to Bioinformatics

Dynamic Programming Algorithms

Hands-On Four Investigating Inherited Diseases

Teaching Bioinformatics in the High School Classroom. Models for Disease. Why teach bioinformatics in high school?

NCBI web resources I: databases and Entrez

PCR PRIMER DESIGN SARIKA GARG SCHOOL OF BIOTECHNOLGY DEVI AHILYA UNIVERSITY INDORE INDIA

VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch

APPENDIX. Appendix. Table of Contents. Ethics Background. Creating Discussion Ground Rules. Amino Acid Abbreviations and Chemistry Resources

Computational analysis of non-coding RNA. Andrew Uzilov BME110 Tue, Nov 16, 2010

Introduction to Bioinformatics Online Course: IBT

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous

An Investigation of Palindromic Sequences in the Pseudomonas fluorescens SBW25 Genome Bachelor of Science Honors Thesis

Optimization of RNAi Targets on the Human Transcriptome Ahmet Arslan Kurdoglu Computational Biosciences Program Arizona State University

Classification and Learning Using Genetic Algorithms

Relationship of Gene s Types and Introns

CSE : Computational Issues in Molecular Biology. Lecture 19. Spring 2004

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

ORTHOMINE - A dataset of Drosophila core promoters and its analysis. Sumit Middha Advisor: Dr. Peter Cherbas

Theory and Application of Multiple Sequence Alignments

Introduction to RNA sequencing

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

PROTEIN SYNTHESIS Flow of Genetic Information The flow of genetic information can be symbolized as: DNA RNA Protein

26 STNews October 2008

Having the same or similar function frequently occurs with homologs. True/False

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

Sequence Variations. Baxevanis and Ouellette, Chapter 7 - Sequence Polymorphisms. NCBI SNP Primer:

Introduction to Bioinformatics

STUDYING THE SECONDARY STRUCTURE OF ACCESSION NUMBER USING CETD MATRIX

Introduction to Bioinformatics

TERTIARY MOTIF INTERACTIONS ON RNA STRUCTURE

Genome and DNA Sequence Databases. BME 110: CompBio Tools Todd Lowe April 5, 2007

Dharmacon TM solutions for studying gene function

Sequencing the Human Genome

Protein Structure Prediction. christian studer , EPFL

Oligonucleotide Design by Multilevel Optimization

I nternet Resources for Bioinformatics Data and Tools

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS*

Answers to additional linkage problems.

Small Genome Annotation and Data Management at TIGR

Solutions to Quiz II

Lecture 2: Central Dogma of Molecular Biology & Intro to Programming

Transcription:

BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses on the analysis of molecular sequences (DNA, RNA, and proteins) The National Institutes of Health (NIH) definition of bioinformatics: research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, analyze, or visualize such data. How is bioinformatics important to biochemistry? The tools of bioinformatics include algorithms and computer programs for analysis of molecular sequences that reveal the structure and function of macromolecules. Bioinformatics analysis gives valuable information that can guide experimental work.

AMINO ACID SEQUENCE ALIGNMENT A way to compare 2 or more sequences; The sequences are lined up ( aligned ), one above the other, so that each residue of one sequence can be compared to the corresponding residue of the other sequence; Sometimes one sequence must be cut, and a gap introduced, in order to make this sequence align in the optimal way with the other sequence. An example of a pairwise amino acid sequence alignment (2 sequences): sequence_1 1 MLFMCHQRVMKKEAEEKLKAEELRRARAAADIPIIWILGGPGCGKGTQCA 50...: :::.. sequence_2 1 MEEKLKKTK-----------IIFVVGGPGSGKGTQCE 26 All the residues that are identical in the two sequences are indicated with the symbol between them; residues that are chemically similar are indicated with the : or. symbol, such as W and F (both have aromatic side chains). Note that a gap (----- region) was introduced into sequence_2 in order to make it align optimally with sequence_1.

BLAST Basic Local Alignment Search Tool A bioinformatics tool that allows users to compare a protein or DNA sequence to databases of other protein or DNA sequences from many organisms. A web-based version is available free of charge at the National Center for Biotechnology Information (NCBI) website: http://www.ncbi.nlm.nih.gov/blast/ The output from a BLAST search is a series of sequence alignments.

EXAMPLE OF A BLAST SEARCH Suppose you have the sequence of a human protein and want to know if there is a homologous protein in the fruit fly Drosophila melanogaster. The amino acid sequence of the human protein will be the query for the BLAST search. The BLAST algorithm compares the query sequence to all proteins in the Drosophila genome. The BLAST output will show a list of the Drosophila proteins that have statistical sequence similarity to the human query protein. These Drosophila proteins can be referred to as BLAST hits. Below this list of BLAST hits, there will be a series of sequence alignments between the human query protein and each Drosophila protein that is in the list of BLAST hits. The first alignment will be between the query and the Drosophila protein that is most similar in sequence; the second alignment will be between the query and the Drosophila protein that is the second best match in terms of sequence similarity and so on. The next slide shows just one of these alignments from a BLAST search. The last 2 slides explain some of the features of the alignment.

Query = a human protein Subject (sbjct) = the Drosophila protein that is most similar to this human protein Sample from BLAST output (see explanation on next 2 slides): >gi 24663208 ref NP_729792.1 Adenylate kinase-1, [Drosophila melanogaster] Length = 229 Score = 179 bits (453), Expect = 1e-45 Identities = 96/205 (47%), Positives = 131/205 (64%), Gaps = 15/205 (7%) Query: 2 EEKLKKTK-----------IIFVVGGPGSGKGTQCEKIVQKYGYTHLSTGDLLRSEVSSG 50 EEKLK + II+++GGPG GKGTQC KIV+KYG+THLS+GDLLR+EV+SG Sbjct: 15 EEKLKAEELRRARAAADIPIIWILGGPGCGKGTQCAKIVEKYGFTHLSSGDLLRNEVASG 74 Query: 51 SARGKKLSEIMEKGQLVPLETVLDMLRDAMVAKVNTSKGFLIDGYPREVQQGEEFERRIG 110 S +G++L +M G LV + VL +L DA+ +SKGFLIDGYPR+ QG EFE RI Sbjct: 75 SDKGRQLQAVMASGGLVSNDEVLSLLNDAITRAKGSSKGFLIDGYPRQKNQGIEFEARIA 134 Query: 111 QPTLLLYVDAGPETMTQRLLKRGETSG--RVDDNEETIKKRLETYYKATEPVIAFYEKRG 168 L LY + +TM QR++ R S R DDNE+TI+ RL T+ + T ++ YE + Sbjct: 135 PADLALYFECSEDTMVQRIMARAAASAVKRDDDNEKTIRARLLTFKQNTNAILELYEPKT 194 Query: 169 IVRKVNAEGSVDSVFSQVCTHLDAL 193 + +NAE VD +F +V +D + Sbjct: 195 LT--INAERDVDDIFLEVVQAIDCV 217

First you will see sequence identification information for the subject (Drosophila) protein in the alignment. This protein is called Adenylate kinase-1 : >gi 24663208 ref NP_729792.1 Adenylate kinase-1, [Drosophila melanogaster] Next you will see the total length of the subject protein, 229 amino acid residues: Length = 229 Looking at the sequence alignment itself, you will see that it wraps around, taking up 3 ½ rows. One row is shown at the bottom of this slide. Residues 2 to 193 of the query protein are aligned with residues 15 to 217 of the Drosophila protein (see the numbers on the right and left sides of the previous slide). The middle line of each row (the line between the query and subject lines) is called the consensus sequence. Whenever there is a residue that is identical for the query protein and the subject protein, it is indicated in this middle line. Whenever there is a residue that is chemically similar (a conservative substitution) for the query and the subject, it is marked with a + symbol. If one of the sequences must be cut in order to align it with the other, this is indicated with a - symbol. This is referred to as a gap in the alignment. Query: 2 EEKLKKTK-----------IIFVVGGPGSGKGTQCEKIVQKYGYTHLSTGDLLRSEVSSG 50 EEKLK + II+++GGPG GKGTQC KIV+KYG+THLS+GDLLR+EV+SG Sbjct: 15 EEKLKAEELRRARAAADIPIIWILGGPGCGKGTQCAKIVEKYGFTHLSSGDLLRNEVASG 74

Just above the sequence alignment itself you will see statistical information for the alignment (essentially telling you how similar the two sequences are): Score = 179 bits (453), Expect = 1e-45 Identities = 96/205 (47%), Positives = 131/205 (64%), Gaps = 15/205 (7%) This tells you that of the 205 amino acid residues that are aligned, 96 are identical between the query protein and the subject protein. Of the 205 aligned residues, 131 are either identical OR similar (have + symbol). 15 gaps were introduced into the sequences (have - symbol). The expected-value (1x10-45 in this case; a very small number!) is the probability that this alignment could occur by chance between two unrelated sequences from a database of the size that was searched. The bottom line: the smaller the expectedvalue, the more similar the two sequences.