BIOINFORMATICS AN OVERVIEW

Similar documents
ELE4120 Bioinformatics. Tutorial 5

Types of Databases - By Scope

Introduction to BIOINFORMATICS

I nternet Resources for Bioinformatics Data and Tools

Genome Sequence Assembly

Bioinformatics for Proteomics. Ann Loraine

Why learn sequence database searching? Searching Molecular Databases with BLAST

Basic Bioinformatics: Homology, Sequence Alignment,

Protein Bioinformatics Part I: Access to information

Product Applications for the Sequence Analysis Collection

NCBI web resources I: databases and Entrez

Worksheet for Bioinformatics

Biotechnology Explorer

Sequence Databases and database scanning

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

user s guide Question 3

B I O I N F O R M A T I C S

Host : Dr. Nobuyuki Nukina Tutor : Dr. Fumitaka Oyama

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

Engineering Genetic Circuits

Student Learning Outcomes (SLOS)

Overview of Health Informatics. ITI BMI-Dept

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

SAMPLE LITERATURE Please refer to included weblink for correct version.

user s guide Question 3

Introduction to Bioinformatics

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Gene-centered resources at NCBI

Guided tour to Ensembl


Introduction to Molecular Biology

Genomics and Transcriptomics of Spirodela polyrhiza

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

Integration of data management and analysis for genome research

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome

Bioinformatics, in general, deals with the following important biological data:

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases.

Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity

Transcriptome analysis in the post-genomic era

Sequencing the Human Genome

European Commission Joint Research Centre Institute for Health and Consumer Protection

Computational Biology I LSM5191

Serial Analysis of Gene Expression

RNA-seq Data Analysis

Chimp Sequence Annotation: Region 2_3

7 Gene Isolation and Analysis of Multiple

Conifer Translational Genomics Network Coordinated Agricultural Project

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Application for Automating Database Storage of EST to Blast Results. Vikas Sharma Shrividya Shivkumar Nathan Helmick

CHAPTER 14 Genetics and Propagation

Molecular Biology: DNA sequencing

O C. 5 th C. 3 rd C. the national health museum

Genome and DNA Sequence Databases. BME 110: CompBio Tools Todd Lowe April 5, 2007

UC Davis UC Davis Previously Published Works

Introduction to Bioinformatics

Eukaryotic Gene Prediction. Wei Zhu May 2007

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE

Bio 101 Sample questions: Chapter 10

Expressed Sequence Tags: Clustering and Applications

DNA sequencing. Course Info

Theory and Application of Multiple Sequence Alignments

CONSERVATION TILLAGE TRENDS IN VIRGINIA AGRICULTURAL PRODUCTION. Research and Extension Center, Painter, VA

Access to Information from Molecular Biology and Genome Research

RNA Sequencing Analyses & Mapping Uncertainty

Chapter 20: Biotechnology

A legume genomics resource: The Chickpea Root Expressed Sequence Tag Database

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

Introduction to Bioinformatics and Gene Expression Technologies

MicroSEQ Rapid Microbial Identification System

From AP investigative Laboratory Manual 1

Molecular Biology Primer. CptS 580, Computational Genomics, Spring 09

MATH 5610, Computational Biology

Chapter 15 Gene Technologies and Human Applications

Hands-On Four Investigating Inherited Diseases

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

Examination Assignments

Existing potato markers and marker conversions. Walter De Jong PAA Workshop August 2009

CHAPTER 21 LECTURE SLIDES

Protein Structure Prediction. christian studer , EPFL

BIOINFORMATICS Introduction

Exploring Similarities of Conserved Domains/Motifs

Using the Potato Genome Sequence! Robin Buell! Michigan State University! Department of Plant Biology! August 15, 2010!

Advances in analytical biochemistry and systems biology: Proteomics

The use of bioinformatic analysis in support of HGT from plants to microorganisms. Meeting with applicants Parma, 26 November 2015

NOTES - CH 15 (and 14.3): DNA Technology ( Biotech )

PCR PRIMER DESIGN SARIKA GARG SCHOOL OF BIOTECHNOLGY DEVI AHILYA UNIVERSITY INDORE INDIA

Function Prediction of Proteins from their Sequences with BAR 3.0

AP BIOLOGY. Investigation #3 Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST. Slide 1 / 32. Slide 2 / 32.

APPENDIX. Appendix. Table of Contents. Ethics Background. Creating Discussion Ground Rules. Amino Acid Abbreviations and Chemistry Resources

Sequence Analysis Lab Protocol

Hands on session: Advanced promoter analysis

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

Sequence Variations. Baxevanis and Ouellette, Chapter 7 - Sequence Polymorphisms. NCBI SNP Primer:

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Entrez Gene: gene-centered information at NCBI

Carl Woese. Used 16S rrna to developed a method to Identify any bacterium, and discovered a novel domain of life

Organisation de Coopération et de Développement Economiques Organisation for Economic Co-operation and Development

Transcription:

BIOINFORMATICS AN OVERVIEW T.R. Sharma Genoinformatics Lab, National Research Centre on Plant Biotechnology I.A.R.I, New Delhi 110012 trsharma@nrcpb.org Introduction Bioinformatics is the computational analysis of biological data, consisting of the information stored in the form of DNA and protein sequences in various biological databases. The National Center for Biotechnology Information (NCBI 2001) defines bioinformatics as: "Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. There are three important sub-disciplines within bioinformatics: the development of new algorithms and statistics which assess relationships among members of large data sets, the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and the development and implementation of tools that enable efficient access and management of different types of information." Analyses in bioinformatics focus on three types of datasets: genome sequences, macromolecular structures, and functional genomics experiments (e.g. microarray data). However, bioinformatics tools are also applied to various other data, e.g. phylogenetic and metabolic pathway analysis, the text of scientific papers, and plant varietal information and statistics. Analysis of biological data requires application of large number of techniques like primary sequence alignment, protein 3D structure alignment, phylogenetic tree construction, prediction and classification of protein structure, prediction of RNA structure, prediction of protein function, and expression data clustering. Development of suitable algorithms is an important part of bioinformatics. The techniques and algorithms were specifically developed for the analysis of biological data, for instance, the dynamic programming algorithm for sequence alignment is one of the most popular programmes among the biologists. The sequence information generated worldwide is stored systematically in different types of databases. Hence, it is necessary to understand about the databases and their different types. What is a database? A database is a collection of information stored in a computer in a systematic way, such that a computer program can consult it to answer questions. A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. A simple database might be a single file containing many records, each of which includes the same set of information. For example, a record associated with a nucleotide sequence database typically contains information such as contact name; the input sequence with a description of the type of molecule; the scientific name of the source organism from which it was isolated; and, often, literature citations associated with the sequence.

Divisions of DNA databases Since the size of databases is growing rapidly, these have been further broken into divisions on the basis of the taxonomy of the organisms. The GenBank divisions are divided into two general categories like, organismal and functional. The sequences derived from specific organisms are stored in the organismal category. Whereas the functional category include databases which are independent of their taxonomic classification e.g. EST, STS and HTG etc. Respective Genbank divisions store sequence records of different organism which is identified from three letter codes indicated in the beginning of each sequence entry. For instance, HTG (high throughput genome) division contained sequences generated from different organisms. These sequences are generally unfinished and are further classified as Phase1(sequences which are unfinished, unordered and contained gaps) and Phase 2 (sequences which unfinished, ordered and contained a few gaps). Once sequences are finished and all gaps are resolved (Phase 3) it moved to a specific division e.g. PLN in case of plants. The huge wealth of information in the form of DNA and protein sequences and publications on molecular biology are stored in the data banks (Fig.1). Major public data banks which takes care of the DNA and protein sequences are GenBank in USA (http://www.ncbi.nlm.nih.gov), EMBL (European Molecular Biology Laboratory) in Europe (http://www.ebi.ac.uk/embl/) and DDBJ (DNA Data Bank) in Japan (http://www.ddbj.nig.ac.jp).. The growth of DNA sequence data in GenBank is depicted in Fig. 2. This rapid growth in DNA sequence data is because of the fact that various Collaborative International Programmes have started during the past few years to sequence complete genomes of various organisms. The whole genomes of various microorganisms have already been sequenced by The Institute of Genome Research (TIGR) which can be seen on their website www.tigr.org. The large genomes like Human (3 billion bp) Rice (450 Mb bp), Arabidopsis (130Mb bp) and Mouse (2.5 billion bp) have also been sequenced and the data is in public domain in GenBank. Now these DNA sequences have to be used in meaningful ways for the welfare of mankind. Different types of sequences of important crops available in public domain are listed in Table1. Fig.1. Status of Sequences submitted in the GenBank (Source: NCBI) VI-78

Table1. Different types of sequences of important crops available in public domain* Type of database in public domain Plant species Whole genome Oryza sativa, Arabidopsis thaliana Partial genome EST mrna Protein BAC end Source: NCBI T. aestivum, Z. mays, S. bicolor, B. oleracea, B. rapa, G. max, S. tuberosum, L. esculentum, V. vinifera, Poncirus trifoliate, Medicago truncatula, Lotus corniculatus Aegilops tauschii, Allium cepa, Arabidopsis thaliana, Avena sativa, Beta vulgaris subsp. vulgaris, Brassica napus, Brassica oleracea, Brassica rapa, Capsicum annuum, Coffea arabica, Glycine max, Gossypium arboreum, Gossypium hirsutum, Helianthus annuus, Hordeum vulgare, Lactuca sativa, Lolium perenne, Lotus corniculatus, Lycopersicon esculentum, Malus domestica, Medicago sativa, Medicago truncatula, Nicotiana benthamiana, Nicotiana tabacum, Oryza sativa, Phaseolus coccineus, Phaseolus vulgaris, Saccharum officinarum, Secale cereale, Solanum melongena, Solanum tuberosum, Sorghum bicolor, Triticum monococcum, Vitis vinifera, Zea mays T. aestivum, Z. mays, S. bicolor, B. oleracea, B. rapa, G. max, S. tuberosum, L. esculentum, V. vinifera, Medicgo truncatula, L. corniculatus, O. sativa, A. thaliana Z. mays, S. bicolor, B. oleracea, B. rapa, G. max, S. tuberosum, V. vinifera, C. sinensis, M. truncatula, E. globulus, O. sativa, A. thaliana Oryza australiensis, O. brachyantha, O. glaberrima, O. granulata, O. latifolia, O. minuta, O. officinalis, O. punctata, O. ridleyi, O. rufipogon, O. schlechteri, G. hirsutum Divisions of Protein databases Protein sequences are mainly stored in two databases EMBL and GenBank. Swiss-Prot which is a very well maintained and curetted database was established at the Swiss Institute of Bioinformatics. Though it is a small database, it has important annotations which are freely available to the academic users. GenBank created PIR a protein database as a translation of the Genbank. PIR database is further subdivided into four sections like PIR1, PIR2, PIR 3 and PIR4 on the bases of degree of annotation. DNA Sequence Analysis Bioinformatics tools are now easily available to the biologists with the advent of internet and various Web Browsers on World Wide Web. These tools are indispensable for any Genome Sequencing Centres. The analysis of DNA sequences started once these are out of the sequencing machines. The first and foremost task of a biologist is to look for the accuracy of sequence he got from the machine. One way is to go for finding cloning sites of inserts in the sequencing vector. If the insert is a PCR product then one should look for the primer sequences used in the amplification of that product. Then one can perform Basic Local alignment Search Tool (BLAST) search against the DNA sequence database in the GenBank and see the probable matches. If the unknown sequences shows hits with any sequence of the same or related organisms then it is considered as a true sequence. These are the basic steps, VI-79

which can be performed manually if the dataset is very small or if one has to deal with single or a few sequences. However, in large genome sequencing projects one has to handle thousands of sequences at a given time. Searching for Sequence Alignment Once high quality sequence is obtained once has to ask an important question whether this is a new sequence or the sequence similar to other DNA sequences available in the databases. For getting answer of this question, on has to perform database search for sequence comparisons. All sequence searching methods rely on the basic concepts of alignment and distance between the sequences and pair wise sequence alignment is performed. There are different algorithms to perform global and local alignments (Fig.2). In global alignment, complete alignment of the input sequence is performed with sequences available in the databases. Whereas in local alignment, most similar segments of the input sequence are aligned with the database sequences. Sequence comparison (DNA/protein) against database is one of the very important and powerful tools of bioinformatics. This type of sequence comparison is generally performed with two programmes BLAST and FASTA, which compares unknown sequence against a sequence database. In BLAST best local alignments between the unknown sequences and the database is found by using an approach based on matching short sequence fragments and a powerful statistical model. Whereas a method of approximation is used in FASTA which try to concentrate only on significant alignments. In BLAST search output, Expected (E) values and Bit scores are mentioned to determine the significant match of unknown sequences with that of sequences available in the database (Fig.3). The significance of a BLAST hit is very important for the interpretation of results. Generally 67% identity at DNA level shows 100% identity in protein level. It is also suggested that at least 75% sequence identity between two sequences should be observed for considering it as a significant hit. Fig.2. Global and local alignments between two DNA sequences VI-80

Fig.3. BLAST output showing Bit score and E values after similarity search Gene Prediction and Annotation Simply determining four alphabets (ATGC) of DNA sequences of any organism has no value until some meaning is derived from this by gene prediction. Gene prediction is complex work and there is no algorithm which can exactly predict the true exons in a DNA sequence. Basically two major considerations are taken into consideration while predicting a gene. 1) identification of structural elements such a start/ stop codon and splice sites of the unknown sequence and 2) performing homology search against protein, EST and cdna database to identify potential coding regions. For gene prediction, very commonly used software GENSCAN developed by MIT, USA (http://www.genes.mit.edu/genscan.html), which is freely available on Web and online analysis of DNA sequences, can be performed. The output obtained from the GENSCAN is then used for gene annotation by using BLAST to search the public or private DNA sequence databases to find out the matches to the unknown query sequence with millions of sequences available in the Gen Bank. A very popular Website http://www.ncbi.nlm.nih.gov is available for BLAST at NCBI`s Home page which performs searches by using various criteria and options (Fig.4). VI-81

Fig. 4. Performing BLAST search at NCBI Home page Primer Design Another important aspects in the use of genome sequence data after predicting genes are to design primers either for PCR or for sequencing. Such primers are used for the amplification of genes or its alleles from the known sources and making best use out of it. Though PRIME software within GCG package is mainly used for this purpose, PRIMER3- a web based software (www-genoem.wi.mit.edu /genome_software/other /primer3.html) is being commonly used for designing primers. PCR Primer pairs are designed to amplify a welldefined target sequences from the template. Some of the important considerations while designing primers are, the GC content, melting temperature, primer size, and size of the PCR product to be amplified. These parameters can be used either as default setting or one can change them as per their requirement. Phylogenetic Analysis Once similarity search is performed between unknown sequence and the database sequence to find per cent homology between them, it is obvious to know how these sequences are related to each other. The sequences derived from two closely related organisms shows more similarity at DNA level and distantly related organisms shows more dissimilarity at the sequence level. To find an evolutionary relationship among sequences derived from different organisms, a phylogenetic tree is constructed (Fig.5). Such evolutionary tree can also be constructed on the basis of phenotypic markers, molecular markers or sequence information. A typical phylogentic tree is comprised of nodes, branches and termini of the branches. When VI-82

all the branches are emerged from a common node it is termed as the root of a tree. Though some trees are constructed as un-rooted tree where common evolutionary point is not known. For constructing a phylogenetic tree the PILEUP option of GCG package is more commonly used. Besides, DNA STAR software (www.dnastar.com) also have options to construct tree from different DNA or protein sequences. However, web based tools like MacClade (//www. phylogeny.arizona.edu/macclade/) can also be used for evolutionary studies of different organisms based on their DNA sequences. Similarly, bioinformatics tools can be used for protein function analysis by database search. Finding SSR markers and SNP markers from the EST or genome sequences can be performed in silico by using different algorithms which will also be discussed in the presentation. Fig. 5. Phylogenetic analysis of resistance gene analogue sequences (sk21,sk95, sk10, sk3, sk76, sk101 and sk65) obtained from rice and known Resistance gene sequences (L6, M, N,RPS2 and Xa1) isolated from different crops. Analysis was performed with DNASTAR software. Conclusions In functional genomics, investigation of gene expression at whole genome levels under different stresses can be studied by using microarryas. Now-a-day this type of gene expression databases are being prepared in different organisms and even at different tissues. Bioinformatics tools are helpful in locating DNA sequences in the GenBank simply by putting accession numbers, making alignments of two or more than two sequences, performing similarity searches for unknown sequences in the GenBank, assembling short sequence reads and developing consensus sequences, finding genes and markers in silico and in performing comparative analysis of different genomes. Selected References and Web Resources Sobral, B.W.S. 1997. Common language of bioinformatics. Nature. 389:418. Brown, S.M. 2000. Bioinformatic: A Biologist`s Guide to Biocomputing and the Internet. Eton Publishing, Natick. MA, USA. Baxevanis, A.D. and Ouellette B.F.F. 2001. Bioinformatics- A Practical Guide to the Analysis of Genes and Proteins. Second Edition. A John Wiley and Sons, Inc., Publication, NY. GENSCAN : http://genes.mit.edu/genscan.html FGENESH :http://www.softberry.com/berry.phtml VI-83