BIOINFORMATICS AN OVERVIEW - PDF Free Download

BIOINFORMATICS AN OVERVIEW T.R. Sharma Genoinformatics Lab, National Research Centre on Plant Biotechnology I.A.R.I, New Delhi 110012 trsharma@nrcpb.org Introduction Bioinformatics is the computational analysis of biological data, consisting of the information stored in the form of DNA and protein sequences in various biological databases. The National Center for Biotechnology Information (NCBI 2001) defines bioinformatics as: "Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. There are three important sub-disciplines within bioinformatics: the development of new algorithms and statistics which assess relationships among members of large data sets, the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and the development and implementation of tools that enable efficient access and management of different types of information." Analyses in bioinformatics focus on three types of datasets: genome sequences, macromolecular structures, and functional genomics experiments (e.g. microarray data). However, bioinformatics tools are also applied to various other data, e.g. phylogenetic and metabolic pathway analysis, the text of scientific papers, and plant varietal information and statistics. Analysis of biological data requires application of large number of techniques like primary sequence alignment, protein 3D structure alignment, phylogenetic tree construction, prediction and classification of protein structure, prediction of RNA structure, prediction of protein function, and expression data clustering. Development of suitable algorithms is an important part of bioinformatics. The techniques and algorithms were specifically developed for the analysis of biological data, for instance, the dynamic programming algorithm for sequence alignment is one of the most popular programmes among the biologists. The sequence information generated worldwide is stored systematically in different types of databases. Hence, it is necessary to understand about the databases and their different types. What is a database? A database is a collection of information stored in a computer in a systematic way, such that a computer program can consult it to answer questions. A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. A simple database might be a single file containing many records, each of which includes the same set of information. For example, a record associated with a nucleotide sequence database typically contains information such as contact name; the input sequence with a description of the type of molecule; the scientific name of the source organism from which it was isolated; and, often, literature citations associated with the sequence.

Divisions of DNA databases Since the size of databases is growing rapidly, these have been further broken into divisions on the basis of the taxonomy of the organisms. The GenBank divisions are divided into two general categories like, organismal and functional. The sequences derived from specific organisms are stored in the organismal category. Whereas the functional category include databases which are independent of their taxonomic classification e.g. EST, STS and HTG etc. Respective Genbank divisions store sequence records of different organism which is identified from three letter codes indicated in the beginning of each sequence entry. For instance, HTG (high throughput genome) division contained sequences generated from different organisms. These sequences are generally unfinished and are further classified as Phase1(sequences which are unfinished, unordered and contained gaps) and Phase 2 (sequences which unfinished, ordered and contained a few gaps). Once sequences are finished and all gaps are resolved (Phase 3) it moved to a specific division e.g. PLN in case of plants. The huge wealth of information in the form of DNA and protein sequences and publications on molecular biology are stored in the data banks (Fig.1). Major public data banks which takes care of the DNA and protein sequences are GenBank in USA (http://www.ncbi.nlm.nih.gov), EMBL (European Molecular Biology Laboratory) in Europe (http://www.ebi.ac.uk/embl/) and DDBJ (DNA Data Bank) in Japan (http://www.ddbj.nig.ac.jp).. The growth of DNA sequence data in GenBank is depicted in Fig. 2. This rapid growth in DNA sequence data is because of the fact that various Collaborative International Programmes have started during the past few years to sequence complete genomes of various organisms. The whole genomes of various microorganisms have already been sequenced by The Institute of Genome Research (TIGR) which can be seen on their website www.tigr.org. The large genomes like Human (3 billion bp) Rice (450 Mb bp), Arabidopsis (130Mb bp) and Mouse (2.5 billion bp) have also been sequenced and the data is in public domain in GenBank. Now these DNA sequences have to be used in meaningful ways for the welfare of mankind. Different types of sequences of important crops available in public domain are listed in Table1. Fig.1. Status of Sequences submitted in the GenBank (Source: NCBI) VI-78

Table1. Different types of sequences of important crops available in public domain* Type of database in public domain Plant species Whole genome Oryza sativa, Arabidopsis thaliana Partial genome EST mrna Protein BAC end Source: NCBI T. aestivum, Z. mays, S. bicolor, B. oleracea, B. rapa, G. max, S. tuberosum, L. esculentum, V. vinifera, Poncirus trifoliate, Medicago truncatula, Lotus corniculatus Aegilops tauschii, Allium cepa, Arabidopsis thaliana, Avena sativa, Beta vulgaris subsp. vulgaris, Brassica napus, Brassica oleracea, Brassica rapa, Capsicum annuum, Coffea arabica, Glycine max, Gossypium arboreum, Gossypium hirsutum, Helianthus annuus, Hordeum vulgare, Lactuca sativa, Lolium perenne, Lotus corniculatus, Lycopersicon esculentum, Malus domestica, Medicago sativa, Medicago truncatula, Nicotiana benthamiana, Nicotiana tabacum, Oryza sativa, Phaseolus coccineus, Phaseolus vulgaris, Saccharum officinarum, Secale cereale, Solanum melongena, Solanum tuberosum, Sorghum bicolor, Triticum monococcum, Vitis vinifera, Zea mays T. aestivum, Z. mays, S. bicolor, B. oleracea, B. rapa, G. max, S. tuberosum, L. esculentum, V. vinifera, Medicgo truncatula, L. corniculatus, O. sativa, A. thaliana Z. mays, S. bicolor, B. oleracea, B. rapa, G. max, S. tuberosum, V. vinifera, C. sinensis, M. truncatula, E. globulus, O. sativa, A. thaliana Oryza australiensis, O. brachyantha, O. glaberrima, O. granulata, O. latifolia, O. minuta, O. officinalis, O. punctata, O. ridleyi, O. rufipogon, O. schlechteri, G. hirsutum Divisions of Protein databases Protein sequences are mainly stored in two databases EMBL and GenBank. Swiss-Prot which is a very well maintained and curetted database was established at the Swiss Institute of Bioinformatics. Though it is a small database, it has important annotations which are freely available to the academic users. GenBank created PIR a protein database as a translation of the Genbank. PIR database is further subdivided into four sections like PIR1, PIR2, PIR 3 and PIR4 on the bases of degree of annotation. DNA Sequence Analysis Bioinformatics tools are now easily available to the biologists with the advent of internet and various Web Browsers on World Wide Web. These tools are indispensable for any Genome Sequencing Centres. The analysis of DNA sequences started once these are out of the sequencing machines. The first and foremost task of a biologist is to look for the accuracy of sequence he got from the machine. One way is to go for finding cloning sites of inserts in the sequencing vector. If the insert is a PCR product then one should look for the primer sequences used in the amplification of that product. Then one can perform Basic Local alignment Search Tool (BLAST) search against the DNA sequence database in the GenBank and see the probable matches. If the unknown sequences shows hits with any sequence of the same or related organisms then it is considered as a true sequence. These are the basic steps, VI-79

which can be performed manually if the dataset is very small or if one has to deal with single or a few sequences. However, in large genome sequencing projects one has to handle thousands of sequences at a given time. Searching for Sequence Alignment Once high quality sequence is obtained once has to ask an important question whether this is a new sequence or the sequence similar to other DNA sequences available in the databases. For getting answer of this question, on has to perform database search for sequence comparisons. All sequence searching methods rely on the basic concepts of alignment and distance between the sequences and pair wise sequence alignment is performed. There are different algorithms to perform global and local alignments (Fig.2). In global alignment, complete alignment of the input sequence is performed with sequences available in the databases. Whereas in local alignment, most similar segments of the input sequence are aligned with the database sequences. Sequence comparison (DNA/protein) against database is one of the very important and powerful tools of bioinformatics. This type of sequence comparison is generally performed with two programmes BLAST and FASTA, which compares unknown sequence against a sequence database. In BLAST best local alignments between the unknown sequences and the database is found by using an approach based on matching short sequence fragments and a powerful statistical model. Whereas a method of approximation is used in FASTA which try to concentrate only on significant alignments. In BLAST search output, Expected (E) values and Bit scores are mentioned to determine the significant match of unknown sequences with that of sequences available in the database (Fig.3). The significance of a BLAST hit is very important for the interpretation of results. Generally 67% identity at DNA level shows 100% identity in protein level. It is also suggested that at least 75% sequence identity between two sequences should be observed for considering it as a significant hit. Fig.2. Global and local alignments between two DNA sequences VI-80

Fig.3. BLAST output showing Bit score and E values after similarity search Gene Prediction and Annotation Simply determining four alphabets (ATGC) of DNA sequences of any organism has no value until some meaning is derived from this by gene prediction. Gene prediction is complex work and there is no algorithm which can exactly predict the true exons in a DNA sequence. Basically two major considerations are taken into consideration while predicting a gene. 1) identification of structural elements such a start/ stop codon and splice sites of the unknown sequence and 2) performing homology search against protein, EST and cdna database to identify potential coding regions. For gene prediction, very commonly used software GENSCAN developed by MIT, USA (http://www.genes.mit.edu/genscan.html), which is freely available on Web and online analysis of DNA sequences, can be performed. The output obtained from the GENSCAN is then used for gene annotation by using BLAST to search the public or private DNA sequence databases to find out the matches to the unknown query sequence with millions of sequences available in the Gen Bank. A very popular Website http://www.ncbi.nlm.nih.gov is available for BLAST at NCBI`s Home page which performs searches by using various criteria and options (Fig.4). VI-81

Fig. 4. Performing BLAST search at NCBI Home page Primer Design Another important aspects in the use of genome sequence data after predicting genes are to design primers either for PCR or for sequencing. Such primers are used for the amplification of genes or its alleles from the known sources and making best use out of it. Though PRIME software within GCG package is mainly used for this purpose, PRIMER3- a web based software (www-genoem.wi.mit.edu /genome_software/other /primer3.html) is being commonly used for designing primers. PCR Primer pairs are designed to amplify a welldefined target sequences from the template. Some of the important considerations while designing primers are, the GC content, melting temperature, primer size, and size of the PCR product to be amplified. These parameters can be used either as default setting or one can change them as per their requirement. Phylogenetic Analysis Once similarity search is performed between unknown sequence and the database sequence to find per cent homology between them, it is obvious to know how these sequences are related to each other. The sequences derived from two closely related organisms shows more similarity at DNA level and distantly related organisms shows more dissimilarity at the sequence level. To find an evolutionary relationship among sequences derived from different organisms, a phylogenetic tree is constructed (Fig.5). Such evolutionary tree can also be constructed on the basis of phenotypic markers, molecular markers or sequence information. A typical phylogentic tree is comprised of nodes, branches and termini of the branches. When VI-82

all the branches are emerged from a common node it is termed as the root of a tree. Though some trees are constructed as un-rooted tree where common evolutionary point is not known. For constructing a phylogenetic tree the PILEUP option of GCG package is more commonly used. Besides, DNA STAR software (www.dnastar.com) also have options to construct tree from different DNA or protein sequences. However, web based tools like MacClade (//www. phylogeny.arizona.edu/macclade/) can also be used for evolutionary studies of different organisms based on their DNA sequences. Similarly, bioinformatics tools can be used for protein function analysis by database search. Finding SSR markers and SNP markers from the EST or genome sequences can be performed in silico by using different algorithms which will also be discussed in the presentation. Fig. 5. Phylogenetic analysis of resistance gene analogue sequences (sk21,sk95, sk10, sk3, sk76, sk101 and sk65) obtained from rice and known Resistance gene sequences (L6, M, N,RPS2 and Xa1) isolated from different crops. Analysis was performed with DNASTAR software. Conclusions In functional genomics, investigation of gene expression at whole genome levels under different stresses can be studied by using microarryas. Now-a-day this type of gene expression databases are being prepared in different organisms and even at different tissues. Bioinformatics tools are helpful in locating DNA sequences in the GenBank simply by putting accession numbers, making alignments of two or more than two sequences, performing similarity searches for unknown sequences in the GenBank, assembling short sequence reads and developing consensus sequences, finding genes and markers in silico and in performing comparative analysis of different genomes. Selected References and Web Resources Sobral, B.W.S. 1997. Common language of bioinformatics. Nature. 389:418. Brown, S.M. 2000. Bioinformatic: A Biologist`s Guide to Biocomputing and the Internet. Eton Publishing, Natick. MA, USA. Baxevanis, A.D. and Ouellette B.F.F. 2001. Bioinformatics- A Practical Guide to the Analysis of Genes and Proteins. Second Edition. A John Wiley and Sons, Inc., Publication, NY. GENSCAN : http://genes.mit.edu/genscan.html FGENESH :http://www.softberry.com/berry.phtml VI-83