Viewing the Proteome from Oligopeptides and Prediction of Protein Function

Size: px
Start display at page:

Download "Viewing the Proteome from Oligopeptides and Prediction of Protein Function"

Transcription

1 74 Genome Informatics 6(2): (25) Viewing the Proteome from Oligopeptides and Prediction of Protein Function Hisayuki Horai,2 Kouichi Doi Hirofumi Doi,2 Nara Institute of Science and Technology, Takayama-cho, Ikoma, Nara 63-92, Japan 2 Celestar Lexico-Sciences, Inc., MTG D7, -3 Nakase, Mihama-ku, Chiba 26-85, Japan Abstract Our research activity of making the lexicon of relatively short oligopeptides has been one of the first steps to view the world of proteome from the perspective of oligopeptides. We propose a new method for the prediction of protein function, especially GeneOntology terms (GO terms), based on statistical characteristics of oligopeptides as an application of the lexicon. In the lexicon, a known function of a protein is inherited to its oligopeptides, and the correspondence between oligopeptides and the function is calculated in the whole proteins. In our method, unknown functions of proteins are predicted by means of the correspondence automatically. We measured the prediction performance using the 28,52 whole human proteins registered in RefSeq for several GO terms by recall-precision graphs. The GO terms include membrane, nucleus, ATP binding, hydorolase activity, GTP binding, intracellular signaling cascade and ubiquitin cycle. In most cases, it scores 7% recall with 8% precision. The prediction for ATP binding and GTP binding results in quite high performance: it scores 8% recall with 8% precision. Even in the worst case (ubiquitin cycle), it scores 62.6% recall with 8% precision. These results suggest that the proposed method is quite efficient for predicting GO terms. Keywords: oligopeptide, prediction, protein function, proteome Introduction Proteomics is the large-scale study of proteins, particularly their structures and functions, and their existence. Identifying the function of each newly determined sequence by means of bioinformatics techniques is one of the most important problems in proteomics [5, 6]. Although the function of each protein can be introduced in wide spectrum and predicted based on different properties, we focus on the prediction of Gene Ontology (GO) terms [] based on sequence. There are some methods for solving the problem proposed based on homology search [6] and pattern matching [5]. Each method based on homology search focuses on similarity of a relatively long subsequence or the full-length sequence. In many cases, each protein is related to several numbers of GO terms. When a new protein has homology to such a multi functional protein, it is difficult to determine that each GO term is annotated or not. Consequently, further investigation of a protein at the level of every shorter subsequence is needed after homology search. Each method based on pattern matching focuses on similarity of a relatively short subsequence. These are conservative methods, taking similarity to clearly defined protein families whose members are annotated with GO terms. It is difficult to predict all GO terms because many GO terms have not been able to relate with any families yet. Oligopeptide is a subsequence of fixed length. For example, in the 28,52 whole human proteins registered in RefSeq (Reference Sequence of the National Center for Biotechnology Information dated

2 Viewing the Proteome from Oligopeptides 75 Protein MAIFARLNRA MADIK ADIKT KTGIF TGIFA GIFAD IFADI FADIK (duplicated) ADIKT IKTKR KTKRL TKRLN KRLNR RLNRA Characteristic Oligopeptides = { ADIKT,,, FADIK, GIFAD, IFADI,, IKTKR, KRLNR, KTGIF, KTKRL, MADIK, RLNRA, TGIFA, TKRLN } Figure : Characteristic of oligopeptide. 3-May-25), there exist 2,36,75 kinds of oligopeptides of length 5. The existence of oligopeptides shows quite interesting characteristics [2]: () some oligopeptides exist commonly in many proteins and others exist unevenly; (2) some oligopeptides exist too many time in comparison with the existing probability of each component amino acid; and (3) many oligopeptides do not exist in the world of proteins (specificity of oligopeptide). Therefore, to view the world of proteins from the perspective of oligopeptides will provide a new computational science of proteomics. As one of the first steps of such computational proteomics from the perspective of oligopeptides, we propose a new method based on the concept of the lexicon of oligopeptides that have been paid much effort to construct by us [2]. In our method, each protein is characterised based on the existence of oligopeptides. In this paper, we use oligopeptide of length 5 in the following consideration but it is not mandatory. Our method is based on the co-occurrence of each oligopeptide and the specificity of oligopeptide. Longer the length of oligopeptide is, less the co-occurrence of each oligopeptide, while, in [5], the specificity of oligopeptide does not be observed strongly in oligopeptides whose length is less than 5. Our method predicts the GO terms annotated to a protein based on a set of proteins already annotated, called Annotated Proteins here. Every Annotated Protein is divided into a set of its oligopeptides, and each GO term annotated to the protein is regarded to be related to all of its oligopeptides. Finally, the correspondence between oligopeptides and GO terms in Annotated Proteins is calculated. The correspondence between an oligopeptide and a GO term is the number of proteins which contain the oligopeptide and be annotated with the GO term. This correspondence is uniquely defined for each set of Annotated Proteins and stored in a matrix, PepGO Matrix. The correspondence between a new protein and each GO term is calculated based on all oligopeptides in the protein and PepGO Matrix. To evaluate the prediction performance, we made several experiments. In the evaluation, we use some measurements used in information retrieval research, such as recall, precision and f-measure [4]. These measurements are effective to evaluate a score-based prediction method. Our method is regarded as a score-based method using the correspondence as score. In a score-based method, the global property of performance for varied score threshold is more important than the best performance by a specific threshold. The global property is usually shown in a recall-precision graph.

3 76 Horai et al. 2 Method 2. Characteristic Oligopeptide In our method, each protein is characterised by a set of oligopeptides, called Characteristic Oligopeptides. The length of oligopeptide is arbitrary fixed number n. Characteristic Oligopeptides of a protein is a set (without duplication) of all oligopeptides which exist in the protein. When the length of a protein is m then the number of Characteristic Oligopeptides is less than or equal to mn +. If there is no duplication of oligopeptides in the protein, the number of its Characteristic Oligopeptides is equal to mn +. Figure shows a simplified example for explanation. In this example, n is 5, m is 2, oligopeptides duplicate once, and 6 Characteristic Oligopeptides are obtained. 2.2 PepGO Matrix PepGO Matrix is a non-negative integer matrix. Each row is related to an oligopeptide, while each line is related to a GO term. Each cell denotes the number of Annotated Proteins which have the corresponding oligopeptide and is annotated with the corresponding GO term. The order of rows and lines of PepGO is arbitrary because our method does not focus on relation among oligopeptides nor relation among GO term, but only on the correspondence of oligopeptides and GO terms. In this subsection, we explain the method to generate PepGO Matrix MAT (P s) from a set of Annotated Proteins P s. Protein P: MAIFARLNRA annotated with GO terms T, T2 and T3 in RefSeq GO Term TT2T3 ADIKT FADIK GIFAD IFADI IKTKR KRLNR KTGIF KTKRL MADIK RLNRA TGIFA TKRLN Characteristic Oligopeptides Protein P: MAIFARLNRA annotated with GO terms T, T2 and T3 in RefSeq TT2T3 ADIKT FADIK GIFAD IFADI IKTKR KRLNR KTGIF KTKRL MADIK RLNRA TGIFA TKRLN Protein P2: MCAA annotated with GO terms T2 and T4 in RefSeq T2T4 CDIKT IKTGA KTGAA MCDIK PepGo Matrix TT2T3T4 2 ADIKT CDIKT FADIK GIFAD IFADI IKTGA IKTKR KRLNR KTGAA KTGIF KTKRL MADIK MCDIK RLNRA TGIFA TKRLN Figure 2: Matrix generated from a protein. Figure 3: PepGO matrix. We start with generation of PepGO Matrix from each Annotated Protein. Consider Annotated Protein P, an element of P s, annotated with several GO terms GO(P ). Let Characteristic Oligopetides of P and PepGO Matrix generated from P be OP (P ) and Mat(P ), respectively. The rows and the lines of Mat(P ) are related to OP (P ) and GO(P ), respectively. The cells of Mat(P ) are all one. Mat(P ) denotes that all oligopeptides in P correspond with all GO terms annotated to P. Figure 2 shows a simplified example for explanation using the protein in Figure. In this example, P is annotated with 3 GO terms T, T2 and T3. After generating PepGO Matrices from all Annotated Proteins one-by-one, we accumulate all PepGO Matrices into a single PepGO Matrix MAT (P s). The rows and the lines of MAT (P s) are a set union of all rows and lines of all PepGO Matrices, respectively. The orders of the rows and the lines are arbitrary. For every cell of all PepGO Matrices, the value, i.e., is accumulated to the corresponding cell of MAT (P s). Finally, a cell of MAT (P s) is the number of P s which has the corresponding oligopeptide and annotated the corresponding GO term. When an oligopeptide is not related to a GO term in any PepGO Matrices, then the corresponding cell of MAT (P s) is zero. Figure 3 shows a simplified example for explanation. In this example, P s consists of only two proteins P

4 Viewing the Proteome from Oligopeptides 77 and P2. P is the protein in Figure and 2. The numbers of OP (P ), GO(P ), OP (P 2) and GO(P 2) are 5, 3, 5 and 2, respectively. Some oligopeptides and GO terms exist commonly in some Annotated Proteins, e.g. oligopeptide DIKIG and GO term T2 in Figure 3, and others exist uniquely in a specific Annotated Protein. We define some notations concerning PepGO Matrix for further explanation in this paper. M AT (P s).cell(op, t) denotes the cell of MAT (P s) corresponding oligopeptide op and GO term t. MAT (P s). row(op) denotes the sum of cells in the row of MAT (P s) corresponding oligopeptide op. 2.3 Prediction of GO Term In our method, the prediction of GO term is the calculation of correspondence between each protein and each GO term by means of MAT (P s). For given protein X, and GO term T, let their correspondence calculated by means of MAT (P s) be Cor(X, T, P s). At first, for every element of OP (X), i.e. every Characteristic Oligopeptide of P, the correspondence between the Characteristic Oligopeptide and T is calculated. Let the correspondence between oligopeptide op and GO term T be cor(op, T, P s). We defined that cor(op, T, P s) is MAT (P s).cell(op, T ) /MAT (P s).row(op) because op can be related to some GO terms and the specificity of op for T should be taken into account. In the case that op is related to many GO terms, even if MAT (P s).cell(op, T ) is large, cor(op, T, P s) become relatively smaller than in the case that op is related to a few GO terms. Figure 4 shows a simplified example for explanation of calculating cor(op, T, P s). In this example, MAT (P s) is the PepGO Matrix in Figure 3. op is and T is T2. MAT (P s).cell(op, T ) and MAT (P s).row(op) are 2 and 5, respectively. Cor(op, T, P s) becomes. After calculating correspondence between all Characteristic Oligopeptides of X one-by-one, Cor(X, T, P s) is calculated by means of the results. Fundamentally, we consider that correspondence between a protein and a GO term is positively correlated with the sum of correspondence between all Characteristic Oligopeptides of the protein and the GO term. Generally speaking, the length of a protein is different from another and has a great positive impact to the sum of correspondence between Characteristic Oligopeptides and a GO term. Because we also intend to compare the correspondence of a GO term among proteins, we defined Cor(X, T, P s) as the sum of cor(op, T, P s) where op is every element of OP (X) divided by the number of OP (X). Figure 5 shows a simplified example for explanation. In this example, MAT (P s) is the PepGO Matrix in Figure 3 and 4. The number of OP (X) is 5. Only four oligopeptides of OP (X) except KTGIA appear in MAT (P s). Cor(X, T, P s) becomes 7. GO term T : T2 Protein X : MCIA GO term T : T3 Oligopeptide op : TT2T3T4 ADIKT CDIKT 2 FADIK GIFAD IFADI IKTGA IKTKR KRLNR KTGAA KTGIF KTKRL MADIK MCDIK RLNRA TGIFA TKRLN PepGo Matrix MAT(Ps) cor(op, T, Ps) = 2 / ( ) = CDIKT KTGIA MCDIK Characteristic Oligopeptide cor(op, T, Ps) T T2 T3 T4 ADIKT CDIKT 2 FADIK GIFAD IFADI IKTGA.333 IKTKR KRLNR KTGAA KTGIF KTKRL MADIK MCDIK RLNRA TGIFA TKRLN PepGo Matrix MAT(Ps) Cor(X, T, Ps) 33 / 5 = 7 Figure 4: Correspondence between oligopeptide and GO term. Figure 5: Correspondence between protein and GO term.

5 78 Horai et al. Whole Proteins RefSeq without 'BJOUXZ' (28,52) OligoPeptide Extract from Database Entry Protein GO term GO term (4,488) AAAAA, AAAAB, AAAAC,......, YYYYS, YYYYW, YYYYY (2,36,75) (,924 pairs) Figure 6: Protein, oligopeptide and GO term in experiments. 3 Results 3. Outline of Experiment We made several experiments in order to evaluate our method. At first, we made a set of proteins from Reference Sequence (RefSeq) [3] of the National Center for Biotechnology Information dated 3-May-25. This set include all human proteins in RefSeq whose sequences are completely known, i.e. not include B, J, O, U, X nor Z. The number of the human proteins, called Whole Proteins, is 28,52. Figure 6 shows the Whole Proteins, the oligopeptide of length 5 and the GO terms of our experiments. 2,36,75 kinds of oligopeptide of length 5 are extracted from the Whole Proteins. GO terms of each RefSeq protein are annotated in the database record of the protein. 4,488 kinds of GO term and,924 pairs of protein and GO term were extracted from the Whole Proteins. Each experiment evaluates the prediction performance for a given GO term T. We choose each GO term which annotated to relatively many proteins. For every protein X of 28,52 Whole Proteins, we divide Whole Proteins into X and the remains P s (28,59 proteins included) and calculate Cor(X, T, P s), i.e. the correspondence between X and T calculated by means of the other proteins. The obtained correspondence depends upon other proteins than X only, and X itself does not affect to the correspondence. After the calculation for all Whole Proteins, we obtain characteristics of prediction performance by means of some measurements in information retrieval research, such as precision (or accuracy ), recall (or sensitivity ) and f-measure (i.e. harmonic mean of precision and recall) as follows. P = TP/(TP + FP), R = TP/(TP + FN), F-measure = 2 P R/(P + R), TP = number of true positive, FP = number of false positive, FN = number of false negative. To draw the recall-precision graph for a GO term, at first, we sort Whole Proteins in descending order. For each i from to the number of Whole Proteins, i.e. 28,52 in the experiments, we select the top i proteins. The top i proteins is regarded as a tentative prediction using Cor(the i th protein, T, P s) as threshold. Using each tentative prediction, we calculate precision, recall and f-measure, and plot

6 Viewing the Proteome from Oligopeptides 79 a point in the recall-precision graph. Finally, 28,52 points are plotted in the recall-precision graph, and we connect these points. We also obtain the maximum f-measure. 3.2 Prediction of GO Component We made experiments for two GO terms: membrane [goid 62] and nucleus [goid 5634]. The numbers of proteins annotated with membrane and nucleus are 2,549 and 4,28, respectively. The recall-precision graph of prediction for membrane is shown in Figure (6.7%) proteins annotated with the GO term are predictable without false prediction. The 5%, 66%, and 8% of proteins annotated with the GO term is predictable with 95.4%, 86.% and 6.% precision, respectively. With 8% and 5% precision, the 7.7% and 83.4% of proteins annotated with the GO term is predictable, respectively. The maximum f-measure is 57. The recall-precision graph of prediction for nucleus is shown in Figure 8. 4 (6.5%) proteins annotated with the GO term are predictable without false prediction. The 5%, 66%, and 8% of proteins annotated with the GO term is predictable with 9%, 85.2% and 6% precision, respectively. With 8% and 5% precision, the 7 and 84.8% of proteins annotated with the GO term is predictable. The maximum f-measure is Figure 7: -precision graph for membrane. Figure 8: -precision graph for nucleus. 3.3 Prediction of GO Function We made experiments for three GO terms: ATP binding [goid 5524], hydrolase activity [goid 6787] and GTP binding [goid 5525]. The numbers of proteins annotated with ATP binding, hydrolase activity and GTP binding are,655,,2 and 47, respectively. The recall-precision graphs of prediction for ATP binding is shown in Figure (2.%) proteins annotated with the GO term are predictable without false prediction. The 5%, 66%, and 8% of proteins annotated with the GO term is predictable with 97.%, 95.5% and 79.4% precision, respectively. With 8% and 5% precision, the 8.% and 62.2% of proteins annotated with the GO term is predictable, respectively. The maximum f-measure is 4. The recall-precision graph of prediction for hydrolase activity is shown in Figure. 44 (3.%) proteins annotated with the GO term are predictable without false prediction. The 5%, 66%, and 8% of proteins annotated with the GO term is predictable with 95.5%, 86.3% and 35.% precision, respectively. With 8% and 5% precision, the 69.6% and 7.% of proteins annotated with the GO term is predictable, respectively. The maximum f-measure is 5. The recall-precision graph of prediction for GTP binding is shown in Figure. 5 (2.5%) proteins annotated with the GO term are predictable without false prediction. The 5%, 66%, and

7 8 Horai et al. Figure 9: - Graph for ATP Binding Figure 9: -precision graph for ATP binding. Figure : -precision graph for hydrolase activity Figure : -precision graph for GTP binding. 8% of proteins annotated with the GO term is predictable with 87.9%, 85.6% and 83.8% precision, respectively. With 8% and 5% precision, the 82.% and 89.9% of proteins annotated with the GO term is predictable, respectively. The maximum f-measure is Prediction of GO Process We made an experiment for two GO terms: intracellular signaling cascade [goid 7242] and ubiquitin cycle [goid 652]. The numbers of proteins annotated with intracellular signaling cascade and ubiquitin cycle 492 and 334. The recall-precision graph of prediction for intracellular signaling cascade is shown in Figure (8.7%) proteins annotated with the GO term are predictable without false prediction. The 5%, 66%, and 8% of proteins annotated with the GO term is predictable with 95.%, 87.5% and 63.5% precision, respectively. With 8% and 5% precision, the 73.6% and 8.3% of proteins annotated with the GO term is predictable, respectively. The maximum f-measure is 69. The recall-precision graph of prediction for ubiquitin cycle is shown in Figure (9.8%) proteins annotated with the GO term are predictable without false prediction. The 5%, 66%, and 8% of proteins annotated with the GO term is predictable with 95.5%, 85.5% and 35.% precision, respectively. With 8% and 5% precision, the 62.6% and 76.% of proteins annotated with the GO term is predictable, respectively. The maximum f-measure is 5.

8 Viewing the Proteome from Oligopeptides Figure 2: -precision graph for intracellular signaling cascade Figure 3: -precision graph for ubiquitin cycle. 4 Conclusion We proposed a new method to predict functions, especially GO terms, for a protein based on statistical characteristics of oligopeptides. In order to evaluate the method, we made several experiments by means of the known annotation in RefSeq. In the experiments for GO component terms, it scores 7% recall with 8% precision, and the maximum f-measure is 5. The results suggest that our method is efficient for predicting GO component terms. In the experiments for GO function terms, ATP binding and GTP binding score 8% recall with 8% precision, and the maximum f-measure is greater than. The results suggest that our method is quite efficient for predicting these GO function terms. In contrast, the prediction performance for GO process terms is delicate. The results for intracellular signaling cascade are almost equivalently favorable, while ubiquitin cycle scores lower than others (62.6% recall with 8% precision and f-measure = 5). We consider that one of the reasons is that the number of annotated proteins is quite less in comparison with cases of other GO terms. Generally speaking for all experiments, every recall-precision graph has the following shape: holding horizontally at relatively high precision from to a certain value of recall, and slanted to the lower right corner. Prediction performance mainly depends upon the length of the horizontal part. GTP binding (see Figure ) is an excellent instance. In a usual information retrieval system, this horizontal holding part is so short that the prediction performance is quite low. The experiments suggest the excellent performance of our method in terms of common sense in information retrieval. Furthermore, we can explain the difference between ATP binding and GTP binding by the characteristics of correspondence calculation in our method. Because the correspondence calculation is normalised by the number of included oligopeptides which is proportional to the length, each oligopeptide has larger impact and makes the characteristics of prediction performance more clearly in case of a short protein than a long protein. Because the length of protein annotated with GTP binding (approx. 46 amino acids in average) is quite shorter than ATP binding (approx. 9 amino acids in average), GTP binding results in better than ATP binding. We mentioned in Introduction that there are some methods for solving the problem proposed based on homology search and pattern matching. The comparison of our method with these methods is remained as future works for us. The results in this paper suggest that our method is quite effective for predicting a variety of many GO terms fundamentally, while it is revealed that the applicability may be substantial for some GO terms. We need further investigation on oligopeptides and improvement of the method. Future works include investigation on every oligopeptide which make predominant impact to predication positively or negatively.

9 82 Horai et al. Acknowledgments The authors would like to thank Dr. Tomohiro Mitsumori, a postdoctoral fellow of NAIST, for expert computational assistant and useful discussion. References [] Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., Davis, A., Dolinski, K., Dwight, S., Eppig, J., Harris, M., Hill, D., Issel-Tarver, L., Kasarkis, A., Lewis, S., Matese, J., Richardson, J., Ringwald, M., Rubin, G., and Sherlock, G., Gene ontology: Tool for the unification of biology, Nat. Genet., 25:25 29, 2. [2] Doi, H., Kitajima, M., Watanabe, I., Kikuchi, Y., Matsuzawa, F., Aikawa, S., Takiguchi, K., and Ohno, S., Diverse incidences of individual oligopeptides (dipeptidic to hexapeptidic) in proteins of human, bakers yeast, and Escherichia coli origin registered in the Swiss-Prot data base, Proc. Natl. Acad. Sci. USA, 92(7): , 995. [3] Pruitt, K., Tatusova, Y., and Maglott, D., NCBI Reference Sequence (RefSeq): A curated nonredundant sequence database of genomes, transcripts and proteins, Nucleic Acids Res., 33:D5 54, 25. [4] Salton, G., Automatic Text Processing - The Transformation, Analysis, and Retrieval of Information by Computer, Addison Wesley, 989. [5] Shug, J., Diskin, S., Mazzarelli, J., Brunk, B., and Stoeckert, C., Predicting gene ontology functions from ProDom and CDD protein domains, Genome Res., 2(4): , 22. [6] Zehetner, G., OntoBlast function: From sequence similarities directly to potential functional annotations by ontology terms, Nucleic Acids Res., 3: , 23.

Exploring Similarities of Conserved Domains/Motifs

Exploring Similarities of Conserved Domains/Motifs Exploring Similarities of Conserved Domains/Motifs Sotiria Palioura Abstract Traditionally, proteins are represented as amino acid sequences. There are, though, other (potentially more exciting) representations;

More information

Sequence Databases and database scanning

Sequence Databases and database scanning Sequence Databases and database scanning Marjolein Thunnissen Lund, 2012 Types of databases: Primary sequence databases (proteins and nucleic acids). Composite protein sequence databases. Secondary databases.

More information

Types of Databases - By Scope

Types of Databases - By Scope Biological Databases Bioinformatics Workshop 2009 Chi-Cheng Lin, Ph.D. Department of Computer Science Winona State University clin@winona.edu Biological Databases Data Domains - By Scope - By Level of

More information

STUDYING THE SECONDARY STRUCTURE OF ACCESSION NUMBER USING CETD MATRIX

STUDYING THE SECONDARY STRUCTURE OF ACCESSION NUMBER USING CETD MATRIX Vol. 4, No.4,. STUDYING THE SECONDARY STRUCTURE OF ACCESSION NUMBER USING CETD MATRIX Anamika Dutta Department of Statistics, Gauhati University, Guwahati-784, Assam, India anamika.dut8@gmail.com Kishore

More information

JAFA: a Protein Function Annotation Meta-Server

JAFA: a Protein Function Annotation Meta-Server JAFA: a Protein Function Annotation Meta-Server Iddo Friedberg *, Tim Harder* and Adam Godzik Burnham Institute for Medical Research Program in Bioinformatics and Systems Biology 10901 North Torrey Pines

More information

Gene Ontology Annotation Using Word Proximity Relationship

Gene Ontology Annotation Using Word Proximity Relationship Gene Ontology Annotation Using Word Proximity Relationship Kevin Hsin-Yih Lin, Wen-Juan Hou and Hsin-Hsi Chen Department of Computer Science and Information Engineering, National Taiwan University, Taipei,

More information

ELE4120 Bioinformatics. Tutorial 5

ELE4120 Bioinformatics. Tutorial 5 ELE4120 Bioinformatics Tutorial 5 1 1. Database Content GenBank RefSeq TPA UniProt 2. Database Searches 2 Databases A common situation for alignment is to search through a database to retrieve the similar

More information

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Sequence Based Function Annotation Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Usage scenarios for sequence based function annotation Function prediction of newly cloned

More information

Protein Bioinformatics Part I: Access to information

Protein Bioinformatics Part I: Access to information Protein Bioinformatics Part I: Access to information 260.655 April 6, 2006 Jonathan Pevsner, Ph.D. pevsner@kennedykrieger.org Outline [1] Proteins at NCBI RefSeq accession numbers Cn3D to visualize structures

More information

2/23/16. Protein-Protein Interactions. Protein Interactions. Protein-Protein Interactions: The Interactome

2/23/16. Protein-Protein Interactions. Protein Interactions. Protein-Protein Interactions: The Interactome Protein-Protein Interactions Protein Interactions A Protein may interact with: Other proteins Nucleic Acids Small molecules Protein-Protein Interactions: The Interactome Experimental methods: Mass Spec,

More information

ISOFORM ABUNDANCE INFERENCE PROVIDES A MORE ACCURATE ESTIMATION OF GENE EXPRESSION LEVELS IN RNA-SEQ

ISOFORM ABUNDANCE INFERENCE PROVIDES A MORE ACCURATE ESTIMATION OF GENE EXPRESSION LEVELS IN RNA-SEQ Journal of Bioinformatics and Computational Biology Vol. 8, Suppl. 1 (2010) 177 192 c The Authors DOI: 10.1142/S0219720010005178 ISOFORM ABUNDANCE INFERENCE PROVIDES A MORE ACCURATE ESTIMATION OF GENE

More information

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1 BSCI348S Fall 2003 Midterm 1 Multiple Choice: select the single best answer to the question or completion of the phrase. (5 points each) 1. The field of bioinformatics a. uses biomimetic algorithms to

More information

Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS*

Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS* COMPUTATIONAL METHODS IN SCIENCE AND TECHNOLOGY 9(1-2) 93-100 (2003/2004) Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS* DARIUSZ PLEWCZYNSKI AND LESZEK RYCHLEWSKI BiolnfoBank

More information

A New Path Length Measure Based on GO for Gene Similarity with Evaluation Using SGD Pathways

A New Path Length Measure Based on GO for Gene Similarity with Evaluation Using SGD Pathways A New Path Length Measure Based on GO for Gene Similarity with Evaluation Using SGD Pathways Anurag Nagar University of Houston-Clear Lake Houston, TX, 77058, USA Hisham Al-Mubaid University of Houston-Clear

More information

Entrez Gene: gene-centered information at NCBI

Entrez Gene: gene-centered information at NCBI D54 D58 Nucleic Acids Research, 2005, Vol. 33, Database issue doi:10.1093/nar/gki031 Entrez Gene: gene-centered information at NCBI Donna Maglott*, Jim Ostell, Kim D. Pruitt and Tatiana Tatusova National

More information

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger Protein-Protein-Interaction Networks Ulf Leser, Samira Jaeger SHK Stelle frei Ab 1.9.2015, 2 Jahre, 41h/Monat Verbundprojekt MaptTorNet: Pankreatische endokrine Tumore Insb. statistische Aufbereitung und

More information

Product Applications for the Sequence Analysis Collection

Product Applications for the Sequence Analysis Collection Product Applications for the Sequence Analysis Collection Pipeline Pilot Contents Introduction... 1 Pipeline Pilot and Bioinformatics... 2 Sequence Searching with Profile HMM...2 Integrating Data in a

More information

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University Machine learning applications in genomics: practical issues & challenges Yuzhen Ye School of Informatics and Computing, Indiana University Reference Machine learning applications in genetics and genomics

More information

NCBI web resources I: databases and Entrez

NCBI web resources I: databases and Entrez NCBI web resources I: databases and Entrez Yanbin Yin Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1 Homework assignment 1 Two parts: Extract the gene IDs reported in table

More information

Analysis of Microarray Data

Analysis of Microarray Data Analysis of Microarray Data Lecture 3: Visualization and Functional Analysis George Bell, Ph.D. Senior Bioinformatics Scientist Bioinformatics and Research Computing Whitehead Institute Outline Review

More information

CS 5984: Application of Basic Clustering Algorithms to Find Expression Modules in Cancer

CS 5984: Application of Basic Clustering Algorithms to Find Expression Modules in Cancer CS 5984: Application of Basic Clustering Algorithms to Find Expression Modules in Cancer T. M. Murali January 31, 2006 Innovative Application of Hierarchical Clustering A module map showing conditional

More information

Analysis of a Tiling Regulation Study in Partek Genomics Suite 6.6

Analysis of a Tiling Regulation Study in Partek Genomics Suite 6.6 Analysis of a Tiling Regulation Study in Partek Genomics Suite 6.6 The example data set used in this tutorial consists of 6 technical replicates from the same human cell line, 3 are SP1 treated, and 3

More information

Gene-centered resources at NCBI

Gene-centered resources at NCBI COURSE OF BIOINFORMATICS a.a. 2014-2015 Gene-centered resources at NCBI We searched Accession Number: M60495 AT NCBI Nucleotide Gene has been implemented at NCBI to organize information about genes, serving

More information

Relationship between nucleotide sequence and 3D protein structure of six genes in Escherichia coli, by analysis of DNA sequence using a Markov model

Relationship between nucleotide sequence and 3D protein structure of six genes in Escherichia coli, by analysis of DNA sequence using a Markov model Relationship between nucleotide sequence and 3D protein structure of six genes in Escherichia coli, by analysis of DNA sequence using a Markov model Yuko Ohfuku 1,, 3*, Hideo Tanaka and Masami Uebayasi

More information

A Greedy Algorithm for Minimizing the Number of Primers in Multiple PCR Experiments

A Greedy Algorithm for Minimizing the Number of Primers in Multiple PCR Experiments A Greedy Algorithm for Minimizing the Number of Primers in Multiple PCR Experiments Koichiro Doi Hiroshi Imai doi@is.s.u-tokyo.ac.jp imai@is.s.u-tokyo.ac.jp Department of Information Science, Faculty of

More information

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz] BLAST Exercise: Detecting and Interpreting Genetic Homology Adapted by W. Leung and SCR Elgin from Detecting and Interpreting Genetic Homology by Dr. J. Buhler Prequisites: None Resources: The BLAST web

More information

Introduction to BIOINFORMATICS

Introduction to BIOINFORMATICS Introduction to BIOINFORMATICS Antonella Lisa CABGen Centro di Analisi Bioinformatica per la Genomica Tel. 0382-546361 E-mail: lisa@igm.cnr.it http://www.igm.cnr.it/pagine-personali/lisa-antonella/ What

More information

The Two-Hybrid System

The Two-Hybrid System Encyclopedic Reference of Genomics and Proteomics in Molecular Medicine The Two-Hybrid System Carolina Vollert & Peter Uetz Institut für Genetik Forschungszentrum Karlsruhe PO Box 3640 D-76021 Karlsruhe

More information

The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches

The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches DOI 10.1186/s13742-015-0083-4 RESEARCH The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches Ishita K. Khan 1, Qing Wei 1, Samuel Chapman 3, Dukka

More information

BIOINFORMATICS Introduction

BIOINFORMATICS Introduction BIOINFORMATICS Introduction Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a 1 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu What is Bioinformatics? (Molecular) Bio -informatics One idea

More information

Fundamentals of Bioinformatics: computation, biology, computational biology

Fundamentals of Bioinformatics: computation, biology, computational biology Fundamentals of Bioinformatics: computation, biology, computational biology Vasilis J. Promponas Bioinformatics Research Laboratory Department of Biological Sciences University of Cyprus A short self-introduction

More information

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments BLAST 100 times faster than dynamic programming. Good for database searches. Derive a list of words of length w from query (e.g., 3 for protein, 11 for DNA) High-scoring words are compared with database

More information

Biotechnology Explorer

Biotechnology Explorer Biotechnology Explorer C. elegans Behavior Kit Bioinformatics Supplement explorer.bio-rad.com Catalog #166-5120EDU This kit contains temperature-sensitive reagents. Open immediately and see individual

More information

user s guide Question 3

user s guide Question 3 Question 3 During a positional cloning project aimed at finding a human disease gene, linkage data have been obtained suggesting that the gene of interest lies between two sequence-tagged site markers.

More information

I nternet Resources for Bioinformatics Data and Tools

I nternet Resources for Bioinformatics Data and Tools ~i;;;;;;;'s :.. ~,;;%.: ;!,;s163 ~. s :s163:: ~s ;'.:'. 3;3 ~,: S;I:;~.3;3'/////, IS~I'//. i: ~s '/, Z I;~;I; :;;; :;I~Z;I~,;'//.;;;;;I'/,;:, :;:;/,;'L;;;~;'~;~,::,:, Z'LZ:..;;',;';4...;,;',~/,~:...;/,;:'.::.

More information

user s guide Question 3

user s guide Question 3 Question 3 During a positional cloning project aimed at finding a human disease gene, linkage data have been obtained suggesting that the gene of interest lies between two sequence-tagged site markers.

More information

PIN (Proteins Interacting in the Nucleus) DB: A database of nuclear protein complexes from human and yeast

PIN (Proteins Interacting in the Nucleus) DB: A database of nuclear protein complexes from human and yeast Bioinformatics Advance Access published April 15, 2004 Bioinfor matics Oxford University Press 2004; all rights reserved. PIN (Proteins Interacting in the Nucleus) DB: A database of nuclear protein complexes

More information

Microarray Analysis of Gene Expression in Huntington's Disease Peripheral Blood - a Platform Comparison. CodeLink compatible

Microarray Analysis of Gene Expression in Huntington's Disease Peripheral Blood - a Platform Comparison. CodeLink compatible Microarray Analysis of Gene Expression in Huntington's Disease Peripheral Blood - a Platform Comparison CodeLink compatible Microarray Analysis of Gene Expression in Huntington's Disease Peripheral Blood

More information

Homology Modeling of Mouse orphan G-protein coupled receptors Q99MX9 and G2A

Homology Modeling of Mouse orphan G-protein coupled receptors Q99MX9 and G2A Quality in Primary Care (2016) 24 (2): 49-57 2016 Insight Medical Publishing Group Research Article Homology Modeling of Mouse orphan G-protein Research Article Open Access coupled receptors Q99MX9 and

More information

Computational aspects of ncrna research. Mihaela Zavolan Biozentrum, Basel Swiss Institute of Bioinformatics

Computational aspects of ncrna research. Mihaela Zavolan Biozentrum, Basel Swiss Institute of Bioinformatics Computational aspects of ncrna research Mihaela Zavolan Biozentrum, Basel Swiss Institute of Bioinformatics Computational aspects on ncrna Bacterial ncrnas research Gene discovery Target discovery Discovery

More information

Self-test Quiz for Chapter 12 (From DNA to Protein: Genotype to Phenotype)

Self-test Quiz for Chapter 12 (From DNA to Protein: Genotype to Phenotype) Self-test Quiz for Chapter 12 (From DNA to Protein: Genotype to Phenotype) Question#1: One-Gene, One-Polypeptide The figure below shows the results of feeding trials with one auxotroph strain of Neurospora

More information

Gene Expression Transcription

Gene Expression Transcription Why? ene Expression Transcription How is mrn synthesized and what message does it carry? DN is often referred to as a genetic blueprint. In the same way that blueprints contain the instructions for construction

More information

Weka Evaluation: Assessing the performance

Weka Evaluation: Assessing the performance Weka Evaluation: Assessing the performance Lab3 (in- class): 21 NOV 2016, 13:00-15:00, CHOMSKY ACKNOWLEDGEMENTS: INFORMATION, EXAMPLES AND TASKS IN THIS LAB COME FROM SEVERAL WEB SOURCES. Learning objectives

More information

MATH 5610, Computational Biology

MATH 5610, Computational Biology MATH 5610, Computational Biology Lecture 2 Intro to Molecular Biology (cont) Stephen Billups University of Colorado at Denver MATH 5610, Computational Biology p.1/24 Announcements Error on syllabus Class

More information

Why learn sequence database searching? Searching Molecular Databases with BLAST

Why learn sequence database searching? Searching Molecular Databases with BLAST Why learn sequence database searching? Searching Molecular Databases with BLAST What have I cloned? Is this really!my gene"? Basic Local Alignment Search Tool How BLAST works Interpreting search results

More information

A Propagation-based Algorithm for Inferring Gene-Disease Associations

A Propagation-based Algorithm for Inferring Gene-Disease Associations A Propagation-based Algorithm for Inferring Gene-Disease Associations Oron Vanunu Roded Sharan Abstract: A fundamental challenge in human health is the identification of diseasecausing genes. Recently,

More information

Basic Bioinformatics: Homology, Sequence Alignment,

Basic Bioinformatics: Homology, Sequence Alignment, Basic Bioinformatics: Homology, Sequence Alignment, and BLAST William S. Sanders Institute for Genomics, Biocomputing, and Biotechnology (IGBB) High Performance Computing Collaboratory (HPC 2 ) Mississippi

More information

DNA RNA PROTEIN. Professor Andrea Garrison Biology 11 Illustrations 2010 Pearson Education, Inc. unless otherwise noted

DNA RNA PROTEIN. Professor Andrea Garrison Biology 11 Illustrations 2010 Pearson Education, Inc. unless otherwise noted DNA RNA PROTEIN Professor Andrea Garrison Biology 11 Illustrations 2010 Pearson Education, Inc. unless otherwise noted DNA Molecule of heredity Contains all the genetic info our cells inherit Determines

More information

Database Searching and BLAST Dannie Durand

Database Searching and BLAST Dannie Durand Computational Genomics and Molecular Biology, Fall 2013 1 Database Searching and BLAST Dannie Durand Tuesday, October 8th Review: Karlin-Altschul Statistics Recall that a Maximal Segment Pair (MSP) is

More information

ONLINE BIOINFORMATICS RESOURCES

ONLINE BIOINFORMATICS RESOURCES Dedan Githae Email: d.githae@cgiar.org BecA-ILRI Hub; Nairobi, Kenya 16 May, 2014 ONLINE BIOINFORMATICS RESOURCES Introduction to Molecular Biology and Bioinformatics (IMBB) 2014 The larger picture.. Lower

More information

Gene Signal Estimates from Exon Arrays

Gene Signal Estimates from Exon Arrays Gene Signal Estimates from Exon Arrays I. Introduction: With exon arrays like the GeneChip Human Exon 1.0 ST Array, researchers can examine the transcriptional profile of an entire gene (Figure 1). Being

More information

A Data Warehouse for Multidimensional Gene Expression Analysis

A Data Warehouse for Multidimensional Gene Expression Analysis Leipzig Bioinformatics Working Paper No. 1 November 2004 A Data Warehouse for Multidimensional Gene Expression Analysis Data Sources Data Integration Database Data Warehouse Analysis and Interpretation

More information

Bioinformatics for Proteomics. Ann Loraine

Bioinformatics for Proteomics. Ann Loraine Bioinformatics for Proteomics Ann Loraine aloraine@uab.edu What is bioinformatics? The science of collecting, processing, organizing, storing, analyzing, and mining biological information, especially data

More information

A New Technique to Manage Big Bioinformatics Data Using Genetic Algorithms

A New Technique to Manage Big Bioinformatics Data Using Genetic Algorithms A New Technique to Manage Big Bioinformatics Data Using Genetic Algorithms Huda Jalil Dikhil Dept. of Comoputer Sciecne Applied Science Private University Amman, Jordan Mohammad Shkoukani Dept. of Comoputer

More information

Chimp Sequence Annotation: Region 2_3

Chimp Sequence Annotation: Region 2_3 Chimp Sequence Annotation: Region 2_3 Jeff Howenstein March 30, 2007 BIO434W Genomics 1 Introduction We received region 2_3 of the ChimpChunk sequence, and the first step we performed was to run RepeatMasker

More information

Introduction to Molecular Biology

Introduction to Molecular Biology Introduction to Molecular Biology Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 2-1- Important points to remember We will study: Problems from bioinformatics. Algorithms used to solve

More information

Small Genome Annotation and Data Management at TIGR

Small Genome Annotation and Data Management at TIGR Small Genome Annotation and Data Management at TIGR Michelle Gwinn, William Nelson, Robert Dodson, Steven Salzberg, Owen White Abstract TIGR has developed, and continues to refine, a comprehensive, efficient

More information

Bioinformatics to chemistry to therapy: Some case studies deriving information from the literature

Bioinformatics to chemistry to therapy: Some case studies deriving information from the literature Bioinformatics to chemistry to therapy: Some case studies deriving information from the literature. Donald Walter August 22, 2007 The Typical Drug Development Paradigm Gary Thomas, Medicinal Chemistry:

More information

What happens after DNA Replication??? Transcription, translation, gene expression/protein synthesis!!!!

What happens after DNA Replication??? Transcription, translation, gene expression/protein synthesis!!!! What happens after DNA Replication??? Transcription, translation, gene expression/protein synthesis!!!! Protein Synthesis/Gene Expression Why do we need to make proteins? To build parts for our body as

More information

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow Technical Overview Import VCF Introduction Next-generation sequencing (NGS) studies have created unanticipated challenges with

More information

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers Web resources: NCBI database: http://www.ncbi.nlm.nih.gov/ Ensembl database: http://useast.ensembl.org/index.html UCSC

More information

Access to Information from Molecular Biology and Genome Research

Access to Information from Molecular Biology and Genome Research Future Needs for Research Infrastructures in Biomedical Sciences Access to Information from Molecular Biology and Genome Research DG Research: Brussels March 2005 User Community for this information is

More information

NiceProt View of Swiss-Prot: P18907

NiceProt View of Swiss-Prot: P18907 Hosted by NCSC US ExPASy Home page Site Map Search ExPASy Contact us Swiss-Prot Mirror sites: Australia Bolivia Canada China Korea Switzerland Taiwan Search Swiss-Prot/TrEMBL for horse alpha Go Clear NiceProt

More information

CHAPTER 21 LECTURE SLIDES

CHAPTER 21 LECTURE SLIDES CHAPTER 21 LECTURE SLIDES Prepared by Brenda Leady University of Toledo To run the animations you must be in Slideshow View. Use the buttons on the animation to play, pause, and turn audio/text on or off.

More information

Genome Sequence Assembly

Genome Sequence Assembly Genome Sequence Assembly Learning Goals: Introduce the field of bioinformatics Familiarize the student with performing sequence alignments Understand the assembly process in genome sequencing Introduction:

More information

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem. Dec-82 Oct-84 Aug-86 Jun-88 Apr-90 Feb-92 Nov-93 Sep-95 Jul-97 May-99 Mar-01 Jan-03 Nov-04 Sep-06 Jul-08 May-10 Mar-12 Growth of GenBank 160,000,000,000 180,000,000 Introduction to Bioinformatics Iosif

More information

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute Introduction to Microarray Data Analysis and Gene Networks Alvis Brazma European Bioinformatics Institute A brief outline of this course What is gene expression, why it s important Microarrays and how

More information

Creation of a PAM matrix

Creation of a PAM matrix Rationale for substitution matrices Substitution matrices are a way of keeping track of the structural, physical and chemical properties of the amino acids in proteins, in such a fashion that less detrimental

More information

Genome Biology and Biotechnology

Genome Biology and Biotechnology Genome Biology and Biotechnology 10. The proteome Prof. M. Zabeau Department of Plant Systems Biology Flanders Interuniversity Institute for Biotechnology (VIB) University of Gent International course

More information

O C. 5 th C. 3 rd C. the national health museum

O C. 5 th C. 3 rd C. the national health museum Elements of Molecular Biology Cells Cells is a basic unit of all living organisms. It stores all information to replicate itself Nucleus, chromosomes, genes, All living things are made of cells Prokaryote,

More information

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015 Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH BIOL 7210 A Computational Genomics 2/18/2015 The $1,000 genome is here! http://www.illumina.com/systems/hiseq-x-sequencing-system.ilmn Bioinformatics bottleneck

More information

M I C R O B I O L O G Y WITH DISEASES BY TAXONOMY, THIRD EDITION

M I C R O B I O L O G Y WITH DISEASES BY TAXONOMY, THIRD EDITION M I C R O B I O L O G Y WITH DISEASES BY TAXONOMY, THIRD EDITION Chapter 7 Microbial Genetics Lecture prepared by Mindy Miller-Kittrell, University of Tennessee, Knoxville The Structure and Replication

More information

2 Gene Technologies in Our Lives

2 Gene Technologies in Our Lives CHAPTER 15 2 Gene Technologies in Our Lives SECTION Gene Technologies and Human Applications KEY IDEAS As you read this section, keep these questions in mind: For what purposes are genes and proteins manipulated?

More information

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding UCSC Genome Browser Introduction to ab initio and evidence-based gene finding Wilson Leung 06/2006 Outline Introduction to annotation ab initio gene finding Basics of the UCSC Browser Evidence-based gene

More information

Supplementary materials

Supplementary materials Supplementary materials Calculation of the growth rate for each gene In the growth rate dataset, each gene has many different growth rates under different conditions. The average growth rate for gene i

More information

A CLOSE-TO OPTIMUM BI-CLUSTERING ALGORITHM FOR MICROARRAY GENE EXPRESSION DATA

A CLOSE-TO OPTIMUM BI-CLUSTERING ALGORITHM FOR MICROARRAY GENE EXPRESSION DATA A CLOSE-TO OPTIMUM BI-CLUSTERING ALGORITHM FOR MICROARRAY GENE EXPRESSION DATA Guojun Li 1*, 3, Qin Ma 1,3, Bingqiang Liu 1,3, Haibao Tang 2, Andrew H. Paterson 2, and Ying Xu 1 1 Computational Systems

More information

Review of Protein (one or more polypeptide) A polypeptide is a long chain of..

Review of Protein (one or more polypeptide) A polypeptide is a long chain of.. Gene expression Review of Protein (one or more polypeptide) A polypeptide is a long chain of.. In a protein, the sequence of amino acid determines its which determines the protein s A protein with an enzymatic

More information

Predicting the Coupling Specif icity of G-protein Coupled Receptors to G-proteins by Support Vector Machines

Predicting the Coupling Specif icity of G-protein Coupled Receptors to G-proteins by Support Vector Machines Article Predicting the Coupling Specif icity of G-protein Coupled Receptors to G-proteins by Support Vector Machines Cui-Ping Guan, Zhen-Ran Jiang, and Yan-Hong Zhou* Hubei Bioinformatics and Molecular

More information

Chapter 8 Lecture Outline. Transcription, Translation, and Bioinformatics

Chapter 8 Lecture Outline. Transcription, Translation, and Bioinformatics Chapter 8 Lecture Outline Transcription, Translation, and Bioinformatics Replication, Transcription, Translation n Repetitive processes Build polymers of nucleotides or amino acids n All have 3 major steps

More information

Towards definition of an ECM parts list: An advance on GO categories

Towards definition of an ECM parts list: An advance on GO categories Towards definition of an ECM parts list: An advance on GO categories The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published

More information

GeneMarkS-2: Raising Standards of Accuracy in Gene Recognition

GeneMarkS-2: Raising Standards of Accuracy in Gene Recognition GeneMarkS-2: Raising Standards of Accuracy in Gene Recognition Alexandre Lomsadze 1^, Shiyuyun Tang 2^, Karl Gemayel 3^ and Mark Borodovsky 1,2,3 ^ joint first authors 1 Wallace H. Coulter Department of

More information

Genome Annotation Genome annotation What is the function of each part of the genome? Where are the genes? What is the mrna sequence (transcription, splicing) What is the protein sequence? What does

More information

GOTA: GO term annotation of biomedical literature

GOTA: GO term annotation of biomedical literature Di Lena et al. BMC Bioinformatics (2015) 16:346 DOI 10.1186/s12859-015-0777-8 METHODOLOGY ARTICLE Open Access GOTA: GO term annotation of biomedical literature Pietro Di Lena *, Giacomo Domeniconi, Luciano

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 14: Microarray Some slides were adapted from Dr. Luke Huan (University of Kansas), Dr. Shaojie Zhang (University of Central Florida), and Dr. Dong Xu and

More information

ClueGO Documentation

ClueGO Documentation ClueGO Documentation Gabriela Bindea and Bernhard Mlecnik Laboratory of Integrative Cancer Immunology U872 Cordeliers Research Center, Paris, France Contents Installation.................................................

More information

The C-value paradox. PHAR2811: Genome Organisation. Is there too much DNA? C o t plots. What do these life forms have in common?

The C-value paradox. PHAR2811: Genome Organisation. Is there too much DNA? C o t plots. What do these life forms have in common? PHAR2811: enome Organisation Synopsis: -value paradox, different classes of DA, repetitive DA and disease. If protein-coding portions of the human genome make up only 1.5% what is the rest doing? The -value

More information

Recommendations from the BCB Graduate Curriculum Committee 1

Recommendations from the BCB Graduate Curriculum Committee 1 Recommendations from the BCB Graduate Curriculum Committee 1 Vasant Honavar, Volker Brendel, Karin Dorman, Scott Emrich, David Fernandez-Baca, and Steve Willson April 10, 2006 Background The current BCB

More information

A Random Forest proximity matrix as a new measure for gene annotation *

A Random Forest proximity matrix as a new measure for gene annotation * A Random Forest proximity matrix as a new measure for gene annotation * Jose A. Seoane 1, Ian N.M. Day 1, Juan P. Casas 2, Colin Campbell 3 and Tom R. Gaunt 1,4 1 Bristol Genetic Epidemiology Labs. School

More information

Molecular Genetics Student Objectives

Molecular Genetics Student Objectives Molecular Genetics Student Objectives Exam 1: Enduring understanding 3.A: Heritable information provides for continuity of life. Essential knowledge 3.A.1: DNA, and in some cases RNA, is the primary source

More information

Protein Synthesis & Gene Expression

Protein Synthesis & Gene Expression DNA provides the instructions for how to build proteins Each gene dictates how to build a single protein in prokaryotes The sequence of nucleotides (AGCT) in DNA dictates the order of amino acids that

More information

Worksheet for Bioinformatics

Worksheet for Bioinformatics Worksheet for Bioinformatics ACTIVITY: Learn to use biological databases and sequence analysis tools Exercise 1 Biological Databases Objective: To use public biological databases to search for latest research

More information

Protein Synthesis. OpenStax College

Protein Synthesis. OpenStax College OpenStax-CNX module: m46032 1 Protein Synthesis OpenStax College This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 By the end of this section, you will

More information

BIOTECHNOLOGY. Course Syllabus. Section A: Engineering Mathematics. Subject Code: BT. Course Structure. Engineering Mathematics. General Biotechnology

BIOTECHNOLOGY. Course Syllabus. Section A: Engineering Mathematics. Subject Code: BT. Course Structure. Engineering Mathematics. General Biotechnology BIOTECHNOLOGY Subject Code: BT Course Structure Sections/Units Section A Section B Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Unit 6 Unit 7 Section C Section D Section E Topics Engineering Mathematics General

More information

Overview of Health Informatics. ITI BMI-Dept

Overview of Health Informatics. ITI BMI-Dept Overview of Health Informatics ITI BMI-Dept Fellowship Week 5 Overview of Health Informatics ITI, BMI-Dept Day 10 7/5/2010 2 Agenda 1-Bioinformatics Definitions 2-System Biology 3-Bioinformatics vs Computational

More information

CSE : Computational Issues in Molecular Biology. Lecture 19. Spring 2004

CSE : Computational Issues in Molecular Biology. Lecture 19. Spring 2004 CSE 397-497: Computational Issues in Molecular Biology Lecture 19 Spring 2004-1- Protein structure Primary structure of protein is determined by number and order of amino acids within polypeptide chain.

More information

Examination Assignments

Examination Assignments Bioinformatics Institute of India H-109, Ground Floor, Sector-63, Noida-201307, UP. INDIA Tel.: 0120-4320801 / 02, M. 09818473366, 09810535368 Email: info@bii.in, Website: www.bii.in INDUSTRY PROGRAM IN

More information

How to view Results with Scaffold. Proteomics Shared Resource

How to view Results with Scaffold. Proteomics Shared Resource How to view Results with Scaffold Proteomics Shared Resource Starting out Download Scaffold from http://www.proteomes oftware.com/proteom e_software_prod_sca ffold_download.html Follow installation instructions

More information

Make the protein through the genetic dogma process.

Make the protein through the genetic dogma process. Make the protein through the genetic dogma process. Coding Strand 5 AGCAATCATGGATTGGGTACATTTGTAACTGT 3 Template Strand mrna Protein Complete the table. DNA strand DNA s strand G mrna A C U G T A T Amino

More information

Genome Annotation. Stefan Prost 1. May 27th, States of America. Genome Annotation

Genome Annotation. Stefan Prost 1. May 27th, States of America. Genome Annotation Genome Annotation Stefan Prost 1 1 Department of Integrative Biology, University of California, Berkeley, United States of America May 27th, 2015 Outline Genome Annotation 1 Repeat Annotation 2 Repeat

More information

Recent advances in high-throughput technologies enable monitoring

Recent advances in high-throughput technologies enable monitoring Integrative analysis of genome-scale data by using pseudoinverse projection predicts novel correlation between DNA replication and RNA transcription Orly Alter* and Gene H. Golub *Department of Biomedical

More information