AN OVERVIEW OF GENE STRUCTURE & FUNCTION PREDICTION. Marcus Chibucos, Ph.D. University of Maryland School of Medicine June 2013

Size: px
Start display at page:

Download "AN OVERVIEW OF GENE STRUCTURE & FUNCTION PREDICTION. Marcus Chibucos, Ph.D. University of Maryland School of Medicine June 2013"

Transcription

1 AN OVERVIEW OF GENE STRUCTURE & FUNCTION PREDICTION Marcus Chibucos, Ph.D. University of Maryland School of Medicine June 2013

2 Overview & goals Understand 1. How we predict presence & structure of coding and non-coding genes in the genome 2. How we know what gene product does & how evidence is used to support this When searching databases like FungiDB or InterPro, understand the meaning of terms like: protein motif, domain, ortholog, HMM, EC, GO annotation, and so forth Learn fundamentals with prokaryotes... Overview of eukaryotes

3 GENE STRUCTURAL ANNOTATION 3

4 What is a gene model? Yandell and Ence (2012) Nature Reviews Genetics. 13:

5 Fundamental methods of pattern detection Intrinsic (ab initio/de novo, from the beginning ) Uses only DNA sequence and the inherent patterns within it Canonical features like start & stop codons Extrinsic Uses additional sources of evidence information Homologous proteins mrna (ESTs, RNA-Seq) Synteny

6 Prokaryotic gene structure promotor ATG start TAG stop DNA RBS AUG UAG mrna Open reading frame (ORF) start RBS Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences, University of Maryland School of Medicine, 2013

7 Start with DNA sequence

8 Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences, University of Maryland School of Medicine, 2013 DNA sequence has 6 translation frames 3 on forward strand, 3 on reverse strand

9 Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences, University of Maryland School of Medicine, 2013 Graphical display of 6-frame translation Each horizontal bar represents one of the translation frames. Tall vertical lines represent translation stops (TAG, TAA, TGA). Short vertical lines represent translation starts (ATG, GTG, TTG).

10 Graphical display of 6-frame translation start These are examples of the many ORFs in this graphic. stop What is an ORF?

11 Prokaryotic gene finders Glimmer prok and euk versions Prodigal GeneMark prok and euk versions EasyGene Many others exist

12 Glimmer Tool uses interpolated Markov models (IMMs) to predict which ORFs in a genome contain real genes. Glimmer compares nucleotide patterns it finds in a training set of genes known (or believed) to be real to nucleotide patterns of ORFs in the whole genome. ORFs with patterns similar to the patterns in the training genes are considered real themselves. Using Glimmer is a two-part process Train Glimmer with genes from organism that was sequenced, which are known, or strongly believed, to be real genes. Run trained Glimmer against the entire genome sequence. This is actually how most ab initio gene predictors including eukaryotic predictors like Augustus, GeneID, SNAP, and others work.

13 Gathering the training set Using verified, published sequences ideal not always possible Minimum needed is 250 kb of total sequence BLAST translated ORFs against a protein database (slow) Keep only very strong matches Gather long non-overlapping ORFs (fast) Many more complex strategies exist, especially for eukaryotes these not these

14 Training Glimmer All k-mers from size 5-8 in sequence are tracked Frequency of each nucleotide following any given k-mer is recorded This data set is used to build a statistical model that provides the probability that any given nucleotide will follow any given k- mer This model is used to score the ORFs in the genome Those where the patterns of nucleotides/k-mers match the model are predicted to be real genes

15 Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences, University of Maryland School of Medicine, 2013 Candidate ORFs Choose a minimum length cut-off Blue ORFs meet this minimum Each blue ORF will be scored against the model built from the training genes

16 Categorizing ORFs as genes or not Some ORFs will score well to the model (green) Some will not (red) Green ORFs will be retained as predicted genes (blue arrows depicted along the DNA molecule in black at the bottom of the figure)

17 Potential problems to watch for False Positives An ORF is predicted to be a gene, but really isn t May result in overlaps False Negatives An ORF is not predicted to be a gene, but really is May result in gaps in feature predictions Wrong start site chosen Most genes have multiple start codons near the beginning it can be hard to determine which is the true one

18 Is one of these a False Positive? Probably. Genes don t generally overlap to this extent in prokaryotes. What about eukaryotes?

19 Is this a false negative? Probably. There are not large regions without gene content in prokaryotes. What about eukaryotes? Why might this happen? If a region of DNA is different in composition than the rest of the genome then the gene finders will score the ORFs poorly when in fact they are real genes. Different composition may come about in many ways one common way is through lateral (or horizontal) transfer. Things such as phage integration, transposition, etc.)

20 20 Translation start site considerations - Start site frequency: ATG >> GTG >> TTG - Ribosome binding site (RBS): AG rich sequence 5-11 bp upstream of the start codon - Similarity to match proteins, in BER & multiple alignments - In example below (showing just the beginning of one BER alignment--here the DNA sequence reads down in columns for each codon), homology starts exactly at the first atg (the current chosen start, aa #1), there is a very favorable RBS beginning 9 bp upstream of this atg (gagggaga). There is no reason to consider the ttg, and no justification for moving to the second atg (this would cut off some similarity and it does not have an RBS). 3 possible start sites RBS upstream of chosen start BER match This ORF s upstream boundary

21 21 Overlap analysis When two ORFs overlap (boxed areas), the one without similarity to anything (another protein, an HMM, etc.) is removed. If both don t match anything, other considerations such as presence in a putative operon and potential start codon quality are considered. Small regions of overlap are allowed (circle).

22 22 Interevidence regions Areas of the genome with no genes and areas within genes without any kind of evidence (no match to another protein, HMM, etc., such regions may include an entire gene in case of hypothetical proteins ) are translated in all 6 frames and searched against a non-redundant protein database.

23 It s not just about proteins Can predict many genes beyond protein coding ones

24

25

26 Manatee genome viewer

27 Artemis gene model curation tool

28 Eukaryotic gene structure prediction now things get more complicated

29 Gene finder evaluation Sensitivity (Sn) measures false negatives The fraction of a known reference feature that is predicted by a gene predictor = TP / (TP + FN) Specificity (Sp) measures false positives The fraction of the prediction that overlaps a known reference feature = TP / (TP + FP) Assessed at different levels Base Exon Transcript Gene

30 Intrinsic (ab initio) success rates Prokaryotic very good Eukaryotic not so good >95% correct ~50% correct (shown below) (accessed May 2013)

31 Complexities of eukaryotic gene finding Large genomes in eukaryotes Low coding density; in proks virtually all long ORFs encode gene, but not so in euks Genomic repeats Non-canonical (ATG) start codon Splicing (exons & introns) Alternative splicing (40-50% genes) Pseudogenes Long genes or short genes Long introns Non-canonical introns UTR introns Overlapping genes on opposite strands Nested genes overlapping on strand or in intron Polycistronic peptide coding genes One mrna codes for several very short (~11 aa) peptides regulatory function Even if you have some RNA (helpful) transcription not always active Need multiple biological conditions

32 Masking repeats is essential RepeatMasker ( finds interspersed repeats & low complexity DNA sequences by comparing DNA sequence to curated genomic-specific libraries Simple Repeats 1-5 bp duplications such as A, CA, CGG Tandem Repeats bases found at centromeres & telomeres Segmental Duplications kilobases blocks copied to another genomic region Interspersed Repeats Processed pseudogenes, retrotranscripts (short-interspersed elements- SINES): Non-functional copies of RNA genes reintegrated into the genome via reverse transcriptase DNA transposons Retrovirus retrotransposons Non-retrovirus retrotransposons (long interspersed elements- LINES) ~50% of human genomic DNA currently will be masked RepeatModeler searches for repeats ab initio and can find not previously characterized repeats

33 Repeats yield similarities in nonhomologous regions GENE1 Using unmasked genomic DNA GENE2 GENE1 Using masked genomic DNA GENE2 Alkes L. Price, Neil C. Jones and Pavel A. Pevzner (June 28, 2005)

34 Predicted genes that are actually repeats Gene predictors Using masked genomic DNA No models Using unmasked genomic DNA Predicted models Repeats

35 Multiple predictors give different results on same data set Factors affecting gene predictor results Underlying algorithm Program parameters Training set (number and quality of models) Additional extrinsic inputs (expression data, protein/genome alignment) Fungus species 1 Fungus species 2 GeneMark-ES (self training) 9,024 9,527 Augustus trained on Botrytis 8,194 9,011 Augustus trained on Neurospora 7,335 7,955 GeneID trained on Stagnospora 10,313 12,894 GeneID trained on Sclerotinia 10,691 13,837 GLEAN consensus 8,705 9,523

36 Which model is correct? Models from three different predictors/ conditions Consensus model Protein alignments

37 We rely on certain conventions Rules are based on gene composition & signal First, what is the basic structure of a gene? Coding region (exon) is inside ORF of one reading frame All exons on same strand for a given gene Exons within a gene can have different reading frame Inherent frequency patterns exist

38 Dimer frequency distribution Dimer frequency in protein sequence is not evenly distributed and is organism specific Some amino acids prefer to be next to one another Most dicodons are biased toward either coding or non-coding, not neutral Expected frequency of dimer If random = 0.25% (1/20 * 1/20) If a dimer has lower than expected frequency, protein less likely to contain it and the reasoning follows that if a sequence does contain it, it is less likely to exist in a coding region Example: In human genome, AAA AAA appears 1% of time in coding regions and 5% of time in non-coding regions

39 Splicing Find all GT/AG donor/acceptor sites Score with position-specific scoring matrix (PSSM) model splice donor branch point polypyrimidine tract splice acceptor Modified from:

40 Position Specific Scoring Matrix (PSSM) A G C U Let s say you look at 5 splice donor (GU) sites:! ATCGUCGC! UCAGUGGC! CUCGUCCC! GUCGUUAC! CACGUCUA! Gene finders use this information to predict where gene features Gene are. For this to work, one must have confirmed splice sites to use for training. These are not always available for new genomes and some splice sites are non-canonical and some genes are alternatively spliced so it can become somewhat complex.

41 Translation start prediction Position-specific scoring matrix (PSSM) Certain nucleotides tend to be in position around start site (ATG), and others not so Such biased nucleotide distribution is basis for translation start prediction Figure courtesy of Sucheta Tripathy

42 Mathematical model Fi(X): freq. of X (A, G, C, T) in position I Score string by Σ log (Fi (X)/0.25) Figure courtesy of Sucheta Tripathy

43 Pattern-based exon & gene prediction Assess different criteria Coding region inside ORF (start & stop, no interrupting stops) Dimer frequency Coding score Donor site score Acceptor site score Other factors to consider GC content Exon length distribution Polymerase II promoter elements (GC box, CCAT box, TATA region) Ribosome binding site Polyadenylation signal upstream poly-a cleavage site Termination signal downstream poly-a cleavage site

44 Example of ab initio gene predictor flow

45 Confirming a predicted gene with cdna 26 exons!

46 Extrinsic evidence & manual curation Expression data EST (expressed sequence tag) sequences RNA-Seq reads mrnaà cdna High throughput sequencing Align reads to genome sequence Homology based approaches Protein (or expression data) sequences from other organisms Nucleic acid conservation via tblastx or many other methods Ortholog mapping/synteny Experimentally confirmed gene products & gene families Manual curation is often done by experts in a domain

47 RNA-seq of transcripts as evidence for gene models cdna mrna GCTAATGCGAAGTCCTAGACCAGATTGAC ATGCGATGCAGCTGACGCTGGCTAATGCG CGCATAGCCAGATGACCATGATGCGATGC TGACAGATTAGACAGTAGGACAGATAGAC..many millions of reads 1. Gene model is confirmed by transcript information 2. Part of the gene model is confirmed but the exons predicted in the middle do not have transcript evidence. Does this mean they are not real? Not necessarily. 3. Transcript sequencing allows for novel gene detection. There is transcript evidence for the presence of a gene (or at least transcription) in an area of the genome without a gene model currently predicted.? Reads mapped to genome with gene models

48 Splice boundaries and alternate transcripts Intron Some reads will span the intron/exon boundaries Allows for verification of gene models Observation of alternate transcripts

49 Multiple genome alignment & conservation

50 Experimentally based manual curation We have experimentally characterized protein What do I know about this gene family? What do I know about genes in general? No introns in multiples of three, short introns, et cetera

51 Leverage comparative genomics Arnaud, et al. (2010) Nucleic Acids Res. 38(Database issue): D420-7.

52 Gather models for ab initio training set Get models verified via expression, homology, or manual curation Use manually curated genes from your organism Generate preliminary ab initio model set and then do a homology search at Swiss-Prot, retaining most-conserved genes Use CEGMA (Core Eukaryotic Genes Mapping Approach) to predict highly conserved genes Align proteins from related organisms to your genome with splice-aware aligner, thus creating models with exon boundaries that have homologs Align RNA-seq or EST reads to your genome to create or update existing models. Use models with multiple sources & remove highly similar ones OR Use pre-existing training set related to your organism For example, I could use chicken if I am studying finch Many software packages provide parameter files for common organisms

53 Run gene finder as online or stand alone Augustus web has text & graphical output à Click! Predictions stored in GFF3 or GFF2 or GTF format

54 RNA-Seq can show differential expression of alternative transcripts

55 Combiners Incorporate multiple evidence types including ab initio predictions, expression data, and homology and these usually perform the best Glean Evidence Modeler (EVM) Jigsaw Maker (actually a whole pipeline that can be used online) PASA (combines predicted structures with expression data) And more Note that many ab inito predictors, for example Augustus, incorporate other data types such as protein alignments or expression data

56 One example, the Glean combiner Glean paper at Top track below is a statistically derived combination of the ones below it

57 Example of annotation pipeline Fungal Genome Annotation Standard Operating Procedure (SOP) at JGI Repeat masking Mapping ESTs (BLAT) from organism and publicly available proteins from related taxa (BLASTx) Ab initio (FGENESH, GeneMark), homolgy-based (FGENESH+, Genewise seeded by BLASTx against nr), EST-based (EST_map) gene prediction EST clustering to improve gene models Filtering overlapping gene models based on protein homology and EST support to derive best model Non-coding genes with trnascan-se ready for functional annotation

58 ngasp the nematode genome annotation assessment project

59 Take home message Intrinsic & extrinsic prediction methods Intrinsic gene finders need high-quality training datasets in order to produce good predictions Correct gene predictions are a moving target Note the steady decrease in the number of predicted genes as the human genome is further curated Gene finders & gene finding pipelines produce predictions, which must be verified and refined do not take them at face value The more pieces of high-quality evidence you add to the process the better In eukaryotes especially, there is not necessarily only one correct model

60 PROTEIN FUNCTIONAL ANNOTATION 60

61 61 Annotation defined annotate to make or furnish critical or explanatory notes or comment. -- Merriam-Webster dictionary genome annotation the process of taking the raw DNA sequence produced by the genome-sequencing projects and adding the layers of analysis and interpretation necessary to extract its biological significance and place it into the context of our understanding of biological processes. -- Lincoln Stein, PMID Gene Ontology (GO) annotation the process of assigning GO terms to gene products according to two general principles: first, annotations should be attributed to a source; second, each annotation should indicate the evidence on which it is based. --

62 62 What do our predicted genes do? What we would like: Experimental knowledge of function Literature curation Perform experiment Not possible for all proteins in most organisms (not even close in most) What we actually have: Sequence similarity Similarity to motifs, domains, or whole sequences Protein not DNA for finding function Shared sequence can imply shared function All sequence-based annotations are putative until proven experimentally

63 63 Basic set of protein annotations protein name - descriptive common name for the protein e.g. ribokinase gene symbol - mnemonic abbreviation for the gene e.g. reca EC number - only applicable to enzymes e.g role - what the protein is doing in the cell and why e.g. amino acid biosynthesis supporting evidence accession numbers of BER and HMM matches TmHMM, SignalP, LipoP whatever information you used to make the annotation unique identifier e.g. locus ids

64 64 Alignments/Families/Motifs pairwise alignments two protein s amino acid sequences aligned next to each other so that the maximum number of amino acids match multiple alignments 3 or more amino acid sequences aligned to each other so that the maximum number of amino acids match in each column more meaningful than pairwise alignments since it is much less likely that several proteins will share sequence similarity due to chance alone, than that 2 will share sequence similarity due to chance alone. Therefore, such shared similarity is more likely to be indicative of shared function. protein families clusters of proteins that all share sequence similarity and presumably similar function may be modeled by various statistical techniques motifs short regions of amino acid sequence shared by many proteins transmembrane regions active sites signal peptides

65 65 Important terms to understand homologs two sequences have evolved from the same common ancestor they may or may not share the same function two proteins are either homologs of each other or they are not. A protein can not be more, or less, homologous to one protein than to another. orthologs a type of homolog where the two sequences are in different species that arose from a common ancestor. The fact of the speciation event has created the two copies of the sequence. orthologs often, but not always, share the same function paralogs a type of homolog where the two sequences have arisen due to a gene duplication within one species paralogs will initially have the same function (just after the duplication) but as time goes by, one copy will be free to evolve new functions, as the other copy will maintain the original function. This process is called neofunctionalization. xenologs a type of ortholog where the two sequences have arisen due to lateral (or horizontal) transfer

66 ancestor speciation to orthologs duplication to paralogs lateral transfer to a different species makes xenologs one paralog evolves a new function neofunctionalization the duplicated gene/protein develops a new function

67 67 Pairwise alignments There are numerous tools available for pairwise alignments NCBI BLAST resources FASTA searches Many more At IGS we use a tool called BER (BLAST-extend-repraze) that combines BLAST and Smith-Waterman approaches Actually much of bioinformatics is based on reusing tools in new and creative ways

68 68 genome s protein set vs. non-redundant protein database BER BLAST mini-db for protein #1, mini-db for protein #2, mini-db for protein #3... mini-db for protein #3000 Significant hits (using a liberal cutoff) put into mini-dbs for each protein Query protein is extended modified Smith- Waterman Alignment vs. BER alignment Extended Query protein by 300 nt Mini database

69 BER Alignment 69 to look through inframe stop codons and across frameshifts to determine if similarity continues

70 70

71 71 Extensions in BER end5 300 bp ORFxxxxx 300 bp! normal full length match end3 search protein match protein The extensions help in the detection of frameshifts (FS) and point mutations resulting in in-frame stop codons (PM). This is indicated when similarity extends outside the coordinates of the protein coding sequence. Blue line indicates predicted protein coding sequence, green line indicates up- and downstream extensions. Red line is the match protein.! FS similarity extending through a frameshift upstream or downstream into extensions * PM similarity extending in the same frame through a stop codon

72 How do you know when an alignment is good enough to determine function? Good question! No easy answer 72 Generally, you want a minimum of 40%-50% identity over the full lengths of both query and match with conservation of all important structural and catalytic sites However, some information can be gained from partial alignments Domains Motifs BEWARE OF TRANSITIVE ANNOTATION ERRORS

73 73 Pitfalls of transitive annotation Transitive Annotation is the process of passing annotation from one protein (or gene) to another based on sequence similarity: A B B C C D A s name has passed to D from A through several intermediates. -This is fine if A is similar to D. -This is NOT fine if A is NOT similar to D Transitive annotation errors are easy to make and happen often. Current public datasets full of such errors A good way to avoid transitive annotation errors is to require that in a pairwise match, the match annotation must be trusted Be conservative Err on the side of not making an annotation, when possibly you should, rather than making an annotation when probably you shouldn t.

74 74 Trusted annotations It is important to know what proteins in our search database are characterized. proteins marked as characterized from public databases Gene Ontology repository (more on this later) GenBank (only recently began) UniProt proteins at protein existence level 1 Proteins with literature reference tags indicating characterization

75 75 UniProt UniProt Swiss-Prot European Bioinformatics Institute (EBI) and Swiss Institute of Bioinformatics (SIB) all entries manually curated annotation includes links to references coordinates of protein features links to cross-referenced databases TrEMBL EBI and SIB entries have not been manually curated once they are accessions remain the same but move into Swiss-Prot Protein Information Resource (PIR)

76 UniProt 76

77 77

78 78

79 79

80 80

81 81 Enzyme Commission Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes by the Reactions they Catalyse not sequence based categorized collection of enzymatic reactions reactions have accession numbers indicating the type of reaction, for example EC

82 82 EC number Hierarchy All ECs starting with #1 are some kind of oxidoreductase Further numbers narrow specificity of the type of enzyme A four-position EC number describes one particular reaction

83 Example entry for one specific enzyme 83

84 84 Metabolic pathway databases KEGG MetaCyc/BioCyc BRENDA

85 85

86 86

87 87 Hidden Markov models (HMMs) Statistical model of the patterns of amino acids in a multiple alignment of proteins (called the seed) which share sequence and, presumably, functional similarity Two sets routinely used for protein functional annotation TIGRFAMs ( Pfam (pfam.sanger.ac.uk) Each TIGRFAM model is assigned to a category which describes the type of functional relationship the proteins in the model have to each other Equivalog - one specific function, e.g. ribokinase Subfamily - group of related functions generally with different substrate specificities, e.g. carbohydrate kinase Superfamily - different specific functions that are related in a very general way, e.g. kinase Domain - not necessarily full-length of the protein, contains one functional part or structural feature of a protein, may be fairly specific or may be very general, e.g. ATP-binding domain

88 88 Annotation attached to HMMs Functionally specific HMMs have specific annotations TIGR00433 (accession number for the model) name: biotin synthase category: equivalog EC: gene symbol: biob Roles: biotin biosynthesis (TIGR 77/GO: ) biotin synthase activity (GO: ) Functionally general HMMs have general annotations PF04055 name: radical SAM domain protein category: domain EC: not applicable gene symbol: not applicable Roles: enzymes of unknown specificity (TIGR role 703) catalytic activity (GO: ) metabolism (GO: )

89 Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences, University of Maryland School of Medicine, 2013 HMM building Proteins from many species Alignments of functionally related proteins act as training sets for HMM building Statistical Model Model specific to a family of proteins, generally found across many species

90 90 HMM scores When a protein is searched against an HMM it receives a BITS score and an e-value indicating the significance of the match Statistical Model The person building the HMM will search the new HMM against a protein database and decide on the trusted and noise cutoff scores Statistical Model T N The search protein s score is compared with the trusted and noise cutoff scores attached to the HMM proteins scoring above the trusted cutoff can be assumed to be members of the family proteins scoring below the noise cutoff can be assumed NOT to be members of the family when proteins score in-between the trusted and noise cutoffs, the protein may be a member of the family and may not.

91 91 Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences, University of Maryland School of Medicine, 2013 HMM databases Proteins from many species Alignments of functionally related proteins act as training sets for HMM building Statistical Model T N Model specific to a family of proteins, generally found across many species Add this model to the database Examples : Pfam and TIGRFA M Database of HMM models, each specific to one protein family and/or functional level

92 The cutoff scores attached to HMMs, are sometimes high and sometimes low 92 and sometimes even negative. There is no inherent meaning in how high or low a cutoff score is, the important thing is the query protein s score relative to the trusted and noise scores. -50 above trusted: the protein is a P member of family the HMM models N T -50 below noise: the protein is not a 0 member of family the HMM models in-between noise and trusted: the protein MAY be a member of 0 the family the HMM models above trusted and some or all scores are negative: the protein is 0 a member of the family the HMM models 100

93 93 Orthologous groups COGs have not been updated in a long time eggnog newer, more complete 2 B Bi-directional best BLAST 1 A 3 C

94 94 Motif searches PROSITE - consists of documentation entries describing protein domains, families and functional sites as well as associated patterns to identify them. Center for Biological Sequence Analysis - Protein Sorting (7 tools) Signal P finds potential secreted proteins LipoP finds potential lipoproteins TargetP predicts subcellular location of proteins Protein function and structure (9 tools) TmHMM finds potential membrane spans Post-translational modifications (14 tools) Immunological features (9 tools) Gene finding and splice sites (9 tools) DNA microarray analysis (2 tools) Small molecules (2 tools)

95 95 One-stop shopping - InterPro InterPro Brings together multiple databases of HMM, motif, and domain information. Excellent annotation and documentation

96 96 Making annotations Use the information from the evidence sources to decide what the gene/protein is doing Assign annotations that are appropriate to your knowledge Name EC number Role Etc.

97 TIGR roles Main Categories: Amino acid biosynthesis Purines, pyrimidines, nucleosides, and nucleotides Fatty acid and phospholipidmetabolism Biosynthesis of cofactors, prosthetic groups, and carriers Central intermediary metabolism Energy metabolism Transport and binding proteins DNA metabolism Transcription Protein synthesis Protein Fate Regulatory Functions Signal Transduction Cell envelope Cellular processes Other categories Unknown Hypothetical Disrupted Reading Frame Unclassified (not a real role) Each main category has several subcategories.

98 Names (and other annotations) should reflect knowledge specific function Example: adenylosuccinate lyase, purb, varying knowledge about substrate specificity A good example: ABC transporters ribose ABC transporter sugar ABC transporter ABC transporter choosing the name at the appropriate level of specificity requires careful evaluation of the evidence looking for specific characterized matches and HMMs. family designation - no gene symbol, partial EC Cbby family protein carbohydrate kinase, FGGY family hypotheticals hypothetical protein conserved hypothetical protein 98

99 99 Names can be problematic..because humans do not always use precise and consistent terminology Our language is riddled with Synonyms different names for the same thing Homonyms different things with the same name This makes data mining/query difficult What name should you assign? What name should you use when you search UniProt or NCBI or any other database?

100 100 Synonyms Within any domain do people use precise & consistent language? Take biologists, for example Mutually understood concepts DNA, RNA, protein Translation & protein synthesis Synonym: one thing, more than one name Enzyme Commission reactions Standardized id, official name & alternative names

101 101 Homonyms Different things known by same name Common in biology Sporulation Vascular (plant vasculature, i.e. xylem & phloem, or vascular smooth muscle, i.e. blood vessels?) Endospore formation Bacillus anthracis! Reproductive sporulation Asci & ascospores, Morchella elata (morel) PG Warner 2008 (accessed 17-Sep-09) ASMOnly/details.asp?id=1426&Lang= L Stauffer 2003 (accessed 17-Sep-09)

102 Standardization with controlled vocabularies (CVs) An official list of precisely defined terms used to classify information & facilitate its retrieval Flat list Thesaurus Catalog 102 Benefits of CVs Allow standardized descriptions Synonyms & homonyms addressed Can be cross-referenced externally Facilitate electronic searching A CV can be used to index and retrieve a body of literature in a bibliographic, factual, or other database. An example is the MeSH controlled vocabulary used in MEDLINE and other MEDLARS databases of the NLM.

103 103 Ontology: CV with defined relationships Formalizes knowledge of subject with precise textual definitions Networked terms; child more specific ( granular ) than parent National Drug File

104 An example is the Gene Ontology with three controlled vocabularies Molecular Function What the gene product is doing Biological Process Why the gene product is doing what it does Cellular component Where a gene product is doing what it does 104

105 The Gene Ontology A good example of a biological ontology Relationships among networked, defined terms Vascular terms shown with relationships

106 106 Example: a GO annotation Associating GO term with gene product (GP) GP has function (6-phosphofructokinase activity) GP participates in process (glycolysis) GP is located in part of cell (cytoplasm) Linking GO term to GP asserts it has that attribute Based on literature or computational methods Always involves: Learning something about gene product Selecting appropriate GO term Providing appropriate evidence code Citing reference [preferably open access] Entering information into GO annotation file

107 Annotation becomes a series of ids linked to other proteins/genes/features 107 This protein is integral to the plasma membrane and is part of an ATP-binding cassette (ABC) transporter complex. It functions as part of a transporter to accomplish the transport of sulfate across the plasma membrane using ATP hydrolysis as an energy source. = GO: GO: GO: GO:

108 108 Term name GO ID (unique numerical identifier) Synonyms for searching, alt. names, misspellings GO slim Ontology relationships (next page) Precise textual definition that describes some aspect of the biology of the gene product Definition reference

109 109 Genomes can be compared High-level biological process terms used to compare Plasmodium and Saccharomyces (made by slimming ) MJ Gardner, et al. (2002) Nature 419:

110 The importance of evidence tracking The process of functional annotation involves assessing available evidence and reaching a conclusion about what you think the protein is doing in the cell and why. Functional annotations should only be as specific as the supporting evidence allows All evidence that led to the annotation conclusions that were made must be stored. In addition, detailed documentation of methodologies and general rules or guidelines used in any annotation process should be provided. I conclude that you are a cat. I conclude that you code for a protein kinase. Why? - You look like other cats I know - I heard you meow and purr Why? - You look like other protein kinases I know - You have been observed 110 to add phosphate to proteins

111 Knowledge & annotation specificity How much can we accurately say? Available evidence for three genes Corresponding GO annotations

112 112 Types of Evidence Experiments (the only truth) Pairwise/multiple alignments HMM/domain matches scoring above trusted cutoff Metabolic Pathway analysis Match to an ortholog group (COG,eggNOG) Motifs

113 The Evidence Ontology Two main classes ECO terms have standardized definitions & references Related to GO evidence codes Allows standardizing evidence description and searching by evidence type

114 114 Evidence Ontolgy & GO Codes Approximately 20 GO codes exist some of the over 250 ECO terms

115 115 The big picture: an DNA Sequence (assembly, masking) Gene Prediction Predicte d protein coding genes translation example pipeline RNA finding: trnascan, RNAMMER, homology searches Predicted RNA Genes MySQL database using the Chado schema Automated start site and gene overlap correction Searches: Pairwise BER searches against UniRef100 HMM searches against Pfam and TIGRfam Motif searches with LipoP, THMHH, PROSITE NCBI COGs Prium profiles Genome viewer/ editor Flat files of annotation information Automatic Annotation using the evidence hierarchy of Pfunc

116 116 Some concluding themes The best annotation comes from looking at multiple sources of evidence It is important to track and check the evidence used in an annotation Do not assume the annotation you see on a protein is correct unless it comes from a trusted source Always err on the side of under-annotating rather than over-annotating Consider using UniProt (UniRef) for searches, not NCBI nr, simply for the depth of information it provides.