1. Proteomics database contents Protein sequence databases

Size: px

Start display at page:

Download "1. Proteomics database contents Protein sequence databases"

Pierce Kelly
6 years ago
Views:

1 1. Proteomics contents Protein sequence s Salvador Martínez de Bartolomé smartinez@proteored.org Bioinformatics support ProteoRed Proteomics Facility, National Center for Biotechnology, Madrid

2 Menu Introduction : bioinformatics and sequence s Nucleic acid sequence s Protein sequences s (sources) Protein sequences s (other)

3 Biology of the XXI century Three major developments: High throughput technique analysis: DNA sequencing, mass spectrometry, micro- Numerous biological s available through the Web Bioinformatics tools available through the Web

4 An overwhelming number of unordered resources

5 Protein Sequence 3 o Structure Protein 2D PAGE & MS PTM Protein identification & characterization PTM Prediction tool 1 o Structure Analysis 3 o Structure Prediction Nucleotide Amino Acid Translator Sequence Alignment Similarity Search Gene Expression Protein Interactions Species / Genomic Functional 2 o Structure Prediction Subcellular localization Polymorphism / Mutation / Disease databae Topology Prediction Pattern & Profile search Domains & classification 2 o Structure Database Database Database Database Database Database Database Database Database Database Database Database Database Phylogenetics & Taxonomy References / nomenclatur e Nucleotide sequence repository

6 References / nomenclatur e Phylogenetics & Taxonomy Subcellular localization Protein Sequence 3 o Structure Protein 2D PAGE & MS PTM Protein identification & characterization PTM Prediction tool 1 o Structure Analysis 3 o Structure Prediction Nucleotide Amino Acid Translator Sequence Alignment Similarity Search Gene Expression Protein Interactions Species / Genomic Functional 2 o Structure Prediction Polymorphism / Mutation / Disease databae Topology Prediction Pattern & Profile search Domains & classification 2 o Structure Database Database Database Database Database Database Database Database Database Database Database Database Database Nucleotide sequence repository UniProtKB (Swiss-Prot/TrEMBL) TargetP EcoGene Ensembl FlyBase MGD SGD SubtiList TIGR CMR HIV TAIR MEROPS ENZYME TRANSFAC KEGG HAMAP PROSITE InterPro Pfam ProDom BLOCKS TIGRFAM ProtoMap CATH SCOP PDB SWISS-MODEL ScanProsite MotifScan HSSP Jpred GOR DIP IntAct ProtScale ProtParam BLAST FASTA dbsnp GeneCards OMIM CleanEx DDBJ GenBank EMBL TreeBase NEWT Taxonomy PSORT Glycosuite PhosphBase NetOGlyc ChloroP PeptideMass Mascot Phenyx ECO2DBASE Siena-2D PAGE SWISS-2D PAGE TMHMM SOSUI PubMed HUGO GO ClustalW DIALIGN Translate

7 Molecular bioinformatics: an operational definition The applications of computer sciences to molecular biology in particular for the study of macromolecules such as proteins, nucleic acids and oligosaccharides

8 Protein sequence s - Identification of proteins by proteomics --> completeness, sequence quality - Similarity searches (functional prediction) --> sequence quality (non redundance) - Training datasets (prediction tools) --> sequence and annotation quality - Genome annotation

9 Proteome complexity Not predictable at the genome level! (Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: ).

10 Avalanche of sequence data

12 ~ 1630 genomes sequenced (single organism, varying sizes) ~ 952 ongoing genome sequencing projects

15 ~ 1630 genomes sequenced (single organism, varying sizes) ~ 952 ongoing genome sequencing projects. ~ 200 metagenome sequencing projects (environmental samples: multiple unknown organisms, varying sizes) Ecological metagenomes: beach sand, Sargasso Sea. Organismal metagenomes: mouse gut ~ 17 million sequences being processed at Venter Institute

16 How many protein sequences at the end? For fun: estimate: ~30 million species (1.5 million named) 20 million bacteria/archea x 4'000 genes ( ) 5 million protists x 6'000 genes 3 million insects x 14'000 genes 1 million fungi x 6'000 genes 0.6 million plants x 20'000 genes 0.2 million molluscs, worms, arachnids, etc. x 20'000 genes 0.2 million vertebrates x 25'000 genes The calculation: 2x10 7 x4000+5x10 6 x6000+3x10 6 x x6000+6x10 5 x x 10 5 x x10 5 x25000 = 179'000'000'000 AMB, SP20

17 Protein sequence origin About 4.5 millions of known protein sequences (in 2007) More than 99 % of the protein sequences are derived from the translation of nucleotide sequences Less than 1 %: direct protein sequencing (Edman, MS/MS ) -> It is important that users know where the protein sequence comes from (sequencing & gene prediction quality)!

18 Menu Introduction : bioinformatics and sequences Nucleic acid sequence s Protein sequences s (sources) Protein sequences s (other)

The hectic life of a sequence Data not submitted to public s*, delayed or cancelled cdnas, ESTs(expressed sequence tags), genes, genomes, EMBL, GenBank,

19 The hectic life of a sequence Data not submitted to public s*, delayed or cancelled cdnas, ESTs(expressed sequence tags), genes, genomes, EMBL, GenBank, DDBJ EMBL: GenBank: DDBJ:

20 Contribution: EMBL 10 %; GenBank 75 %; DDBJ 15 %

21 Goal -to accept, process and make freely available sequence data from individual researchers, research group and patent office - available via SRS/Entrez, ftp, web services and similarity search tools.

22 The tremendous increase in nucleotide sequences 1980: 80 genes fully sequenced!

23 EMBL/GenBank/DDBJ Serve as archives : nothing goes out Contain all public sequences derived from: Genome projects (> 80 % of entries) Sequencing centers (cdnas, ESTs ) Individual scientists ( 15 % of entries) Patent offices (i.e. European Patent Office, EPO) Currently: ~152x10 6 sequences, ~242 x10 9 bp; Sequences from > different species;

24 More than species, but human mouse rat Human/Mouse/Rat: organisms with the highest redundancy!

25 Where the sequenced specimen was collected? Geographical Origin of Sequenced Samples (since 2005) (lat_lon: latitude_longitude qualifier)

26 EMBL/GenBank/DDBJ A very important annotation for proteomic: the CoDing Sequence (CDS) (in particular for eucaryotes)

27 with or without annotated CDS provided by authors Data not submitted to public s*, delayed or cancelled cdnas, ESTs, genes, genomes, EMBL, GenBank, DDBJ CDS CoDing Sequence portion of DNA/RNA translated into protein (from Met to STOP) Experimentally proved or derived from gene prediction

28 5 Problems

29 Problem 1 Complete genome (submitted) only ~ 2,015 CDS available!

uk/services/dbstats/ human mouse rat At the protein level At the

30 At the nucleic acid level human mouse rat At the protein level At the protein level (Example with UniProtKB/TrEMBL): The CDS of virus and bacteria are easy to obtain!

31 Problem 2: Variable level of sequence quality - Sequencing quality - Gene prediction quality Authors can specify the nature of the CDS by using the qualifier: "/evidence=experimental" or "/evidence=not_experimental". Very rarely done

32 Very rarely done

33 UniProtKB/Swiss-Prot protein knowledgebase release 56.6 statistics (16-Dec-08) Protein existence (PE): % 1: At protein level 15,3% 2: Evidence at transcript level 15,8% 3: Inferred from homology 65,2% 4: Predicted 3,4% 5: Uncertain 0,3%

34 Problem 3: highly redundant Sort of sequence museum, where sequences are preserved for eternity as they were determined, interpreted and published originally by their authors (primary sequence repository) -> Similarity searches are not obvious

35 Problem no 4 Author authority --> variable level of the annotation (CDS and other) quality - i.e. gene/protein name attribution

36 EMBL/GenBank/DDBJ The authors have full authority over the content of the entries they submit! (editorial control of the content belongs to the authors) (exception: TPA (Third Party Annotation), since january 2003)

37 Problem no 5 Environmental samples

38 Environmental sequences (ENV) Aim: To sequence all DNA present in a given sample, without knowing from which species the DNA is derived from - Sargasso sea (Craig Venter) - human fluids - earth

41 No idea of the species (microbial population ) No idea of the gene prediction program to be used No idea of the genetic code to be used for traduction!!!!! Not always associated with CDS. If yes, the protein sequence are present in protein sequence s

42 Menu Introduction : bioinformatics and sequences Nucleic acid sequence s Protein sequences s (sources) Protein sequences s (other)

43 Data not submitted to public s, delayed or cancelled cdnas, ESTs, genomes, Nucleic acid s no CDS EMBL, GenBank, DDBJ if the submitters provide an annotated Coding Sequence (CDS) (1/10 EMBL entries) Gene prediction Protein sequence s

44 Major protein sequence sources PIR PDB PRF UniProtKB: Swiss-Prot + TrEMBL Integrated resources cross-references Separated resources NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq UniProtKB/Swiss-Prot: manually annotated protein sequences ( species) UniProtKB/TrEMBL: submitted CDS (EMBL) + automated annotation; non redundant with Swiss-Prot ( species) GenPept: submitted CDS (GenBank); redundant with Swiss-Prot ( species) PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB PDB: Protein Databank: 3D data and associated sequences PRF: journal scan of published peptide sequences RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction (4 000 species)

45 UniProt, the Universal protein resource is maintained by the UniProt consortium SIB + EBI + PIR SIB = Swiss Institute Bioinformatics EBI = European Bioinformatics Institute PIR = Protein Information Resource

46 entries ( species) entries ( species)

48 The UniProt KnowledgeBase (UniProtKB) an encyclopedia on proteins biweekly released

49 EMBL TrEMBL Automated extraction of protein sequence (translated CDS), gene name and references.+ Automated annotation

50 !!!! The quality of UniProtKB/TrEMBL data, including the protein sequence, is directly dependent on the information provided by the submitter of the original nucleotide entry. Automated annotation using rules derived from Swiss-Prot manually annotated entries but with no manual oversight RuleBase using automatically generated rules - Spearmint

51 EMBL TrEMBL Manual annotation of the sequence and associated biological information Swiss-Prot Automated extraction of protein sequence (translated CDS), gene name and references.+ Automated annotation

52 UniProtKB from TrEMBL to Swiss-Prot Sequence check

53 UniProtKB/Swiss-Prot 1 entry <-> 1 gene (1 species) i) Merge of all known protein sequences (CDS) derived from the same gene -> avoid redundancy and improve sequence reliability (for human: ~ 6 different sequence report per entry) ii) Annotation of the sequence differences (including conflicts, polymorphisms, splice variants etc..) -> annotation of protein diversity

54 Righting the wrongs Sequences are rarely deposited in a mature state; as with all scientific research, DNA and protein annotation is a continual process of learning, revision and corrections. Sequencing error rates: ~1 base in

55 evidence exists that prove the existence of a protein; Different qualifiers: 1. Evidence at protein level (~15,3%) 2. Evidence at transcript level (~15,8%) 3. Inferred from homology (~65,2 %) 4. Predicted (~3,4%) 5. Unassigned (mainly in TrEMBL) (0,3%)

56 Annotation Focal point of our efforts to maintain and develop UniProtKB/Swiss-Prot; Enables individual researchers to obtain a summary of what is known about a protein

57 In a UniProtKB/Swiss-Prot entry, you can expect to find: A (often corrected) protein sequence and the description of various isoforms/variants. Its biological origin with links to the taxonomic s; All the names of a given protein (and of its gene); A summary of what is known about the protein: function, alternative products, PTM, tissue expression, disease, 3D data etc. ; A description of important sequence features: domains, PTMs, variations, etc.; A selection of references; Selected keywords; Numerous cross-references (central hub);

58 An easy way to access the history of a protein sequence entry UniSave homepage:

61 Other UniProt s

63 UniRef

64 UniRef useful for comprehensive BLAST searches by providing sets of representative sequences «Collapsing BLAST results» = Three collections of sequences clusters from the UniProt knowledgebase and EnsEMBL, IPI, EMBL_WGS: One UniRef100 entry -> all identical sequences (Identical sequences and sub-fragments with 11 or more residues are placed into a single record) -> reduction of 12 % One UniRef90 entry -> sequences that have at least 90 % or more identity -> reduction of 40 % One UniRef50 entry -> sequences that are at least 50 % identical -> reduction of 65 % Independently of the species!

65 UniParc

66 UniParc

67 UniParc UniProt Archive (UniParc) is part of UniProt project. It is a non-redundant archive of protein sequences extracted from public s UniProtKB/Swiss-Prot,UniProtKB/TrEMBL, PIR-PSD, EMBL, EMBL WGS, Ensembl, IPI, PDB, PIR-PSD,RefSeq, FlyBase, WormBase, H-Invitational Database, TROME, European Patent Office proteins, United States Patent and Trademark Office proteins (USPTO) and Japan Patent Office proteins. UniParc contains only protein sequences. All other information about the protein must be retrieved from the source s using the cross-references. Each unique sequence is stored only once with a stable identifier. The format of the identifier is UPI followed by ten hexadecimal numbers, e.g.upi a.

68 UniParc Use with extreme caution: also contains pseudogene, incorrect CDS prediction etc! Also patent office data (EPO, ESPO ).

69 Not downloadable

70 UniMES

71 The UniProt Metagenomic and Environmental Sequences (UniMES) is a repository specifically developed for metagenomic and environmental data. UniMES is available in FASTA format on the UniProt ftp servers, in the new subdirectory current_release/unimes: ftp.uniprot.org/pub/s/uniprot ftp.ebi.ac.uk/pub/s/uniprot ftp.expasy.org/s/uniprot

73 NCBInr (Entrez protein)

74 Protein sequences: «NR» Entrez protein

75 Major protein sequence sources PIR PDB PRF UniProtKB: Swiss-Prot + TrEMBL Integrated resources cross-references Separated resources NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq UniProtKB/Swiss-Prot: manually annotated protein sequences ( species) UniProtKB/TrEMBL: submitted CDS (EMBL) + automated annotation; non redundant with Swiss-Prot ( species) GenPept: submitted CDS (GenBank); redundant with Swiss-Prot ( species) PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB PDB: Protein Databank: 3D data and associated sequences PRF: journal scan of published peptide sequences RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction (4 000 species)

76 Scientific publications derived sequences «Journal scan» (integrated into TrEMBL) NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF All PIR data have been integrated into Swiss-Prot and TrEMBL (UniProt) derived from GenBank/EMBL/DDBJ sequences which have a CDS annotated on them - equivalent to TrEMBL 3D structure : all the protein sequences which have been cristallized (Swiss-Prot/TrEMBL are crosslinked to PDB)

77 RefSeq

79 RefSeq: The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. 3,648,590 entries (22-May-2007); 4,300 species. 5,590,364 entries (11-July-2008); 5,395 species. 6,042,750 entries (20-November-2008); 5,726 species. Accession numbers - for RNA (NM_) - for genomic (NT_) - for protein (NP_) - for predicted protein (XP_)

80 AC

81 AC KW Taxonomy References

82 Scientific publications derived sequences «Journal scan» (integrated into TrEMBL) NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF All PIR data have been integrated into Swiss-Prot and TrEMBL (UniProt) derived from GenBank/EMBL/DDBJ sequences which have a CDS annotated on them - equivalent to TrEMBL, except that it is redundant with Swiss-Prot 3D structure : all the protein sequences which have been cristallized (Swiss-Prot/TrEMBL are crosslinked to PDB)

83 PIR

84 PIR: the Protein Identification Resource PIR-PSD is no more updated, but exists as an archive

85 PDB

86 PDB PDB (Protein Data Bank), 3D structure Contains the spatial coordinates of macromolecule atoms whose 3D structure has been obtained by X- ray or NMR studies Contains also the corresponding protein sequences *The PIR-NRL3D makes the sequence information in PDB available for similarity searches and other tools Includes protein sequences which are mutated, effect of a mutation on the 3D structure)

87 PDB: Protein Data Bank Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA). Associated with specialized programs allow the visualization of the corresponding 3D structure (e.g., SwissPDB-viewer, Chime, Rasmol)). Currently there are structural data for about different proteins, but far less protein family (highly redundant)!

88 PDB: example

89 Sequence Coordinates of each atom

90 Visualisation with Jmol

91 PRF

92 Looks for the peptide sequence described in publication (and which are not submitted in s!!!)

94 Query at Entrez protein (NCBInr)

95 RefSeq Typical result of a query at «Entrez protein» Genpept (gb/embl/ddbj) PIR Swiss-Prot PDB

96 AC GenInfo identifier number

97 GI number: GenInfo identifier number - In addition to an AC number specific from the original, each protein sequence in the NCBInr has a GI number. - If the sequence changes in any way, a new GI number will be assigned -> not a stable identifier - A separate GI number is also assigned to each protein translation within a nucleotide sequence record (alternative products) - A Sequence Revision History tool is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record:

98 Menu Introduction : bioinformatics and sequences Nucleic acid sequence s Protein sequences s (sources) Protein sequences s (other)

99 EnsEMBL not only for proteins.

100 EnsEMBL Automated genome annotation and subsequent visualisation of annotated genomes. Ensembl concentrates on vertebrate genomes, but other groups have adapted the system for use with plant and fungal genomes.

101 - EnsEMBL: align the genomic sequences with all the sequences found in EMBL, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (-> known genes) - Also do gene prediction (-> novel genes) -DNA, RNA and protein sequences available for ~30 species - Browsing tool

102

103 Browsing tool available for 49 species

104

105 CCDS Consensus CDS protein set

106

107 CCDS (human) Combining different approaches ab initio, by similarity - and taking advantage of the expertise acquired by different institutes, including manual annotation Consensus between 4 institutions

108

109 IPI International Protein Index

110

111 IPI (International Protein Index) Provides a guide to the main s that describe the human, mouse, rat, Zebrafish, Arabidopsis, Chicken, and Cow proteomes: Swiss-Prot, TrEMBL, RefSeq and Ensembl (and H- InvDB, TAIR and VEGA). IPI is built in order to provide maximum coverage of the major publicly available protein (and gene) s, for a same protein For each protein in IPI, an entry from one of the constituent s is selected as the master entry, and supplies the IPI entry with its sequence and annotation. Stable identifiers (with incremental versioning) are maintained to allow the tracking of sequences in IPI between IPI releases.

112

113 IMGT (international ImMunoGeneTics information) Is a collection of high-quality integrated s specialising in inmunoglobulins, T cell receptors and the Major Histocompatibility Complex (MHC) of all vertebrate species.

114

115

116 Protein sequence s for proteomics

117 Phenyx: UniProtKB Translation of ESTs sequences in the 6 frames (EST are not associated with annotated CDSs!) PROWL: NCBInr, Swiss-Prot, dbest Protein prospector: NCBInr, Swiss-Prot, dbest, GenPept, Ludwignr, OWL*. Peptident (Aldente): UniProtKB. Mascot: NCBInr, Swiss-Prot, dbest, OWL*, MSDB * OWL is obsolete since 1999

118 OWL Non redundant protein, including: Swiss-Prot, PIR, NRL3-D* and GenPept. *The PIR-NRL3D makes the sequence information in PDB available for similarity searches

119 Phenyx: UniProtKB Translation of ESTs sequences in the 6 frames (EST are not associated with annotated CDSs!) PROWL: NCBInr, Swiss-Prot, dbest Protein prospector: NCBInr, Swiss-Prot, dbest, GenPept, Ludwignr, OWL*. Peptident (Aldente): UniProtKB. Mascot: NCBInr, Swiss-Prot, dbest, OWL*, MSDB * OWL is obsolete since 1999

120

121 ID/AC mapping

122 -> Accession / version number jungle! According to the, a AC number can be associated with an entry (gene product: stable even if the sequence changes) or with a sequence (it change as soon as the sequence changes)

123 In resume For the same protein sequence You can find: A UniProtKB/Swiss-Prot entry A RefSeq entry (or GenPept) A EnsEMBl entry A CCDS entry A UniParc entry (archive) A IPI

124 The AC number jungle Type of record Sample Accession Format GenBank/EMBL/DDBJ Swiss-Prot/TrEMBL RefSeq nucleotide RefSeq protein RefSeq prediction PDB (protein structure) One letter followed by five digits: e.g. U12345 Two letters followed by 6 digits: e.g. AF One letter and five digits/letters: e.g. P12345, A0B533 Two letters, underscore bar and six digit: e.g. mrna NM_ e.g. genomic NT_ e.g. NP_00483 e.g. XM_ e.g. XP_ One digit followed by three letters: e.g. 1TUP

125 uniprot.org

126

127 UniProtKB and PTMs

128 Proteome complexity Not predictable at the genome level! (Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: ).

129 Chemical aspects Post-translational modifications (PTMs) consist in the breaking and/or the making of covalent bonds catalyzed by enzyme PTMs modify both protein mass and isoelectric point (PI)

The PTM variety Gly Ala Val Leu Ile Lys Arg His Asp Glu Asn Gln Cys Ser Thr Met Pro Phe Tyr Trp side-chain modifications acetylation methylation acylation phosphorylation oxidation crosslinks

130 The PTM variety Gly Ala Val Leu Ile Lys Arg His Asp Glu Asn Gln Cys Ser Thr Met Pro Phe Tyr Trp side-chain modifications acetylation methylation acylation phosphorylation oxidation crosslinks hydroxylation cofactor binding sulfation C-linked sugar N-linked sugar O-linked sugar S-linked sugar N-terminal modifications acetylation methylation acylation crosslinks C-terminal modifications GPI amidation crosslinks methylation in black: cytoplasmic modifications in dark grey: both cytoplasmic and extracellular modifications, depending on the exact type in light grey: extracellular modifications

131 PTM distribution among kingdoms FMN binding bacterial lipid anchor pyrrolysine archaea bacteria-specific methylation lanthionine crosslink archaea-specific methylation bacteria acetylation archaean lipid anchor phosphorylation myristoylation methylation FAD binding diphthamide palmitoylation GPI-anchor amidation sulfation eukaryote-specific methylation eukaryotes

132 PTM annotation in UniProtKB entries PTMs are annotated in the feature table ( sequence annotation ) when they can be assigned a position on the protein sequence - in the comments when they cannot.

133 PTM-dedicated FT keys FT key usage CARBOHYD (Glycosylation ) DISULFID (Disulfide bond) CROSSLNK (Cross-link) LIPID MOD_RES (Modified residue) sugars disulfide bonds other crosslinks lipids other modifications PTMs are grouped by type, are specifically and uniquely annotated by the use of a controlled vocabulary and a set of specific FT keys

134 PTM annotation in UniProtKB entries PTMs are annotated in the feature table when they can be assigned a position on the protein sequence - in the comments when they cannot. Associated keywords

135

136 Find all mouse proteins which are phosphorylated

137

138 UniProtKB/Swiss-Prot Number of PTMs in Swiss-Prot release 51 ( entries) all organisms Pot. By sim. Exp. & Prob. total signal peptide N-GlcNAc O-GalNAc O-GlcNAc phosphorylation sulfation myristate GPI-anchor

139 Resid

RESID RESID is a of 473 natural modifications (Rel. 56.00) with chemical and structural annotations such as recommended name and synonyms, delta mass, 3D structure, UniProt annotations, etc.

140 RESID RESID is a of 473 natural modifications (Rel ) with chemical and structural annotations such as recommended name and synonyms, delta mass, 3D structure, UniProt annotations, etc. FTP sites: ftp://ftp.ebi.ac.uk/pub/s/resid/ ftp://ftp.ncifcrf.gov/pub/users/residues Web sites:

141 RESID

142 RESID

143 Other PTM s UNIMOD: PSI-MOD: ontology Delta Mass:

144 GO

145

146 GO scope Three disjoint axes: cellular component Sub-cellular location e.g nucleus, ribosome, origin recognition complex molecular function molecular role e.g. catalytic activity, binding biological process broad biological phenomena e.g. mitosis, growth, digestion

147 GO structure terms are related within a hierarchy Terms are linked by two relationships is-a part-of

148 GO structure cell is-a part-of membrane chloroplast mitochondrial membrane chloroplast membrane

149 GOA: Gene Ontology Annotation What is GOA? GOA aims to provide high-quality electronic and manual annotations to the UniProt Knowledgebase and International Protein Index, using GO terms. The GOA project is run by EBI and is a member of the GO consortium since In 2001, the first phase of the GOA project involved the large-scale assignment of GO terms to Swiss-Prot and TrEMBL entries using electronic methods, namely the mappings spkw2go, ec2go and Interpro2go.

150

151 e-proxemis:

152

Sequence Databases and database scanning

Sequence Databases and database scanning Marjolein Thunnissen Lund, 2012 Types of databases: Primary sequence databases (proteins and nucleic acids). Composite protein sequence databases. Secondary databases.