Proteomics databases

Size: px

Start display at page:

Download "Proteomics databases"

Lorena Randall
5 years ago
Views:

1 Proteomics databases and protein characterization tools Part I Proteomics databases

2 Proteomics databases 1. Sequence databases: «The story of a protein sequence s life» 2. Swiss-Prot: a quick overview 3. UniProt utilities: UniRef and UniParc 4. Swiss-Prot and the other protein databases Where do the protein sequences come from? What s about their reliability? What do you have to take care of?

3 Real life of a protein sequence with or without annotated CDS PRF, PIR CoDing Sequences provided by submitters TrEMBL Genpept Manually annotated cdnas, ESTs, genomes, Data not submitted to public databases, delayed or cancelled EMBL, GenBank, DDBJ CoDing Sequences provided by submitter and «de novo» gene prediction RefSeq XP_NNNNN Scientific publications derived sequences PRF Swiss-Prot 3D structures UniProt: Swiss-Prot + TrEMBL + (PIR) NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF Let s start at the very beginning

4 with or without annotated CDS provided by authors Real life of a protein sequence cdnas, ESTs, genomes, Data not submitted to public databases, delayed or cancelled EMBL, GenBank, DDBJ CDS CoDing Sequence portion of DNA/RNA translated into protein (from Met to STOP) EMBL/GenBank/DDBJ The 3 main public nucleic acid sequence databases are EMBL (EBI)/GenBank (NCBI) /DDBJ (Japan): «different views of the same data set» within 2-3 days Contribution: EMBL 10 %; GenBank 73 %; DDBJ 17 % EMBL: since 1982

EPO) Currently: 30x10 6 sequences, ~36 x10 9 bp; Sequences from > 50 000 different species; The tremendous increase in

5 EMBL/GenBank/DDBJ Serve as archives Contain all public sequences derived from: Genome projects (> 80 % of entries) Sequencing centers (cdnas, ESTs ) Individual scientists ( 15 % of entries) Patent offices (i.e. European Patent Office, EPO) Currently: 30x10 6 sequences, ~36 x10 9 bp; Sequences from > different species; The tremendous increase in nucleotide sequences Mouse Other Rat Human 1980: 80 genes fully sequenced! Human/Mouse/Rat: Organisms with the highest redundancy!

6 EMBL/GenBank/DDBJ Sort of sequence museum, where sequences are preserved for eternity as they were determined, interpreted and published originally by their authors (primary sequence repository) The authors have full authority over the content of the entries they submit! (exception: TPA, since january 2003) an EMBL entry ID HSERPG standard; genomic DNA; HUM; 3398 BP. XX AC X02158; XX SV X XX DT 13-JUN-1985 (Rel. 06, Created) DT 22-JUN-1993 (Rel. 36, Last updated, Version 2) XX DE Human gene for erythropoietin XX KW erythropoietin; glycoprotein hormone; hormone; signal peptide. XX OS Homo sapiens (human) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; OC Eutheria; Primates; Catarrhini; Hominidae; Homo. XX RN [1] RP RX MEDLINE; RA Jacobs K., Shoemaker C., Rudersdorf R., Neill S.D., Kaufman R.J., RA Mufson A., Seehra J., Jones S.S., Hewick R., Fritsch E.F., Kawakita M., RA Shimizu T., Miyake T.; RT Isolation and characterization of genomic and cdna clones of human RT erythropoietin; RL Nature 313: (1985). XX DR GDB; ; EPO. DR GDB; ; TIMP1. DR Swiss-Prot; P01588; EPO_HUMAN. XX keyword taxonomy references Cross-references DNA (genomic) or RNA

CC Data kindly reviewed (24-FEB-1986) by K. Jacobs FH Key Location/Qualifiers FH FT source 1..3398 FT /db_xref=taxon:9606 FT /organism=homo sapiens FT mrna join(397..627,1194..1339,1596..1682,2294.

7 CC Data kindly reviewed (24-FEB-1986) by K. Jacobs FH Key Location/Qualifiers FH FT source FT /db_xref=taxon:9606 FT /organism=homo sapiens FT mrna join( , , , , ) FT CDS join( , , , , ) FT /db_xref=swiss-prot:p01588 FT /product=erythropoietin FT /protein_id=caa FT /translation=mgvhecpawlwlllsllslplglpvlgapprlicdsrvlqrylle FT AKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVLRG FT QALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTITAD FT TFRKLFRVYSNFLRGKLKLYTGEACRTGDR FT mat_peptide join( , , , ) FT /product=erythropoietin FT sig_peptide join( , ) FT exon FT /number=1 FT intron FT /number=1 FT exon FT /number=2 FT intron FT /number=2 FT exon FT /number=3 FT intron FT /number=3 FT exon FT /number=4 FT intron FT /number=4 FT exon FT /note=3' untranslated region FT /number=5 XX SQ Sequence 3398 BP; 698 A; 1034 C; 991 G; 675 T; 0 other; agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 60 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 120 CDS CoDing Sequence (proposed by submitters) Annotation (Prediction or experimentally determined) sequence

FT CDS complement(45959..47332) FT /db_xref="sptrembl:q9uz71" FT /note="pab2386" FT /transl_table=11 FT /product="4-aminobutyrate qui se dilate AMINOTRANSFERASE FT (EC 2.6.1.19)" FT /protein_id="cab50188.

8 FT CDS complement( ) FT /db_xref="sptrembl:q9uz71" FT /note="pab2386" FT /transl_table=11 FT /product="4-aminobutyrate qui se dilate AMINOTRANSFERASE FT (EC )" FT /protein_id="cab " FT /translation="mdyprivvnppgpkakelierekrvlstgigvklfplvpkrgfgp FT FIEDVDGNVFIDFLAGAAAASTGYSHPKLVKAVKEQVELIQHSMIGYTHSERAIRVAEK FT LVKISPIKNSKVLFGLSGSDAVDMAIKVSKFSTRRPWILAFIGAYHGQTLGATSVASFQ FT VSQKRGYSPLMPNVFWVPYPNPYRNPWGINGYEEPQELVNRVVEYLEDYVFSHVVPPDE FT VAAFFAEPIQGDAGIVVPPENFFKELKKLLDEHGILLVMDEVQTGIGRTGKWFASEWFE FT VKPDMIIFGKGVASGMGLSGVIGREDIMDITSGSALLTPAANPVISAAADATLEIIEEE FT NLLKNAIEVGSFIMKRLNELKEQFDIIGDVRGKGLMIGVEIVKENGRPDPEMTGKICWR FT AFELGLILPSYGMFGNVIRITPPLVLTKEVAEKGLEIIEKAIKDAIAGKVERKVVTWH"

9 Proteomics databases 1. Sequence databases: «The story of a protein sequence s life» 2. Swiss-Prot: a quick overview 3. UniProt utilities: UniRef and UniParc 4. Swiss-Prot and the other protein databases Real life of a protein sequence with or without annotated CDS Data not submitted to public databases, delayed or cancelled cdnas, ESTs, genomes, EMBL Nucleic acids CoDing Sequences provided by submitters TrEMBL Amino acids Swiss-Prot Manually annotated

10 Since december 15, 2003 Swiss-Prot and TrEMBL constitute the Knowledgebase (integration of the PIR data) -> give access to all known* protein sequences * submitted to the public databases (EMBL, GenBank, DDJB, SWISS-PROT)

11 a SWISS-PROT entry = a protein sequence associated with - manually-checked - well-structured - periodically-updated - searchable biological information a TrEMBL entry = a protein sequence associated with - computer-annotated - well-structured - periodically-updated - searchable biological information

12 CDS TrEMBL EMBL Swiss-Prot CDS TrEMBL Once in Swiss-Prot, no more in TrEMBL -> Minimal redundancy Annotation of conflicts EMBL Swiss-Prot

13 CDS TrEMBL EMBL Swiss-Prot How to make things clear? Depending of the server UniProt = Swiss-Prot + TrEMBL = SPTR = SWALL Swiss-Prot =UniProt/Swiss-Prot TrEMBL= UniProt/TrEMBL=SPTrEMBL EMBnet 2004: Proteomics TrEMBL=SPTrEMBL + using TrEMBLnew** **is going to disappear soon!

14 Swiss-Prot 1. Minimal redundancy; 2. Maximal manual annotation; 3. Integration with other databases. Swiss-Prot 1. Minimal redundancy; 1 gene (1 species) -> 1 entry Swiss-Prot Identical sequences are merged, as are variants, fragments, alternative splicing isoforms.

15 Swiss-Prot 1. Minimal redundancy. 2. Maximal manual annotation: Function(s); Interactions; Subcellular localization and tissue expression; Structure (domains, ); Post translational modifications (PTMs); Variants (alternative splicing, polymorphisms, ); Similarities Swiss-Prot 1. Minimal redundancy; 2. Manual annotation; 3. Integration with other databases: Release (26-Sep-2003): 83 links to other datases.

16 Up-to-date sources: Swiss-Prot -> ExPASy Since 1986 ( TrEMBL Since > EBI (European Bioinformatics Institute) ( You can install the ExPASyBar on your computer Amos links

17 Search also with accession numbers (Swiss-Prot or other databases)

18 Swiss-Prot an overview View «by default» on the ExPASy server

19 ExPASy EBI NCBI

20 Not always obvious to known from which database your protein sequence is derived from! Topology of a Swiss-Prot entry EMBnet 2004: Proteomics sequence using

Swiss-Prot Protein sequence: - The longest sequence is usually «displayed» - Precursor (except INIT_MET 0 and «amino acid sequencing») - Comparison of genomic and cdna sequences -> carefully checked;

21 Swiss-Prot Protein sequence: - The longest sequence is usually «displayed» - Precursor (except INIT_MET 0 and «amino acid sequencing») - Comparison of genomic and cdna sequences -> carefully checked; validated! -> choose the most representative -> The sequence quality is always increasing. Swiss-Prot s daily bread Alternative splicing? Same gene? Polymorphisms? Alternative initiation? RNA editing? Usage of an alternative promoter? Selenocystein? Fragment? Sequencing errors?

22 Topology of a Swiss-Prot entry Identifier Accession Nr. Protein name Gene name EMBnet 2004: Proteomics sequence using Always cite the primary accession number!

23 Topology of a Swiss-Prot entry Identifier Accession Nr. Protein name Gene name Taxonomy EMBnet 2004: Proteomics sequence using Topology of a Swiss-Prot entry Identifier Accession Nr. Protein name Gene name Taxonomy References EMBnet 2004: Proteomics sequence using

24 References Complete sequences; Fragments ; Function, characterization, interaction ; Post translational modifications; 3D structure (crystallography or NMR); Polymorphisms. Topology of a Swiss-Prot entry Comments Identifier Accession Nr. Protein name Gene name Taxonomy References sequence

25 Comment lines Function(s) and role(s); enzymes: a. Catalytic activity (if EC number) b. Cofactor c. Enzyme regulation d. Pathway Subunit (Protein/protein interactions) Subcellular location Alternative products (alt. splicing, alt. initiation, RNA editing) Tissue specificity (Northern and Western results) Developmental stage Induction (genetic control) Domain Post translational modifications (PTM) Mass spectrometry Polymorphisms Disease Biotechnology Pharmaceutical Miscellaneous Similarities Caution Database (specialized cross-references) Comment lines Information is derived from: Publications; Databases; Personal communications; Predictions; Brain storming

Experimental qualifiers: «-»: experimentally proved; «By similarity»:

«Probable»: not proved, but realistic; «Potential»: predicted ().

predicted (). AAA1_HUMAN, Q9NS82 BRH2_HUMAN, Q9NY43

26 Experimental qualifiers: «-»: experimentally proved; «By similarity»: experimentally proved in an ortholog or in another member of the family; «Probable»: not proved, but realistic; «Potential»: predicted (). ICOL_HUMAN, O75144 Experimental qualifiers: «-»: experimentally proved; «By similarity»: experimentally proved in an ortholog or in another member of the family; «Probable»: not proved but realistic; «Potential»: predicted (). AAA1_HUMAN, Q9NS82 BRH2_HUMAN, Q9NY43

27 Topology of a Swiss-Prot entry Comments Identifier Accession Nr. Protein name Gene name Taxonomy Cross-references References sequence Cross-references (X-ref) Swiss-Prot was the first database with X-ref.; Explicit links to 53 databases; Implicit X-references to 30 additional db added by the ExPASy servers on the WWW (such as GenBank, Ensembl, ) => links to 83 databases from the ExPASy servers Currently 1.2x10 6 cross-references in Swiss-Prot Gasteiger et al., Curr. Issues Mol. Biol. (2001), 3(3): 47-55

28 Swiss-Prot currently acts as the main index for the 15 federated 2D-PAGE databases. Cross-references 1. ICE8_HUMAN Q14790 ADN (Index of low redundancy) Examples of implicit links to GenBank and DDBJ added on the fly by the ExPASy server 3D genomic

29 Cross-references S_HUMAN P D-PAGE

Theoritically computed pi and MW Experimentally determined position Theoritically computed pi and MW with potential phosphorylation and acetylation

30 Theoritically computed pi and MW Experimentally determined position Theoritically computed pi and MW with potential phosphorylation and acetylation sites Topology of a Swiss-Prot entry Comments Identifier Accession Nr. Protein name Gene name Taxonomy Cross-references References Keywords sequence

31 Keywords (automated and manual annotation) Q9HC96 Calpain 10 n=481 entries

32 Topology of a Swiss-Prot entry Comments Identifier Accession Nr. Protein name Gene name Taxonomy Cross-references References Keywords Feature table sequence Sequence features: Manual annotation ICOL_HUMAN, O75144 General topology

33 Sequence features: Manual annotation ICOL_HUMAN, O75144 General topology Domains Sequence features: Manual annotation ICOL_HUMAN, O75144 General topology Domains PTM

34 Experimental qualifiers: «-»: experimentally proved; «By similarity»: experimentally proved in an ortholog or in another member of the family; «Probable»: not proved but realistic; «Potential»: predicted (). ICOL_HUMAN, O75144 Sequence features: Manual annotation ICOL_HUMAN, O75144 General topology Domains PTM Alternative splicing

35 All the «alternatively spliced sequences» are available, on the ExPASy server, in Fasta format, i.e. for Blast searches or proteomic tools. Some proteomic tools, on other server, such as Mascot, also include these «alternatively spliced sequences» in their search engines. BRC2_HUMAN, P51587 Polymorphisms Polymorphisms Differences between the sequence shown and other submitted sequences

36 Swiss-Prot and PTM annotations Swiss-Prot PTM annotations References (Rx lines) Comments (CC lines) CC -!- PTM: Keywords (KW lines) KW Feature table (FT lines) FT references comments keywords features

37 references references Comments (CC PTM) The N-terminus is blocked. Phosphorylation of Tyr-660 reduces the ability of 4.1 to promote the assembly of the spectrin/actin/4.1 ternary complex. comments Sulfated.

38 keywords Cleavage : Signal, Transit peptide, Protein splicing, etc. Linkage : Acetylation, Amidation, D- amino acid, Formylation, Glycoprotein, GPI-anchor, Hydroxylation, Hypusine, Iodination, Myristate, Palmitate, Phosphorylation, Cross-link Prenylation, : Sulfation, etc. Thioether bond, Thioester bond. keywords features Cleavage : INIT_MET, PROPEP, SIGNAL, TRANSIT Linkage : MOD_RES, CARBOHYD, LIPID, BINDING Cross-link : DISULFID, CROSSLNK features sequence

39 Swiss-Prot & TrEMBL introduce a new arithmetical concept! Redundancy in TrEMBL & Redundancy between TrEMBL and Swiss-Prot In 3 years.more than protein sequences But, in the future: redundancy is going to decrease: «new» genome sequencing -> «new» proteins (AB, sept 2002) In the case of human proteins, the redundancy is still very high: about * * human gene number estimation: Are missing: Sequences not submitted to EMBL/GenBank/DDJB (and PIR) Not yet predicted or known genes («no CDS provided by the submitters» or no DNA sequence) Confidential data (Patent application sequences) Immunoglobulins, T-cell receptors (-> UniParc)

Take home message Swiss-Prot is a nonredundant, manually annotated and highly crossreferenced protein knowledgebase. Be aware of the differences between TrEMBL and Swiss-Prot.

40 Take home message Swiss-Prot is a nonredundant, manually annotated and highly crossreferenced protein knowledgebase. Be aware of the differences between TrEMBL and Swiss-Prot. Always cite the Accession number, not the ID. We need your feedback! swiss-prot@expasy.org Righting the wrongs Sequences are rarely deposited in a mature state; as with all scientific research, DNA and protein annotation is a continual process of learning, revision and corrections. Sequencing error rates: ~1 base in Making people aware of errors is good and great; making people aware that they re responsible also for correcting errors is even greater C. Hardley, EMBO reports, 4(9), 2003.

41 Proteomics databases 1. Sequence databases: «The story of a protein sequence s life» 2. Swiss-Prot: a quick overview 3. UniProt utilities: UniRef and UniParc 4. Swiss-Prot and the other protein databases UniProt consortium (since oct. 2002): The UniProt Knowledgebase (UniProt) (Swiss-Prot and TrEMBL; integration of PIR data) (Release 1 dec. 2003). The UniProt Non-redundant Reference (UniRef) databases combine closely related sequences into a single record to speed BLAST searches. The UniProt Archive (UniParc) is a comprehensive repository, reflecting the history of all protein sequences.

42 UniRef useful for comprehensive BLAST searches by providing sets of representative sequences «Collapsing BLAST results» = Three collections of sequences clusters from the UniProt knowledgebase (Swiss-Prot, TrEMBL): One UniRef100 entry -> all identical sequences (including fragments) One UniRef90 entry -> sequences that have at least 90 % or more identity One UniRef50 entry -> sequences that are at least 50 % identical Independently of the species! BLASTP: UniRef100 UniRef100 does not include TrEMBLnew (tn), because TrEMBLnew is going to «disappear» soon

43 BLASTP: UniRef100 BLASTP: UniRef90

44 BLASTP: UniRef90 BLASTP: UniRef50

org/cgi-bin/textsearch_ar UniParc Use with extreme caution: also

45 UniParc allows to keep track of a protein sequence and of its integration in various databases UniParc Use with extreme caution: also contains pseudogene, incorrect CDS prediction etc! Also patent office database data (EPO, ESPO ).

46 Proteomics databases 1. Sequence databases: «The story of a protein sequence s life» 2. Swiss-Prot: a quick overview 3. UniProt utilities: UniRef and UniParc 3. Swiss-Prot and the other protein databases Real life of a protein sequence cdnas, ESTs, genomes, Data not submitted to public databases, delayed or cancelled EMBL Nucleic acids CoDing Sequences provided by submitters TrEMBL Amino acids Swiss-Prot Manually annotated

47 Real life of a protein sequence cdnas, ESTs, genomes, Data not submitted to public databases, delayed or cancelled PRF CoDing Sequences provided by submitters TrEMBL Genpept Swiss-Prot Manually annotated EMBL, GenBank, DDBJ CoDing Sequences provided by submitter and «de novo» gene prediction RefSeq XP_NNNNN Scientific publications derived sequences PRF UniProt: Swiss-Prot + TrEMBL + (PIR) NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF Protein sequences: «NR database»

48 Scientific publications derived sequences (integrated into TrEMBL) NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF ~TrEMBL, except that it is redundant with Swiss-Prot All PIR data have been integrated into Swiss-Prot and TrEMBL (UniProt) 3D structure database: all the protein sequences which have been cristallized (Swiss-Prot/TrEMBL are crosslinked to PDB) NCBI Reference Sequence (RefSeq) The RefSeq collection: genomic DNA, transcript (RNA), and protein products RefSeq provides a non-redundant set of sequences, derived from GenBank, the literature and gene prediction. Release 3 includes over proteins from 2218 (!!! 1 entry = 1 sequence.) organisms (including 1100 viruses and 150 bacteria). The sequence data are tightly linked to LocusLink which contains the associated biological information («interdependent curated resources»)

49 Example 1 Search for a gene name

Protein sequences: «NR database» http://www.ncbi.nlm.nih.

50 Protein sequences: «NR database» AMBN 20 entries Swiss-Prot

51 «Entrez protein AMBN» Genpept Genpept RefSeq RefSeq RefSeq AC KW Taxonomy References Correspond to Swiss-Prot entry AMBN_HUMAN Q9NP70 GenBank source GenBank source

52 used for the construction of the RefSeq entry Description of the sequence differences Annotation

53 Example 2 BLAST searches Human EPO: Blastp against Swiss-Prot/TrEMBL (at the ExPASy server) *

Human EPO: Blastp against NR All these human sequences are integrated into the corresponding Swiss-Prot entry with the annotation of their differences (conflicts, variant, fragments ) Scientific

54 Human EPO: Blastp against NR All these human sequences are integrated into the corresponding Swiss-Prot entry with the annotation of their differences (conflicts, variant, fragments ) Scientific publications derived sequences (integrated into TrEMBL) NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF ~TrEMBL, except that it is redundant with Swiss-Prot All PIR data have been integrated into Swiss-Prot and TrEMBL (UniProt) 3D structure database: all the protein sequences which have been cristallized (Swiss-Prot/TrEMBL are crosslinked to PDB)

55 PDB: Protein Data Bank Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA). Contains macromolecular structure data on proteins, nucleic acids, protein-nucleic acid complexes, and viruses. Proteins represent more than 90% of available structures Contain the spatial coordinates of macromolecules whose 3D structure has been obtained by X-ray or NMR studies Specialized programs allow the visualization of the corresponding 3D structure (e.g., SwissPDB-viewer, Chime, Rasmol)). Currently there are structural data for about molecules, but far less protein family (highly redundant)! PDB: example HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2 COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3 COMPND 2 (E.C ) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4 SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5 AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6 REVDAT 1 15-OCT-92 12CA 0 12CA 7 JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8 JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9 JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10 JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL CA 11 JRNL REF J.BIOL.CHEM. V CA 12 JRNL REFN ASTM JBCHA3 US ISSN CA 13 REMARK 1 12CA 14 REMARK 2 12CA 15 REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16 REMARK 3 12CA 17 REMARK 3 REFINEMENT. 12CA 18 REMARK 3 PROGRAM PROLSQ 12CA 19 REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20 REMARK 3 R VALUE CA 21 REMARK 3 RMSD BOND DISTANCES ANGSTROMS 12CA 22 REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23 REMARK 4 12CA 24 REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25 REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26 REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27

56 PDB (cont.) SHEET 3 S10 PHE 66 PHE 70-1 O ASN 67 N LEU 60 12CA 68 SHEET 4 S10 TYR 88 TRP 97-1 O PHE 93 N VAL 68 12CA 69 SHEET 5 S10 ALA 116 ASN O HIS 119 N HIS 94 12CA 70 SHEET 6 S10 LEU 141 VAL O LEU 144 N LEU CA 71 SHEET 7 S10 VAL 207 LEU O ILE 210 N GLY CA 72 SHEET 8 S10 TYR 191 GLY O TRP 192 N VAL CA 73 SHEET 9 S10 LYS 257 ALA O LYS 257 N THR CA 74 SHEET 10 S10 LYS 39 TYR 40 1 O LYS 39 N ALA CA 75 TURN 1 T1 GLN 28 VAL 31 TYPE VIB (CIS-PRO 30) 12CA 76 TURN 2 T2 GLY 81 LEU 84 TYPE II(PRIME) (GLY 82) 12CA 77 TURN 3 T3 ALA 134 GLN 137 TYPE I (GLN 136) 12CA 78 TURN 4 T4 GLN 137 GLY 140 TYPE I (ASP 139) 12CA 79 TURN 5 T5 THR 200 LEU 203 TYPE VIA (CIS-PRO 202) 12CA 80 TURN 6 T6 GLY 233 GLU 236 TYPE II (GLY 235) 12CA 81 CRYST P CA 82 ORIGX CA 83 ORIGX CA 84 ORIGX CA 85 SCALE CA 86 SCALE CA 87 SCALE CA 88 ATOM 1 N TRP CA 89 ATOM 2 CA TRP CA 90 ATOM 3 C TRP CA 91 ATOM 4 O TRP CA 92 ATOM 5 CB TRP CA 93 ATOM 6 CG TRP CA 94 ATOM 7 CD1 TRP CA 95 ATOM 8 CD2 TRP CA 96 ATOM 9 NE1 TRP CA 97 ATOM 10 CE2 TRP CA 98 ATOM 11 CE3 TRP CA 99 ATOM 12 CZ2 TRP CA 100 ATOM 13 CZ3 TRP CA 101 ATOM 14 CH2 TRP CA 102. Coordinates of each atom The same PDB entry visualized with Chime

57 3D structure database: other There are all derived from PDB data! HSSP: Homology-derived secondary structure of proteins FSSP: structural alignment SCOP: Structural classification of proteins CATH: hierarchical domain classification of protein structures HomStrad: (HOMologous STRucture Alignment Database) DALI server (EBI): network service for comparing protein structures in 3D. Protein databases used by the protein identification tools the jungle

58 PROWL: NCBInr, Swiss-Prot, dbest Protein prospector: NCBInr, Swiss-Prot, dbest, GenPept, Ludwignr, OWL*. Peptident (Aldente): Swiss-Prot, TrEMBL. Mascot: NCBInr, Swiss-Prot, dbest, OWL*, MSDB * OWL is obsolete since 1999 Matrix Science (Mascot) Sequence databases MSDB: non-identical protein sequence database Contains sequences derived from: PIR (now integrated into UniProt (Swiss-Prot /TrEMBL)) TrEMBL REMTrEMBL (does not exist anymore, see UniParc) GenBank Swiss-Prot NRL3D (PDB derived sequences)

The AC number jungle Type of record GenBank/EMBL/DDBJ Swiss-Prot/TrEMBL RefSeq nucleotide RefSeq protein RefSeq prediction PDB (protein structure) Sample Accession Format One letter followed by five

59 The AC number jungle Type of record GenBank/EMBL/DDBJ Swiss-Prot/TrEMBL RefSeq nucleotide RefSeq protein RefSeq prediction PDB (protein structure) Sample Accession Format One letter followed by five digits: e.g. U12345 Two letters followed by 6 digits: e.g. AF One letter (O, P, Q) and five digits/letters: e.g. P12345 Two letters, underscore bar and six digit: e.g. mrna NM_ e.g. genomic NT_ e.g. NP_00483 e.g. XM_ e.g. XP_ One digit followed by three letters: e.g. 1TUP The end of part I

60 PART II Protein characterization tools What can we learn in silico from a amino acid sequence? 1. Domain, family attribution 2. Subcellular location 3. Posttranslational modifications (PTMs)

61 What can we learn in silico from a amino acid sequence? 1. Domain, family attribution 2. Subcellular location 3. Posttranslational modifications (PTMs) Protein domain/family: some definitions Most proteins have «modular» structures Estimation: ~ 3 domains / protein Domains not only share a common structure but have also often a similar function that contributes to the global activity of the proteins which contain them.

Domains are identified by multiple sequence alignments Domains can be defined by different methods: Pattern (regular expression); used for very conserved domains Profiles (weighted matrices):

MA /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=234; MA /DISJOINT: DEFINITION=PROTECT; N1=6; N2=229; MA /NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=0.0169; R2=0.

62 Domains are identified by multiple sequence alignments Domains can be defined by different methods: Pattern (regular expression); used for very conserved domains Profiles (weighted matrices): two-dimensional tables of position specific match-, gap-, and insertion-scores, derived from aligned sequence families; used for less conserved domains Hidden Markov Model (HMM); probabilistic models; an other method to generate profiles. Pattern-Profile Pattern: [LIVM]-[ST]-A-[STAG]-H-C Yes or no Profile: ID TRYPSIN_DOM; MATRIX. AC PS50240; DT DEC-2001 (CREATED); DEC-2001 (DATA UPDATE); DEC-2001 (INFO UPDATE). DE Serine proteases, trypsin domain profile. MA /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=234; MA /DISJOINT: DEFINITION=PROTECT; N1=6; N2=229; MA /NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=0.0169; R2= ; TEXT='-LogE'; MA /CUT_OFF: LEVEL=0; SCORE=1134; N_SCORE=9.5; MODE=1; TEXT='!'; MA /CUT_OFF: LEVEL=-1; SCORE=775; N_SCORE=6.5; MODE=1; TEXT='?'; MA /DEFAULT: M0=-9; D=-20; I=-20; B1=-60; E1=-60; MI=-105; MD=-105; IM=-105; DM=-105; MA /I: B1=0; BI=-105; BD=-105; MA A B D E F G H I K L M N P Q R S T V W Y MA /M: SY='I'; M= -8,-29,-34,-26, 3,-34,-24, 34,-26, 19, 15,-24,-21,-21,-24,-19, -8, 25,-19, 3; MA /M: SY='N'; M= 0, 14, 10, 1,-22, -1, 6,-23, -4,-26,-17, 20,-14, -1, -6, 13, 2,-20,-34,-15; MA /M: SY='E'; M= -4, 4, 7, 14,-26,-13, -7,-23, 3,-22,-16, 2, 7, 3, -3, 2, -2,-21,-30,-18; MA /M: SY='R'; M=-12, 5, 5, 7,-23,-17, 3,-24, 8,-20,-12, 7,-16, 10, 12, -2, -6,-21,-27, -9; MA /M: SY='W'; M=-16,-33,-35,-27, 13,-22,-24,-11,-18,-13,-13,-31,-27,-20,-18,-30,-21,-18, 97, 25; MA /M: SY='V'; M= 1,-29,-31,-28, -1,-30,-29, 31,-22, 13, 11,-27,-27,-26,-22,-12, -2, 41,-27, -8; MA /M: SY='L'; M= -8,-29,-31,-22, 9,-30,-21, 23,-27, 37, 20,-28,-28,-21,-20,-25, -8, 17,-20, -1; MA /M: SY='T'; M= 2, -1, -9, -9,-11,-17,-19,-10,-10,-13,-11, 1,-11, -9,-10, 23, 43, 0,-32,-12; MA /M: SY='A'; M= 45, -9,-19,-10,-20, -2,-15,-11,-10,-11,-10, -9,-11, -9,-19, 10, 1, -1,-21,-18; MA /M: SY='A'; M= 40, -9,-17, -8,-21, 5,-18,-14, -9,-13,-12, -8,-11, -9,-16, 9, -2, -5,-21,-21; MA /M: SY='H'; M=-18, 0, 0, 1,-21,-19, 89,-29, -8,-21, -1, 9,-19, 11, 0, -7,-17,-29,-30, 16; MA /M: SY='C'; M= -9,-18,-28,-29,-20,-29,-29,-29,-29,-20,-19,-18,-39,-29,-29, -9, -9, -9,-49,-29; MA /I: E1=0; IE=-105; DE=-105; // score/threshold

63 Protein domain/family db PROSITE Patterns / Profiles ProDom Aligned motifs (PSI-BLAST) (Pfam B) PRINTS Aligned motifs Pfam HMM (Hidden Markov Models) SMART HMM TIGRfam HMM I n t e r p r o DOMO BLOCKS CDD(CDART) Aligned motifs Aligned motifs (PSI-BLAST) PSI-BLAST(PSSM) of Pfam and SMART InterPro Search simultaneously many domain databases (PRINTS, PROSITE, Pfam, ProDom, SMART, and TIGRFAMs). Contains an unique AC, functional description of the domain and references. Links are made back to the relevant member databases.

65 What can we learn in silico from a amino acid sequence? 1. Domain, family attribution 2. Subcellular location 3. Posttranslational modifications (PTMs) Protein pathway in Eukaryota ---> per default with a specific signal Secretory pathway

67 What can we learn in silico from a amino acid sequence? 1. Domain, family attribution 2. Subcellular location 3. Posttranslational modifications (PTMs) from genome to proteome ~ human genes alternative splicing of mrna 2-5 fold increase post-translational modifications of proteins (PTMs) 5-10 fold increase ~ 1'000'000 human proteins ~ human transcripts protein complexity

68 PTM diversity GPI Myr GPI Ngly GPI Ogly GPI GPI GPI GPI GPI Pho Sul Am Amidation AcN Acetylation N-terminal AcI Acetylation internal Alk Alkylation Adp ADP-ribosylation Bio Biotinylation Bro Bromination Cgly C-linked glycosylation Ogly O-linked glycosylation Ngly N-linked glycosylation Dea Deamidation Sul Sulfation Far Farnesylation Ger Geranylgeranylation GPI GPI-anchoring Met Methylation Myr Myristoylation Hyd Hydroxylation Pho Phosphorylation Pal Palmitoylation Pyr Pyrrolidone carboxylic acid Oxo 2-amino-3-oxopropionic acid Three major categories cleavage linkage x-linking initiator Met, signal and transit peptides, propeptides, complex processing, etc. simple chemical groups: phosphate, sulfate, methyl, hydroxyl, acetate, etc. complex molecules: N-, O- or C-linked glycans, lipids (e.g. palmitate, myristate, GPI) disulfide bonds, thioester, thioether bonds, etc.

PTM database http://www-nbrf.georgetown.

html RESID is a database of protein post-translational

69 PTM database RESID is a database of protein post-translational modifications with descriptive, chemical, structural and bibliographic information. contains 351 entries (last update nov 2003)

70 PTM prediction tools PTM prediction on ExPASy + PROSITE predictions (n~15)

71 PTM prediction -> Beware the «biological consistency»! -> Organisms (Eubacteria, Archae, Eukaryota) -> Subcellular location -> secretory pathway (ER, Golgi) -> shuttle between organelles -> topology -> A well characterized orthologous protein

72 Some statistics Number of PTMs in Swiss-Prot release 40 Pot./prob. By sim. all organisms Exp. total signal peptide N-GlcNAc O-GalNAc O-GlcNAc phosphorylation sulfation myristate GPI-anchor 108 Total number of proteins < total number of PTMs PTM annotation in SWISS-PROT: all organisms acetyl phosphate methyl sulfate total proven

73 We need your help! The end of part II

Biological databases an introduction

Biological databases an introduction By Dr. Erik Bongcam-Rudloff SLU 2017 Biological Databases Sequence Databases Genome Databases Structure Databases Sequence Databases The sequence databases are the