Proteomics databases

Size: px
Start display at page:

Download "Proteomics databases"

Transcription

1 Proteomics databases and protein characterization tools Part I Proteomics databases

2 Proteomics databases 1. Sequence databases: «The story of a protein sequence s life» 2. Swiss-Prot: a quick overview 3. UniProt utilities: UniRef and UniParc 4. Swiss-Prot and the other protein databases Where do the protein sequences come from? What s about their reliability? What do you have to take care of?

3 Real life of a protein sequence with or without annotated CDS PRF, PIR CoDing Sequences provided by submitters TrEMBL Genpept Manually annotated cdnas, ESTs, genomes, Data not submitted to public databases, delayed or cancelled EMBL, GenBank, DDBJ CoDing Sequences provided by submitter and «de novo» gene prediction RefSeq XP_NNNNN Scientific publications derived sequences PRF Swiss-Prot 3D structures UniProt: Swiss-Prot + TrEMBL + (PIR) NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF Let s start at the very beginning

4 with or without annotated CDS provided by authors Real life of a protein sequence cdnas, ESTs, genomes, Data not submitted to public databases, delayed or cancelled EMBL, GenBank, DDBJ CDS CoDing Sequence portion of DNA/RNA translated into protein (from Met to STOP) EMBL/GenBank/DDBJ The 3 main public nucleic acid sequence databases are EMBL (EBI)/GenBank (NCBI) /DDBJ (Japan): «different views of the same data set» within 2-3 days Contribution: EMBL 10 %; GenBank 73 %; DDBJ 17 % EMBL: since 1982

5 EMBL/GenBank/DDBJ Serve as archives Contain all public sequences derived from: Genome projects (> 80 % of entries) Sequencing centers (cdnas, ESTs ) Individual scientists ( 15 % of entries) Patent offices (i.e. European Patent Office, EPO) Currently: 30x10 6 sequences, ~36 x10 9 bp; Sequences from > different species; The tremendous increase in nucleotide sequences Mouse Other Rat Human 1980: 80 genes fully sequenced! Human/Mouse/Rat: Organisms with the highest redundancy!

6 EMBL/GenBank/DDBJ Sort of sequence museum, where sequences are preserved for eternity as they were determined, interpreted and published originally by their authors (primary sequence repository) The authors have full authority over the content of the entries they submit! (exception: TPA, since january 2003) an EMBL entry ID HSERPG standard; genomic DNA; HUM; 3398 BP. XX AC X02158; XX SV X XX DT 13-JUN-1985 (Rel. 06, Created) DT 22-JUN-1993 (Rel. 36, Last updated, Version 2) XX DE Human gene for erythropoietin XX KW erythropoietin; glycoprotein hormone; hormone; signal peptide. XX OS Homo sapiens (human) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; OC Eutheria; Primates; Catarrhini; Hominidae; Homo. XX RN [1] RP RX MEDLINE; RA Jacobs K., Shoemaker C., Rudersdorf R., Neill S.D., Kaufman R.J., RA Mufson A., Seehra J., Jones S.S., Hewick R., Fritsch E.F., Kawakita M., RA Shimizu T., Miyake T.; RT Isolation and characterization of genomic and cdna clones of human RT erythropoietin; RL Nature 313: (1985). XX DR GDB; ; EPO. DR GDB; ; TIMP1. DR Swiss-Prot; P01588; EPO_HUMAN. XX keyword taxonomy references Cross-references DNA (genomic) or RNA

7 CC Data kindly reviewed (24-FEB-1986) by K. Jacobs FH Key Location/Qualifiers FH FT source FT /db_xref=taxon:9606 FT /organism=homo sapiens FT mrna join( , , , , ) FT CDS join( , , , , ) FT /db_xref=swiss-prot:p01588 FT /product=erythropoietin FT /protein_id=caa FT /translation=mgvhecpawlwlllsllslplglpvlgapprlicdsrvlqrylle FT AKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVLRG FT QALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTITAD FT TFRKLFRVYSNFLRGKLKLYTGEACRTGDR FT mat_peptide join( , , , ) FT /product=erythropoietin FT sig_peptide join( , ) FT exon FT /number=1 FT intron FT /number=1 FT exon FT /number=2 FT intron FT /number=2 FT exon FT /number=3 FT intron FT /number=3 FT exon FT /number=4 FT intron FT /number=4 FT exon FT /note=3' untranslated region FT /number=5 XX SQ Sequence 3398 BP; 698 A; 1034 C; 991 G; 675 T; 0 other; agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 60 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 120 CDS CoDing Sequence (proposed by submitters) Annotation (Prediction or experimentally determined) sequence

8 FT CDS complement( ) FT /db_xref="sptrembl:q9uz71" FT /note="pab2386" FT /transl_table=11 FT /product="4-aminobutyrate qui se dilate AMINOTRANSFERASE FT (EC )" FT /protein_id="cab " FT /translation="mdyprivvnppgpkakelierekrvlstgigvklfplvpkrgfgp FT FIEDVDGNVFIDFLAGAAAASTGYSHPKLVKAVKEQVELIQHSMIGYTHSERAIRVAEK FT LVKISPIKNSKVLFGLSGSDAVDMAIKVSKFSTRRPWILAFIGAYHGQTLGATSVASFQ FT VSQKRGYSPLMPNVFWVPYPNPYRNPWGINGYEEPQELVNRVVEYLEDYVFSHVVPPDE FT VAAFFAEPIQGDAGIVVPPENFFKELKKLLDEHGILLVMDEVQTGIGRTGKWFASEWFE FT VKPDMIIFGKGVASGMGLSGVIGREDIMDITSGSALLTPAANPVISAAADATLEIIEEE FT NLLKNAIEVGSFIMKRLNELKEQFDIIGDVRGKGLMIGVEIVKENGRPDPEMTGKICWR FT AFELGLILPSYGMFGNVIRITPPLVLTKEVAEKGLEIIEKAIKDAIAGKVERKVVTWH"

9 Proteomics databases 1. Sequence databases: «The story of a protein sequence s life» 2. Swiss-Prot: a quick overview 3. UniProt utilities: UniRef and UniParc 4. Swiss-Prot and the other protein databases Real life of a protein sequence with or without annotated CDS Data not submitted to public databases, delayed or cancelled cdnas, ESTs, genomes, EMBL Nucleic acids CoDing Sequences provided by submitters TrEMBL Amino acids Swiss-Prot Manually annotated

10 Since december 15, 2003 Swiss-Prot and TrEMBL constitute the Knowledgebase (integration of the PIR data) -> give access to all known* protein sequences * submitted to the public databases (EMBL, GenBank, DDJB, SWISS-PROT)

11 a SWISS-PROT entry = a protein sequence associated with - manually-checked - well-structured - periodically-updated - searchable biological information a TrEMBL entry = a protein sequence associated with - computer-annotated - well-structured - periodically-updated - searchable biological information

12 CDS TrEMBL EMBL Swiss-Prot CDS TrEMBL Once in Swiss-Prot, no more in TrEMBL -> Minimal redundancy Annotation of conflicts EMBL Swiss-Prot

13 CDS TrEMBL EMBL Swiss-Prot How to make things clear? Depending of the server UniProt = Swiss-Prot + TrEMBL = SPTR = SWALL Swiss-Prot =UniProt/Swiss-Prot TrEMBL= UniProt/TrEMBL=SPTrEMBL EMBnet 2004: Proteomics TrEMBL=SPTrEMBL + using TrEMBLnew** **is going to disappear soon!

14 Swiss-Prot 1. Minimal redundancy; 2. Maximal manual annotation; 3. Integration with other databases. Swiss-Prot 1. Minimal redundancy; 1 gene (1 species) -> 1 entry Swiss-Prot Identical sequences are merged, as are variants, fragments, alternative splicing isoforms.

15 Swiss-Prot 1. Minimal redundancy. 2. Maximal manual annotation: Function(s); Interactions; Subcellular localization and tissue expression; Structure (domains, ); Post translational modifications (PTMs); Variants (alternative splicing, polymorphisms, ); Similarities Swiss-Prot 1. Minimal redundancy; 2. Manual annotation; 3. Integration with other databases: Release (26-Sep-2003): 83 links to other datases.

16 Up-to-date sources: Swiss-Prot -> ExPASy Since 1986 ( TrEMBL Since > EBI (European Bioinformatics Institute) ( You can install the ExPASyBar on your computer Amos links

17 Search also with accession numbers (Swiss-Prot or other databases)

18 Swiss-Prot an overview View «by default» on the ExPASy server

19 ExPASy EBI NCBI

20 Not always obvious to known from which database your protein sequence is derived from! Topology of a Swiss-Prot entry EMBnet 2004: Proteomics sequence using

21 Swiss-Prot Protein sequence: - The longest sequence is usually «displayed» - Precursor (except INIT_MET 0 and «amino acid sequencing») - Comparison of genomic and cdna sequences -> carefully checked; validated! -> choose the most representative -> The sequence quality is always increasing. Swiss-Prot s daily bread Alternative splicing? Same gene? Polymorphisms? Alternative initiation? RNA editing? Usage of an alternative promoter? Selenocystein? Fragment? Sequencing errors?

22 Topology of a Swiss-Prot entry Identifier Accession Nr. Protein name Gene name EMBnet 2004: Proteomics sequence using Always cite the primary accession number!

23 Topology of a Swiss-Prot entry Identifier Accession Nr. Protein name Gene name Taxonomy EMBnet 2004: Proteomics sequence using Topology of a Swiss-Prot entry Identifier Accession Nr. Protein name Gene name Taxonomy References EMBnet 2004: Proteomics sequence using

24 References Complete sequences; Fragments ; Function, characterization, interaction ; Post translational modifications; 3D structure (crystallography or NMR); Polymorphisms. Topology of a Swiss-Prot entry Comments Identifier Accession Nr. Protein name Gene name Taxonomy References sequence

25 Comment lines Function(s) and role(s); enzymes: a. Catalytic activity (if EC number) b. Cofactor c. Enzyme regulation d. Pathway Subunit (Protein/protein interactions) Subcellular location Alternative products (alt. splicing, alt. initiation, RNA editing) Tissue specificity (Northern and Western results) Developmental stage Induction (genetic control) Domain Post translational modifications (PTM) Mass spectrometry Polymorphisms Disease Biotechnology Pharmaceutical Miscellaneous Similarities Caution Database (specialized cross-references) Comment lines Information is derived from: Publications; Databases; Personal communications; Predictions; Brain storming

26 Experimental qualifiers: «-»: experimentally proved; «By similarity»: experimentally proved in an ortholog or in another member of the family; «Probable»: not proved, but realistic; «Potential»: predicted (). ICOL_HUMAN, O75144 Experimental qualifiers: «-»: experimentally proved; «By similarity»: experimentally proved in an ortholog or in another member of the family; «Probable»: not proved but realistic; «Potential»: predicted (). AAA1_HUMAN, Q9NS82 BRH2_HUMAN, Q9NY43

27 Topology of a Swiss-Prot entry Comments Identifier Accession Nr. Protein name Gene name Taxonomy Cross-references References sequence Cross-references (X-ref) Swiss-Prot was the first database with X-ref.; Explicit links to 53 databases; Implicit X-references to 30 additional db added by the ExPASy servers on the WWW (such as GenBank, Ensembl, ) => links to 83 databases from the ExPASy servers Currently 1.2x10 6 cross-references in Swiss-Prot Gasteiger et al., Curr. Issues Mol. Biol. (2001), 3(3): 47-55

28 Swiss-Prot currently acts as the main index for the 15 federated 2D-PAGE databases. Cross-references 1. ICE8_HUMAN Q14790 ADN (Index of low redundancy) Examples of implicit links to GenBank and DDBJ added on the fly by the ExPASy server 3D genomic

29 Cross-references S_HUMAN P D-PAGE

30 Theoritically computed pi and MW Experimentally determined position Theoritically computed pi and MW with potential phosphorylation and acetylation sites Topology of a Swiss-Prot entry Comments Identifier Accession Nr. Protein name Gene name Taxonomy Cross-references References Keywords sequence

31 Keywords (automated and manual annotation) Q9HC96 Calpain 10 n=481 entries

32 Topology of a Swiss-Prot entry Comments Identifier Accession Nr. Protein name Gene name Taxonomy Cross-references References Keywords Feature table sequence Sequence features: Manual annotation ICOL_HUMAN, O75144 General topology

33 Sequence features: Manual annotation ICOL_HUMAN, O75144 General topology Domains Sequence features: Manual annotation ICOL_HUMAN, O75144 General topology Domains PTM

34 Experimental qualifiers: «-»: experimentally proved; «By similarity»: experimentally proved in an ortholog or in another member of the family; «Probable»: not proved but realistic; «Potential»: predicted (). ICOL_HUMAN, O75144 Sequence features: Manual annotation ICOL_HUMAN, O75144 General topology Domains PTM Alternative splicing

35 All the «alternatively spliced sequences» are available, on the ExPASy server, in Fasta format, i.e. for Blast searches or proteomic tools. Some proteomic tools, on other server, such as Mascot, also include these «alternatively spliced sequences» in their search engines. BRC2_HUMAN, P51587 Polymorphisms Polymorphisms Differences between the sequence shown and other submitted sequences

36 Swiss-Prot and PTM annotations Swiss-Prot PTM annotations References (Rx lines) Comments (CC lines) CC -!- PTM: Keywords (KW lines) KW Feature table (FT lines) FT references comments keywords features

37 references references Comments (CC PTM) The N-terminus is blocked. Phosphorylation of Tyr-660 reduces the ability of 4.1 to promote the assembly of the spectrin/actin/4.1 ternary complex. comments Sulfated.

38 keywords Cleavage : Signal, Transit peptide, Protein splicing, etc. Linkage : Acetylation, Amidation, D- amino acid, Formylation, Glycoprotein, GPI-anchor, Hydroxylation, Hypusine, Iodination, Myristate, Palmitate, Phosphorylation, Cross-link Prenylation, : Sulfation, etc. Thioether bond, Thioester bond. keywords features Cleavage : INIT_MET, PROPEP, SIGNAL, TRANSIT Linkage : MOD_RES, CARBOHYD, LIPID, BINDING Cross-link : DISULFID, CROSSLNK features sequence

39 Swiss-Prot & TrEMBL introduce a new arithmetical concept! Redundancy in TrEMBL & Redundancy between TrEMBL and Swiss-Prot In 3 years.more than protein sequences But, in the future: redundancy is going to decrease: «new» genome sequencing -> «new» proteins (AB, sept 2002) In the case of human proteins, the redundancy is still very high: about * * human gene number estimation: Are missing: Sequences not submitted to EMBL/GenBank/DDJB (and PIR) Not yet predicted or known genes («no CDS provided by the submitters» or no DNA sequence) Confidential data (Patent application sequences) Immunoglobulins, T-cell receptors (-> UniParc)

40 Take home message Swiss-Prot is a nonredundant, manually annotated and highly crossreferenced protein knowledgebase. Be aware of the differences between TrEMBL and Swiss-Prot. Always cite the Accession number, not the ID. We need your feedback! swiss-prot@expasy.org Righting the wrongs Sequences are rarely deposited in a mature state; as with all scientific research, DNA and protein annotation is a continual process of learning, revision and corrections. Sequencing error rates: ~1 base in Making people aware of errors is good and great; making people aware that they re responsible also for correcting errors is even greater C. Hardley, EMBO reports, 4(9), 2003.

41 Proteomics databases 1. Sequence databases: «The story of a protein sequence s life» 2. Swiss-Prot: a quick overview 3. UniProt utilities: UniRef and UniParc 4. Swiss-Prot and the other protein databases UniProt consortium (since oct. 2002): The UniProt Knowledgebase (UniProt) (Swiss-Prot and TrEMBL; integration of PIR data) (Release 1 dec. 2003). The UniProt Non-redundant Reference (UniRef) databases combine closely related sequences into a single record to speed BLAST searches. The UniProt Archive (UniParc) is a comprehensive repository, reflecting the history of all protein sequences.

42 UniRef useful for comprehensive BLAST searches by providing sets of representative sequences «Collapsing BLAST results» = Three collections of sequences clusters from the UniProt knowledgebase (Swiss-Prot, TrEMBL): One UniRef100 entry -> all identical sequences (including fragments) One UniRef90 entry -> sequences that have at least 90 % or more identity One UniRef50 entry -> sequences that are at least 50 % identical Independently of the species! BLASTP: UniRef100 UniRef100 does not include TrEMBLnew (tn), because TrEMBLnew is going to «disappear» soon

43 BLASTP: UniRef100 BLASTP: UniRef90

44 BLASTP: UniRef90 BLASTP: UniRef50

45 UniParc allows to keep track of a protein sequence and of its integration in various databases UniParc Use with extreme caution: also contains pseudogene, incorrect CDS prediction etc! Also patent office database data (EPO, ESPO ).

46 Proteomics databases 1. Sequence databases: «The story of a protein sequence s life» 2. Swiss-Prot: a quick overview 3. UniProt utilities: UniRef and UniParc 3. Swiss-Prot and the other protein databases Real life of a protein sequence cdnas, ESTs, genomes, Data not submitted to public databases, delayed or cancelled EMBL Nucleic acids CoDing Sequences provided by submitters TrEMBL Amino acids Swiss-Prot Manually annotated

47 Real life of a protein sequence cdnas, ESTs, genomes, Data not submitted to public databases, delayed or cancelled PRF CoDing Sequences provided by submitters TrEMBL Genpept Swiss-Prot Manually annotated EMBL, GenBank, DDBJ CoDing Sequences provided by submitter and «de novo» gene prediction RefSeq XP_NNNNN Scientific publications derived sequences PRF UniProt: Swiss-Prot + TrEMBL + (PIR) NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF Protein sequences: «NR database»

48 Scientific publications derived sequences (integrated into TrEMBL) NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF ~TrEMBL, except that it is redundant with Swiss-Prot All PIR data have been integrated into Swiss-Prot and TrEMBL (UniProt) 3D structure database: all the protein sequences which have been cristallized (Swiss-Prot/TrEMBL are crosslinked to PDB) NCBI Reference Sequence (RefSeq) The RefSeq collection: genomic DNA, transcript (RNA), and protein products RefSeq provides a non-redundant set of sequences, derived from GenBank, the literature and gene prediction. Release 3 includes over proteins from 2218 (!!! 1 entry = 1 sequence.) organisms (including 1100 viruses and 150 bacteria). The sequence data are tightly linked to LocusLink which contains the associated biological information («interdependent curated resources»)

49 Example 1 Search for a gene name

50 Protein sequences: «NR database» AMBN 20 entries Swiss-Prot

51 «Entrez protein AMBN» Genpept Genpept RefSeq RefSeq RefSeq AC KW Taxonomy References Correspond to Swiss-Prot entry AMBN_HUMAN Q9NP70 GenBank source GenBank source

52 used for the construction of the RefSeq entry Description of the sequence differences Annotation

53 Example 2 BLAST searches Human EPO: Blastp against Swiss-Prot/TrEMBL (at the ExPASy server) *

54 Human EPO: Blastp against NR All these human sequences are integrated into the corresponding Swiss-Prot entry with the annotation of their differences (conflicts, variant, fragments ) Scientific publications derived sequences (integrated into TrEMBL) NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF ~TrEMBL, except that it is redundant with Swiss-Prot All PIR data have been integrated into Swiss-Prot and TrEMBL (UniProt) 3D structure database: all the protein sequences which have been cristallized (Swiss-Prot/TrEMBL are crosslinked to PDB)

55 PDB: Protein Data Bank Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA). Contains macromolecular structure data on proteins, nucleic acids, protein-nucleic acid complexes, and viruses. Proteins represent more than 90% of available structures Contain the spatial coordinates of macromolecules whose 3D structure has been obtained by X-ray or NMR studies Specialized programs allow the visualization of the corresponding 3D structure (e.g., SwissPDB-viewer, Chime, Rasmol)). Currently there are structural data for about molecules, but far less protein family (highly redundant)! PDB: example HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2 COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3 COMPND 2 (E.C ) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4 SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5 AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6 REVDAT 1 15-OCT-92 12CA 0 12CA 7 JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8 JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9 JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10 JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL CA 11 JRNL REF J.BIOL.CHEM. V CA 12 JRNL REFN ASTM JBCHA3 US ISSN CA 13 REMARK 1 12CA 14 REMARK 2 12CA 15 REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16 REMARK 3 12CA 17 REMARK 3 REFINEMENT. 12CA 18 REMARK 3 PROGRAM PROLSQ 12CA 19 REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20 REMARK 3 R VALUE CA 21 REMARK 3 RMSD BOND DISTANCES ANGSTROMS 12CA 22 REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23 REMARK 4 12CA 24 REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25 REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26 REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27

56 PDB (cont.) SHEET 3 S10 PHE 66 PHE 70-1 O ASN 67 N LEU 60 12CA 68 SHEET 4 S10 TYR 88 TRP 97-1 O PHE 93 N VAL 68 12CA 69 SHEET 5 S10 ALA 116 ASN O HIS 119 N HIS 94 12CA 70 SHEET 6 S10 LEU 141 VAL O LEU 144 N LEU CA 71 SHEET 7 S10 VAL 207 LEU O ILE 210 N GLY CA 72 SHEET 8 S10 TYR 191 GLY O TRP 192 N VAL CA 73 SHEET 9 S10 LYS 257 ALA O LYS 257 N THR CA 74 SHEET 10 S10 LYS 39 TYR 40 1 O LYS 39 N ALA CA 75 TURN 1 T1 GLN 28 VAL 31 TYPE VIB (CIS-PRO 30) 12CA 76 TURN 2 T2 GLY 81 LEU 84 TYPE II(PRIME) (GLY 82) 12CA 77 TURN 3 T3 ALA 134 GLN 137 TYPE I (GLN 136) 12CA 78 TURN 4 T4 GLN 137 GLY 140 TYPE I (ASP 139) 12CA 79 TURN 5 T5 THR 200 LEU 203 TYPE VIA (CIS-PRO 202) 12CA 80 TURN 6 T6 GLY 233 GLU 236 TYPE II (GLY 235) 12CA 81 CRYST P CA 82 ORIGX CA 83 ORIGX CA 84 ORIGX CA 85 SCALE CA 86 SCALE CA 87 SCALE CA 88 ATOM 1 N TRP CA 89 ATOM 2 CA TRP CA 90 ATOM 3 C TRP CA 91 ATOM 4 O TRP CA 92 ATOM 5 CB TRP CA 93 ATOM 6 CG TRP CA 94 ATOM 7 CD1 TRP CA 95 ATOM 8 CD2 TRP CA 96 ATOM 9 NE1 TRP CA 97 ATOM 10 CE2 TRP CA 98 ATOM 11 CE3 TRP CA 99 ATOM 12 CZ2 TRP CA 100 ATOM 13 CZ3 TRP CA 101 ATOM 14 CH2 TRP CA 102. Coordinates of each atom The same PDB entry visualized with Chime

57 3D structure database: other There are all derived from PDB data! HSSP: Homology-derived secondary structure of proteins FSSP: structural alignment SCOP: Structural classification of proteins CATH: hierarchical domain classification of protein structures HomStrad: (HOMologous STRucture Alignment Database) DALI server (EBI): network service for comparing protein structures in 3D. Protein databases used by the protein identification tools the jungle

58 PROWL: NCBInr, Swiss-Prot, dbest Protein prospector: NCBInr, Swiss-Prot, dbest, GenPept, Ludwignr, OWL*. Peptident (Aldente): Swiss-Prot, TrEMBL. Mascot: NCBInr, Swiss-Prot, dbest, OWL*, MSDB * OWL is obsolete since 1999 Matrix Science (Mascot) Sequence databases MSDB: non-identical protein sequence database Contains sequences derived from: PIR (now integrated into UniProt (Swiss-Prot /TrEMBL)) TrEMBL REMTrEMBL (does not exist anymore, see UniParc) GenBank Swiss-Prot NRL3D (PDB derived sequences)

59 The AC number jungle Type of record GenBank/EMBL/DDBJ Swiss-Prot/TrEMBL RefSeq nucleotide RefSeq protein RefSeq prediction PDB (protein structure) Sample Accession Format One letter followed by five digits: e.g. U12345 Two letters followed by 6 digits: e.g. AF One letter (O, P, Q) and five digits/letters: e.g. P12345 Two letters, underscore bar and six digit: e.g. mrna NM_ e.g. genomic NT_ e.g. NP_00483 e.g. XM_ e.g. XP_ One digit followed by three letters: e.g. 1TUP The end of part I

60 PART II Protein characterization tools What can we learn in silico from a amino acid sequence? 1. Domain, family attribution 2. Subcellular location 3. Posttranslational modifications (PTMs)

61 What can we learn in silico from a amino acid sequence? 1. Domain, family attribution 2. Subcellular location 3. Posttranslational modifications (PTMs) Protein domain/family: some definitions Most proteins have «modular» structures Estimation: ~ 3 domains / protein Domains not only share a common structure but have also often a similar function that contributes to the global activity of the proteins which contain them.

62 Domains are identified by multiple sequence alignments Domains can be defined by different methods: Pattern (regular expression); used for very conserved domains Profiles (weighted matrices): two-dimensional tables of position specific match-, gap-, and insertion-scores, derived from aligned sequence families; used for less conserved domains Hidden Markov Model (HMM); probabilistic models; an other method to generate profiles. Pattern-Profile Pattern: [LIVM]-[ST]-A-[STAG]-H-C Yes or no Profile: ID TRYPSIN_DOM; MATRIX. AC PS50240; DT DEC-2001 (CREATED); DEC-2001 (DATA UPDATE); DEC-2001 (INFO UPDATE). DE Serine proteases, trypsin domain profile. MA /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=234; MA /DISJOINT: DEFINITION=PROTECT; N1=6; N2=229; MA /NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=0.0169; R2= ; TEXT='-LogE'; MA /CUT_OFF: LEVEL=0; SCORE=1134; N_SCORE=9.5; MODE=1; TEXT='!'; MA /CUT_OFF: LEVEL=-1; SCORE=775; N_SCORE=6.5; MODE=1; TEXT='?'; MA /DEFAULT: M0=-9; D=-20; I=-20; B1=-60; E1=-60; MI=-105; MD=-105; IM=-105; DM=-105; MA /I: B1=0; BI=-105; BD=-105; MA A B D E F G H I K L M N P Q R S T V W Y MA /M: SY='I'; M= -8,-29,-34,-26, 3,-34,-24, 34,-26, 19, 15,-24,-21,-21,-24,-19, -8, 25,-19, 3; MA /M: SY='N'; M= 0, 14, 10, 1,-22, -1, 6,-23, -4,-26,-17, 20,-14, -1, -6, 13, 2,-20,-34,-15; MA /M: SY='E'; M= -4, 4, 7, 14,-26,-13, -7,-23, 3,-22,-16, 2, 7, 3, -3, 2, -2,-21,-30,-18; MA /M: SY='R'; M=-12, 5, 5, 7,-23,-17, 3,-24, 8,-20,-12, 7,-16, 10, 12, -2, -6,-21,-27, -9; MA /M: SY='W'; M=-16,-33,-35,-27, 13,-22,-24,-11,-18,-13,-13,-31,-27,-20,-18,-30,-21,-18, 97, 25; MA /M: SY='V'; M= 1,-29,-31,-28, -1,-30,-29, 31,-22, 13, 11,-27,-27,-26,-22,-12, -2, 41,-27, -8; MA /M: SY='L'; M= -8,-29,-31,-22, 9,-30,-21, 23,-27, 37, 20,-28,-28,-21,-20,-25, -8, 17,-20, -1; MA /M: SY='T'; M= 2, -1, -9, -9,-11,-17,-19,-10,-10,-13,-11, 1,-11, -9,-10, 23, 43, 0,-32,-12; MA /M: SY='A'; M= 45, -9,-19,-10,-20, -2,-15,-11,-10,-11,-10, -9,-11, -9,-19, 10, 1, -1,-21,-18; MA /M: SY='A'; M= 40, -9,-17, -8,-21, 5,-18,-14, -9,-13,-12, -8,-11, -9,-16, 9, -2, -5,-21,-21; MA /M: SY='H'; M=-18, 0, 0, 1,-21,-19, 89,-29, -8,-21, -1, 9,-19, 11, 0, -7,-17,-29,-30, 16; MA /M: SY='C'; M= -9,-18,-28,-29,-20,-29,-29,-29,-29,-20,-19,-18,-39,-29,-29, -9, -9, -9,-49,-29; MA /I: E1=0; IE=-105; DE=-105; // score/threshold

63 Protein domain/family db PROSITE Patterns / Profiles ProDom Aligned motifs (PSI-BLAST) (Pfam B) PRINTS Aligned motifs Pfam HMM (Hidden Markov Models) SMART HMM TIGRfam HMM I n t e r p r o DOMO BLOCKS CDD(CDART) Aligned motifs Aligned motifs (PSI-BLAST) PSI-BLAST(PSSM) of Pfam and SMART InterPro Search simultaneously many domain databases (PRINTS, PROSITE, Pfam, ProDom, SMART, and TIGRFAMs). Contains an unique AC, functional description of the domain and references. Links are made back to the relevant member databases.

64

65 What can we learn in silico from a amino acid sequence? 1. Domain, family attribution 2. Subcellular location 3. Posttranslational modifications (PTMs) Protein pathway in Eukaryota ---> per default with a specific signal Secretory pathway

66

67 What can we learn in silico from a amino acid sequence? 1. Domain, family attribution 2. Subcellular location 3. Posttranslational modifications (PTMs) from genome to proteome ~ human genes alternative splicing of mrna 2-5 fold increase post-translational modifications of proteins (PTMs) 5-10 fold increase ~ 1'000'000 human proteins ~ human transcripts protein complexity

68 PTM diversity GPI Myr GPI Ngly GPI Ogly GPI GPI GPI GPI GPI Pho Sul Am Amidation AcN Acetylation N-terminal AcI Acetylation internal Alk Alkylation Adp ADP-ribosylation Bio Biotinylation Bro Bromination Cgly C-linked glycosylation Ogly O-linked glycosylation Ngly N-linked glycosylation Dea Deamidation Sul Sulfation Far Farnesylation Ger Geranylgeranylation GPI GPI-anchoring Met Methylation Myr Myristoylation Hyd Hydroxylation Pho Phosphorylation Pal Palmitoylation Pyr Pyrrolidone carboxylic acid Oxo 2-amino-3-oxopropionic acid Three major categories cleavage linkage x-linking initiator Met, signal and transit peptides, propeptides, complex processing, etc. simple chemical groups: phosphate, sulfate, methyl, hydroxyl, acetate, etc. complex molecules: N-, O- or C-linked glycans, lipids (e.g. palmitate, myristate, GPI) disulfide bonds, thioester, thioether bonds, etc.

69 PTM database RESID is a database of protein post-translational modifications with descriptive, chemical, structural and bibliographic information. contains 351 entries (last update nov 2003)

70 PTM prediction tools PTM prediction on ExPASy + PROSITE predictions (n~15)

71 PTM prediction -> Beware the «biological consistency»! -> Organisms (Eubacteria, Archae, Eukaryota) -> Subcellular location -> secretory pathway (ER, Golgi) -> shuttle between organelles -> topology -> A well characterized orthologous protein

72 Some statistics Number of PTMs in Swiss-Prot release 40 Pot./prob. By sim. all organisms Exp. total signal peptide N-GlcNAc O-GalNAc O-GlcNAc phosphorylation sulfation myristate GPI-anchor 108 Total number of proteins < total number of PTMs PTM annotation in SWISS-PROT: all organisms acetyl phosphate methyl sulfate total proven

73 We need your help! The end of part II

Biological databases an introduction

Biological databases an introduction Biological databases an introduction By Dr. Erik Bongcam-Rudloff SLU 2017 Biological Databases Sequence Databases Genome Databases Structure Databases Sequence Databases The sequence databases are the

More information

Biological databases an introduction

Biological databases an introduction Biological databases an introduction By Dr. Erik Bongcam-Rudloff SGBC-SLU 2016 VALIDATION Experimental Literature Manual or semi-automatic computational analysis EXPERIMENTAL Costs Needs skilled manpower

More information

What is a database? biological databases. An introduction to. A collection of. Includes also associated tools (software) data

What is a database? biological databases. An introduction to. A collection of. Includes also associated tools (software) data An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch A collection of What is a database? structured searchable (index) -> table of contents updated periodically (release) -> new edition

More information

Sequence Databases and database scanning

Sequence Databases and database scanning Sequence Databases and database scanning Marjolein Thunnissen Lund, 2012 Types of databases: Primary sequence databases (proteins and nucleic acids). Composite protein sequence databases. Secondary databases.

More information

Protein Bioinformatics Part I: Access to information

Protein Bioinformatics Part I: Access to information Protein Bioinformatics Part I: Access to information 260.655 April 6, 2006 Jonathan Pevsner, Ph.D. pevsner@kennedykrieger.org Outline [1] Proteins at NCBI RefSeq accession numbers Cn3D to visualize structures

More information

1. Proteomics database contents Protein sequence databases

1. Proteomics database contents Protein sequence databases 1. Proteomics contents Protein sequence s Salvador Martínez de Bartolomé smartinez@proteored.org Bioinformatics support ProteoRed Proteomics Facility, National Center for Biotechnology, Madrid Menu Introduction

More information

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science EECS 730 Introduction to Bioinformatics Sequence Alignment Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/ Database What is database An organized set of data Can

More information

Bioinformatics overview

Bioinformatics overview Bioinformatics overview Aplicações biomédicas em plataformas computacionais de alto desempenho Aplicaciones biomédicas sobre plataformas gráficas de altas prestaciones Biomedical applications in High performance

More information

ELE4120 Bioinformatics. Tutorial 5

ELE4120 Bioinformatics. Tutorial 5 ELE4120 Bioinformatics Tutorial 5 1 1. Database Content GenBank RefSeq TPA UniProt 2. Database Searches 2 Databases A common situation for alignment is to search through a database to retrieve the similar

More information

NiceProt View of Swiss-Prot: P18907

NiceProt View of Swiss-Prot: P18907 Hosted by NCSC US ExPASy Home page Site Map Search ExPASy Contact us Swiss-Prot Mirror sites: Australia Bolivia Canada China Korea Switzerland Taiwan Search Swiss-Prot/TrEMBL for horse alpha Go Clear NiceProt

More information

AAGTGCCACTGCATAAATGACCATGAGTGGGCACCGGTAAGGGAGGGTGATGCTATCTGGTCTGAAG. Protein 3D structure. sequence. primary. Interactions Mutations

AAGTGCCACTGCATAAATGACCATGAGTGGGCACCGGTAAGGGAGGGTGATGCTATCTGGTCTGAAG. Protein 3D structure. sequence. primary. Interactions Mutations Introduction to Databases Lecture Outline Shifra Ben-Dor Irit Orr Introduction Data and Database types Database components Data Formats Sample databases How to text search databases What units of information

More information

Redundancy at GenBank => RefSeq. RefSeq vs GenBank. Databases, cont. Genome sequencing using a shotgun approach. Sequenced eukaryotic genomes

Redundancy at GenBank => RefSeq. RefSeq vs GenBank. Databases, cont. Genome sequencing using a shotgun approach. Sequenced eukaryotic genomes Databases, cont. Redundancy at GenBank => RefSeq http://www.ncbi.nlm.nih.gov/books/bv.fcg i?rid=handbook RefSeq vs GenBank Many sequences are represented more than once in GenBank 2003 RefSeq collection

More information

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Protein Sequence Analysis BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical

More information

Since 2002 a merger and collaboration of three databases: Swiss-Prot & TrEMBL

Since 2002 a merger and collaboration of three databases: Swiss-Prot & TrEMBL Since 2002 a merger and collaboration of three databases: Swiss-Prot & TrEMBL PIR-PSD Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database o A high quality

More information

Computational Biology and Bioinformatics

Computational Biology and Bioinformatics Computational Biology and Bioinformatics Computational biology Development of algorithms to solve problems in biology Bioinformatics Application of computational biology to the analysis and management

More information

Dr. R. Sankar, BSE 631 (2018)

Dr. R. Sankar, BSE 631 (2018) Pauling, Corey and Branson Diffraction of DNA http://www.nature.com/scitable/topicpage/dna-is-a-structure-that-encodes-biological-6493050 In short, stereochemistry is important in determining which helices

More information

Bioinformatics Prof. M. Michael Gromiha Department of Biotechnology Indian Institute of Technology, Madras. Lecture - 5a Protein sequence databases

Bioinformatics Prof. M. Michael Gromiha Department of Biotechnology Indian Institute of Technology, Madras. Lecture - 5a Protein sequence databases Bioinformatics Prof. M. Michael Gromiha Department of Biotechnology Indian Institute of Technology, Madras Lecture - 5a Protein sequence databases In this lecture, we will mainly discuss on Protein Sequence

More information

Virtual bond representation

Virtual bond representation Today s subjects: Virtual bond representation Coordination number Contact maps Sidechain packing: is it an instrumental way of selecting and consolidating a fold? ASA of proteins Interatomic distances

More information

Structural bioinformatics

Structural bioinformatics Structural bioinformatics Why structures? The representation of the molecules in 3D is more informative New properties of the molecules are revealed, which can not be detected by sequences Eran Eyal Plant

More information

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Bioinformatics Tools Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Bioinformatics Tools Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Overview This lecture will

More information

Will discuss proteins in view of Sequence (I,II) Structure (III) Function (IV) proteins in practice

Will discuss proteins in view of Sequence (I,II) Structure (III) Function (IV) proteins in practice Will discuss proteins in view of Sequence (I,II) Structure (III) Function (IV) proteins in practice integration - web system (V) 1 Touring the Protein Space (outline) 1. Protein Sequence - how rich? How

More information

I nternet Resources for Bioinformatics Data and Tools

I nternet Resources for Bioinformatics Data and Tools ~i;;;;;;;'s :.. ~,;;%.: ;!,;s163 ~. s :s163:: ~s ;'.:'. 3;3 ~,: S;I:;~.3;3'/////, IS~I'//. i: ~s '/, Z I;~;I; :;;; :;I~Z;I~,;'//.;;;;;I'/,;:, :;:;/,;'L;;;~;'~;~,::,:, Z'LZ:..;;',;';4...;,;',~/,~:...;/,;:'.::.

More information

Web-based Bioinformatics Applications in Proteomics

Web-based Bioinformatics Applications in Proteomics Web-based Bioinformatics Applications in Proteomics Chiquito Crasto ccrasto@genetics.uab.edu January 30, 2009 NCBI (National Center for Biotechnology Information) http://www.ncbi.nlm.nih.gov/ 1 Pubmed

More information

Two Mark question and Answers

Two Mark question and Answers 1. Define Bioinformatics Two Mark question and Answers Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. There are three

More information

Sequence Based Function Annotation

Sequence Based Function Annotation Sequence Based Function Annotation Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Sequence Based Function Annotation 1. Given a sequence, how to predict its biological

More information

Types of Databases - By Scope

Types of Databases - By Scope Biological Databases Bioinformatics Workshop 2009 Chi-Cheng Lin, Ph.D. Department of Computer Science Winona State University clin@winona.edu Biological Databases Data Domains - By Scope - By Level of

More information

ONLINE BIOINFORMATICS RESOURCES

ONLINE BIOINFORMATICS RESOURCES Dedan Githae Email: d.githae@cgiar.org BecA-ILRI Hub; Nairobi, Kenya 16 May, 2014 ONLINE BIOINFORMATICS RESOURCES Introduction to Molecular Biology and Bioinformatics (IMBB) 2014 The larger picture.. Lower

More information

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks Introduction to Bioinformatics CPSC 265 Thanks to Jonathan Pevsner, Ph.D. Textbooks Johnathan Pevsner, who I stole most of these slides from (thanks!) has written a textbook, Bioinformatics and Functional

More information

NCBI web resources I: databases and Entrez

NCBI web resources I: databases and Entrez NCBI web resources I: databases and Entrez Yanbin Yin Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1 Homework assignment 1 Two parts: Extract the gene IDs reported in table

More information

ab initio and Evidence-Based Gene Finding

ab initio and Evidence-Based Gene Finding ab initio and Evidence-Based Gene Finding A basic introduction to annotation Outline What is annotation? ab initio gene finding Genome databases on the web Basics of the UCSC browser Evidence-based gene

More information

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE BIOMOLECULES COURSE: COMPUTER PRACTICAL 1 Author of the exercise: Prof. Lloyd Ruddock Edited by Dr. Leila Tajedin 2017-2018 Assistant: Leila Tajedin (leila.tajedin@oulu.fi)

More information

Web based Bioinformatics Applications in Proteomics. Genbank

Web based Bioinformatics Applications in Proteomics. Genbank Web based Bioinformatics Applications in Proteomics Chiquito Crasto ccrasto@genetics.uab.edu February 9, 2010 Genbank Primary nucleic acid sequence database Maintained by NCBI National Center for Biotechnology

More information

Basic protein and peptide science for proteomics. Henrik Johansson

Basic protein and peptide science for proteomics. Henrik Johansson Basic protein and peptide science for proteomics Henrik Johansson Proteins are the main actors in the cell Membranes Transport and storage Chemical factories DNA Building proteins Structure Proteins mediate

More information

The University of California, Santa Cruz (UCSC) Genome Browser

The University of California, Santa Cruz (UCSC) Genome Browser The University of California, Santa Cruz (UCSC) Genome Browser There are hundreds of available userselected tracks in categories such as mapping and sequencing, phenotype and disease associations, genes,

More information

Array-Ready Oligo Set for the Rat Genome Version 3.0

Array-Ready Oligo Set for the Rat Genome Version 3.0 Array-Ready Oligo Set for the Rat Genome Version 3.0 We are pleased to announce Version 3.0 of the Rat Genome Oligo Set containing 26,962 longmer probes representing 22,012 genes and 27,044 gene transcripts.

More information

Bioinformatics Introduction to genomics and proteomics II

Bioinformatics Introduction to genomics and proteomics II Bioinformatics Introduction to genomics and proteomics II ulf.schmitz@informatik.uni-rostock.de Bioinformatics and Systems Biology Group www.sbi.informatik.uni-rostock.de Ulf Schmitz, Introduction to genomics

More information

Bacterial Genome Annotation

Bacterial Genome Annotation Bacterial Genome Annotation Bacterial Genome Annotation For an annotation you want to predict from the sequence, all of... protein-coding genes their stop-start the resulting protein the function the control

More information

Cryo-electron microscopy

Cryo-electron microscopy Cryo-electron microscopy Liao et al., Nature 504, 107 (2013) TRPV1 receptor (receptor for capsaicin making chili hot ) 3.4 Å resolution breaking side-chain resolution barrier (PDB: 3J5P) Protein Structure

More information

Chapter Twelve Protein Synthesis: Translation of the Genetic Message

Chapter Twelve Protein Synthesis: Translation of the Genetic Message Mary K. Campbell Shawn O. Farrell international.cengage.com/ Chapter Twelve Protein Synthesis: Translation of the Genetic Message Paul D. Adams University of Arkansas 1 Translating the Genetic Message

More information

An Introduction to Bioinformatics for Biological Sciences Students

An Introduction to Bioinformatics for Biological Sciences Students An Introduction to Bioinformatics for Biological Sciences Students Department of Microbiology and Immunology, McGill University Version 2.5 (For the BIOC-300 lab), March 2006 2 AN INTRODUCTION TO BIOINFORMATICS

More information

Regulation of eukaryotic transcription:

Regulation of eukaryotic transcription: Promoter definition by mass genome annotation data: in silico primer extension EMBNET course Bioinformatics of transcriptional regulation Jan 28 2008 Christoph Schmid Regulation of eukaryotic transcription:

More information

Bioinformatics for Cell Biologists

Bioinformatics for Cell Biologists Bioinformatics for Cell Biologists 15 19 March 2010 Developmental Biology and Regnerative Medicine (DBRM) Schedule Monday, March 15 09.00 11.00 Introduction to course and Bioinformatics (L1) D224 Helena

More information

Algorithms in Bioinformatics ONE Transcription Translation

Algorithms in Bioinformatics ONE Transcription Translation Algorithms in Bioinformatics ONE Transcription Translation Sami Khuri Department of Computer Science San José State University sami.khuri@sjsu.edu Biology Review DNA RNA Proteins Central Dogma Transcription

More information

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Sequence Based Function Annotation Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Usage scenarios for sequence based function annotation Function prediction of newly cloned

More information

Bioinformatics Practical Course. 80 Practical Hours

Bioinformatics Practical Course. 80 Practical Hours Bioinformatics Practical Course 80 Practical Hours Course Description: This course presents major ideas and techniques for auxiliary bioinformatics and the advanced applications. Points included incorporate

More information

Introduction to Molecular Biology Databases

Introduction to Molecular Biology Databases Introduction to Molecular Biology Databases Laboratorio de Bioinformática Centro de Astrobiología INTA-CSIC Centro de Astrobiología PRESENT BIOLOGY RESEARCH Data sources Genome sequencing projects: genome

More information

NCBI Molecular Biology Resources

NCBI Molecular Biology Resources NCBI Molecular Biology Resources Part 2: Using NCBI BLAST December 2009 Using BLAST Basics of using NCBI BLAST Using the new Interface Improved organism and filter options New Services Primer BLAST Align

More information

Motif Search CMSC 423

Motif Search CMSC 423 Motif Search CMSC 423 Central Dogma of Biology proteins Translation mrna (T U) Transcription Genome DNA = double-stranded, linear molecule each strand is string over {A,C,G,T} strands are complements of

More information

Lecture 7 Motif Databases and Gene Finding

Lecture 7 Motif Databases and Gene Finding Introduction to Bioinformatics for Medical Research Gideon Greenspan gdg@cs.technion.ac.il Lecture 7 Motif Databases and Gene Finding Motif Databases & Gene Finding Motifs Recap Motif Databases TRANSFAC

More information

Protein Structure Databases, cont. 11/09/05

Protein Structure Databases, cont. 11/09/05 11/9/05 Protein Structure Databases (continued) Prediction & Modeling Bioinformatics Seminars Nov 10 Thurs 3:40 Com S Seminar in 223 Atanasoff Computational Epidemiology Armin R. Mikler, Univ. North Texas

More information

Bi Lecture 3 Loss-of-function (Ch. 4A) Monday, April 8, 13

Bi Lecture 3 Loss-of-function (Ch. 4A) Monday, April 8, 13 Bi190-2013 Lecture 3 Loss-of-function (Ch. 4A) Infer Gene activity from type of allele Loss-of-Function alleles are Gold Standard If organism deficient in gene A fails to accomplish process B, then gene

More information

Protein structure. Wednesday, October 4, 2006

Protein structure. Wednesday, October 4, 2006 Protein structure Wednesday, October 4, 2006 Introduction to Bioinformatics Johns Hopkins School of Public Health 260.602.01 J. Pevsner pevsner@jhmi.edu Copyright notice Many of the images in this powerpoint

More information

Introduction to Bioinformatics. What are the goals of the course? Who is taking this course? Textbook. Web sites. Literature references

Introduction to Bioinformatics. What are the goals of the course? Who is taking this course? Textbook. Web sites. Literature references Introduction to Bioinformatics Who is taking this course? People with very diverse backgrounds in biology Some people with backgrounds in computer science and biostatistics Most people (will) have a favorite

More information

Bioinformatics for Proteomics. Ann Loraine

Bioinformatics for Proteomics. Ann Loraine Bioinformatics for Proteomics Ann Loraine aloraine@uab.edu What is bioinformatics? The science of collecting, processing, organizing, storing, analyzing, and mining biological information, especially data

More information

Zool 3200: Cell Biology Exam 3 3/6/15

Zool 3200: Cell Biology Exam 3 3/6/15 Name: Trask Zool 3200: Cell Biology Exam 3 3/6/15 Answer each of the following questions in the space provided; circle the correct answer or answers for each multiple choice question and circle either

More information

Textbook Reading Guidelines

Textbook Reading Guidelines Understanding Bioinformatics by Marketa Zvelebil and Jeremy Baum Last updated: May 1, 2009 Textbook Reading Guidelines Preface: Read the whole preface, and especially: For the students with Life Science

More information

Homology Modelling. Thomas Holberg Blicher NNF Center for Protein Research University of Copenhagen

Homology Modelling. Thomas Holberg Blicher NNF Center for Protein Research University of Copenhagen Homology Modelling Thomas Holberg Blicher NNF Center for Protein Research University of Copenhagen Why are Protein Structures so Interesting? They provide a detailed picture of interesting biological features,

More information

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding UCSC Genome Browser Introduction to ab initio and evidence-based gene finding Wilson Leung 06/2006 Outline Introduction to annotation ab initio gene finding Basics of the UCSC Browser Evidence-based gene

More information

Exercises (Multiple sequence alignment, profile search)

Exercises (Multiple sequence alignment, profile search) Exercises (Multiple sequence alignment, profile search) 8. Using Clustal Omega program, available among the tools at the EBI website (http://www.ebi.ac.uk/tools/msa/clustalo/), calculate a multiple alignment

More information

Introduction to protein structure analysis and prediction

Introduction to protein structure analysis and prediction Introduction to protein structure analysis and prediction Mónica Chagoyen monica.chagoyen@cnb.csic.es Protein sequence analysis and prediction service Centro Nacional de Biotecnologia (CNB-CSIC) 24-26

More information

Unit 1. DNA and the Genome

Unit 1. DNA and the Genome Unit 1 DNA and the Genome Gene Expression Key Area 3 Vocabulary 1: Transcription Translation Phenotype RNA (mrna, trna, rrna) Codon Anticodon Ribosome RNA polymerase RNA splicing Introns Extrons Gene Expression

More information

Chimp Sequence Annotation: Region 2_3

Chimp Sequence Annotation: Region 2_3 Chimp Sequence Annotation: Region 2_3 Jeff Howenstein March 30, 2007 BIO434W Genomics 1 Introduction We received region 2_3 of the ChimpChunk sequence, and the first step we performed was to run RepeatMasker

More information

Lecture for Wednesday. Dr. Prince BIOL 1408

Lecture for Wednesday. Dr. Prince BIOL 1408 Lecture for Wednesday Dr. Prince BIOL 1408 THE FLOW OF GENETIC INFORMATION FROM DNA TO RNA TO PROTEIN Copyright 2009 Pearson Education, Inc. Genes are expressed as proteins A gene is a segment of DNA that

More information

Gene-centered resources at NCBI

Gene-centered resources at NCBI COURSE OF BIOINFORMATICS a.a. 2014-2015 Gene-centered resources at NCBI We searched Accession Number: M60495 AT NCBI Nucleotide Gene has been implemented at NCBI to organize information about genes, serving

More information

The Gene Ontology Annotation (GOA) project application of GO in SWISS-PROT, TrEMBL and InterPro

The Gene Ontology Annotation (GOA) project application of GO in SWISS-PROT, TrEMBL and InterPro Comparative and Functional Genomics Comp Funct Genom 2003; 4: 71 74. Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.235 Conference Review The Gene Ontology Annotation

More information

The Central Dogma. DNA makes RNA makes Proteins

The Central Dogma. DNA makes RNA makes Proteins The Central Dogma DNA makes RNA makes Proteins TRANSCRIPTION DNA RNA transcript RNA polymerase RNA PROCESSING Exon RNA transcript (pre-) Intron Aminoacyl-tRNA synthetase NUCLEUS CYTOPLASM FORMATION OF

More information

Center for Mass Spectrometry and Proteomics Phone (612) (612)

Center for Mass Spectrometry and Proteomics Phone (612) (612) Outline Database search types Peptide Mass Fingerprint (PMF) Precursor mass-based Sequence tag Results comparison across programs Manual inspection of results Terminology Mass tolerance MS/MS search FASTA

More information

Bioinformatics & Protein Structural Analysis. Bioinformatics & Protein Structural Analysis. Learning Objective. Proteomics

Bioinformatics & Protein Structural Analysis. Bioinformatics & Protein Structural Analysis. Learning Objective. Proteomics The molecular structures of proteins are complex and can be defined at various levels. These structures can also be predicted from their amino-acid sequences. Protein structure prediction is one of the

More information

IV107 Bioinformatika I

IV107 Bioinformatika I IV107 Bioinformatika I Přednáška 5 Katedra informačních technologií Masarykova Univerzita Brno Jaro 2011 Předchozí týden Struktura genu prokaryotického eukaryotického Porovnání sekvencí globální (Needleman

More information

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu.   handouts, papers, datasets Ensembl workshop Thomas Randall, PhD tarandal@email.unc.edu bioinformatics.unc.edu www.unc.edu/~tarandal/ensembl handouts, papers, datasets Ensembl is a joint project between EMBL - EBI and the Sanger

More information

36. The double bonds in naturally-occuring fatty acids are usually isomers. A. cis B. trans C. both cis and trans D. D- E. L-

36. The double bonds in naturally-occuring fatty acids are usually isomers. A. cis B. trans C. both cis and trans D. D- E. L- 36. The double bonds in naturally-occuring fatty acids are usually isomers. A. cis B. trans C. both cis and trans D. D- E. L- 37. The essential fatty acids are A. palmitic acid B. linoleic acid C. linolenic

More information

Computational Molecular Biology Intro. Alexander (Sacha) Gultyaev

Computational Molecular Biology Intro. Alexander (Sacha) Gultyaev Computational Molecular Biology Intro Alexander (Sacha) Gultyaev a.p.goultiaev@liacs.leidenuniv.nl Biopolymer sequences DNA: double-helical nucleic acid. Monomers: nucleotides C, A, T, G. RNA: (single-stranded)

More information

Key Area 1.3: Gene Expression

Key Area 1.3: Gene Expression Key Area 1.3: Gene Expression RNA There is a second type of nucleic acid in the cell, called RNA. RNA plays a vital role in the production of protein from the code in the DNA. What is gene expression?

More information

Data Retrieval from GenBank

Data Retrieval from GenBank Data Retrieval from GenBank Peter J. Myler Bioinformatics of Intracellular Pathogens JNU, Feb 7-0, 2009 http://www.ncbi.nlm.nih.gov (January, 2007) http://ncbi.nlm.nih.gov/sitemap/resourceguide.html Accessing

More information

Basic concepts of molecular biology

Basic concepts of molecular biology Basic concepts of molecular biology Gabriella Trucco Email: gabriella.trucco@unimi.it Life The main actors in the chemistry of life are molecules called proteins nucleic acids Proteins: many different

More information

NUCLEIC ACIDS. DNA (Deoxyribonucleic Acid) and RNA (Ribonucleic Acid): information storage molecules made up of nucleotides.

NUCLEIC ACIDS. DNA (Deoxyribonucleic Acid) and RNA (Ribonucleic Acid): information storage molecules made up of nucleotides. NUCLEIC ACIDS DNA (Deoxyribonucleic Acid) and RNA (Ribonucleic Acid): information storage molecules made up of nucleotides. Base Adenine Guanine Cytosine Uracil Thymine Abbreviation A G C U T DNA RNA 2

More information

Bioinformatics. ONE Introduction to Biology. Sami Khuri Department of Computer Science San José State University Biology/CS 123A Fall 2012

Bioinformatics. ONE Introduction to Biology. Sami Khuri Department of Computer Science San José State University Biology/CS 123A Fall 2012 Bioinformatics ONE Introduction to Biology Sami Khuri Department of Computer Science San José State University Biology/CS 123A Fall 2012 Biology Review DNA RNA Proteins Central Dogma Transcription Translation

More information

Gil Alterovitz Harvard-MIT Division of Health Science & Technology

Gil Alterovitz Harvard-MIT Division of Health Science & Technology Modern Biology in Two Lectures (Part II) Gil Alterovitz Course Administration andouts Open Courseware form- please turn in before leaving class Matlab form- for free copy of Matlab for students in class

More information

Databases in Bioinformatics. Molecular Databases. Molecular Databases. NCBI Databases. BINF 630: Bioinformatics Methods

Databases in Bioinformatics. Molecular Databases. Molecular Databases. NCBI Databases. BINF 630: Bioinformatics Methods Databases in Bioinformatics BINF 630: Bioinformatics Methods Iosif Vaisman Email: ivaisman@gmu.edu Molecular Databases Molecular Databases Nucleic acid sequences: GenBank, DNA Data Bank of Japan, EMBL

More information

Chapter 2: Access to Information

Chapter 2: Access to Information Chapter 2: Access to Information Outline Introduction to biological databases Centralized databases store DNA sequences Contents of DNA, RNA, and protein databases Central bioinformatics resources: NCBI

More information

Genome Informatics. Systems Biology and the Omics Cascade (Course 2143) Day 3, June 11 th, Kiyoko F. Aoki-Kinoshita

Genome Informatics. Systems Biology and the Omics Cascade (Course 2143) Day 3, June 11 th, Kiyoko F. Aoki-Kinoshita Genome Informatics Systems Biology and the Omics Cascade (Course 2143) Day 3, June 11 th, 2008 Kiyoko F. Aoki-Kinoshita Introduction Genome informatics covers the computer- based modeling and data processing

More information

Product Applications for the Sequence Analysis Collection

Product Applications for the Sequence Analysis Collection Product Applications for the Sequence Analysis Collection Pipeline Pilot Contents Introduction... 1 Pipeline Pilot and Bioinformatics... 2 Sequence Searching with Profile HMM...2 Integrating Data in a

More information

DNA makes RNA makes Proteins. The Central Dogma

DNA makes RNA makes Proteins. The Central Dogma DNA makes RNA makes Proteins The Central Dogma TRANSCRIPTION DNA RNA transcript RNA polymerase RNA PROCESSING Exon RNA transcript (pre-mrna) Intron Aminoacyl-tRNA synthetase NUCLEUS CYTOPLASM FORMATION

More information

Homology Modelling. Thomas Holberg Blicher NNF Center for Protein Research University of Copenhagen

Homology Modelling. Thomas Holberg Blicher NNF Center for Protein Research University of Copenhagen Homology Modelling Thomas Holberg Blicher NNF Center for Protein Research University of Copenhagen Why are Protein Structures so Interesting? They provide a detailed picture of interesting biological features,

More information

BME 110 Midterm Examination

BME 110 Midterm Examination BME 110 Midterm Examination May 10, 2011 Name: (please print) Directions: Please circle one answer for each question, unless the question specifies "circle all correct answers". You can use any resource

More information

Klinisk kemisk diagnostik BIOINFORMATICS

Klinisk kemisk diagnostik BIOINFORMATICS Klinisk kemisk diagnostik - 2017 BIOINFORMATICS What is bioinformatics? Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological,

More information

Important gene-information's

Important gene-information's Sequences, domains and databases. How to gather information on a gene. Jens Bohnekamp, Institute for Biochemistry Important gene-information's Protein sequence Nucleotide sequence Gene structure Protein

More information

Lecture 2 Introduction to Data Formats

Lecture 2 Introduction to Data Formats Introduction to Bioinformatics for Medical Research Gideon Greenspan gdg@cs.technion.ac.il Lecture 2 Introduction to Data Formats Introduction to Data Formats Real world, data and formats Sequences and

More information

EE550 Computational Biology

EE550 Computational Biology EE550 Computational Biology Week 1 Course Notes Instructor: Bilge Karaçalı, PhD Syllabus Schedule : Thursday 13:30, 14:30, 15:30 Text : Paul G. Higgs, Teresa K. Attwood, Bioinformatics and Molecular Evolution,

More information

Computational gene finding

Computational gene finding Computational gene finding Devika Subramanian Comp 470 Outline (3 lectures) Lec 1 Lec 2 Lec 3 The biological context Markov models and Hidden Markov models Ab-initio methods for gene finding Comparative

More information

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology Jeremy Buhler March 15, 2004 In this lab, we ll annotate an interesting piece of the D. melanogaster genome. Along the way, you ll get

More information

Problem Set Unit The base ratios in the DNA and RNA for an onion (Allium cepa) are given below.

Problem Set Unit The base ratios in the DNA and RNA for an onion (Allium cepa) are given below. Problem Set Unit 3 Name 1. Which molecule is found in both DNA and RNA? A. Ribose B. Uracil C. Phosphate D. Amino acid 2. Which molecules form the nucleotide marked in the diagram? A. phosphate, deoxyribose

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools CAP 5510: Introduction to Bioinformatics : Bioinformatics Tools ECS 254A / EC 2474; Phone x3748; Email: giri@cis.fiu.edu My Homepage: http://www.cs.fiu.edu/~giri http://www.cs.fiu.edu/~giri/teach/bioinfs15.html

More information

Genome annotation & EST

Genome annotation & EST Genome annotation & EST What is genome annotation? The process of taking the raw DNA sequence produced by the genome sequence projects and adding the layers of analysis and interpretation necessary

More information

Amino Acids and Proteins

Amino Acids and Proteins Various Functions of Proteins SB203 Amino Acids and Proteins Jirundon Yuvaniyama, Ph.D. Department of Biochemistry Faculty of Science Mahidol University Enzymes Transport proteins utrient and storage proteins

More information

Sequence Analysis. Introduction to Bioinformatics BIMMS December 2015

Sequence Analysis. Introduction to Bioinformatics BIMMS December 2015 Sequence Analysis Introduction to Bioinformatics BIMMS December 2015 abriel Teku Department of Experimental Medical Science Faculty of Medicine Lund University Sequence analysis Part 1 Sequence analysis:

More information

Dina El-Khishin (Ph.D.) Bioinformatics Research Facility. Deputy Director of AGERI & Head of the Genomics, Proteomics &

Dina El-Khishin (Ph.D.) Bioinformatics Research Facility. Deputy Director of AGERI & Head of the Genomics, Proteomics & Dina El-Khishin (Ph.D.) Deputy Director of AGERI & Head of the Genomics, Proteomics & Bioinformatics Research Facility Agricultural Genetic Engineering Research Institute (AGERI) Giza EGYPT Bioinformatics

More information

Description of Changes and Corrections for PDB File Format Version 4.0. Provisional Document April 12, 2011

Description of Changes and Corrections for PDB File Format Version 4.0. Provisional Document April 12, 2011 Description of Changes and Corrections for PDB File Format Version 4.0 Provisional Document April 12, 2011 The wwpdb has reviewed the PDB archive and created a new set of corrected files that will be released

More information

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

TIGR THE INSTITUTE FOR GENOMIC RESEARCH Introduction to Genome Annotation: Overview of What You Will Learn This Week C. Robin Buell May 21, 2007 Types of Annotation Structural Annotation: Defining genes, boundaries, sequence motifs e.g. ORF,

More information

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica The Ensembl Database Dott.ssa Inga Prokopenko Corso di Genomica 1 www.ensembl.org Lecture 7.1 2 What is Ensembl? Public annotation of mammalian and other genomes Open source software Relational database

More information