Sequence Databases and database scanning

Size: px

Start display at page:

Download "Sequence Databases and database scanning"

Wendy Logan
6 years ago
Views:

Sequence Databases and database scanning Marjolein Thunnissen Lund, 2012 Types of databases: Primary sequence databases (proteins and nucleic acids). Composite protein sequence databases.

1 Sequence Databases and database scanning Marjolein Thunnissen Lund, 2012 Types of databases: Primary sequence databases (proteins and nucleic acids). Composite protein sequence databases. Secondary databases. 3D structure and structure classification databases. Specific databases (GTPases, AAA-protein, ligand binding, etc). Examples of primary databases: Nucleic acids ENA European Nucleotide Archive, NCBI (National Center for Biotechnology Information), DDBJ (DNA Database of Japan) The ENA (European Nucleotide Archive) formerly the EMBL (European Molecular Biology Laboratory) Nucleotide Sequence Database is included within the server of the European Bioinformatics Institute Very many other services/databases and tools are provided by EBI

ENA: The NCBI server (http://www.ncbi.nlm.nih.gov) includes many databases and services. Provides valuable educational material at: http:// www.ncbi.nlm.nih.gov/education Proteins (amino acid sequence databases): Expasy http://www.

de The Expasy group: ExPASy server (Expert Protein Analysis System) proteomics server of the Swiss Institute of Bioinformatics (SIB).

2 ENA: The NCBI server ( includes many databases and services. Provides valuable educational material at: Proteins (amino acid sequence databases): Expasy PIR MIPS The Expasy group: ExPASy server (Expert Protein Analysis System) proteomics server of the Swiss Institute of Bioinformatics (SIB). This server is dedicated to the analysis of protein sequences and structures as well as 2-D PAGE. Includes: SWISS-PROT and TrEMBL - Protein sequences PROSITE - Protein families and domains SWISS-2DPAGE - Two-dimensional polyacrylamide gel electrophoresis SWISS-3DIMAGE - 3D images of proteins and other biological macromolecules SWISS-MODEL Repository - Automatically generated protein models ENZYME - Enzyme nomenclature

translated from the EMBL Nucleotide Sequence Database. The format of the sequence entries for standardization purposes follows as closely as possible that of the EMBL Nucleotide Sequence Database.

3 ExPASy The server also includes a large collection of links to various proteomics tools: Identification and characterization DNA -> Protein Similarity searches Pattern and profile searches Post-translational modification prediction Primary structure analysis Secondary structure prediction Functionality Tertiary structure Transmembrane regions Database UniprotKB/TrEMBL - sequences translated from the EMBL Nucleotide Sequence Database. The format of the sequence entries for standardization purposes follows as closely as possible that of the EMBL Nucleotide Sequence Database. The database is updated daily. SWISS-PROT makes cross-references to many specific databases. Examples: The PROSITE dictionary of sites and patterns in proteins The Protein Data Bank (PDB) Mendelian Inheritance in Man (MIM) Mouse Genome Database (MGD) The restriction enzymes database (REBASE) And more cross-references: Example of a Uniprot entry: human leukotriene A4 hydrolase The G-protein--coupled receptor database (GCRDb) The Encyclopedia of Escherichia coli genes and metabolism (EcoCyc) The 2D gel protein database (SWISS-2DPAGE) The Saccharomyces Genome Database (SGD) The Yeast Electrophoresis Protein Database (YEPD) The Harefield Hospital 2D gel protein databases The Drosophila genome database (FlyBase) The database of Homology-derived Secondary Structure of Proteins (HSSP) The transcription factor database (Transfac)

4 The PIR group: The Protein Information Resource (PIR), in collaboration with MIPS and JIPID, produces the PIR- International Protein Sequence Database (PIR-PSD) - a comprehensive, non-redundant, expertly annotated, fully classified and extensively crossreferenced protein sequence database in the public domain. The PIR-NRL3D database makes the sequence information in PDB available for similarity searches and retrieval and provides cross-reference information for use with the other PIR Protein Sequence Databases. Examples of specific databases: Proteome: Database on proteins from several different genome sources Human,mouse,rat, yeasts, worms & pathogenic fungi. Annotation focuses on the molecular function and biological role of proteins, expression patterns across cells, tissues, and organs, consequences of gene mutation (mouse proteins only), relationships to disease, and the physical and regulatory interactions between proteins and genes. NOT FREE See also eg (for free) Flybase or through ebi This is the database of the model organism Drosophila melanogaster. Genome database with information about genes, clones, function expression and lots and lots more. Enzyme nomenclature database: Enzyme contains the following data : EC number Recommended name Alternative names (if any) Catalytic activity Cofactors (if any) Pointers to the SWISS-PROT entry(s) that correspond to the protein Pointers to disease(s) associated with the particular protein

Secondary (pattern) databases: Secondary (pattern) databases: Contain the results of analysis of sequences in the primary databases: Homologous sequences may be gathered in multiple alignments,

An unknown sequence may be searched against a database of motifs to determine whether or not it contains any of the conserved motifs. This will define the family to which the sequence belongs.

5 Secondary (pattern) databases: Secondary (pattern) databases: Contain the results of analysis of sequences in the primary databases: Homologous sequences may be gathered in multiple alignments, within which conserved regions (motifs) are clearly visible. These motifs often reflect some vital biological role. An unknown sequence may be searched against a database of motifs to determine whether or not it contains any of the conserved motifs. This will define the family to which the sequence belongs. PRINTS - Data library of conserved motifs used to characterize a protein family. PFAM - is a large collection of multiple sequence alignments covering many common protein domains. Version 7.5 of Pfam (July 2007) contains alignments and models for 9318 protein families. PROSITE - Data library of biologically significant sites and patterns to reliably identify to which known family of protein a sequence belongs to. REBASE - Data library of type 2 restriction enzymes. PRINTS ( is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterize a protein family; its diagnostic power is refined by iterative scanning of a SWISS-PROT/TrEMBL composite. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs. PROSITE: dictionary of sites and patterns in proteins Examples of patterns: GDSGGP a pattern typical for a serine protease [A,G]-x(4)-G-K-[S,T] a pattern corresponding to the ATP/GTPbinding site

The conventions for the PROSITE patterns : The conventions for the PROSITE patterns 2: The standard IUPAC one-letter codes for the amino acids are used.

6 The conventions for the PROSITE patterns : The conventions for the PROSITE patterns 2: The standard IUPAC one-letter codes for the amino acids are used. The symbol x is used for a position where any amino acid is accepted. Ambiguities are indicated by listing the acceptable amino acids for a given position, between square parentheses [ ]. For example: [ALT] stands for Ala or Leu or Thr. Ambiguities are also indicated by listing between a pair of curly brackets { } the amino acids that are not accepted at a given position. For example: {AM} stands for any amino acid except Ala and Met. Each element in a pattern is separated from its neighbour by a -. Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds to x-x or x-x-x or x-x-x-x. When a pattern is restricted to either the N- or C- terminal of a sequence, that pattern either starts with a < symbol or respectively ends with a > symbol. Examples: Example of a PROSITE entry: phospholipase A2 [AC]-x-V-x(4)-{ED} This pattern is translated as: [Ala or Cys]-any-Val-anyany-any-any-{any but Glu or Asp} < A-x-[ST](2)-x(0,1)-V This pattern, which must be in the N-terminus of the sequence (<), is translated as: Ala-any-[Ser or Thr]- [Ser or Thr]-(any or none)-val

7 Searching databases: Homology search - Used to search for related sequences in sequence databases and allows a search with long sequences, programs like FASTA and BLAST are used. Pattern Searches - used to find short sequence patterns in a single sequence, in a group of sequences or in a database. Pattern searches only decide whether there is an exact match or not (subsequently no gaps or substitution matrices used). Useful when there is no information on a particular protein. Steps in sequences analysis: 1. Sequence the gene, translate the nucleotides 2. Search databases using BLAST or FASTA (local alignment of short stretches). If convinced by the hits, go to step If the BLAST search was not 100% convincing, also search for patterns, motifs. Compare with known function, if known, etc. 4. If identified correctly: Sequence alignment - align and compare the sequence to a family of related sequences. 5. Search PDB. Any homologues with a known 3D structure? Make a homology-based model - useful for a deeper understanding of function, design of mutations, etc. BLAST: BLAST (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA. ( BLAST/) Programs within BLAST: blastp - Compares an amino acid query sequence against a protein sequence database. blastn - Compares a nucleotide query sequence against a nucleotide sequence database. blastx - Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. This option can be used to find potential translation products of an unknown nucleotide sequence. BLAST algorithm:

8 Example of BLAST search: Example of a BLAST search output: Example of a BLAST search output 2 Example of a BLAST search output 3

Example of a BLAST search output 4 Distribution of Blast alignments Alignment score distribution for a database search Black bars - proteins known to be similar to the query sequence.

9 Example of a BLAST search output 4 Distribution of Blast alignments Alignment score distribution for a database search Black bars - proteins known to be similar to the query sequence. White bars - not related sequences. a- scan does not discriminate well c- perfect discrimination b- usual intermediate result showing overlap between true hits and unrelated sequences. Profile analysis another method Profile analysis is a method of sequence comparison which is distinct from homology and pattern searches. Starts with a multiple sequence alignment which is then used to create a profile. The profile is a table where we find for each amino acid position the frequency of each of the 20 amino acids (Profile = position-specific scoring table). Profile searching can be useful for finding and aligning distantly related sequences and finding new family members. Example of a profile: Cons A B C D E F G H I K L M N P Q... P W T T P S G K R T E G P A P

10 Some rules for sequence search: The requirement for a common folded structure in homologous proteins usually causes these proteins to be similar over the entire length. Therefore, most sequences that share statistically significant similarity throughout their entire lengths are homologous. Matches that are more than 50% identical in a amino acid region occur frequently by chance. Distantly related homologues may lack significant similarity. Two or more homologous sequences may have very few absolutely conserved residues. If homology between A and B and also between B and C --> A and C are related as well. Low complexity regions, trans-membrane regions and coiled-coil regions frequently display significant similarity in the absence of homology (filter out). Guidelines for database scanning: To assess the success of a scanning technique, test its ability to find all the members of a known family from the database. Count how many of the known family members are found with scores higher than for nonmembers. Tutorial: concepts.html see also chime/ and for King and kinemages 39

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Protein Sequence Analysis BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical