1. Proteomics database contents Protein sequence databases

Size: px
Start display at page:

Download "1. Proteomics database contents Protein sequence databases"

Transcription

1 1. Proteomics contents Protein sequence s Salvador Martínez de Bartolomé smartinez@proteored.org Bioinformatics support ProteoRed Proteomics Facility, National Center for Biotechnology, Madrid

2 Menu Introduction : bioinformatics and sequence s Nucleic acid sequence s Protein sequences s (sources) Protein sequences s (other)

3 Biology of the XXI century Three major developments: High throughput technique analysis: DNA sequencing, mass spectrometry, micro- Numerous biological s available through the Web Bioinformatics tools available through the Web

4 An overwhelming number of unordered resources

5 Protein Sequence 3 o Structure Protein 2D PAGE & MS PTM Protein identification & characterization PTM Prediction tool 1 o Structure Analysis 3 o Structure Prediction Nucleotide Amino Acid Translator Sequence Alignment Similarity Search Gene Expression Protein Interactions Species / Genomic Functional 2 o Structure Prediction Subcellular localization Polymorphism / Mutation / Disease databae Topology Prediction Pattern & Profile search Domains & classification 2 o Structure Database Database Database Database Database Database Database Database Database Database Database Database Database Phylogenetics & Taxonomy References / nomenclatur e Nucleotide sequence repository

6 References / nomenclatur e Phylogenetics & Taxonomy Subcellular localization Protein Sequence 3 o Structure Protein 2D PAGE & MS PTM Protein identification & characterization PTM Prediction tool 1 o Structure Analysis 3 o Structure Prediction Nucleotide Amino Acid Translator Sequence Alignment Similarity Search Gene Expression Protein Interactions Species / Genomic Functional 2 o Structure Prediction Polymorphism / Mutation / Disease databae Topology Prediction Pattern & Profile search Domains & classification 2 o Structure Database Database Database Database Database Database Database Database Database Database Database Database Database Nucleotide sequence repository UniProtKB (Swiss-Prot/TrEMBL) TargetP EcoGene Ensembl FlyBase MGD SGD SubtiList TIGR CMR HIV TAIR MEROPS ENZYME TRANSFAC KEGG HAMAP PROSITE InterPro Pfam ProDom BLOCKS TIGRFAM ProtoMap CATH SCOP PDB SWISS-MODEL ScanProsite MotifScan HSSP Jpred GOR DIP IntAct ProtScale ProtParam BLAST FASTA dbsnp GeneCards OMIM CleanEx DDBJ GenBank EMBL TreeBase NEWT Taxonomy PSORT Glycosuite PhosphBase NetOGlyc ChloroP PeptideMass Mascot Phenyx ECO2DBASE Siena-2D PAGE SWISS-2D PAGE TMHMM SOSUI PubMed HUGO GO ClustalW DIALIGN Translate

7 Molecular bioinformatics: an operational definition The applications of computer sciences to molecular biology in particular for the study of macromolecules such as proteins, nucleic acids and oligosaccharides

8 Protein sequence s - Identification of proteins by proteomics --> completeness, sequence quality - Similarity searches (functional prediction) --> sequence quality (non redundance) - Training datasets (prediction tools) --> sequence and annotation quality - Genome annotation

9 Proteome complexity Not predictable at the genome level! (Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: ).

10 Avalanche of sequence data

11

12 ~ 1630 genomes sequenced (single organism, varying sizes) ~ 952 ongoing genome sequencing projects

13

14

15 ~ 1630 genomes sequenced (single organism, varying sizes) ~ 952 ongoing genome sequencing projects. ~ 200 metagenome sequencing projects (environmental samples: multiple unknown organisms, varying sizes) Ecological metagenomes: beach sand, Sargasso Sea. Organismal metagenomes: mouse gut ~ 17 million sequences being processed at Venter Institute

16 How many protein sequences at the end? For fun: estimate: ~30 million species (1.5 million named) 20 million bacteria/archea x 4'000 genes ( ) 5 million protists x 6'000 genes 3 million insects x 14'000 genes 1 million fungi x 6'000 genes 0.6 million plants x 20'000 genes 0.2 million molluscs, worms, arachnids, etc. x 20'000 genes 0.2 million vertebrates x 25'000 genes The calculation: 2x10 7 x4000+5x10 6 x6000+3x10 6 x x6000+6x10 5 x x 10 5 x x10 5 x25000 = 179'000'000'000 AMB, SP20

17 Protein sequence origin About 4.5 millions of known protein sequences (in 2007) More than 99 % of the protein sequences are derived from the translation of nucleotide sequences Less than 1 %: direct protein sequencing (Edman, MS/MS ) -> It is important that users know where the protein sequence comes from (sequencing & gene prediction quality)!

18 Menu Introduction : bioinformatics and sequences Nucleic acid sequence s Protein sequences s (sources) Protein sequences s (other)

19 The hectic life of a sequence Data not submitted to public s*, delayed or cancelled cdnas, ESTs(expressed sequence tags), genes, genomes, EMBL, GenBank, DDBJ EMBL: GenBank: DDBJ:

20 Contribution: EMBL 10 %; GenBank 75 %; DDBJ 15 %

21 Goal -to accept, process and make freely available sequence data from individual researchers, research group and patent office - available via SRS/Entrez, ftp, web services and similarity search tools.

22 The tremendous increase in nucleotide sequences 1980: 80 genes fully sequenced!

23 EMBL/GenBank/DDBJ Serve as archives : nothing goes out Contain all public sequences derived from: Genome projects (> 80 % of entries) Sequencing centers (cdnas, ESTs ) Individual scientists ( 15 % of entries) Patent offices (i.e. European Patent Office, EPO) Currently: ~152x10 6 sequences, ~242 x10 9 bp; Sequences from > different species;

24 More than species, but human mouse rat Human/Mouse/Rat: organisms with the highest redundancy!

25 Where the sequenced specimen was collected? Geographical Origin of Sequenced Samples (since 2005) (lat_lon: latitude_longitude qualifier)

26 EMBL/GenBank/DDBJ A very important annotation for proteomic: the CoDing Sequence (CDS) (in particular for eucaryotes)

27 with or without annotated CDS provided by authors Data not submitted to public s*, delayed or cancelled cdnas, ESTs, genes, genomes, EMBL, GenBank, DDBJ CDS CoDing Sequence portion of DNA/RNA translated into protein (from Met to STOP) Experimentally proved or derived from gene prediction

28 5 Problems

29 Problem 1 Complete genome (submitted) only ~ 2,015 CDS available!

30 At the nucleic acid level human mouse rat At the protein level At the protein level (Example with UniProtKB/TrEMBL): The CDS of virus and bacteria are easy to obtain!

31 Problem 2: Variable level of sequence quality - Sequencing quality - Gene prediction quality Authors can specify the nature of the CDS by using the qualifier: "/evidence=experimental" or "/evidence=not_experimental". Very rarely done

32 Very rarely done

33 UniProtKB/Swiss-Prot protein knowledgebase release 56.6 statistics (16-Dec-08) Protein existence (PE): % 1: At protein level 15,3% 2: Evidence at transcript level 15,8% 3: Inferred from homology 65,2% 4: Predicted 3,4% 5: Uncertain 0,3%

34 Problem 3: highly redundant Sort of sequence museum, where sequences are preserved for eternity as they were determined, interpreted and published originally by their authors (primary sequence repository) -> Similarity searches are not obvious

35 Problem no 4 Author authority --> variable level of the annotation (CDS and other) quality - i.e. gene/protein name attribution

36 EMBL/GenBank/DDBJ The authors have full authority over the content of the entries they submit! (editorial control of the content belongs to the authors) (exception: TPA (Third Party Annotation), since january 2003)

37 Problem no 5 Environmental samples

38 Environmental sequences (ENV) Aim: To sequence all DNA present in a given sample, without knowing from which species the DNA is derived from - Sargasso sea (Craig Venter) - human fluids - earth

39

40

41 No idea of the species (microbial population ) No idea of the gene prediction program to be used No idea of the genetic code to be used for traduction!!!!! Not always associated with CDS. If yes, the protein sequence are present in protein sequence s

42 Menu Introduction : bioinformatics and sequences Nucleic acid sequence s Protein sequences s (sources) Protein sequences s (other)

43 Data not submitted to public s, delayed or cancelled cdnas, ESTs, genomes, Nucleic acid s no CDS EMBL, GenBank, DDBJ if the submitters provide an annotated Coding Sequence (CDS) (1/10 EMBL entries) Gene prediction Protein sequence s

44 Major protein sequence sources PIR PDB PRF UniProtKB: Swiss-Prot + TrEMBL Integrated resources cross-references Separated resources NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq UniProtKB/Swiss-Prot: manually annotated protein sequences ( species) UniProtKB/TrEMBL: submitted CDS (EMBL) + automated annotation; non redundant with Swiss-Prot ( species) GenPept: submitted CDS (GenBank); redundant with Swiss-Prot ( species) PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB PDB: Protein Databank: 3D data and associated sequences PRF: journal scan of published peptide sequences RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction (4 000 species)

45 UniProt, the Universal protein resource is maintained by the UniProt consortium SIB + EBI + PIR SIB = Swiss Institute Bioinformatics EBI = European Bioinformatics Institute PIR = Protein Information Resource

46 entries ( species) entries ( species)

47

48 The UniProt KnowledgeBase (UniProtKB) an encyclopedia on proteins biweekly released

49 EMBL TrEMBL Automated extraction of protein sequence (translated CDS), gene name and references.+ Automated annotation

50 !!!! The quality of UniProtKB/TrEMBL data, including the protein sequence, is directly dependent on the information provided by the submitter of the original nucleotide entry. Automated annotation using rules derived from Swiss-Prot manually annotated entries but with no manual oversight RuleBase using automatically generated rules - Spearmint

51 EMBL TrEMBL Manual annotation of the sequence and associated biological information Swiss-Prot Automated extraction of protein sequence (translated CDS), gene name and references.+ Automated annotation

52 UniProtKB from TrEMBL to Swiss-Prot Sequence check

53 UniProtKB/Swiss-Prot 1 entry <-> 1 gene (1 species) i) Merge of all known protein sequences (CDS) derived from the same gene -> avoid redundancy and improve sequence reliability (for human: ~ 6 different sequence report per entry) ii) Annotation of the sequence differences (including conflicts, polymorphisms, splice variants etc..) -> annotation of protein diversity

54 Righting the wrongs Sequences are rarely deposited in a mature state; as with all scientific research, DNA and protein annotation is a continual process of learning, revision and corrections. Sequencing error rates: ~1 base in

55 evidence exists that prove the existence of a protein; Different qualifiers: 1. Evidence at protein level (~15,3%) 2. Evidence at transcript level (~15,8%) 3. Inferred from homology (~65,2 %) 4. Predicted (~3,4%) 5. Unassigned (mainly in TrEMBL) (0,3%)

56 Annotation Focal point of our efforts to maintain and develop UniProtKB/Swiss-Prot; Enables individual researchers to obtain a summary of what is known about a protein

57 In a UniProtKB/Swiss-Prot entry, you can expect to find: A (often corrected) protein sequence and the description of various isoforms/variants. Its biological origin with links to the taxonomic s; All the names of a given protein (and of its gene); A summary of what is known about the protein: function, alternative products, PTM, tissue expression, disease, 3D data etc. ; A description of important sequence features: domains, PTMs, variations, etc.; A selection of references; Selected keywords; Numerous cross-references (central hub);

58 An easy way to access the history of a protein sequence entry UniSave homepage:

59

60

61 Other UniProt s

62

63 UniRef

64 UniRef useful for comprehensive BLAST searches by providing sets of representative sequences «Collapsing BLAST results» = Three collections of sequences clusters from the UniProt knowledgebase and EnsEMBL, IPI, EMBL_WGS: One UniRef100 entry -> all identical sequences (Identical sequences and sub-fragments with 11 or more residues are placed into a single record) -> reduction of 12 % One UniRef90 entry -> sequences that have at least 90 % or more identity -> reduction of 40 % One UniRef50 entry -> sequences that are at least 50 % identical -> reduction of 65 % Independently of the species!

65 UniParc

66 UniParc

67 UniParc UniProt Archive (UniParc) is part of UniProt project. It is a non-redundant archive of protein sequences extracted from public s UniProtKB/Swiss-Prot,UniProtKB/TrEMBL, PIR-PSD, EMBL, EMBL WGS, Ensembl, IPI, PDB, PIR-PSD,RefSeq, FlyBase, WormBase, H-Invitational Database, TROME, European Patent Office proteins, United States Patent and Trademark Office proteins (USPTO) and Japan Patent Office proteins. UniParc contains only protein sequences. All other information about the protein must be retrieved from the source s using the cross-references. Each unique sequence is stored only once with a stable identifier. The format of the identifier is UPI followed by ten hexadecimal numbers, e.g.upi a.

68 UniParc Use with extreme caution: also contains pseudogene, incorrect CDS prediction etc! Also patent office data (EPO, ESPO ).

69 Not downloadable

70 UniMES

71 The UniProt Metagenomic and Environmental Sequences (UniMES) is a repository specifically developed for metagenomic and environmental data. UniMES is available in FASTA format on the UniProt ftp servers, in the new subdirectory current_release/unimes: ftp.uniprot.org/pub/s/uniprot ftp.ebi.ac.uk/pub/s/uniprot ftp.expasy.org/s/uniprot

72

73 NCBInr (Entrez protein)

74 Protein sequences: «NR» Entrez protein

75 Major protein sequence sources PIR PDB PRF UniProtKB: Swiss-Prot + TrEMBL Integrated resources cross-references Separated resources NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq UniProtKB/Swiss-Prot: manually annotated protein sequences ( species) UniProtKB/TrEMBL: submitted CDS (EMBL) + automated annotation; non redundant with Swiss-Prot ( species) GenPept: submitted CDS (GenBank); redundant with Swiss-Prot ( species) PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB PDB: Protein Databank: 3D data and associated sequences PRF: journal scan of published peptide sequences RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction (4 000 species)

76 Scientific publications derived sequences «Journal scan» (integrated into TrEMBL) NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF All PIR data have been integrated into Swiss-Prot and TrEMBL (UniProt) derived from GenBank/EMBL/DDBJ sequences which have a CDS annotated on them - equivalent to TrEMBL 3D structure : all the protein sequences which have been cristallized (Swiss-Prot/TrEMBL are crosslinked to PDB)

77 RefSeq

78

79 RefSeq: The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. 3,648,590 entries (22-May-2007); 4,300 species. 5,590,364 entries (11-July-2008); 5,395 species. 6,042,750 entries (20-November-2008); 5,726 species. Accession numbers - for RNA (NM_) - for genomic (NT_) - for protein (NP_) - for predicted protein (XP_)

80 AC

81 AC KW Taxonomy References

82 Scientific publications derived sequences «Journal scan» (integrated into TrEMBL) NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF All PIR data have been integrated into Swiss-Prot and TrEMBL (UniProt) derived from GenBank/EMBL/DDBJ sequences which have a CDS annotated on them - equivalent to TrEMBL, except that it is redundant with Swiss-Prot 3D structure : all the protein sequences which have been cristallized (Swiss-Prot/TrEMBL are crosslinked to PDB)

83 PIR

84 PIR: the Protein Identification Resource PIR-PSD is no more updated, but exists as an archive

85 PDB

86 PDB PDB (Protein Data Bank), 3D structure Contains the spatial coordinates of macromolecule atoms whose 3D structure has been obtained by X- ray or NMR studies Contains also the corresponding protein sequences *The PIR-NRL3D makes the sequence information in PDB available for similarity searches and other tools Includes protein sequences which are mutated, effect of a mutation on the 3D structure)

87 PDB: Protein Data Bank Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA). Associated with specialized programs allow the visualization of the corresponding 3D structure (e.g., SwissPDB-viewer, Chime, Rasmol)). Currently there are structural data for about different proteins, but far less protein family (highly redundant)!

88 PDB: example

89 Sequence Coordinates of each atom

90 Visualisation with Jmol

91 PRF

92 Looks for the peptide sequence described in publication (and which are not submitted in s!!!)

93

94 Query at Entrez protein (NCBInr)

95 RefSeq Typical result of a query at «Entrez protein» Genpept (gb/embl/ddbj) PIR Swiss-Prot PDB

96 AC GenInfo identifier number

97 GI number: GenInfo identifier number - In addition to an AC number specific from the original, each protein sequence in the NCBInr has a GI number. - If the sequence changes in any way, a new GI number will be assigned -> not a stable identifier - A separate GI number is also assigned to each protein translation within a nucleotide sequence record (alternative products) - A Sequence Revision History tool is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record:

98 Menu Introduction : bioinformatics and sequences Nucleic acid sequence s Protein sequences s (sources) Protein sequences s (other)

99 EnsEMBL not only for proteins.

100 EnsEMBL Automated genome annotation and subsequent visualisation of annotated genomes. Ensembl concentrates on vertebrate genomes, but other groups have adapted the system for use with plant and fungal genomes.

101 - EnsEMBL: align the genomic sequences with all the sequences found in EMBL, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (-> known genes) - Also do gene prediction (-> novel genes) -DNA, RNA and protein sequences available for ~30 species - Browsing tool

102

103 Browsing tool available for 49 species

104

105 CCDS Consensus CDS protein set

106

107 CCDS (human) Combining different approaches ab initio, by similarity - and taking advantage of the expertise acquired by different institutes, including manual annotation Consensus between 4 institutions

108

109 IPI International Protein Index

110

111 IPI (International Protein Index) Provides a guide to the main s that describe the human, mouse, rat, Zebrafish, Arabidopsis, Chicken, and Cow proteomes: Swiss-Prot, TrEMBL, RefSeq and Ensembl (and H- InvDB, TAIR and VEGA). IPI is built in order to provide maximum coverage of the major publicly available protein (and gene) s, for a same protein For each protein in IPI, an entry from one of the constituent s is selected as the master entry, and supplies the IPI entry with its sequence and annotation. Stable identifiers (with incremental versioning) are maintained to allow the tracking of sequences in IPI between IPI releases.

112

113 IMGT (international ImMunoGeneTics information) Is a collection of high-quality integrated s specialising in inmunoglobulins, T cell receptors and the Major Histocompatibility Complex (MHC) of all vertebrate species.

114

115

116 Protein sequence s for proteomics

117 Phenyx: UniProtKB Translation of ESTs sequences in the 6 frames (EST are not associated with annotated CDSs!) PROWL: NCBInr, Swiss-Prot, dbest Protein prospector: NCBInr, Swiss-Prot, dbest, GenPept, Ludwignr, OWL*. Peptident (Aldente): UniProtKB. Mascot: NCBInr, Swiss-Prot, dbest, OWL*, MSDB * OWL is obsolete since 1999

118 OWL Non redundant protein, including: Swiss-Prot, PIR, NRL3-D* and GenPept. *The PIR-NRL3D makes the sequence information in PDB available for similarity searches

119 Phenyx: UniProtKB Translation of ESTs sequences in the 6 frames (EST are not associated with annotated CDSs!) PROWL: NCBInr, Swiss-Prot, dbest Protein prospector: NCBInr, Swiss-Prot, dbest, GenPept, Ludwignr, OWL*. Peptident (Aldente): UniProtKB. Mascot: NCBInr, Swiss-Prot, dbest, OWL*, MSDB * OWL is obsolete since 1999

120

121 ID/AC mapping

122 -> Accession / version number jungle! According to the, a AC number can be associated with an entry (gene product: stable even if the sequence changes) or with a sequence (it change as soon as the sequence changes)

123 In resume For the same protein sequence You can find: A UniProtKB/Swiss-Prot entry A RefSeq entry (or GenPept) A EnsEMBl entry A CCDS entry A UniParc entry (archive) A IPI

124 The AC number jungle Type of record Sample Accession Format GenBank/EMBL/DDBJ Swiss-Prot/TrEMBL RefSeq nucleotide RefSeq protein RefSeq prediction PDB (protein structure) One letter followed by five digits: e.g. U12345 Two letters followed by 6 digits: e.g. AF One letter and five digits/letters: e.g. P12345, A0B533 Two letters, underscore bar and six digit: e.g. mrna NM_ e.g. genomic NT_ e.g. NP_00483 e.g. XM_ e.g. XP_ One digit followed by three letters: e.g. 1TUP

125 uniprot.org

126

127 UniProtKB and PTMs

128 Proteome complexity Not predictable at the genome level! (Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: ).

129 Chemical aspects Post-translational modifications (PTMs) consist in the breaking and/or the making of covalent bonds catalyzed by enzyme PTMs modify both protein mass and isoelectric point (PI)

130 The PTM variety Gly Ala Val Leu Ile Lys Arg His Asp Glu Asn Gln Cys Ser Thr Met Pro Phe Tyr Trp side-chain modifications acetylation methylation acylation phosphorylation oxidation crosslinks hydroxylation cofactor binding sulfation C-linked sugar N-linked sugar O-linked sugar S-linked sugar N-terminal modifications acetylation methylation acylation crosslinks C-terminal modifications GPI amidation crosslinks methylation in black: cytoplasmic modifications in dark grey: both cytoplasmic and extracellular modifications, depending on the exact type in light grey: extracellular modifications

131 PTM distribution among kingdoms FMN binding bacterial lipid anchor pyrrolysine archaea bacteria-specific methylation lanthionine crosslink archaea-specific methylation bacteria acetylation archaean lipid anchor phosphorylation myristoylation methylation FAD binding diphthamide palmitoylation GPI-anchor amidation sulfation eukaryote-specific methylation eukaryotes

132 PTM annotation in UniProtKB entries PTMs are annotated in the feature table ( sequence annotation ) when they can be assigned a position on the protein sequence - in the comments when they cannot.

133 PTM-dedicated FT keys FT key usage CARBOHYD (Glycosylation ) DISULFID (Disulfide bond) CROSSLNK (Cross-link) LIPID MOD_RES (Modified residue) sugars disulfide bonds other crosslinks lipids other modifications PTMs are grouped by type, are specifically and uniquely annotated by the use of a controlled vocabulary and a set of specific FT keys

134 PTM annotation in UniProtKB entries PTMs are annotated in the feature table when they can be assigned a position on the protein sequence - in the comments when they cannot. Associated keywords

135

136 Find all mouse proteins which are phosphorylated

137

138 UniProtKB/Swiss-Prot Number of PTMs in Swiss-Prot release 51 ( entries) all organisms Pot. By sim. Exp. & Prob. total signal peptide N-GlcNAc O-GalNAc O-GlcNAc phosphorylation sulfation myristate GPI-anchor

139 Resid

140 RESID RESID is a of 473 natural modifications (Rel ) with chemical and structural annotations such as recommended name and synonyms, delta mass, 3D structure, UniProt annotations, etc. FTP sites: ftp://ftp.ebi.ac.uk/pub/s/resid/ ftp://ftp.ncifcrf.gov/pub/users/residues Web sites:

141 RESID

142 RESID

143 Other PTM s UNIMOD: PSI-MOD: ontology Delta Mass:

144 GO

145

146 GO scope Three disjoint axes: cellular component Sub-cellular location e.g nucleus, ribosome, origin recognition complex molecular function molecular role e.g. catalytic activity, binding biological process broad biological phenomena e.g. mitosis, growth, digestion

147 GO structure terms are related within a hierarchy Terms are linked by two relationships is-a part-of

148 GO structure cell is-a part-of membrane chloroplast mitochondrial membrane chloroplast membrane

149 GOA: Gene Ontology Annotation What is GOA? GOA aims to provide high-quality electronic and manual annotations to the UniProt Knowledgebase and International Protein Index, using GO terms. The GOA project is run by EBI and is a member of the GO consortium since In 2001, the first phase of the GOA project involved the large-scale assignment of GO terms to Swiss-Prot and TrEMBL entries using electronic methods, namely the mappings spkw2go, ec2go and Interpro2go.

150

151 e-proxemis:

152

Sequence Databases and database scanning

Sequence Databases and database scanning Sequence Databases and database scanning Marjolein Thunnissen Lund, 2012 Types of databases: Primary sequence databases (proteins and nucleic acids). Composite protein sequence databases. Secondary databases.

More information

Protein Bioinformatics Part I: Access to information

Protein Bioinformatics Part I: Access to information Protein Bioinformatics Part I: Access to information 260.655 April 6, 2006 Jonathan Pevsner, Ph.D. pevsner@kennedykrieger.org Outline [1] Proteins at NCBI RefSeq accession numbers Cn3D to visualize structures

More information

Since 2002 a merger and collaboration of three databases: Swiss-Prot & TrEMBL

Since 2002 a merger and collaboration of three databases: Swiss-Prot & TrEMBL Since 2002 a merger and collaboration of three databases: Swiss-Prot & TrEMBL PIR-PSD Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database o A high quality

More information

Biological databases an introduction

Biological databases an introduction Biological databases an introduction By Dr. Erik Bongcam-Rudloff SLU 2017 Biological Databases Sequence Databases Genome Databases Structure Databases Sequence Databases The sequence databases are the

More information

ELE4120 Bioinformatics. Tutorial 5

ELE4120 Bioinformatics. Tutorial 5 ELE4120 Bioinformatics Tutorial 5 1 1. Database Content GenBank RefSeq TPA UniProt 2. Database Searches 2 Databases A common situation for alignment is to search through a database to retrieve the similar

More information

Computational Biology and Bioinformatics

Computational Biology and Bioinformatics Computational Biology and Bioinformatics Computational biology Development of algorithms to solve problems in biology Bioinformatics Application of computational biology to the analysis and management

More information

Types of Databases - By Scope

Types of Databases - By Scope Biological Databases Bioinformatics Workshop 2009 Chi-Cheng Lin, Ph.D. Department of Computer Science Winona State University clin@winona.edu Biological Databases Data Domains - By Scope - By Level of

More information

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science EECS 730 Introduction to Bioinformatics Sequence Alignment Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/ Database What is database An organized set of data Can

More information

Biological databases an introduction

Biological databases an introduction Biological databases an introduction By Dr. Erik Bongcam-Rudloff SGBC-SLU 2016 VALIDATION Experimental Literature Manual or semi-automatic computational analysis EXPERIMENTAL Costs Needs skilled manpower

More information

I nternet Resources for Bioinformatics Data and Tools

I nternet Resources for Bioinformatics Data and Tools ~i;;;;;;;'s :.. ~,;;%.: ;!,;s163 ~. s :s163:: ~s ;'.:'. 3;3 ~,: S;I:;~.3;3'/////, IS~I'//. i: ~s '/, Z I;~;I; :;;; :;I~Z;I~,;'//.;;;;;I'/,;:, :;:;/,;'L;;;~;'~;~,::,:, Z'LZ:..;;',;';4...;,;',~/,~:...;/,;:'.::.

More information

ONLINE BIOINFORMATICS RESOURCES

ONLINE BIOINFORMATICS RESOURCES Dedan Githae Email: d.githae@cgiar.org BecA-ILRI Hub; Nairobi, Kenya 16 May, 2014 ONLINE BIOINFORMATICS RESOURCES Introduction to Molecular Biology and Bioinformatics (IMBB) 2014 The larger picture.. Lower

More information

The Gene Ontology Annotation (GOA) project application of GO in SWISS-PROT, TrEMBL and InterPro

The Gene Ontology Annotation (GOA) project application of GO in SWISS-PROT, TrEMBL and InterPro Comparative and Functional Genomics Comp Funct Genom 2003; 4: 71 74. Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.235 Conference Review The Gene Ontology Annotation

More information

Genome Informatics. Systems Biology and the Omics Cascade (Course 2143) Day 3, June 11 th, Kiyoko F. Aoki-Kinoshita

Genome Informatics. Systems Biology and the Omics Cascade (Course 2143) Day 3, June 11 th, Kiyoko F. Aoki-Kinoshita Genome Informatics Systems Biology and the Omics Cascade (Course 2143) Day 3, June 11 th, 2008 Kiyoko F. Aoki-Kinoshita Introduction Genome informatics covers the computer- based modeling and data processing

More information

Web-based Bioinformatics Applications in Proteomics

Web-based Bioinformatics Applications in Proteomics Web-based Bioinformatics Applications in Proteomics Chiquito Crasto ccrasto@genetics.uab.edu January 30, 2009 NCBI (National Center for Biotechnology Information) http://www.ncbi.nlm.nih.gov/ 1 Pubmed

More information

NiceProt View of Swiss-Prot: P18907

NiceProt View of Swiss-Prot: P18907 Hosted by NCSC US ExPASy Home page Site Map Search ExPASy Contact us Swiss-Prot Mirror sites: Australia Bolivia Canada China Korea Switzerland Taiwan Search Swiss-Prot/TrEMBL for horse alpha Go Clear NiceProt

More information

NCBI web resources I: databases and Entrez

NCBI web resources I: databases and Entrez NCBI web resources I: databases and Entrez Yanbin Yin Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1 Homework assignment 1 Two parts: Extract the gene IDs reported in table

More information

Web based Bioinformatics Applications in Proteomics. Genbank

Web based Bioinformatics Applications in Proteomics. Genbank Web based Bioinformatics Applications in Proteomics Chiquito Crasto ccrasto@genetics.uab.edu February 9, 2010 Genbank Primary nucleic acid sequence database Maintained by NCBI National Center for Biotechnology

More information

Sequence Based Function Annotation

Sequence Based Function Annotation Sequence Based Function Annotation Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Sequence Based Function Annotation 1. Given a sequence, how to predict its biological

More information

Bioinformatics for Proteomics. Ann Loraine

Bioinformatics for Proteomics. Ann Loraine Bioinformatics for Proteomics Ann Loraine aloraine@uab.edu What is bioinformatics? The science of collecting, processing, organizing, storing, analyzing, and mining biological information, especially data

More information

Bioinformatics Prof. M. Michael Gromiha Department of Biotechnology Indian Institute of Technology, Madras. Lecture - 5a Protein sequence databases

Bioinformatics Prof. M. Michael Gromiha Department of Biotechnology Indian Institute of Technology, Madras. Lecture - 5a Protein sequence databases Bioinformatics Prof. M. Michael Gromiha Department of Biotechnology Indian Institute of Technology, Madras Lecture - 5a Protein sequence databases In this lecture, we will mainly discuss on Protein Sequence

More information

Two Mark question and Answers

Two Mark question and Answers 1. Define Bioinformatics Two Mark question and Answers Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. There are three

More information

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE BIOMOLECULES COURSE: COMPUTER PRACTICAL 1 Author of the exercise: Prof. Lloyd Ruddock Edited by Dr. Leila Tajedin 2017-2018 Assistant: Leila Tajedin (leila.tajedin@oulu.fi)

More information

The University of California, Santa Cruz (UCSC) Genome Browser

The University of California, Santa Cruz (UCSC) Genome Browser The University of California, Santa Cruz (UCSC) Genome Browser There are hundreds of available userselected tracks in categories such as mapping and sequencing, phenotype and disease associations, genes,

More information

Will discuss proteins in view of Sequence (I,II) Structure (III) Function (IV) proteins in practice

Will discuss proteins in view of Sequence (I,II) Structure (III) Function (IV) proteins in practice Will discuss proteins in view of Sequence (I,II) Structure (III) Function (IV) proteins in practice integration - web system (V) 1 Touring the Protein Space (outline) 1. Protein Sequence - how rich? How

More information

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Protein Sequence Analysis BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical

More information

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Bioinformatics Tools Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Bioinformatics Tools Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Overview This lecture will

More information

Gene-centered resources at NCBI

Gene-centered resources at NCBI COURSE OF BIOINFORMATICS a.a. 2014-2015 Gene-centered resources at NCBI We searched Accession Number: M60495 AT NCBI Nucleotide Gene has been implemented at NCBI to organize information about genes, serving

More information

This practical aims to walk you through the process of text searching DNA and protein databases for sequence entries.

This practical aims to walk you through the process of text searching DNA and protein databases for sequence entries. PRACTICAL 1: BLAST and Sequence Alignment The EBI and NCBI websites, two of the most widely used life science web portals are introduced along with some of the principal databases: the NCBI Protein database,

More information

Introduction to Bioinformatics. What are the goals of the course? Who is taking this course? Textbook. Web sites. Literature references

Introduction to Bioinformatics. What are the goals of the course? Who is taking this course? Textbook. Web sites. Literature references Introduction to Bioinformatics Who is taking this course? People with very diverse backgrounds in biology Some people with backgrounds in computer science and biostatistics Most people (will) have a favorite

More information

Guided tour to Ensembl

Guided tour to Ensembl Guided tour to Ensembl Introduction Introduction to the Ensembl project Walk-through of the browser Variations and Functional Genomics Comparative Genomics BioMart Ensembl Genome browser http://www.ensembl.org

More information

Lecture 7 Motif Databases and Gene Finding

Lecture 7 Motif Databases and Gene Finding Introduction to Bioinformatics for Medical Research Gideon Greenspan gdg@cs.technion.ac.il Lecture 7 Motif Databases and Gene Finding Motif Databases & Gene Finding Motifs Recap Motif Databases TRANSFAC

More information

B I O I N F O R M A T I C S

B I O I N F O R M A T I C S B I O I N F O R M A T I C S Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be SUPPLEMENTARY CHAPTER: DATA BASES AND MINING 1 What

More information

An Introduction to Bioinformatics for Biological Sciences Students

An Introduction to Bioinformatics for Biological Sciences Students An Introduction to Bioinformatics for Biological Sciences Students Department of Microbiology and Immunology, McGill University Version 2.5 (For the BIOC-300 lab), March 2006 2 AN INTRODUCTION TO BIOINFORMATICS

More information

ab initio and Evidence-Based Gene Finding

ab initio and Evidence-Based Gene Finding ab initio and Evidence-Based Gene Finding A basic introduction to annotation Outline What is annotation? ab initio gene finding Genome databases on the web Basics of the UCSC browser Evidence-based gene

More information

Chapter 2: Access to Information

Chapter 2: Access to Information Chapter 2: Access to Information Outline Introduction to biological databases Centralized databases store DNA sequences Contents of DNA, RNA, and protein databases Central bioinformatics resources: NCBI

More information

MS bioinformatics analysis for proteomics. Protein anotations

MS bioinformatics analysis for proteomics. Protein anotations MS bioinformatics analysis for proteomics Protein anotations UCO - Córdoba Organized by: ProteoRed, EUPA and Seprot Alberto Medina January, 23rd 2009 Summary Introduction Some issues Software: Fatigo -

More information

Array-Ready Oligo Set for the Rat Genome Version 3.0

Array-Ready Oligo Set for the Rat Genome Version 3.0 Array-Ready Oligo Set for the Rat Genome Version 3.0 We are pleased to announce Version 3.0 of the Rat Genome Oligo Set containing 26,962 longmer probes representing 22,012 genes and 27,044 gene transcripts.

More information

Bioinformatics for Cell Biologists

Bioinformatics for Cell Biologists Bioinformatics for Cell Biologists 15 19 March 2010 Developmental Biology and Regnerative Medicine (DBRM) Schedule Monday, March 15 09.00 11.00 Introduction to course and Bioinformatics (L1) D224 Helena

More information

Basic concepts of molecular biology

Basic concepts of molecular biology Basic concepts of molecular biology Gabriella Trucco Email: gabriella.trucco@unimi.it Life The main actors in the chemistry of life are molecules called proteins nucleic acids Proteins: many different

More information

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks Introduction to Bioinformatics CPSC 265 Thanks to Jonathan Pevsner, Ph.D. Textbooks Johnathan Pevsner, who I stole most of these slides from (thanks!) has written a textbook, Bioinformatics and Functional

More information

ENZYMES AND METABOLIC PATHWAYS

ENZYMES AND METABOLIC PATHWAYS ENZYMES AND METABOLIC PATHWAYS This document is licensed under the Attribution-NonCommercial-ShareAlike 2.5 Italy license, available at http://creativecommons.org/licenses/by-nc-sa/2.5/it/ 1. Enzymes build

More information

EBI web resources I: databases and tools. Yanbin Yin Spring 2013

EBI web resources I: databases and tools. Yanbin Yin Spring 2013 EBI web resources I: databases and tools Yanbin Yin Spring 2013 1 Outline Intro to EBI Databases and web tools UniProt Gene Ontology Hands on PracBce MOST MATERIALS ARE FROM: hkp://www.ebi.ac.uk/training/online/course-

More information

Bioinformatics Practical Course. 80 Practical Hours

Bioinformatics Practical Course. 80 Practical Hours Bioinformatics Practical Course 80 Practical Hours Course Description: This course presents major ideas and techniques for auxiliary bioinformatics and the advanced applications. Points included incorporate

More information

EE550 Computational Biology

EE550 Computational Biology EE550 Computational Biology Week 1 Course Notes Instructor: Bilge Karaçalı, PhD Syllabus Schedule : Thursday 13:30, 14:30, 15:30 Text : Paul G. Higgs, Teresa K. Attwood, Bioinformatics and Molecular Evolution,

More information

Sequence Databases. Chapter 2. caister.com/bioinformaticsbooks. Paul Rangel. Sequence Databases

Sequence Databases. Chapter 2. caister.com/bioinformaticsbooks. Paul Rangel. Sequence Databases Chapter 2 Paul Rangel Abstract DNA and Protein sequence databases are the cornerstone of bioinformatics research. DNA databases such as GenBank and EMBL accept genome data from sequencing projects around

More information

Introduction to BIOINFORMATICS

Introduction to BIOINFORMATICS Introduction to BIOINFORMATICS Antonella Lisa CABGen Centro di Analisi Bioinformatica per la Genomica Tel. 0382-546361 E-mail: lisa@igm.cnr.it http://www.igm.cnr.it/pagine-personali/lisa-antonella/ What

More information

Databases in genomics

Databases in genomics Databases in genomics Search in biological databases: The most common task of molecular biologist researcher, to answer to the following ques7ons:! Are they new sequences deposited in biological databases

More information

Proteomics databases

Proteomics databases Proteomics databases and protein characterization tools Marie-Claude.Blatter@ISB-SIB.ch Part I Proteomics databases Proteomics databases 1. Sequence databases: «The story of a protein sequence s life»

More information

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide.

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide. Page 1 of 18 Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide. When and Where---Wednesdays 1-2pm Room 438 Library Admin Building Beginning September

More information

Problem Set Unit The base ratios in the DNA and RNA for an onion (Allium cepa) are given below.

Problem Set Unit The base ratios in the DNA and RNA for an onion (Allium cepa) are given below. Problem Set Unit 3 Name 1. Which molecule is found in both DNA and RNA? A. Ribose B. Uracil C. Phosphate D. Amino acid 2. Which molecules form the nucleotide marked in the diagram? A. phosphate, deoxyribose

More information

What is a database? biological databases. An introduction to. A collection of. Includes also associated tools (software) data

What is a database? biological databases. An introduction to. A collection of. Includes also associated tools (software) data An introduction to biological databases Marie-Claude.Blatter@isb-sib.ch A collection of What is a database? structured searchable (index) -> table of contents updated periodically (release) -> new edition

More information

Basic concepts of molecular biology

Basic concepts of molecular biology Basic concepts of molecular biology Gabriella Trucco Email: gabriella.trucco@unimi.it What is life made of? 1665: Robert Hooke discovered that organisms are composed of individual compartments called cells

More information

PROTEOINFORMATICS OVERVIEW

PROTEOINFORMATICS OVERVIEW PROTEOINFORMATICS OVERVIEW August 11th 2016 Pratik Jagtap Center for Mass Spectrometry and Proteomics http://www.cbs.umn.edu/msp Outline PROTEOMICS WORKFLOW PEAKLIST PROCESSING Search Databases Overview

More information

NCBI Molecular Biology Resources. Entrez & BLAST. Entrez: Database Integration. Database Searching with Entrez. WWW Access. Using Entrez.

NCBI Molecular Biology Resources. Entrez & BLAST. Entrez: Database Integration. Database Searching with Entrez. WWW Access. Using Entrez. NCBI Molecular Biology Resources Using Entrez WWW Access Entrez & BLAST March 2007 Phylogeny Entrez: Database Integration Taxonomy PubMed abstracts Genomes Word weight 3-D Structure VAST Neighbors Related

More information

Center for Mass Spectrometry and Proteomics Phone (612) (612)

Center for Mass Spectrometry and Proteomics Phone (612) (612) Outline Database search types Peptide Mass Fingerprint (PMF) Precursor mass-based Sequence tag Results comparison across programs Manual inspection of results Terminology Mass tolerance MS/MS search FASTA

More information

From assembled genome to annotated genome

From assembled genome to annotated genome From assembled genome to annotated genome Procaryotic genomes Eucaryotic genomes Genome annotation servers (web based) 1. RAST 2. NCBI Gene prediction pipeline: Maker Function annotation pipeline: Blast2GO

More information

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Sequence Based Function Annotation Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Usage scenarios for sequence based function annotation Function prediction of newly cloned

More information

Access to Information from Molecular Biology and Genome Research

Access to Information from Molecular Biology and Genome Research Future Needs for Research Infrastructures in Biomedical Sciences Access to Information from Molecular Biology and Genome Research DG Research: Brussels March 2005 User Community for this information is

More information

Bioinformatic Tools. So you acquired data.. But you wanted knowledge. So Now What?

Bioinformatic Tools. So you acquired data.. But you wanted knowledge. So Now What? Bioinformatic Tools So you acquired data.. But you wanted knowledge So Now What? We have a series of questions What the Heck is That Ion? How come my MW does not match? How do I make a DB to search against?

More information

Retrieval of gene information at NCBI

Retrieval of gene information at NCBI Retrieval of gene information at NCBI Some notes 1. http://www.cs.ucf.edu/~xiaoman/fall/ 2. Slides are for presenting the main paper, should minimize the copy and paste from the paper, should write in

More information

Introduction to 'Omics and Bioinformatics

Introduction to 'Omics and Bioinformatics Introduction to 'Omics and Bioinformatics Chris Overall Department of Bioinformatics and Genomics University of North Carolina Charlotte Acquire Store Analyze Visualize Bioinformatics makes many current

More information

Zool 3200: Cell Biology Exam 3 3/6/15

Zool 3200: Cell Biology Exam 3 3/6/15 Name: Trask Zool 3200: Cell Biology Exam 3 3/6/15 Answer each of the following questions in the space provided; circle the correct answer or answers for each multiple choice question and circle either

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Brad Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org 8

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Brad Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org 7

More information

glycosylphosphatidylinositol (GPI)- anchor;

glycosylphosphatidylinositol (GPI)- anchor; Claire O Donovan is the large-scale annotation coordinator and is responsible for the TrEMBL database production at the EMBL Outstation EBI. Maria Jesus Martin coordinates software development and is responsible

More information

Algorithms in Bioinformatics ONE Transcription Translation

Algorithms in Bioinformatics ONE Transcription Translation Algorithms in Bioinformatics ONE Transcription Translation Sami Khuri Department of Computer Science San José State University sami.khuri@sjsu.edu Biology Review DNA RNA Proteins Central Dogma Transcription

More information

Important gene-information's

Important gene-information's Sequences, domains and databases. How to gather information on a gene. Jens Bohnekamp, Institute for Biochemistry Important gene-information's Protein sequence Nucleotide sequence Gene structure Protein

More information

Chimp Sequence Annotation: Region 2_3

Chimp Sequence Annotation: Region 2_3 Chimp Sequence Annotation: Region 2_3 Jeff Howenstein March 30, 2007 BIO434W Genomics 1 Introduction We received region 2_3 of the ChimpChunk sequence, and the first step we performed was to run RepeatMasker

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org kcoombes@mdanderson.org

More information

What You NEED to Know

What You NEED to Know What You NEED to Know Major DNA Databases NCBI RefSeq EBI DDBJ Protein Structural Databases PDB SCOP CCDC Major Protein Sequence Databases UniprotKB Swissprot PIR TrEMBL Genpept Other Major Databases MIM

More information

Basic protein and peptide science for proteomics. Henrik Johansson

Basic protein and peptide science for proteomics. Henrik Johansson Basic protein and peptide science for proteomics Henrik Johansson Proteins are the main actors in the cell Membranes Transport and storage Chemical factories DNA Building proteins Structure Proteins mediate

More information

Data Retrieval from GenBank

Data Retrieval from GenBank Data Retrieval from GenBank Peter J. Myler Bioinformatics of Intracellular Pathogens JNU, Feb 7-0, 2009 http://www.ncbi.nlm.nih.gov (January, 2007) http://ncbi.nlm.nih.gov/sitemap/resourceguide.html Accessing

More information

RESEARCH METHODOLOGY, BIOSTATISTICS AND IPR

RESEARCH METHODOLOGY, BIOSTATISTICS AND IPR MB 401: RESEARCH METHODOLOGY, BIOSTATISTICS AND IPR Objectives: The overall aim of the course is to deepen knowledge regarding basic concepts of Biostatistics, the research process in occupational therapy

More information

Introduc)on to Databases and Resources Biological Databases and Resources

Introduc)on to Databases and Resources Biological Databases and Resources Introduc)on to Bioinforma)cs Online Course : IBT Introduc)on to Databases and Resources Biological Databases and Resources Learning Objec)ves Introduc)on to Databases and Resources - Understand how bioinforma)cs

More information

Genome Resources. Genome Resources. Maj Gen (R) Suhaib Ahmed, HI (M)

Genome Resources. Genome Resources. Maj Gen (R) Suhaib Ahmed, HI (M) Maj Gen (R) Suhaib Ahmed, I (M) The human genome comprises DNA sequences mostly contained in the nucleus. A small portion is also present in the mitochondria. The nuclear DNA is present in chromosomes.

More information

Databases/Resources on the web

Databases/Resources on the web Databases/Resources on the web Jon K. Lærdahl jonkl@medisin.uio.no A lot of biological databases available on the web... MetaBase, the database of biological databases (1801 entries) - h p://metadatabase.org

More information

De novo sequencing in the identification of mass data. Wang Quanhui Liu Siqi Beijing Institute of Genomics, CAS

De novo sequencing in the identification of mass data. Wang Quanhui Liu Siqi Beijing Institute of Genomics, CAS De novo sequencing in the identification of mass data Wang Quanhui Liu Siqi Beijing Institute of Genomics, CAS The difficulties in mass data analysis Although the techniques of genomic sequencing are being

More information

This software/database/presentation is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part

This software/database/presentation is a United States Government Work under the terms of the United States Copyright Act. It was written as part This software/database/presentation is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the author's official duties as a United States Government

More information

Protein Structure Databases, cont. 11/09/05

Protein Structure Databases, cont. 11/09/05 11/9/05 Protein Structure Databases (continued) Prediction & Modeling Bioinformatics Seminars Nov 10 Thurs 3:40 Com S Seminar in 223 Atanasoff Computational Epidemiology Armin R. Mikler, Univ. North Texas

More information

Unit 1. DNA and the Genome

Unit 1. DNA and the Genome Unit 1 DNA and the Genome Gene Expression Key Area 3 Vocabulary 1: Transcription Translation Phenotype RNA (mrna, trna, rrna) Codon Anticodon Ribosome RNA polymerase RNA splicing Introns Extrons Gene Expression

More information

Lecture for Wednesday. Dr. Prince BIOL 1408

Lecture for Wednesday. Dr. Prince BIOL 1408 Lecture for Wednesday Dr. Prince BIOL 1408 THE FLOW OF GENETIC INFORMATION FROM DNA TO RNA TO PROTEIN Copyright 2009 Pearson Education, Inc. Genes are expressed as proteins A gene is a segment of DNA that

More information

Genome annotation & EST

Genome annotation & EST Genome annotation & EST What is genome annotation? The process of taking the raw DNA sequence produced by the genome sequence projects and adding the layers of analysis and interpretation necessary

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics 260.602.01 September 1, 2006 Jonathan Pevsner, Ph.D. pevsner@kennedykrieger.org Teaching assistants Hugh Cahill (hugh@jhu.edu) Jennifer Turney (jturney@jhsph.edu) Meg Zupancic

More information

Compiled by Mr. Nitin Swamy Asst. Prof. Department of Biotechnology

Compiled by Mr. Nitin Swamy Asst. Prof. Department of Biotechnology Bioinformatics Model Answers Compiled by Mr. Nitin Swamy Asst. Prof. Department of Biotechnology Page 1 of 15 Previous years questions asked. 1. Describe the software used in bioinformatics 2. Name four

More information

BIMM 143: Introduction to Bioinformatics (Winter 2018)

BIMM 143: Introduction to Bioinformatics (Winter 2018) BIMM 143: Introduction to Bioinformatics (Winter 2018) Course Instructor: Dr. Barry J. Grant ( bjgrant@ucsd.edu ) Course Website: https://bioboot.github.io/bimm143_w18/ DRAFT: 2017-12-02 (20:48:10 PST

More information

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology. G16B BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY Methods or systems for genetic

More information

Bioinformatics for Molecular Biology

Bioinformatics for Molecular Biology Bioinformatics for Molecular Biology Databases & Accessing data Today s Programme Biological databases Brief introduction What is UNIX? Why should you learn UNIX? Bioinformatics Core Facility Setting up

More information

Nucleic acid and protein Flow of genetic information

Nucleic acid and protein Flow of genetic information Nucleic acid and protein Flow of genetic information References: Glick, BR and JJ Pasternak, 2003, Molecular Biotechnology: Principles and Applications of Recombinant DNA, ASM Press, Washington DC, pages.

More information

NCBI Molecular Biology Resources

NCBI Molecular Biology Resources NCBI Molecular Biology Resources Part 2: Using NCBI BLAST December 2009 Using BLAST Basics of using NCBI BLAST Using the new Interface Improved organism and filter options New Services Primer BLAST Align

More information

Klinisk kemisk diagnostik BIOINFORMATICS

Klinisk kemisk diagnostik BIOINFORMATICS Klinisk kemisk diagnostik - 2017 BIOINFORMATICS What is bioinformatics? Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological,

More information

BIOINFORMATICS FOR DUMMIES MB&C2017 WORKSHOP

BIOINFORMATICS FOR DUMMIES MB&C2017 WORKSHOP Jasper Decuyper BIOINFORMATICS FOR DUMMIES MB&C2017 WORKSHOP MB&C2017 Workshop Bioinformatics for dummies 2 INTRODUCTION Imagine your workspace without the computers Both in research laboratories and in

More information

Annotation. (Chapter 8)

Annotation. (Chapter 8) Annotation (Chapter 8) Genome annotation Genome annotation is the process of attaching biological information to sequences: identify elements on the genome attach biological information to elements store

More information

FUNCTIONAL ANNOTATION

FUNCTIONAL ANNOTATION FUNCTIONAL ANNOTATION Benjamin Hsieh Emily Rogers >prot_contig_1 MGYRVGINCFDTRLQADDYLLSSLPPTVTQDGKI IRPERVGDKWILNGKPVTLSYPKCSNYEQVKSGA YLGSMVLILFVVIYGFRLLINFLKDIGKVGA Jin Hee Kim Jasreet Hundal Pushkala

More information

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu.   handouts, papers, datasets Ensembl workshop Thomas Randall, PhD tarandal@email.unc.edu bioinformatics.unc.edu www.unc.edu/~tarandal/ensembl handouts, papers, datasets Ensembl is a joint project between EMBL - EBI and the Sanger

More information

Dina El-Khishin (Ph.D.) Bioinformatics Research Facility. Deputy Director of AGERI & Head of the Genomics, Proteomics &

Dina El-Khishin (Ph.D.) Bioinformatics Research Facility. Deputy Director of AGERI & Head of the Genomics, Proteomics & Dina El-Khishin (Ph.D.) Deputy Director of AGERI & Head of the Genomics, Proteomics & Bioinformatics Research Facility Agricultural Genetic Engineering Research Institute (AGERI) Giza EGYPT Bioinformatics

More information

Redundancy at GenBank => RefSeq. RefSeq vs GenBank. Databases, cont. Genome sequencing using a shotgun approach. Sequenced eukaryotic genomes

Redundancy at GenBank => RefSeq. RefSeq vs GenBank. Databases, cont. Genome sequencing using a shotgun approach. Sequenced eukaryotic genomes Databases, cont. Redundancy at GenBank => RefSeq http://www.ncbi.nlm.nih.gov/books/bv.fcg i?rid=handbook RefSeq vs GenBank Many sequences are represented more than once in GenBank 2003 RefSeq collection

More information

GREG GIBSON SPENCER V. MUSE

GREG GIBSON SPENCER V. MUSE A Primer of Genome Science ience THIRD EDITION TAGCACCTAGAATCATGGAGAGATAATTCGGTGAGAATTAAATGGAGAGTTGCATAGAGAACTGCGAACTG GREG GIBSON SPENCER V. MUSE North Carolina State University Sinauer Associates, Inc.

More information

Lecture 2 Introduction to Data Formats

Lecture 2 Introduction to Data Formats Introduction to Bioinformatics for Medical Research Gideon Greenspan gdg@cs.technion.ac.il Lecture 2 Introduction to Data Formats Introduction to Data Formats Real world, data and formats Sequences and

More information

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica The Ensembl Database Dott.ssa Inga Prokopenko Corso di Genomica 1 www.ensembl.org Lecture 7.1 2 What is Ensembl? Public annotation of mammalian and other genomes Open source software Relational database

More information

Bioinformatics. ONE Introduction to Biology. Sami Khuri Department of Computer Science San José State University Biology/CS 123A Fall 2012

Bioinformatics. ONE Introduction to Biology. Sami Khuri Department of Computer Science San José State University Biology/CS 123A Fall 2012 Bioinformatics ONE Introduction to Biology Sami Khuri Department of Computer Science San José State University Biology/CS 123A Fall 2012 Biology Review DNA RNA Proteins Central Dogma Transcription Translation

More information