Applied Bioinformatics Bing Zhang Department of Biomedical Informatics Vanderbilt University bing.zhang@vanderbilt.edu
Course overview What is bioinformatics Data driven science: the creation and advancement of databases, algorithms, and computational and statistical methods to solve theoretical and practical problems arising from the management and analysis of biological data. Major research areas: sequence alignment, gene finding, genome assembly, protein structure prediction, gene expression and regulation, protein interaction, drug design, genome-wide association studies, computational evolutionary biology etc. Applied bioinformatics module Not a comprehensive guide to all facets of bioinformatics To equip you with the computational understanding and expertise needed to solve bioinformatics problems that you will likely encounter in your research. http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html 2
Course overview What is bioinformatics Data driven science: the creation and advancement of databases, algorithms, and computational and statistical methods to solve theoretical and practical problems arising from the management and analysis of biological data. Major research areas: sequence alignment, gene finding, genome assembly, protein structure prediction, gene expression and regulation, protein interaction, drug design, genome-wide association studies, computational evolutionary biology etc. Applied bioinformatics module Not a comprehensive guide to all facets of bioinformatics To equip you with the computational understanding and expertise needed to solve bioinformatics problems that you will likely encounter in your research. http://www.bioinformatics.ca/links_directory/ 3
Course content and grades Date Subject Instructor Homework (HW) 2/14 Finding information about genes Zhang 2/16 Navigating sequenced genomes Zhang 2/18 Pairwise sequence alignment and database search Zhao 2/21 Multiple sequence alignment Zhao 2/23 Inferring phylogenetic relationships from sequence data Zhao HW I distribution 20 pts Zhao + 10 pts Zhang 2/25 Protein sequence annotation Tabb 2/28 Protein structure prediction and visualization Tabb HW I due 3/2 Protein identification by mass spectrometry Tabb HW II distribution 20 pts Tabb 3/4 Gene prediction and annotation Bush 3/7 Finding regulatory and conserved elements in DNA sequence Bush HW II due 3/9 Assessing the impact of genetic variation Bush HW III distribution 20 pts Bush 3/11 Supervised analysis of gene expression data Zhang 3/14 Unsupervised analysis of gene expression data Zhang HW III due 3/16 Functional interpretation of gene lists Zhang 3/18 Biological pathways Zhang 3/21 Biological networks Zhang HW IV distribution 30 pts Zhang 3/25 HW assignments will be graded by each instructor for their respective sections. Final Grade = sum of the hw scores (100 pts in total). A: 85-100; B: 70-84; C: 55-69; D: 40-54; F: 0-39 Homework IV due by 5pm! 4
Course materials and assignments Lecture slides available at https://medschool.mc.vanderbilt.edu/iidea/admin/admin_login.php before each lecture Homework assignments available at https://medschool.mc.vanderbilt.edu/iidea/admin/admin_login.php on the distribution date (2/23, 3/2, 3/9, 3/21) Homework assignments are due at 5pm on the due date (2/28, 3/7, 3/14, 3/25). There will be a 10% per day deduction for late reports. Email your reports in the pdf, doc, or docx format to corresponding instructor(s) HW I: bing.zhang@vanderbilt.edu; zhongming.zhao@vanderbilt.edu HW II: david.l.tabb@vanderbilt.edu HW III: william.s.bush@vanderbilt.edu HW IV: bing.zhang@vanderbilt.edu Text book (optional): Dear, Paul H. (2007) Methods Express: Bioinformatics. Scion, ISBN 978-1904842163. 5
Finding information about genes Bing Zhang Department of Biomedical Informatics Vanderbilt University bing.zhang@vanderbilt.edu
When do we need gene information? Case 1 From Prof. Randy Blakely (Pharmacology): We have hit an uncharacterized gene in our hunt for SERT interacting proteins=****** that appears to be highly depleted when extracts are made from SERT KO mice. Can you help us come up with some ideas as to what this gene might be. Case 2 From Prof. Kevin Schey (Biochemistry): I ve attached a spreadsheet of our proteomics results comparing 5 Vehicle and 5 Aldosterone treated patients. We ve included only those proteins whose summed spectral counts are >30 in one treatment group. Would it be possible to get the GO annotations for these? The Uniprot name is listed in column A and the gene name is listed in column R. If this is a time consuming task (and I imagine that it is), can you tell me how to do it? 7
Resources Entrez Gene Gene Cards http://www.ncbi.nlm.nih.gov/gene http://www.genecards.org NCBI/NIH Weizmann Institute of Science, Israel All completely sequenced genomes One gene per page Ensembl BioMart http://www.ensembl.org/biomart/martview EMBL-EBI and Sanger Institute Vertebrates and other selected eukaryotic species Batch information retrieval Comprehensive information on human genes WikiGenes http://www.wikigenes.org MIT Collaborative annotation in a wiki system GLAD4U http://bioinfo.vanderbilt.edu/glad4u Vanderbilt Genes related to a specific topic 8
Learning objectives To gain a basic understanding of the Entrez Gene system To be able to retrieve information for individual genes using Entrez Gene To gain a basic understanding of the Ensembl BioMart system To be able to retrieve information for a list of genes using Ensembl BioMart 9
Entrez Gene: overview Data source Automated analyses and curation by NCBI staff Data stored in flat files Updated continuously Unique gene identifier Entrez Gene uses unique integers (GeneID) as stable identifiers for genes, e.g. GeneID for human tumor protein p53 (TP53) is 7157 GeneID assigned to each record is species specific, e.g. GeneID for the mouse ortholog of TP53 (Trp53) is 22059 Statistics as of February 2011 7.2 million records distributed among 7039 taxa 45,227 records for human Query system Entrez 10
Entrez Gene: Entrez An integrated search and retrieval system that provides access to many discrete databases at the NCBI website. All databases indexed by Entrez can be searched via a single query string, including Entrez Gene Supports Boolean operators AND, OR, NOT Supports search term tags to limit search to particular fields Title, organism, etc. Sample query transporter[title] AND ( Homo sapiens"[organism] OR "Mus musculus"[organism]) 11
Entrez Gene: search result Display Setting Help Advanced search Filtering Summary record Related data 12
Entrez Gene: Gene record (I) Each Gene record integrates multiple types of information Gene type: trna, rrna, snrna, scrna, snorna, miscrna, proteincoding, pseudo, other, and unknown Nomenclature, summary descriptions, accessions of gene specific and gene product-specific sequences, chromosomal location, reports of pathways and protein interactions, associated markers and phenotypes Links to other databases at NCBI including literature citations, sequences, variations, and homologs Links to databases outside of NCBI 13
Entrez Gene: Gene record (II) New search Export Expand Help http://www.ncbi.nlm.nih.gov/gene/7157 14
Entrez Gene: advanced ways of accessing FTP download ftp://ftp.ncbi.nlm.nih.gov/gene/readme E-Utilities (Entrez Programming Utilities) Server-side programs that provide a stable interface into the Entrez query and database system Uses a fixed URL syntax that translates a standard set of input parameters into the values necessary for various NCBI software components to search for and retrieve the requested data, including nucleotide and protein sequences, gene records, three-dimensional molecular structures, and the biomedical literature. Works with any computer language that can send a URL to the E-utilities server and interpret the XML response, e.g. Perl, Python, Java, and C++. Combining E-utilities components to form customized data pipelines within these applications is a powerful approach to data manipulation. 15
Entrez Gene: documentation and publications http://www.ncbi.nlm.nih.gov/books/nbk3841/ Maglott et al. NAR, 39:D52-D57, 2011 16
Entrez Gene: exercise Questions How many records can we get for a simple search of kinase in Entrez Gene? Use Boolean operators and search term tags to search for mouse genes located on chromosome 1 and with kinase in title. With the default display setting, what is the first hit? Click on the first hit and identify how many publications in PubMed are associated with this gene. Identify which proteins interact with the protein product of this gene. Answers 244,301 records Query term: kinase[title] AND mouse[organism] AND 1[Chromosome] Epha4 Bibliograph section: 220 citations in PubMed Interactions section: 3 proteins, Epha4, Ngef, and Vav2 17
Ensembl Genome databases for vertebrates and other selected eukaryotic species Automated annotation system at EBI Data stored in a relational database Updated periodically with versions Unique gene identifier Ensembl uses unique strings (Ensembl gene ID) as stable identifiers for genes, e.g. Ensembl gene stable ID for human tumor protein p53 (TP53) is ENSG00000141510 GeneID assigned to each record is species specific, e.g. Ensembl gene stable ID for the mouse ortholog of TP53 (Trp53) is ENSMUSG00000059552 Clear gene, transcript, and protein relationship, e.g. ENSG00000141510 => 17 transcripts (e.g. ENST00000445888) => 13 proteins (e.g. ENSP00000391478) Statistics as of February 2011 (version 61) 55 species 53,630 genes for human Other species available in the recently expanded system EnsemblGenomes http://www.ensemblgenomes.org 18
Biomart: a batch information retrieval system Biomart is a query-oriented data management system. Batch information retrieval for complex queries Particularly suited for providing 'data mining' like searches of complex descriptive data such as those related to genes and proteins Open source and can be customized Originally developed for the Ensembl genome databases Adopted by many other projects including UniProt, InterPro, Reactome, Pancreatic Expression Database, and many others (see a c o m p l e t e l i s t a n d g e t a c c e s s t o t h e t o o l s f r o m http://www.biomart.org/ ) 19
BioMart: basic concepts Dataset Filter Attribute From Prof. Kevin Schey (Biochemistry): I ve attached a spreadsheet of our proteomics results comparing 5 Vehicle and 5 Aldosterone treated patients. We ve included only those proteins whose summed spectral counts are >30 in one treatment group. Would it be possible to get the GO annotations for these? The Uniprot name is listed in column A and the gene name is listed in column R. If this is a time consuming task (and I imagine that it is), can you tell me how to do it? From all human genes, selected those with the listed Uniprot IDs, and retrieve GO annotations. 20
Ensembl Biomart analysis Choose dataset Choose database: Ensembl Genes 61 Choose dataset: Homo sapiens genes (GRch37) Set filters Gene: a list of genes/proteins identified by various database IDs (e.g. IPI IDs) Gene Ontology: filter for proteins with specific GO terms (e.g. cell cycle) Protein domains: filter for proteins with specific protein domains (e.g. SH2 domain) Region: filter for genes in a specific chromosome region (e.g. chr1 1:1000000 or 11q13) Others Select output attributes Gene annotation information in the Ensembl database, e.g. gene description, chromosome name, gene start, gene end, strand, band, gene name, etc. External data: Gene Ontology, IDs in other databases Expression: anatomical system, development stage, cell type, pathology Protein domains: SMART, PFAM, Interpro, etc. 21
Ensembl BioMart: query interface Count Results Help Perl API Choose dataset Set filters Select output attributes 22
Ensembl Biomart: sample output Export all results to a file 23
Ensembl Biomart: documentation and publications http://www.ensembl.org/info/website/tutorials/index.html Smedley et al. BMC Genomics, 10:22, 2009 24
Ensembl Biomart analysis: exercise 1 Question I have two Ensembl gene IDs, ENSG00000162367 and ENSG00000187048. How do I get their gene names from HGNC, IDs from EntrezGene, and any probes that contain these gene sequences from the Affymetrix microarray platform HC G110? Choose data set Database: Ensembl Gene 61 Dataset: Homo sapiens genes (GRCh37.p2) Set filters Under GENE: check ID list limit box Select Header: Ensembl Gene IDs, Enter the gene IDs into the box. Select output attributes Select Features (default) Under EXTERNAL: External References, Select 'HGNC Symbol' and 'EntrezGene ID Under EXTERNAL: Microarray, Select 'Affy HC G110 Click on Count and then Results Export all results to File, TSV 25
Ensembl Biomart analysis: exercise 2 Question How can I get the 2kb upstream sequences for all genes on chromosome 1? Choose data set Database: Ensembl Gene 61 Dataset: Mus musculus genes (NCBIM37) Set filters Under REGION: check Chromosome, select 1 Select output attributes Select Sequences Under SEQUENCES: select Flank (Gene) Under Upstream flank: check and enter 2000 into the box Under Header Information, Gene Information, check Description Click on Count (1916/36817) and then Results Export all results to File, FASTA format 26
Summary Entrez Gene Ensembl BioMart http://www.ncbi.nlm.nih.gov/gene http://www.ensembl.org/biomart/martview NCBI/NIH EMBL-EBI and Sanger Institute All completely sequenced genomes Mainly vertebrates Data stored in flat files Data stored in a relational database Updated continuously Updated periodically with versions Unique gene identifier: Entrez Gene ID Unique gene identifier: Ensembl Gene ID Query system: Entrez Query system: BioMart Output: one-gene-at-a-time Output: multiple genes at the same time 27