Applied Bioinformatics

Similar documents
The human gene encoding Glucose-6-phosphate dehydrogenase (G6PD) is located on chromosome X in cytogenetic band q28.

Understanding protein lists from proteomics studies. Bing Zhang Department of Biomedical Informatics Vanderbilt University

Understanding protein lists from comparative proteomics studies

Guided tour to Ensembl

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

Genomics: Genome Browsing & Annota3on

Bioinformatics for Proteomics. Ann Loraine

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Browsing Genes and Genomes with Ensembl

Overview of the next two hours...

Training materials.

Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

BIMM 143: Introduction to Bioinformatics (Winter 2018)

Chapter 2: Access to Information

Gene-centered resources at NCBI

NCBI web resources I: databases and Entrez

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Access to genes and genomes with. Ensembl. Worked Example & Exercises

The Gene Ontology Annotation (GOA) project application of GO in SWISS-PROT, TrEMBL and InterPro

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Genetics and Bioinformatics

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

Grundlagen der Bioinformatik Summer Lecturer: Prof. Daniel Huson

Training materials.

BGGN 213: Foundations of Bioinformatics (Fall 2017)

FUNCTIONAL BIOINFORMATICS

This practical aims to walk you through the process of text searching DNA and protein databases for sequence entries.

ELE4120 Bioinformatics. Tutorial 5

org.ag.eg.db October 2, 2015 org.ag.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Biological Interpretation of Metabolomics Data. Martina Kutmon Maastricht University

Introduction to EMBL-EBI.

Annotation. (Chapter 8)

Bioinformatics for Cell Biologists

Often BioMart databases contain more than one dataset. We can check for available datasets using the function listdatasets.

Briefly, this exercise can be summarised by the follow flowchart:

Entrez Gene: gene-centered information at NCBI

Introduction to BIOINFORMATICS

user s guide Question 1

GenMAPP Gene Database for Escherichia coli K12 Ec-K12-Std_External_ gdb ReadMe

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Exercise1 ArrayExpress Archive - High-throughput sequencing example

Array-Ready Oligo Set for the Rat Genome Version 3.0

INTRODUCTION TO BIOINFORMATICS. SAINTS GENETICS Ian Bosdet

GS Analysis of Microarray Data

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

ab initio and Evidence-Based Gene Finding

Biology 644: Bioinformatics

Advanced Bioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2018

GS Analysis of Microarray Data

Textbook Reading Guidelines

A WEB-BASED TOOL FOR GENOMIC FUNCTIONAL ANNOTATION, STATISTICAL ANALYSIS AND DATA MINING

GREG GIBSON SPENCER V. MUSE

Gene-centered databases and Genome Browsers

Gene-centered databases and Genome Browsers

Genome Informatics. Systems Biology and the Omics Cascade (Course 2143) Day 3, June 11 th, Kiyoko F. Aoki-Kinoshita

Why learn sequence database searching? Searching Molecular Databases with BLAST

org.bt.eg.db April 1, 2019

The University of California, Santa Cruz (UCSC) Genome Browser

Retrieval of gene information at NCBI

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

An Introduction to the package geno2proteo

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Computational Biology and Bioinformatics

org.gg.eg.db November 2, 2013 org.gg.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Types of Databases - By Scope

Final exam: Introduction to Bioinformatics and Genomics DUE: Friday June 29 th at 4:00 pm

Deakin Research Online

COMPUTER RESOURCES II:

Klinisk kemisk diagnostik BIOINFORMATICS

TUTORIAL. Revised in Apr 2015

SeattleSNPs Interactive Tutorial: Database Inteface Entrez, dbsnp, HapMap, Perlegen

Bioinformatic Tools. So you acquired data.. But you wanted knowledge. So Now What?

PATHWAY ANALYSIS. Susan LM Coort, PhD Department of Bioinformatics, Maastricht University. PET course: Toxicogenomics

A tutorial introduction into the MIPS PlantsDB barley&wheat database instances

Browsing Genes and Genomes with Ensembl

DNAFSMiner: A Web-Based Software Toolbox to Recognize Two Types of Functional Sites in DNA Sequences

BME 110 Midterm Examination

ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data

GS Analysis of Microarray Data

Introduction to Bioinformatics

Microarray Data Analysis in GeneSpring GX 11. Month ##, 200X

BIOINFORMATICS FOR DUMMIES MB&C2017 WORKSHOP

Homework 4. Due in class, Wednesday, November 10, 2004

G4120: Introduction to Computational Biology

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM)

B I O I N F O R M A T I C S

Basic Bioinformatics: Homology, Sequence Alignment,

Gene Prediction 10/21/05

Genome 373: Genomic Informatics. Elhanan Borenstein

KnetMiner USER TUTORIAL

Finding and Exporting Data. Search

user s guide Question 3

Introduction and Public Sequence Databases. BME 110/BIOL 181 CompBio Tools

Engineering Genetic Circuits

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

Browser Exercises - I. Alignments and Comparative genomics

Compiled by Mr. Nitin Swamy Asst. Prof. Department of Biotechnology

Identifying Regulatory Regions using Multiple Sequence Alignments

Bioinformatics Course AA 2017/2018 Tutorial 2

Transcription:

Applied Bioinformatics Bing Zhang Department of Biomedical Informatics Vanderbilt University bing.zhang@vanderbilt.edu

Course overview What is bioinformatics Data driven science: the creation and advancement of databases, algorithms, and computational and statistical methods to solve theoretical and practical problems arising from the management and analysis of biological data. Major research areas: sequence alignment, gene finding, genome assembly, protein structure prediction, gene expression and regulation, protein interaction, drug design, genome-wide association studies, computational evolutionary biology etc. Applied bioinformatics module Not a comprehensive guide to all facets of bioinformatics To equip you with the computational understanding and expertise needed to solve bioinformatics problems that you will likely encounter in your research. http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html 2

Course overview What is bioinformatics Data driven science: the creation and advancement of databases, algorithms, and computational and statistical methods to solve theoretical and practical problems arising from the management and analysis of biological data. Major research areas: sequence alignment, gene finding, genome assembly, protein structure prediction, gene expression and regulation, protein interaction, drug design, genome-wide association studies, computational evolutionary biology etc. Applied bioinformatics module Not a comprehensive guide to all facets of bioinformatics To equip you with the computational understanding and expertise needed to solve bioinformatics problems that you will likely encounter in your research. http://www.bioinformatics.ca/links_directory/ 3

Course content and grades Date Subject Instructor Homework (HW) 2/14 Finding information about genes Zhang 2/16 Navigating sequenced genomes Zhang 2/18 Pairwise sequence alignment and database search Zhao 2/21 Multiple sequence alignment Zhao 2/23 Inferring phylogenetic relationships from sequence data Zhao HW I distribution 20 pts Zhao + 10 pts Zhang 2/25 Protein sequence annotation Tabb 2/28 Protein structure prediction and visualization Tabb HW I due 3/2 Protein identification by mass spectrometry Tabb HW II distribution 20 pts Tabb 3/4 Gene prediction and annotation Bush 3/7 Finding regulatory and conserved elements in DNA sequence Bush HW II due 3/9 Assessing the impact of genetic variation Bush HW III distribution 20 pts Bush 3/11 Supervised analysis of gene expression data Zhang 3/14 Unsupervised analysis of gene expression data Zhang HW III due 3/16 Functional interpretation of gene lists Zhang 3/18 Biological pathways Zhang 3/21 Biological networks Zhang HW IV distribution 30 pts Zhang 3/25 HW assignments will be graded by each instructor for their respective sections. Final Grade = sum of the hw scores (100 pts in total). A: 85-100; B: 70-84; C: 55-69; D: 40-54; F: 0-39 Homework IV due by 5pm! 4

Course materials and assignments Lecture slides available at https://medschool.mc.vanderbilt.edu/iidea/admin/admin_login.php before each lecture Homework assignments available at https://medschool.mc.vanderbilt.edu/iidea/admin/admin_login.php on the distribution date (2/23, 3/2, 3/9, 3/21) Homework assignments are due at 5pm on the due date (2/28, 3/7, 3/14, 3/25). There will be a 10% per day deduction for late reports. Email your reports in the pdf, doc, or docx format to corresponding instructor(s) HW I: bing.zhang@vanderbilt.edu; zhongming.zhao@vanderbilt.edu HW II: david.l.tabb@vanderbilt.edu HW III: william.s.bush@vanderbilt.edu HW IV: bing.zhang@vanderbilt.edu Text book (optional): Dear, Paul H. (2007) Methods Express: Bioinformatics. Scion, ISBN 978-1904842163. 5

Finding information about genes Bing Zhang Department of Biomedical Informatics Vanderbilt University bing.zhang@vanderbilt.edu

When do we need gene information? Case 1 From Prof. Randy Blakely (Pharmacology): We have hit an uncharacterized gene in our hunt for SERT interacting proteins=****** that appears to be highly depleted when extracts are made from SERT KO mice. Can you help us come up with some ideas as to what this gene might be. Case 2 From Prof. Kevin Schey (Biochemistry): I ve attached a spreadsheet of our proteomics results comparing 5 Vehicle and 5 Aldosterone treated patients. We ve included only those proteins whose summed spectral counts are >30 in one treatment group. Would it be possible to get the GO annotations for these? The Uniprot name is listed in column A and the gene name is listed in column R. If this is a time consuming task (and I imagine that it is), can you tell me how to do it? 7

Resources Entrez Gene Gene Cards http://www.ncbi.nlm.nih.gov/gene http://www.genecards.org NCBI/NIH Weizmann Institute of Science, Israel All completely sequenced genomes One gene per page Ensembl BioMart http://www.ensembl.org/biomart/martview EMBL-EBI and Sanger Institute Vertebrates and other selected eukaryotic species Batch information retrieval Comprehensive information on human genes WikiGenes http://www.wikigenes.org MIT Collaborative annotation in a wiki system GLAD4U http://bioinfo.vanderbilt.edu/glad4u Vanderbilt Genes related to a specific topic 8

Learning objectives To gain a basic understanding of the Entrez Gene system To be able to retrieve information for individual genes using Entrez Gene To gain a basic understanding of the Ensembl BioMart system To be able to retrieve information for a list of genes using Ensembl BioMart 9

Entrez Gene: overview Data source Automated analyses and curation by NCBI staff Data stored in flat files Updated continuously Unique gene identifier Entrez Gene uses unique integers (GeneID) as stable identifiers for genes, e.g. GeneID for human tumor protein p53 (TP53) is 7157 GeneID assigned to each record is species specific, e.g. GeneID for the mouse ortholog of TP53 (Trp53) is 22059 Statistics as of February 2011 7.2 million records distributed among 7039 taxa 45,227 records for human Query system Entrez 10

Entrez Gene: Entrez An integrated search and retrieval system that provides access to many discrete databases at the NCBI website. All databases indexed by Entrez can be searched via a single query string, including Entrez Gene Supports Boolean operators AND, OR, NOT Supports search term tags to limit search to particular fields Title, organism, etc. Sample query transporter[title] AND ( Homo sapiens"[organism] OR "Mus musculus"[organism]) 11

Entrez Gene: search result Display Setting Help Advanced search Filtering Summary record Related data 12

Entrez Gene: Gene record (I) Each Gene record integrates multiple types of information Gene type: trna, rrna, snrna, scrna, snorna, miscrna, proteincoding, pseudo, other, and unknown Nomenclature, summary descriptions, accessions of gene specific and gene product-specific sequences, chromosomal location, reports of pathways and protein interactions, associated markers and phenotypes Links to other databases at NCBI including literature citations, sequences, variations, and homologs Links to databases outside of NCBI 13

Entrez Gene: Gene record (II) New search Export Expand Help http://www.ncbi.nlm.nih.gov/gene/7157 14

Entrez Gene: advanced ways of accessing FTP download ftp://ftp.ncbi.nlm.nih.gov/gene/readme E-Utilities (Entrez Programming Utilities) Server-side programs that provide a stable interface into the Entrez query and database system Uses a fixed URL syntax that translates a standard set of input parameters into the values necessary for various NCBI software components to search for and retrieve the requested data, including nucleotide and protein sequences, gene records, three-dimensional molecular structures, and the biomedical literature. Works with any computer language that can send a URL to the E-utilities server and interpret the XML response, e.g. Perl, Python, Java, and C++. Combining E-utilities components to form customized data pipelines within these applications is a powerful approach to data manipulation. 15

Entrez Gene: documentation and publications http://www.ncbi.nlm.nih.gov/books/nbk3841/ Maglott et al. NAR, 39:D52-D57, 2011 16

Entrez Gene: exercise Questions How many records can we get for a simple search of kinase in Entrez Gene? Use Boolean operators and search term tags to search for mouse genes located on chromosome 1 and with kinase in title. With the default display setting, what is the first hit? Click on the first hit and identify how many publications in PubMed are associated with this gene. Identify which proteins interact with the protein product of this gene. Answers 244,301 records Query term: kinase[title] AND mouse[organism] AND 1[Chromosome] Epha4 Bibliograph section: 220 citations in PubMed Interactions section: 3 proteins, Epha4, Ngef, and Vav2 17

Ensembl Genome databases for vertebrates and other selected eukaryotic species Automated annotation system at EBI Data stored in a relational database Updated periodically with versions Unique gene identifier Ensembl uses unique strings (Ensembl gene ID) as stable identifiers for genes, e.g. Ensembl gene stable ID for human tumor protein p53 (TP53) is ENSG00000141510 GeneID assigned to each record is species specific, e.g. Ensembl gene stable ID for the mouse ortholog of TP53 (Trp53) is ENSMUSG00000059552 Clear gene, transcript, and protein relationship, e.g. ENSG00000141510 => 17 transcripts (e.g. ENST00000445888) => 13 proteins (e.g. ENSP00000391478) Statistics as of February 2011 (version 61) 55 species 53,630 genes for human Other species available in the recently expanded system EnsemblGenomes http://www.ensemblgenomes.org 18

Biomart: a batch information retrieval system Biomart is a query-oriented data management system. Batch information retrieval for complex queries Particularly suited for providing 'data mining' like searches of complex descriptive data such as those related to genes and proteins Open source and can be customized Originally developed for the Ensembl genome databases Adopted by many other projects including UniProt, InterPro, Reactome, Pancreatic Expression Database, and many others (see a c o m p l e t e l i s t a n d g e t a c c e s s t o t h e t o o l s f r o m http://www.biomart.org/ ) 19

BioMart: basic concepts Dataset Filter Attribute From Prof. Kevin Schey (Biochemistry): I ve attached a spreadsheet of our proteomics results comparing 5 Vehicle and 5 Aldosterone treated patients. We ve included only those proteins whose summed spectral counts are >30 in one treatment group. Would it be possible to get the GO annotations for these? The Uniprot name is listed in column A and the gene name is listed in column R. If this is a time consuming task (and I imagine that it is), can you tell me how to do it? From all human genes, selected those with the listed Uniprot IDs, and retrieve GO annotations. 20

Ensembl Biomart analysis Choose dataset Choose database: Ensembl Genes 61 Choose dataset: Homo sapiens genes (GRch37) Set filters Gene: a list of genes/proteins identified by various database IDs (e.g. IPI IDs) Gene Ontology: filter for proteins with specific GO terms (e.g. cell cycle) Protein domains: filter for proteins with specific protein domains (e.g. SH2 domain) Region: filter for genes in a specific chromosome region (e.g. chr1 1:1000000 or 11q13) Others Select output attributes Gene annotation information in the Ensembl database, e.g. gene description, chromosome name, gene start, gene end, strand, band, gene name, etc. External data: Gene Ontology, IDs in other databases Expression: anatomical system, development stage, cell type, pathology Protein domains: SMART, PFAM, Interpro, etc. 21

Ensembl BioMart: query interface Count Results Help Perl API Choose dataset Set filters Select output attributes 22

Ensembl Biomart: sample output Export all results to a file 23

Ensembl Biomart: documentation and publications http://www.ensembl.org/info/website/tutorials/index.html Smedley et al. BMC Genomics, 10:22, 2009 24

Ensembl Biomart analysis: exercise 1 Question I have two Ensembl gene IDs, ENSG00000162367 and ENSG00000187048. How do I get their gene names from HGNC, IDs from EntrezGene, and any probes that contain these gene sequences from the Affymetrix microarray platform HC G110? Choose data set Database: Ensembl Gene 61 Dataset: Homo sapiens genes (GRCh37.p2) Set filters Under GENE: check ID list limit box Select Header: Ensembl Gene IDs, Enter the gene IDs into the box. Select output attributes Select Features (default) Under EXTERNAL: External References, Select 'HGNC Symbol' and 'EntrezGene ID Under EXTERNAL: Microarray, Select 'Affy HC G110 Click on Count and then Results Export all results to File, TSV 25

Ensembl Biomart analysis: exercise 2 Question How can I get the 2kb upstream sequences for all genes on chromosome 1? Choose data set Database: Ensembl Gene 61 Dataset: Mus musculus genes (NCBIM37) Set filters Under REGION: check Chromosome, select 1 Select output attributes Select Sequences Under SEQUENCES: select Flank (Gene) Under Upstream flank: check and enter 2000 into the box Under Header Information, Gene Information, check Description Click on Count (1916/36817) and then Results Export all results to File, FASTA format 26

Summary Entrez Gene Ensembl BioMart http://www.ncbi.nlm.nih.gov/gene http://www.ensembl.org/biomart/martview NCBI/NIH EMBL-EBI and Sanger Institute All completely sequenced genomes Mainly vertebrates Data stored in flat files Data stored in a relational database Updated continuously Updated periodically with versions Unique gene identifier: Entrez Gene ID Unique gene identifier: Ensembl Gene ID Query system: Entrez Query system: BioMart Output: one-gene-at-a-time Output: multiple genes at the same time 27