BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Similar documents
Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

user s guide Question 1

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

Hands-On Four Investigating Inherited Diseases

Investigating Inherited Diseases

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

Using the Genome Browser: A Practical Guide. Travis Saari

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

user s guide Question 3

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

SeattleSNPs Interactive Tutorial: Database Inteface Entrez, dbsnp, HapMap, Perlegen

ab initio and Evidence-Based Gene Finding

user s guide Question 3

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

Browser Exercises - I. Alignments and Comparative genomics

The University of California, Santa Cruz (UCSC) Genome Browser

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Go to Bottom Left click WashU Epigenome Browser. Click

Chapter 2: Access to Information

Browsing Genes and Genomes with Ensembl

Guided tour to Ensembl

Gene-centered resources at NCBI

Briefly, this exercise can be summarised by the follow flowchart:

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Training materials.

Annotation of a Drosophila Gene

Genomics: Genome Browsing & Annota3on

Gene-centered databases and Genome Browsers

Gene-centered databases and Genome Browsers

Overview: GQuery Entrez human and amylase Search Pubmed Gene Gene: collected information about gene loci AMY1A Genomic context Summary

FINDING GENES AND EXPLORING THE GENE PAGE AND RUNNING A BLAST (Exercise 1)

Introduction to RNA-Seq in GeneSpring NGS Software

Chimp BAC analysis: Adapted by Wilson Leung and Sarah C.R. Elgin from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. Michael R.

Sequence Annotation & Designing Gene-specific qpcr Primers (computational)

INTRODUCTION TO BIOINFORMATICS. SAINTS GENETICS Ian Bosdet

Functional analysis using EBI Metagenomics

OncoMD User Manual Version 2.6. OncoMD: Cancer Analytics Platform

Applied Bioinformatics

Figure 1. FasterDB SEARCH PAGE corresponding to human WNK1 gene. In the search page, gene searching, in the mouse or human genome, can be done: 1- By

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide.

Lab Week 9 - A Sample Annotation Problem (adapted by Chris Shaffer from a worksheet by Varun Sundaram, WU-STL, Class of 2009)

Identifying Genes and Pseudogenes in a Chimpanzee Sequence Adapted from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. M.

RNA-Seq Analysis. August Strand Genomics, Inc All rights reserved.

COMPUTER RESOURCES II:

Overview of the next two hours...

BIOINF525: INTRODUCTION TO BIOINFORMATICS LAB SESSION 1

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

Final exam: Introduction to Bioinformatics and Genomics DUE: Friday June 29 th at 4:00 pm

Chimp Sequence Annotation: Region 2_3

User s Manual Version 1.0

Object Groups. SRI International Bioinformatics

PeCan Data Portal. rnal/v48/n1/full/ng.3466.html


The Gene Gateway Workbook

Analyzing an individual sequence in the Sequence Editor

FUNCTIONAL BIOINFORMATICS

Finding Genes, Building Search Strategies and Visiting a Gene Page

The human gene encoding Glucose-6-phosphate dehydrogenase (G6PD) is located on chromosome X in cytogenetic band q28.

BME 110 Midterm Examination

Aligning GENCODE and RefSeq transcripts By EMBL-EBI and NCBI

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

MODULE 5: TRANSLATION

Niemann-Pick Type C Disease Gene Variation Database ( )

Protein Bioinformatics Part I: Access to information

Finding and Exporting Data. Search

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

MAKING WHOLE GENOME ALIGNMENTS USABLE FOR BIOLOGISTS. EXAMPLES AND SAMPLE ANALYSES.

Bioinformatics for Proteomics. Ann Loraine

Access to genes and genomes with. Ensembl. Worked Example & Exercises

2. The dropdown box has a number of databases that are searchable. Select the gene option and search for dihydrofolate reductase.

A tutorial introduction into the MIPS PlantsDB barley&wheat database instances

The first thing you will see is the opening page. SeqMonk scans your copy and make sure everything is in order, indicated by the green check marks.

Biotechnology Explorer

Training materials.

TUTORIAL. Revised in Apr 2015

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

Finding Genes, Building Search Strategies and Visiting a Gene Page

Motif Discovery in Drosophila

MODULE TSS2: SEQUENCE ALIGNMENTS (ADVANCED)

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE

Agenda. Annotation of Drosophila. Muller element nomenclature. Annotation: Adding labels to a sequence. GEP Drosophila annotation projects 01/03/2018

Interpreting RNA-seq data (Browser Exercise II)

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Exercise I, Sequence Analysis

How to view Results with Scaffold. Proteomics Shared Resource

Next-Generation Sequencing Gene Expression Analysis Using Agilent GeneSpring GX

Analysis of Microarray Data

MODULE TSS1: TRANSCRIPTION START SITES INTRODUCTION (BASIC)

Homework 4. Due in class, Wednesday, November 10, 2004

PrimePCR Assay Validation Report

Analysis of Microarray Data

Mouse Genome Informatics (MGI) Workshop

ChroMoS Guide (version 1.2)

Shannon pipeline plug-in: For human mrna splicing mutations CLC bio Genomics Workbench plug-in CLC bio Genomics Server plug-in Features and Benefits

Introduction to IBM Cognos for Consumers. IBM Cognos

Annotation. (Chapter 8)

KnetMiner USER TUTORIAL

Exercises (Multiple sequence alignment, profile search)

PrimePCR Assay Validation Report

Transcription:

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers Web resources: NCBI database: http://www.ncbi.nlm.nih.gov/ Ensembl database: http://useast.ensembl.org/index.html UCSC Genome browser: http://genome.ucsc.edu/ Exercise 1 homepage: http://biochem.slu.edu/bchm628/exercise1.html Goals: Learn how to efficiently navigate the NCBI, EBI-Ensembl, and UCSC Genome browsers to find information on specific genes. NOTE: Refseq refers to records that have been reviewed by the NCBI curation staff. The Refseq database is a precursor to the Gene database and is available as a Limits option in the protein and nucleotide databases. Curated Refseq records have the nomenclature: NM_#### for mrna and NP_#### for protein records. Other designations are described in the PDF file RefseqNomenclature.pdf available from the Exercise 1 homepage. Conduct text based searches of NCBI and Ensembl a) Search the NCBI Gene database using the query term: p53 AND human. The AND tells it to search for both p53 and human in every field. b) Change the search query to: p53 AND human[organism] or use the Advance option to create the same query. This tells the search algorithm that you are searching specifically for species human in the Organism field of the database. c) Search the Ensembl database for the human gene encoding p53. Change the dropdown menu to human, type p53 in the search box and click GO. The first thing you should note is that there are many matches to the query p53. There are several reasons for this: 1. You are searching every field and not just the gene name 2. You are not using the official HGNC (Human Genome Nomenclature Committee) gene name and there are several different aliases for this gene. 3. The p53 protein interacts with >100 other proteins so there is a lot of literature that mention this protein and thus the name will appear in the records of many other genes. So how do you get around this? You can try searching for different aliases. You can look through the first few records and see if you can determine what the official gene symbol is. You can search the literature for other aliases. In this case, from your search of NCBI/Gene database in either a) or b), the top hit is the gene with the symbol TP53, which is the correct symbol. Read through the summary and you ll note that the official gene name is Tumor Protein p53 and that it is involved in numerous cellular processes involved in gene regulation. You should also note that p53 is one of the listed aliases. BCHM 6280 2017 NCBI & Ensembl Tutorial Page 1 of 5

Search the Ensembl human genome with the query p53. How many results? Now, restrict the results to Genes and this should reduce the list to ~443 records. However, I did not find it within the first few pages. Change the search to TP53 restricted to human and Genes and it should come up as the top record. Central to this course is dealing with lists of genes. For this reason, we will use the official gene symbols and specific database IDs. If you had to find the official gene symbol for more than about 10 genes you will quickly see the value of using gene identifiers that are universally recognized. You will also learn to value literature that references genes by their official symbols. Unfortunately, this is not a universal practice. Finding transcript information about a specific gene using NCBI & Ensembl Human genes are complex and often have several transcript isoforms. The curation of gene models to identify all possible and expressed transcripts uses several experimental techniques, including tissue-specific RNAseq, which provides direct support for expression of exons. The curation of genes at NCBI uses a single pipeline and collects the curated genomic, transcript and protein sequences into the RefSeq database. They nomenclature identifies those sequences that are considered Reference (NG_ (genomic) NM_ (mrna) and NP_ (protein). There is a PDF on the exercise 1 homepage that describes all of the Refseq nomenclature. Note that some of listed as XM or XP, which indicates predicted transcripts or proteins with less or no experimental evidence for them. Ensembl has two gene curation pipelines (VEGA & HAVANNA), and when the two pipelines are combined, the annotation is known as GENCODE. On the Gene specific pages, the transcripts are identified by whether they are protein coding or not. There is also a visual for splice variants that matches the known domains in the gene with the different transcripts. Ensembl also makes it easy to export an Excel-compatible transcript table and usually identifies which of its transcripts have a corresponding Refseq transcript match. a) Within the NCBI gene record for the TP53 gene there are 2 sections that provide transcript/protein information: Genomic regions, transcripts and products and NCBI Reference Set. Export a PDF from the Genomic regions section. Here, genes are color coded (green for protein coding, blue for non-coding). It also lists gene models (XR or XM). Refseq transcripts/proteins starting with X represent computational models without experimental verification. An example is provided on the Exercise 1 homepage. b) Within the Ensembl gene record for TP53, find the transcript table. Here you can export the entire table in CSV format and then import into Excel. An example is provided on the Exercise 1 homepage. NOTE: The Ensembl site generally makes it easier to deal with lists of genes (both importing and exporting). The NCBI site has better cross-database functionality and is better integrated with the literature. You should note several things about these transcript searches: BCHM 6280 2017 NCBI & Ensembl Tutorial Page 2 of 5

1. TP53 has a large number of transcript isoforms. Not all human genes have this many, but if you want to conduct a whole genome expression experiment, one consideration is consider whether to analyze the data on a gene (~25,000) or transcript (~160,000) level. 2. The transcript variants differ between Ensembl and NCBI. Though Ensembl kindly lists those that are in common between the two sites. 3. Ensembl makes it easy to distinguish between transcripts that are protein coding or not and also between transcripts with good experimental evidence versus computationally predicted transcripts. Exploring the genomic context of genes using Ensembl and UCSC Genome browser. The genomic context means where on the genome the gene is located. That is: Which chromosome Where on that chromosome What strand What genes are upstream/downstream Genome browsers offer a way to visualize data that can be placed on a chromosome. These data are included as additional tracks of information (from a few to hundreds depending on the genome) and include such data as: Location of repetitive sequences Level of homology to other genomes SNP or variants within the genome of interest TF binding sites The data behind a genome browser is enormous and can be quite complex to sort through. This amount of data can also be slow to load. Spend some time turning tracks on and off and following links or pop-ups that explain the different data sources. We will use both the UCSC and Ensembl genome browsers for this exercise. Both allow you to export images of the browser window and offer links to download sequence data. Ensembl genome browser To access the Ensembl genome browser, click on the Location tab (which should have a title: Location: 17:7,661,779-7,687,550. This indicates that this gene is located on Chromosome 17 between the coordinates 7,661,779-7,687,550. The first section shows a schematic of the chromosome with a red box around the coordinates of the gene (Fig. 1). If you click on the Assembly Exceptions link, you can turn off that track and are left with just the box highlighting Figure 1: Chromosome ideogram of chr 17 with the region for TP53 shown as a red box the gene. BCHM 6280 2017 NCBI & Ensembl Tutorial Page 3 of 5

Scroll down to the next section and you ll see the chromosome region in more detail, with the TP53 gene in the middle. This gives you an idea of the genomic context of the gene of interest. Scroll down to the next section and this will display the 25 Kb region that encompasses the largest transcript isoform of the gene. You can see all the different splice variants. They are color coded by experimental support and whether they are protein coding or not. Click on one of the transcripts and it will open a pop-up window with additional details about that transcript. You can right-click on the links within the pop-up window to open up the link in a new tab or window. Click on the X to close the window. Scroll down further and you will see additional tracks of information, such as SNP locations, associated phenotypes and %GC. These tracks can be expanded and turned on and off. It can take a while for the changes to be implemented depending on how long of a chromosomal region you are working with and how much data is in the track. If you scroll back to the top of this section, you can zoom in or out. Sometimes tracks won t expand because you are viewing a large enough section that there will be too much information to display. If you tried expanding a track and nothing happened, try zooming in such that you are displaying <10 Kb of sequence. That will usually allow any track to be expanded. Figure 2 shows a portion of the TDP53 transcript with expanded track of SNPs. Figure 2: Part of the TP53 transcript variants with expanded SNPs below. BCHM 6280 2017 NCBI & Ensembl Tutorial Page 4 of 5

Using the UCSC Genome browser Below the headers is a dark blue bar with the link Genomes. Mouse over it and select human genome GRCh38/hg38. Or click the link and it will open a search window for the latest Human assembly as a default option. Type in TP53 into the search text box and it will list many possible matches. Select the second one which corresponds to tumor protein p53 (from HGNC TP53). This should open a window that looks something like Fig. 3. Figure 3: UCSC view of Tp53 The gene size and coordinates of where this gene falls on Chr 17 should be very similar if not identical to the coordinates listed for the Ensembl browser. Scroll down through the graphics. Click on the graphic or clicking on the name of the track will pop open a window with information about the track. Click on any single transcript to see details about the transcript. A FEW of the questions you can ask with a genome browser include (depending on the genome and available track information): 1) What genes are located near it or may share promoters? 2) What SNPs are found in my gene and are they located in introns, promoters or exons? 3) What strand is my gene encoded on? 4) What regulator elements are located within or near my gene? 5) What clinical variants are associated with my gene? Spend some time exploring the tracks and looking up what they represent and how the data is presented. You may find some of the information pertinent to your research project. BCHM 6280 2017 NCBI & Ensembl Tutorial Page 5 of 5