Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Size: px
Start display at page:

Download "Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers"

Transcription

1 Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers Web resources: NCBI database: Ensembl database: UCSC Genome browser: Exercise 1 homepage: Goals: Learn how to efficiently navigate the NCBI, EBI-Ensembl, and UCSC Genome browsers to find information on specific genes. Background on nomenclature: Refseq refers to records that have been reviewed by the NCBI curation staff. The Refseq database is a precursor to the Gene database and is available as a Limits option in the protein and nucleotide databases. Curated Refseq records have the nomenclature: NM_#### for mrna and NP_#### for protein records. Other designations are described in the PDF file RefseqNomenclature.pdf available from the Exercise 1 homepage. Conduct text based searches of NCBI and Ensembl a) Search the NCBI Gene database using the query term: p53 AND human. The AND tells it to search for both p53 and human in every field. How many results were returned? [9227 in April] b) Change the search query to: p53 AND human[organism] or use the Advance option to create the same query. This tells the search algorithm that you are searching specifically for species human in the Organism field of the database. How many results were returned? [2092 in April] c) Search the Ensembl database for the human gene encoding p53. Change the dropdown menu to human, type p53 in the search box and click GO. How many results were returned? [6331 in April] Why so many results? 1. By default, every field in the record is searched, not just the gene name. 2. You are not using the official HGNC (Human Genome Nomenclature Committee) gene name and there are several different aliases for this gene. 3. The p53 protein interacts with >100 other proteins so there is a lot of literature that mention this protein and thus the name will appear in the records of many other genes. So how do you get around this? You can try searching for different aliases. You can look through the first few records and see if you can determine what the official gene symbol is. You can search the literature for other aliases. BCHM NCBI & Ensembl Tutorial Page 1 of 6

2 In this case, from your search of NCBI/Gene database in either a) or b), the top hit is the gene with the symbol TP53, which is the officially recognized gene symbol. Read through the summary and you ll note that the official gene name is Tumor Protein p53 and that it is involved in numerous cellular processes involved in gene regulation. You should also note that p53 is one of the listed aliases. Search the Ensembl database using the search term TP53 restricted to human and Genes and it should come up as the top record. In this course, you will be analyzing list of genes. For this reason, we will use the official gene symbols and/or specific database IDs. If you had to find the official gene symbol for more than about 10 genes you will quickly see the value of using gene identifiers that are universally recognized. You will also learn to value literature that references genes by their official symbols. Unfortunately, this is not a universal practice. Finding transcript information about a specific gene using NCBI & Ensembl Human genes are complex and often have several transcript isoforms. The curation of gene models to identify all possible and expressed transcripts uses several computational and experimental techniques, including tissue-specific RNAseq, which provides direct support for expression of exons. The curation of genes at NCBI uses a single pipeline and collects the curated genomic, transcript and protein sequences into the RefSeq database. The nomenclature identifies those sequences that are considered Reference (NG_ (genomic) NM_ (mrna) and NP_ Figure 1: Transcripts for TP53 in NCBI gene record Genomics (protein) versus those with only section computational support (XM_ or XP_). There is a PDF on the exercise 1 homepage that describes the Refseq nomenclature. Ensembl has two gene curation pipelines (VEGA & HAVANNA), and when the two pipelines are combined, the annotation is known as GENCODE. On the Gene specific pages, the transcripts are identified by whether they are protein coding or not. There is also a visual for splice variants that matches the known domains in the gene with the different transcripts. Ensembl also makes it easy to export an Excelcompatible transcript table and identifies which of its transcripts have a corresponding Refseq transcript. a) Within the NCBI gene record for the TP53 gene there are 2 sections that provide transcript/protein information: 1) Genomic regions, transcripts and products (Fig. 1) and 2) NCBI Reference Sequences (RefSeq) as shown in Fig. 2. In Fig. 1, note the menu bar above the chromosome ruler. In that bar, you will see a button with 3 colors on it. Click BCHM NCBI & Ensembl Tutorial Figure 2: List of transcripts in RefSeq section of the TP53 Gene record Page 2 of 6

3 on that to change the view to show the associated proteins with each transcript. As you scroll further down in this section, you should see another set of transcripts corresponding to the Ensembl transcripts. You can turn off tracks by clicking the red X in the top right of track. In the drop-down Tools menu, there is an option to create a PDF of the graphic. Do that and save the PDF. These can be embedded in your exercise report. 1. b) Click on the Gene:TP53 tab in Ensembl. Near the top of the record, you will see a button labeled Show transcript table. Click on that and it will expand to show a list of the annotated and predicted transcripts. Here you can export the entire table in CSV format and then import into Excel. There is color coding to indicate which annotation pipeline the transcripts came from. Figure 3: Ensembl transcript table for TP53 gene NOTE: The Ensembl site generally makes it easier to deal with lists of genes (both importing and exporting). The NCBI site has better cross-database functionality and is better integrated with the literature. You should note several things about these transcript searches: 1. TP53 has a large number of transcript isoforms. Not all human genes have this many, but if you want to conduct a whole genome expression experiment, one consideration is whether to analyze the data on a gene (~25,000) or transcript (~160,000) level. 2. The transcript variants differ between Ensembl and NCBI. Though Ensembl kindly lists those that are in common between the two sites. Exploring the genomic context of genes using Ensembl and UCSC Genome browser. The genomic context means where on the genome the gene is located. That is: Which chromosome Where on that chromosome What strand What genes are upstream/downstream Genome browsers offer a way to visualize data that can be placed on a chromosome. These data are included as additional tracks of information (from a few to hundreds depending on the genome) and include such data as: Location of repetitive sequences Level of homology to other genomes SNP or variants within the genome of interest TF binding sites The data behind a genome browser is enormous and can be quite complex to sort through. This amount of data can also be slow to load. Spend some time turning tracks on and off and following links or pop-ups that explain the different data sources. We will use both the UCSC and Ensembl BCHM NCBI & Ensembl Tutorial Page 3 of 6

4 genome browsers for this exercise. Both allow you to export images of the browser window and offer links to download sequence data. Ensembl genome browser To access the Ensembl genome browser, click on the Location tab (which should have a title: Location: 17:7,661,779-7,687,550. This indicates that the TP53 gene is located on Chromosome 17 between the coordinates 7,661,264-7,688,064. The first section shows a schematic of the chromosome with a red box around the coordinates of the gene (Fig. 4). If you click on the Assembly Exceptions link, you can turn off that track and are left with just the box highlighting the gene. Figure 4: Chromosome ideogram of chr 17 with the region for TP53 shown as a red box Scroll down to the next section and you ll see the chromosome region in more detail, with the TP53 gene in the middle. This gives you an idea of the genomic context of the gene of interest. Scroll down to the next section and this will display the ~25 Kb region that encompasses the largest transcript isoform of the gene. Here you can observe all splice variants. They are color coded by experimental support and whether they are protein coding or not. Click on one of the transcripts and it will open a pop-up window with additional details about that transcript. You can right-click on the links within the pop-up window to open up the link in a new tab or window. Click on the X to close the window. Scroll down further and you will see additional tracks of Figure 5: Part of the TP53 transcript variants with expanded SNPs below. information, such as SNP locations, associated phenotypes and %GC. These tracks can be expanded and turned on and off. It can take a while for the changes to be implemented depending on how long of a chromosomal region you are working with and how much data is in the track. BCHM NCBI & Ensembl Tutorial Page 4 of 6

5 If you scroll back to the top of this section, you can zoom in or out. Sometimes tracks won t expand because you are viewing a large enough section that there will be too much information to display. If you tried expanding a track and nothing happened, try zooming in such that you are displaying <10 Kb of sequence. That will usually allow any track to be expanded. Figure 5 shows a portion of the TDP53 transcript with expanded track of SNPs. A quick look at the variant legend below the SNPs provides the meaning for the color coding. Click on an individual SNP name (rs######) and it open a pop-up window (Fig. 6) with a brief description and links to open a Variant tab that will provide yet more information about the variant. This exercise should demonstrate the vast amount of genetic and sequence information that is available for human. There are browsers for other organisms, but the amount of data available is variable, depending on the level of interest in a particular organism. Using the UCSC Genome browser Open the UCSC genome browser from the link in this document or from the Exercise 1homepage. Below the headers is a dark blue bar with the link Genomes. Mouse over it and select human genome GRCh38/hg38. Or click the link and it will open a search window for the latest Human assembly as a default option. Type in TP53 into the search text box and it will list many possible matches. Select the second one which corresponds to tumor protein p53 (from HGNC TP53). This should open a window that looks something like Fig. 7. The gene size and coordinates of where this gene falls on Chr 17 should be very similar if not identical to the coordinates listed for the Ensembl browser. Scroll down through the graphics. Click on the graphic or clicking on the name of the track will pop open a window with information about Figure 7: UCSC view of TP53 the track. Click on any single transcript to see details about the transcript. Figure 6: SNP pop-up window in Ensemble A FEW of the questions you can ask with a genome browser include (depending on the genome and available track information): BCHM NCBI & Ensembl Tutorial Page 5 of 6

6 1) What genes are located near it or may share promoters? 2) What SNPs are found in my gene and are they located in introns, promoters or exons? 3) What strand is my gene encoded on? 4) What regulator elements are located within or near my gene? 5) What clinical variants are associated with my gene? A relatively new default track at USCS is the gene expression data in different tissues from the NIH Genotype-Tissue Expression (GTEx) project. This project was created to establish a sample and data resource for studies on the relationship between genetic variation and gene expression in multiple human tissues. This track shows median gene expression levels in 51 tissues and 2 cell lines, based on RNA-seq data from the GTEx midpoint milestone data release (V6, October 2015). This release is based on data from 8555 tissue samples obtained from 570 adult post-mortem individuals. Take-home points: 1. There is a LOT of biomolecular data and much of it is shared between databases. 2. The various web tools/interfaces provide different approaches to viewing and interacting with this data. There is likely to be more than one way to answer whatever questions you might have. The most important thing is to document what tools you used, when you accessed them and, for genomic data, what assembly/version of the genome you accessed. 3. This tutorial provides only a sliver of the types of searches and questions you can ask. Take your time going through it and spend some extra time exploring what other features are there. All of the main websites will have readily available information to explain what you are looking at. BCHM NCBI & Ensembl Tutorial Page 6 of 6