BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers Web resources: NCBI database: http://www.ncbi.nlm.nih.gov/ Ensembl database: http://useast.ensembl.org/index.html UCSC Genome browser: http://genome.ucsc.edu/ Exercise 1 homepage: http://biochem.slu.edu/bchm628/exercise1.html Goals: Learn how to efficiently navigate the NCBI, EBI-Ensembl, and UCSC Genome browsers to find information on specific genes. NOTE: Refseq refers to records that have been reviewed by the NCBI curation staff. The Refseq database is a precursor to the Gene database and is available as a Limits option in the protein and nucleotide databases. Curated Refseq records have the nomenclature: NM_#### for mrna and NP_#### for protein records. Other designations are described in the PDF file RefseqNomenclature.pdf available from the Exercise 1 homepage. Conduct text based searches of NCBI and Ensembl a) Search the NCBI Gene database using the query term: p53 AND human. The AND tells it to search for both p53 and human in every field. b) Change the search query to: p53 AND human[organism] or use the Advance option to create the same query. This tells the search algorithm that you are searching specifically for species human in the Organism field of the database. c) Search the Ensembl database for the human gene encoding p53. Change the dropdown menu to human, type p53 in the search box and click GO. The first thing you should note is that there are many matches to the query p53. There are several reasons for this: 1. You are searching every field and not just the gene name 2. You are not using the official HGNC (Human Genome Nomenclature Committee) gene name and there are several different aliases for this gene. 3. The p53 protein interacts with >100 other proteins so there is a lot of literature that mention this protein and thus the name will appear in the records of many other genes. So how do you get around this? You can try searching for different aliases. You can look through the first few records and see if you can determine what the official gene symbol is. You can search the literature for other aliases. In this case, from your search of NCBI/Gene database in either a) or b), the top hit is the gene with the symbol TP53, which is the correct symbol. Read through the summary and you ll note that the official gene name is Tumor Protein p53 and that it is involved in numerous cellular processes involved in gene regulation. You should also note that p53 is one of the listed aliases. BCHM 6280 2017 NCBI & Ensembl Tutorial Page 1 of 5

Search the Ensembl human genome with the query p53. How many results? Now, restrict the results to Genes and this should reduce the list to ~443 records. However, I did not find it within the first few pages. Change the search to TP53 restricted to human and Genes and it should come up as the top record. Central to this course is dealing with lists of genes. For this reason, we will use the official gene symbols and specific database IDs. If you had to find the official gene symbol for more than about 10 genes you will quickly see the value of using gene identifiers that are universally recognized. You will also learn to value literature that references genes by their official symbols. Unfortunately, this is not a universal practice. Finding transcript information about a specific gene using NCBI & Ensembl Human genes are complex and often have several transcript isoforms. The curation of gene models to identify all possible and expressed transcripts uses several experimental techniques, including tissue-specific RNAseq, which provides direct support for expression of exons. The curation of genes at NCBI uses a single pipeline and collects the curated genomic, transcript and protein sequences into the RefSeq database. They nomenclature identifies those sequences that are considered Reference (NG_ (genomic) NM_ (mrna) and NP_ (protein). There is a PDF on the exercise 1 homepage that describes all of the Refseq nomenclature. Note that some of listed as XM or XP, which indicates predicted transcripts or proteins with less or no experimental evidence for them. Ensembl has two gene curation pipelines (VEGA & HAVANNA), and when the two pipelines are combined, the annotation is known as GENCODE. On the Gene specific pages, the transcripts are identified by whether they are protein coding or not. There is also a visual for splice variants that matches the known domains in the gene with the different transcripts. Ensembl also makes it easy to export an Excel-compatible transcript table and usually identifies which of its transcripts have a corresponding Refseq transcript match. a) Within the NCBI gene record for the TP53 gene there are 2 sections that provide transcript/protein information: Genomic regions, transcripts and products and NCBI Reference Set. Export a PDF from the Genomic regions section. Here, genes are color coded (green for protein coding, blue for non-coding). It also lists gene models (XR or XM). Refseq transcripts/proteins starting with X represent computational models without experimental verification. An example is provided on the Exercise 1 homepage. b) Within the Ensembl gene record for TP53, find the transcript table. Here you can export the entire table in CSV format and then import into Excel. An example is provided on the Exercise 1 homepage. NOTE: The Ensembl site generally makes it easier to deal with lists of genes (both importing and exporting). The NCBI site has better cross-database functionality and is better integrated with the literature. You should note several things about these transcript searches: BCHM 6280 2017 NCBI & Ensembl Tutorial Page 2 of 5

1. TP53 has a large number of transcript isoforms. Not all human genes have this many, but if you want to conduct a whole genome expression experiment, one consideration is consider whether to analyze the data on a gene (~25,000) or transcript (~160,000) level. 2. The transcript variants differ between Ensembl and NCBI. Though Ensembl kindly lists those that are in common between the two sites. 3. Ensembl makes it easy to distinguish between transcripts that are protein coding or not and also between transcripts with good experimental evidence versus computationally predicted transcripts. Exploring the genomic context of genes using Ensembl and UCSC Genome browser. The genomic context means where on the genome the gene is located. That is: Which chromosome Where on that chromosome What strand What genes are upstream/downstream Genome browsers offer a way to visualize data that can be placed on a chromosome. These data are included as additional tracks of information (from a few to hundreds depending on the genome) and include such data as: Location of repetitive sequences Level of homology to other genomes SNP or variants within the genome of interest TF binding sites The data behind a genome browser is enormous and can be quite complex to sort through. This amount of data can also be slow to load. Spend some time turning tracks on and off and following links or pop-ups that explain the different data sources. We will use both the UCSC and Ensembl genome browsers for this exercise. Both allow you to export images of the browser window and offer links to download sequence data. Ensembl genome browser To access the Ensembl genome browser, click on the Location tab (which should have a title: Location: 17:7,661,779-7,687,550. This indicates that this gene is located on Chromosome 17 between the coordinates 7,661,779-7,687,550. The first section shows a schematic of the chromosome with a red box around the coordinates of the gene (Fig. 1). If you click on the Assembly Exceptions link, you can turn off that track and are left with just the box highlighting Figure 1: Chromosome ideogram of chr 17 with the region for TP53 shown as a red box the gene. BCHM 6280 2017 NCBI & Ensembl Tutorial Page 3 of 5

Scroll down to the next section and you ll see the chromosome region in more detail, with the TP53 gene in the middle. This gives you an idea of the genomic context of the gene of interest. Scroll down to the next section and this will display the 25 Kb region that encompasses the largest transcript isoform of the gene. You can see all the different splice variants. They are color coded by experimental support and whether they are protein coding or not. Click on one of the transcripts and it will open a pop-up window with additional details about that transcript. You can right-click on the links within the pop-up window to open up the link in a new tab or window. Click on the X to close the window. Scroll down further and you will see additional tracks of information, such as SNP locations, associated phenotypes and %GC. These tracks can be expanded and turned on and off. It can take a while for the changes to be implemented depending on how long of a chromosomal region you are working with and how much data is in the track. If you scroll back to the top of this section, you can zoom in or out. Sometimes tracks won t expand because you are viewing a large enough section that there will be too much information to display. If you tried expanding a track and nothing happened, try zooming in such that you are displaying <10 Kb of sequence. That will usually allow any track to be expanded. Figure 2 shows a portion of the TDP53 transcript with expanded track of SNPs. Figure 2: Part of the TP53 transcript variants with expanded SNPs below. BCHM 6280 2017 NCBI & Ensembl Tutorial Page 4 of 5

Using the UCSC Genome browser Below the headers is a dark blue bar with the link Genomes. Mouse over it and select human genome GRCh38/hg38. Or click the link and it will open a search window for the latest Human assembly as a default option. Type in TP53 into the search text box and it will list many possible matches. Select the second one which corresponds to tumor protein p53 (from HGNC TP53). This should open a window that looks something like Fig. 3. Figure 3: UCSC view of Tp53 The gene size and coordinates of where this gene falls on Chr 17 should be very similar if not identical to the coordinates listed for the Ensembl browser. Scroll down through the graphics. Click on the graphic or clicking on the name of the track will pop open a window with information about the track. Click on any single transcript to see details about the transcript. A FEW of the questions you can ask with a genome browser include (depending on the genome and available track information): 1) What genes are located near it or may share promoters? 2) What SNPs are found in my gene and are they located in introns, promoters or exons? 3) What strand is my gene encoded on? 4) What regulator elements are located within or near my gene? 5) What clinical variants are associated with my gene? Spend some time exploring the tracks and looking up what they represent and how the data is presented. You may find some of the information pertinent to your research project. BCHM 6280 2017 NCBI & Ensembl Tutorial Page 5 of 5