Final exam: Introduction to Bioinformatics and Genomics DUE: Friday June 29 th at 4:00 pm

Size: px
Start display at page:

Download "Final exam: Introduction to Bioinformatics and Genomics DUE: Friday June 29 th at 4:00 pm"

Transcription

1 Final exam: Introduction to Bioinformatics and Genomics DUE: Friday June 29 th at 4:00 pm Exam description: The purpose of this exam is for you to demonstrate your ability to use the different biomolecular databases and bioinformatics tools to find biological meaning in a list of genes identified in a transcriptome study. In addition, your report should be written clearly, concisely and resemble a results section of a manuscript. That is, your narratives should explain what you did and why, without listing step-by-step each mouse click. You should address the questions listed, but do not make this look like a short answer exam. The questions I put on the exam should NOT appear in your report, except as rephrased as part of your narrative. There are five sections to the exam. The starting point is a list of significantly differentially expressed genes from a mouse neuronal stem cell line transduced with a putative oncogenic transcription factor compared to transduction with a vector control. I analyzed the microarray data using two different methods, compared the genes lists from both and found that genes that were significantly differentially expressed by both methods. I then edited the list to remove redundant genes, leaving a final list of 198 genes. You will conduct a gene enrichment using gene ontology and pathway analyses, look at protein-protein interactions and then choose one gene on which to conduct an in silico molecular/biochemical analysis. I strongly suggest that you read the paper to help you put the analyses into context. Some of the Gene Symbols are likely out-of-date given the advancement of annotation in recent years. Reference: Zheng H., et. al. PLAGL2 Regulated Wnt Signaling to Impede Differentiation in Neural Stem Cells and Gliomas. (2010) Cancer Cell 17: PMID: To obtain the full points write-up should include: Abstract/Introduction Gene ontology enrichment Pathway enrichment & PPI A background section on a gene of choice Virtual biochemical/molecular analysis of same gene No more than 7 pages of text, including figures and tables. All pages should be in portrait orientation. Use 12 point font and margins 0.75 inches Figures should be embedded in the Word document and have figure legends. The legends can be in 10 point font. Tables should be generated in Word, not pasted in from a website or as a screen capture. If tables cannot be formatted to fit within ½ the page, then you may submit them as a separate attachment. No more than 4 attachments (1 page each). The report should have your name as a header and page numbers in the footer The exam is worth a total of 140 points. The abstract/introduction is worth 10 points. The gene ontology and pathway enrichment sections are worth 30 points each. The background section on the single gene is worth 20 points. The virtual molecular analysis is worth 50 points. BCHM Final Exam Page 1 of 5

2 When characterizing a list of genes, you are looking for clues as to why these particular genes show differential expression in response to some condition. I have suggested specific strategies in the sections below that should guide you in finding potentially useful information and how to present that information in a concise, informative manner. You need to spend some time thinking about what the enriched GO categories, pathways or protein domains may tell you about this particular experimental system. You have the paper with the authors conclusions. You can use this to guide your own thinking and provide context. You may chose to use a different strategy than what I laid out, but you need to explain why you chose that particular path. READ carefully. You will not get full points if you do not address all of the bullet points for a given analysis. If you cannot obtain the data or it isn t available to address a particular point, then you must state how you looked for it and the result, even if that result is negative. Choosing a gene without any known domains or annotation is probably not a wise strategy, as it will not give you much to work with. Section 1: Abstract/introduction The goal of this section is to provide context for your analyses and to briefly describe your results. In 800 words or less, describe the results of the paper from which the microarray data was obtained and the results of your own analyses. No figures or tables should be included in this section of your report. This section should probably be written last. Section 2: Gene enrichment by gene ontology Use the DAVID suite of programs to conduct enrichment using gene ontology on the genes that are UP-regulated in the list. Generate a table for your report that lists the top 5 enriched terms for each of the three ontology categories. The table should include: termid, term description, count, % and P- value and fold enrichment. Identify the terms by their GO category (biological process, cellular component or molecular function). Address the following in your narrative: How many genes from the original list were UP and DOWN-regulated? Do any of the enriched terms support the conclusions of the paper with regards to the role of Plagl2 in the cell? Provide your reasoning for this conclusion. Are there other terms that you think might be important for the role of Plagl2 as a potential oncogenic transcription factor? What other over-represented GO terms would you find interesting to pursue and why? Create a sublist from all UP-regulated genes in the top 5 biological process categories. This list should be <40 genes. If you have many more than that, then see me before moving on with your analyses. View the sublist in the Gene List report and export the file. As a supplemental table to your report, convert your sublist to Excel that includes: Entrez Gene ID, official gene symbol, gene name, and log2 ratio of Plgl2/ctl expression. You will use this list in the next section. BCHM Final Exam Page 2 of 5

3 Section 3: Protein-protein interactions and pathway enrichment KEGG pathway enrichment: Provide a table of the KEGG pathways over-represented or enriched in your list of genes, based on the DAVID analysis with default options. The table should include 5 columns of data from the DAVID output: the KEGG ID, pathway name, gene count, % and p-value (unadjusted). Compare the enriched pathways to those listed in Figure 6 of the paper, which was done using the Ingenuity Pathway Analysis software. Some pathways appear to be the same based on the names of the pathways. Others may reflect similar pathways if you dig into the KEGG pathway description which can be found at the KEGG website. Use this information to determine which of the enriched KEGG pathways overlap with those in the paper. In a 6 th column of the table with the overrepresented KEGG pathways, list the name of the pathway from Figure 6 that you think matches the KEGG pathway that row of the table. Based on the authors assertion that Plagl2 promotes cellular renewal and inhibits differentiation, do any of the KEGG enriched pathways make sense? Discuss which ones and why. Create a sublist of genes from the top 5 enriched KEGG pathways. As a supplemental table to your report, generate a list of all the genes that appear in the top 5 pathways including: Entrez Gene ID, official gene symbol, gene name, and log2 ratio of Plgl2/ctl expression. Protein-protein interactions: Submit the sublist from the genes in the top 5 KEGG gene list to the STRING Database. Change the view to Confidence. Use this information to answer the following questions as part of your narrative: 1. Do the proteins in this list appear to interact in clusters? Describe how many clusters and which proteins form the clusters. 2. What are the three most connected proteins? (You may need to shift some proteins around to count all the interactions). 3. Does expression level (relative to control) correlate with the most connected proteins? That is, are the most connected also the most UP-regulated? 4. When you conduct an analysis in STRING, are the same pathways over-represented as you found from the KEGG analysis in DAVID? How do they differ? Submit the Entrez Gene IDs from the top 5 BP gene list to the STRING database. 1. Do the proteins in this list appear to interact in clusters? 2. Compared to the PPIs from the top 5 KEGG list, does this list of proteins have more or fewer proteins with no known or predicted interactions 3. What are the top three most connected proteins? 4. Are they the same as you observed when looking at the gene list from the top 5 KEGG pathways? If not, how do they differ? BCHM Final Exam Page 3 of 5

4 In general, comment on whether you find the visual view of the PPI or the pathway visuals helpful in understanding your system. How would you use this context to choose genes to study in bench experiments? Section 4: Background section on gene of choice: For this section and the next section of the exam, you will select one gene from the list of differentially expressed genes and find its human homolog. Provide a brief summary of the function of your GENE in the cell in table format: Provide the Entrez Gene ID, HGNC approved name and symbol, Uniprot accession and the Refseq mrna accession number for the longest isoform of your GENE. Include up to two other aliases or alternative gene symbols for your gene. In your summary narrative: Describe the molecular function and biological processes in which your GENE is involved using GO annotations, with evidence codes that are not IEA or NAS. Provide the cellular component information with evidence codes that are not IEA or NAS. Provide a brief summary of the cellular role the gene, including any role it may have in development of cancer or other diseases, if applicable. Your summary narrative should discuss why you chose the particular gene and where you found the information in paragraph form Section 5: Virtual molecular/biochemical analysis: Examine the protein domains and protein-protein interactions found in the human homolog of your GENE. 1. What is the predicted molecular weight and pi of your protein? 2. Is your protein likely to be glycosylated? 3. What is the likely cellular location of your protein: membrane, cytosolic or secreted? Provide evidence for this conclusion. If it is localized differently depending on the condition, then describe that as well. 4. Are there any known disease-associated variants within your protein sequence? 5. Provide a table or graphic identifying the protein domains found in your GENE protein. 6. Determine if your protein is known to be involved in any protein-protein interactions (PPI). 7. Evaluate the evidence for those PPI, and provide a rationale for what criteria you would use to determine which PPI you would consider testing in an experiment. Criteria might include a type of experiment or simply that the interaction looks interesting because of the identity of the second protein. 8. Is your protein of interest known or predicted to interact with Plagl2? 9. Do any of the PPI identified involve any of the other proteins on the list for this exam? 10. For potential PPI, describe which ones you might want to confirm experimentally and why. 11. Determine the top 5 predicted phosphorylation sites based on NetPhosK. Include the class of kinase that is likely to phosphorylate the protein at each of the top 5 positions. 12. Are homologs of your protein available in any of the following model organisms: C. elegans, D. melanogaster or S. cereviae? Provide this information in the form of a table providing a Refseq protein accession for the homolog if it is present in the model organism. BCHM Final Exam Page 4 of 5

5 13. Provide at least 1 reference that supports some of the human protein information you provided in this section. 14. Describe your approach to this section and describe any unexpected or interesting (to you) results that you may have found. Include a list of references to the websites/databases that you used for the analysis. BCHM Final Exam Page 5 of 5