UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics

Size: px
Start display at page:

Download "UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics"

Transcription

1 UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, 2010 SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics DURATION: 3 HOURS TOTAL MARKS: 125 Internal Examiner: Dr. Ché Pillay External Examiner: Prof. Nicola Mulder (UCT) CANDIDATES ARE REQUESTED, IN THEIR OWN INTERESTS, TO WRITE LEGIBLY. ANSWER ALL QUESTIONS Question 1 Shown below are algorithm parameters that may be varied during a BLAST search of a database. Answer questions that follow. [30] 1.1 Do the algorithm parameters shown above describe a blastn or a blastp search? Give a reason for you answer. [2] 1.2 Name three of the parameters (shown above) that could be modified so that you could find distantly related homologs of a query sequence. In your answer explain exactly how you would change those parameters and give a detailed reason for your choices. [12] 1.3 One of the algorithm parameters are Gap Costs. Explain why gap costs/gap penalties are needed during sequence alignment. [2] Page 1 of 9

2 1.4 Name the three types of gap penalties that are commonly used by sequence alignment programs and briefly describe how they work. [9] 1.5 The results from BLAST searches are reported as E-values and bit scores which are calculated as shown below. S' S ln K ln2 E What is the main limitation of using E-values to compare the BLAST results obtained from different databases? [2] 1.5 BLAST and FASTA are both heuristic algorithms. However, BLAST searches typically take a shorter amount of time than FASTA searches. Explain why FASTA searches take a longer time. [3] m n P Question 2 [25] A biotech company approaches you to help them with a project to sequence the genomes of microorganisms involved in the breakdown of industrial pollutants. The company plans to use this information to isolate enzymes that may be used for commercial applications. Answer questions that follow. 2.1 Should the company use a shot-gun sequencing or cloned contig sequencing approach when sequencing these genomes? Give two detailed reasons for your answer. [5] 2.2 This project is expected to generate several gigabases of sequencing data. Give three reasons why this data would not be stored in a flat-file type database. [3] 2.3 With respect to question 2.2, what type of database would be used to store this sequence data? Name and describe one advantage that this type of database has over flat-file databases. [3] 2.4 The biotech company had been using automated Sanger sequencing but has recently acquired a 454 sequencer. Name and describe two advantages that 454 sequencing technology has over Sanger sequencing for the project described above. [4] 2.5 For Sanger sequencing the company normally used a nine-fold coverage for the sequencing of a given genome. Give a detailed explanation as to why this coverage would not be sufficient for 454 sequencing. [4] 2.6 Assume that the company has sequenced the genomes of these microorganisms and all the contigs have been correctly assembled and ordered. Give a detailed description of how you would analyze these sequences to find the target enzymes. [6] Page 2 of 9

3 Question 3 [27] 3.1 Using dynamic programming align the sequences shown below using the following parameters: match +2, mismatch -1, gap penalty = -3. In your answer clearly indicate the final alignment(s) obtained and determine the score for this alignment. [10] AGGTC GGCT 3.2 Calculate the percentage identity for one of the alignments described in 3.1 above. [2] 3.3 Explain the purpose of using a tuple in dot-matrix sequence alignments. Why would you vary the tuple size for a given alignment? [3] 3.4 Describe two disadvantages of using dot-plots for sequence alignment. [2] 3.5 Determine the alignment score for the following alignment using the BLOSUM45 and BLOSUM62 matrices shown below. In your answer, clearly indicate which score was obtained from which matrix. [4] G R R Y K L P C S N R Q E C M P C Q 3.6 Explain the difference between the BLOSUM62 and BLOSUM45 matrices. [2] 3.7 Explain why the substitution scores for tryptophan(w) are usually less than zero in both matrices. [4] BLOSUM 62 Page 3 of 9

4 BLOSUM 45 Page 4 of 9

5 Question 4 [8] Shown below is an entry from the NCBI database. Answer questions that follow. Page 5 of 9

6 4.1 What type of organism was this sequence obtained from and what is the length of this sequence? [1] 4.2 Why is there a difference between the accession and version numbers of this sequence entry? [1] 4.3 How many proteins are encoded for by the polyprotein sequence shown in this entry? [2] 4.4. What is the RefSeq accession number for this sequence and describe what type of sequence data is found in the RefSeq database? [2] 4.5 Assume that your internet connection to the NCBI was not working. Name another database where you could find sequence information about this polyprotein. [1]. 4.6 Which database could you use to find kinetic information about the proteins described by the entry above? [1] Question 5 [25] Shown on the next page is a figure showing a comparison of the accuracy of ten multiple sequence alignment programs (Nuin et al. (2006) BMC Bioinformatics 7: 471). These programs were given a Simprot dataset with an indel frequency of 15% and the accuracy of these alignments generated by these programs was calculated using the so-called developers score: f D c r where c is the number of residue pairs in the test alignment that are correctly aligned with respect to the reference alignment and r is the number of aligned residue pairs in the reference alignment. Answer questions 5.1 to 5.7 that follow. 5.1 What are the maximum and minimum scores for f D? Give an explanation for your answer. [4] 5.2 Which program performed the most accurate alignments with increasing sequence length? Give a reason for your answer. [3] 5.3 Which program performed the least accurate alignments with increasing sequence length? Give a reason for your answer. [3] 5.4. Which program showed the steepest decline in accuracy with increases in the sequence length of the target datasets? Give a reason for your answer. [3] 5.5. Why do you think the program chosen in 5.4 above performed so poorly with increases in sequence length? [4] 5.6 Most of the programs listed on the next page use progressive, consistency and/or iterative algorithms to find multiple sequence alignment. For each of these algorithms list and describe a disadvantage. [6] 5.7 Despite their differences, what score do all these programs aim to maximize? [2] Page 6 of 9

7 Fig. 1 A comparison of alignment accuracy and increasing sequence length at high indel frequency values with different tree topologies (Nuin et al. (2006) BMC Bioinformatics 7: 471). Page 7 of 9

8 Question 6 [10] 6.1 The first step in developing a phylogenetic tree is multiple sequence alignment. Can a multiple sequence alignment generated by a program be used without modification? Explain your answer. [2] 6.2 Explain what Tajima s relative test is used for in phylogenetic studies. [1] 6.3 Explain what homoplasy is and describe why it is necessary to correct for this in phylogenetic analysis. [2] HIV 1 Group O HIV 1 Group N SIV HIV 1 Group M HIV 2 Page 8 of 9

9 Consider the phylogenetic tree presented above (Arien et al. (2007) Nature Reviews Microbiology 5:141). Answer questions 6.4 and 6.5 below. 6.4 A number of polytomies are present in the figure. Give an explanation of how these could have occurred. [3] 6.5 Based on the data presented in the figure, is HIV-2 more closely related to HIV-1 or SIV? Give a reason for your answer. [2] Page 9 of 9