UNIVERSITY OF EAST ANGLIA School of Computing Sciences Main Series UG Examination ALGORITHMS FOR BIOINFORMATICS CMP-6034B

Size: px
Start display at page:

Download "UNIVERSITY OF EAST ANGLIA School of Computing Sciences Main Series UG Examination ALGORITHMS FOR BIOINFORMATICS CMP-6034B"

Transcription

1 UNIVERSITY OF EAST ANGLIA School of Computing Sciences Main Series UG Examination ALGORITHMS FOR BIOINFORMATICS CMP-6034B Time allowed: 3 hours All questions are worth 30 marks. Answer any FOUR. Notes are not permitted in this examination. Do not turn over until you are told to do so by the Invigilator. CMP-6034B Copyright of the University of East Anglia Module Contact: Dr. Katharina Huber, CMP Version 1

2 Page 2 1. (a) What is a non-coding RNA? Name two special types of non-coding RNAs, and explain their functions in the cell. (b) For each of the following four sets decide whether it can be an RNA secondary structure on the sequence ACGUCAUGCAGU or not (i) {(1,12),(3,9),(4,6),(2,10)} (ii) {(3,5),(2,11),(7,10),(1,12)} (iii) {(5,11),(1,4),(8,9),(7,10)} (iv) {(3,9),(2,11),(4,6),(10,12)} (c) Briefly explain how the concept of energy minimisation can be used to predict RNA secondary structure. Give one advantage and one disadvantage of this approach. (d) Write down the secondary structure (((..))(.)). given in dot-bracket notation as a circle plot and in tree representation. (e) Write down all secondary structures on an RNA sequence of length 8 that do not respect base-pairing with exactly 3 base-pairs. (f) Compute the value of the Zuker metric D Z (S 1,S 2 ) between the pair of secondary structures S 1 =.(.)((..).) S 2 = ((.)(..)).. represented in dot-bracket notation. CMP-6034B Version 1

3 Page 3 2. (a) What is sequence assembly, and how is it used in genomics? [3 marks] (b) (i) Describe the shortest superstring problem, and explain why solving it can be useful in sequence assembly. (ii) Give two reasons why solving the shortest superstring problem may not be a suitable approach to sequence assembly. (c) Explain the greedy approach to finding a sequence assembly. What is a disadvantage of the greedy approach? Use the greedy approach to find an assembly of the following set of sequences {CAGT TA, AGACG, T GCAG, CGATC}. (d) (i) Explain how the overlap graph can be used to find a sequence assembly. Give one reason why this might not be a good approach to finding a sequence assembly. (ii) Write down the overlap graph for the set of sequences {CGAG, GCAA, CTCG, GCTC} and use it to find a superstring of this set. (e) Give two applications that come about from having a knowledge of the human genome sequence. CMP-6034B Version 1 TURN OVER

4 Page 4 3. (a) Write the sequence, 5 to 3, of the mrna molecule synthesised from a DNA template strand having the sequence from 5 to 3 TTAACCATGTAACCG. (b) Explain what is meant by a reading frame in the context of translation. (c) Translate the following DNA sequence using the genetic code given in Appendix A in all possible reading frames CCGCCATCCTACTCCCAG. (d) Explain how words are used to speed up pairwise alignment of sequences. With an example, show how a look-up table can be used to speed up the alignment using words. (e) How does BLAST search for related words? [10 marks] CMP-6034B Version 1

5 Page 5 4. (a) Describe in as much detail as possible, giving pseudo code for the core routines, the dynamic programming algorithm for the global alignment of two biological sequences. [10 marks] (b) Fill in the dynamic programming matrix to determine the optimal global alignment of the sequences ATGCA and GTCAC using the identity matrix for the substitution matrix and a gap penalty of -1. Give the three possible scores in each cell, indicating the one(s) with the highest score. Give all the optimal alignments. [10 marks] (c) Explain how the basic algorithm for global alignment of two sequences should be adapted so that overhanging ends are not penalised. (d) How is the algorithm for pairwise sequence alignment extended for multiple-sequence alignment? CMP-6034B Version 1 TURN OVER

6 Page 6 5. (a) Let X = {a,b,c,d,e, f } and consider the graph T depicted in Fig.1. T: a * f b c d e Figure 1. (i) Explain why T is a tree. (ii) Explain why T is a phylogenetic tree on X. [2 marks] [1 mark] (iii) Find the split of X that is induced by deleting the edge of T marked by. [1 mark] (iv) Are the splits ab cde f and de f abc compatible? (Justify your answer!) [2 marks] (v) Use Meacham s Tree Popping algorithm to construct the tree T that displays precisely the split system comprising all splits S x = x X x, for all x X, and S 1 = ac bde f, S 2 = abc de f and S 3 = e f abcd. Process the splits in the following order: S a, S b, S c, S d, S e, S f, S 1, S 2, S 3. (vi) Are the trees T and T isomorphic as phylogenetic trees? (Justify your answer!) [2 marks] (vii) T is one instance of a phylogenetic tree on 6 leaves. How many non-isomorphic binary phylogenetic trees are there on X? (Simplify your answer as much as possible!). [3 marks] (b) Consider the rooted phylogenetic tree R on X = {a,b,c,d} depicted in Fig. 2. (i) Sate the triplet induced by R on {a,b,d}. [1 mark] (ii) How many non-isomorphic rooted binary phylogenetic trees are there on X? (Simplify your answer as much as possible!) (iii) Either explain why R cannot induce two non-isomorphic triplets on {a,b,c,d} or give two such triplets. [2 marks] CMP-6034B Version 1

7 Page 7 R: a b c d Figure 2. (c) Apply the Build algorithm to the triplet set T = {ab c,bc d,de f, f g d} to decide if T is induced by a rooted phylogenetic tree on X = {a,b,...,g} or not. If it is, then construct that tree and if it is not then explain why. CMP-6034B Version 1 TURN OVER

8 Page 8 Appendix A END OF PAPER CMP-6034B Version 1