Admission Exam for the Graduate Course in Bioinformatics. November 17 th, 2017 NAME:

Size: px
Start display at page:

Download "Admission Exam for the Graduate Course in Bioinformatics. November 17 th, 2017 NAME:"

Transcription

1 1 Admission Exam for the Graduate Course in Bioinformatics November 17 th, 2017 NAME: This exam contains 30 (thirty) questions divided in 3 (three) areas (maths/statistics, computer science, biological sciences). You must answer a total of 10 (ten) questions, and at least 7 (seven) must be of one area. You may also answer 10 (ten) questions from only one area or answer 7 (seven) questions from one area and 3 (three) from other area(s). Answer each question in the respective box. Answers that are not in the respective box will not be considered. You may answer each question in pen or pencil. Good luck! Duration: 3 hours

2 2 Questions of Mathematics and Statistics 1. Consider an experiment with 92 students to be randomly assigned to groups G1 or G2. It is decided to throw a coin and in the case of "head" the student will be allocated to G1, otherwise to G2. The following results were obtained: Number of students G1 35 G2 57 Total 92 a) Based on the data, estimate the probability π of head. b) Calculate the confidence interval of 95% to π. c) Is the coin honest (i.e., π=0.50)? Justify your answer. Consider that: P(Z 1.645) = 0.95 and P(Z 1.960) = 0.975, Z following a standard normal distribution. X~Binomial n, π ; To large n: p = X π 1 π ~N(π, ) n n

3 3 2. Based on a sample of size 100 from a population, the interval 1.65 ± 0.20 is the 95% confidence interval to the parameter mean µ F, the height of women (in meters) in the population. a) By using this confidence interval, what is the decision for hypothesis test H 0 : µ F =1,87 versus H 1 : µ F 1,87? Justify your answer. b) For this population, suppose the heights of both men and women follow a normal distribution with standard deviation equal to 0.80m. Considering that in a sample of 80 men from this population was observed an average height mean equal to 1.87m, find the 95% confidence interval to µ M of the height, the mean height (in meters) of men. c) Is there evidence of a significant difference between the mean height of men and women? Justify. Consider that: P(Z 1.645) = 0.95 and P(Z 1.960) = 0.975, Z following a standard normal distribution.

4 4 3. A researcher is investigating variables associated to gene expression. To this end, the following model was adopted: y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2 + e where y is the observed gene expression measurement, x 1 (=0, 1, 2) is the number of allele G in a molecular marker, x 2 is an indicator variable of hypertension (0=normal e 1=hypertension), and e is a random error with mean zero and constant variance. The beta s are regression coefficients. Indicate FALSE or TRUE, and justify, the following statements: a) β 0 is the expected gene expression, independently of the others variables. b) β 1 indicate the expected change in the gene expression for change of one unity on the number of allele G, independently of the other variable. c) β 3 =0 indicates that parallel lines explain the effect of x 1 on y for normal and hypertensive individuals.

5 5 4. During 30 days, the cause of hospitalizations (due to disease A or disease B) was followed in a Health Center. At the end of this period, the following result was obtained, classified according to the patient's gender: Gender Disease A Disease B Total Male Female Total a) Based on the data, hospitalization due to disease A is most likely for which gender? b) Consider π as the probability of an event occurring. The odds of an event is defined as π/(1-π). Calculate the odds ratio for hospitalization due to disease A among males and females. c) Interpret the obtained odds ratio value.

6 6 5. Let X, Y e Z random variables, and a, b, c e d fixed real values. Answer: a) When cor(ax + b, cy + d) = cor(x, Y)? (cor is the correlation operator) b) When var(x + Y) = var(x) + var(y)? (var is the variance operator) c) var(x + M X ) is greater, less than or equal to var(x)? (M X is the expected value of X)

7 7 6. Find the convergence intervals for the Taylor-series expansion of: a) g t = e! b) g t = ln (t)

8 7. Interpret the solution and give an application of the differential equation: f''(t) = -af(t). 8

9 9 8. Calculate: a) 2 lg ln (e! ); lg: logarithm on the basis 10, ln: logarithm on the basis e b) 2 ln 3e ln (9) 3 ln 5e 3 2 ln (25)

10 9. Formulate the problem addressed by PCA Principal Component Analysis, and relate it with spectral theory of symmetric matrices. 10

11 A symmetric matrix A is positive definite if, and only if, x ʹAx > 0, being x a nonzero vector. In this case, the quadratic form associated with matrix A is also positive definite. a) Determine matrix A associated to the quadratic form 2 2 f x, x = 3x + 3x + x x. ( 1 2 ) b) Verify if f ( x 1, x 2 ) is positive definite.

12 12 Questions of Computer Science 1. Describe a deterministic finite automaton over alphabet Σ = {a,b} that accepts strings of the following kind: one or more a s followed by zero or more b s. For example, the following strings should be accepted: aaa and aab; and the following strings should not be accepted: bb, bba, aabba.

13 2. Present the most efficient algorithm for sorting n integers, assuming that each number represents a person s age. You should sketch the algorithm in pseudocode and present its complexity analysis. 13

14 3. Given a directed weighted graph G with nonnegative edge weights, we can formulate the following two problems: (a) find the shortest path from vertex u to vertex v in G; (b) find the longest path (without visiting any vertex more than once) from vertex u to vertex v in G. A path s weight is the sum of the paths of its edges. Can both of these problems be solved efficiently? Justify your answer. 14

15 4. How can the runtime of recursive algorithms be analyzed? Provide an example of a recursive algorithm and its analysis. 15

16 5. Given class C with n students and the friendship relations between pairs of students, we want to determine how many bubbles there are in the class. A student X does not belong to bubble B if X is not a friend of any student in B; otherwise X belongs to B. Present an efficient algorithm for determining the bubbles of C. A friendship relation is symmetric: if X is a friend of Y then Y is a friend of X. You need to take into consideration that any student can be considered his or her own friend; therefore bubbles with only one student are possible. Note that a bubble member is not necessarily a friend of all the other bubble members; one friend is already enough. 16

17 6. Write pseudocode for the function pop (S, x) for stack S. If the stack is empty, pop should return false. Otherwise pop should return true, place in x the value on top of the stack, and update the stack. Assume that S is a vector. 17

18 7. Write a regular expression that generates strings of 0s and 1s in which the number of consecutive 0s and the number of consecutive 1s is always even. Examples of correct strings: 0011, , Examples of incorrect strings: 00011, You may use only the following operators: * (repeat zero or more times), (character alternation), character concatenation, and parenthesis for charater grouping. 18

19 8. There are eight boxes. Seven of them contain five-gram balls. One of them contains four-gram balls. You have one digital weighing scale. Describe a strategy to identify the box containing four-gram balls by using only once the weighing scale. The number of balls in each box is as large as you wish. 19

20 9. Given an empty binary search tree of integers, show the structure of the tree after each of the values 5, 2, 4, 7, 8, 1, 3 is inserted and then show the change to the resulting tree when 2 is deleted. 20

21 For each of the following statements, aswer True or False. a) Prim s algorithm uses the Union-Find data structure. b) In Union-Find with path compression, after we do a Find-Set(x) operation, the height of the tree that x is in always decreases. c) In the best implementation of the Union-Find data structure, the worst case cost for each operation is O(log n). d) In the best implementation of Union-Find, the worst case cost for the Make- Set operation is O(1).

22 22 Questions of Biological Sciences 1) The flow of genetic information involves processes called replication, transcription, and translation. a) Justify this statement, indicating the role of each of these processes in the transmission of information in biological systems. b) Indicate which major molecules (substrates, products and catalysts) are involved in each process.

23 23 2) How do the polar and apolar groups distribute themselves in the tertiary structure of globular proteins? Most globular proteins are denatured by brief exposure to 65 C, but some that have cysteine residues in their chain must be heated longer and at higher temperatures to denature. What is the molecular basis of this property? How could the denaturation of these proteins with cysteine residues be facilitated?

24 3) Draw a scheme with the typical structure of a eukaryotic gene containing 3 exons. Indicate in the diagram and explain in writing the function of the most important regulatory elements to: a) control of DNA transcription b) RNA processing to give rise to the mature messenger RNA c) initiation of protein translation from the mature mrna 24

25 4) Explain why DNA replication is semi-conservative, bidirectional, and semicontinuous. 25

26 5) The 20 amino acids that make up the proteins are encoded by 61 different codons. However, only 32 distinct transporter RNAs are sufficient to recognize all 61 codons and ensure protein synthesis. Explain this apparent paradox by using your knowledge about the mechanism of reading the genetic code. 26

27 6) What is the difference between genetic mutations and epigenetic modifications of the genome? Give an example of a type of frequent epigenetic modification and explain its role in the control of gene expression. 27

28 28 7) The DNA sequence below contains the gene encoding the A chain of human insulin. The start and end codons of the coding region are underlined and in bold: 5 AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGATCACTGTCCTTCTGCC ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACC CAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCT AGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCT GCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGG CCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCT CCCTCTACCAGCTGGAGAACTACTGCAACTAGACGCAGCCCGCAGGCAGCCCCACACCCG CCGCCTCCTGCACCGAGAGAGATGGAATAAAGCCCTTGAACCAGCAAAA 3 a) Provide the sequence of a pair of oligonucleotide primers suitable for amplification of the insulin gene. Mark the above sequence for the annealing site of each primer. b) Draw a diagram illustrating the amplification of the desired region along the 1st cycle of PCR. Indicate the orientation of the DNA strands in the scheme. c) How many times would the initial amount of DNA be amplified after 10 cycles of PCR? d) What prior procedure would be required to amplify the insulin gene by PCR from RNA isolated from a cell line?

29 29 8) In the two columns listed below there are a number of related terms. Make the link between the names in the left column with a single term in the right column. RNA polymerase vector Primase translation HindIII eukaryotic messenger RNA Holoenzyme intron 5 'cap methylation of messenger RNA Promoter plasmid AUG restriction endonuclease TATA box splicing Poly A replication Shine-Dalgarno Prokaryotic ribosomal RNA

30 30 9) Compare the gene expression in prokaryotes and eukaryotes to: a) Degree of coupling of transcription and translation. b) number of gene products in a primary transcript. c) number of proteins resulting from the translation of a primary transcript. d) control by protein transcription factors e) organization of genes in operons.

31 10) What are the main elements necessary for the transcription of genes present in the DNA sequence of eukaryotes? Describe the function of these elements and their role in the control of gene expression during differentiation and development in multicellular organisms. 31