Module 2: Core Bioinformatics FINAL EXAM SOLUTIONS

Size: px
Start display at page:

Download "Module 2: Core Bioinformatics FINAL EXAM SOLUTIONS"

Transcription

1 Master in Bioinformatics January 9th, 2013 Universitat Autònoma de Barcelona Module 2: Core Bioinformatics FINAL EXAM SOLUTIONS Question 1: What is the statement that does NOT apply to the FASTA format? a. FASTA format can be used to store multiple sequences in tandem in a unique computer file. b. In FASTA file the definition line always starts with greater than (>) symbol, and usually no constraints restrict its length. c. In addition to the plain text file extension (.txt), there is no other file extension for a text file containing FASTA formatted sequences. d. FASTA sequence lines could contain line breaks or paragraph marks < > at the end of each line. Question 2: Search SWISS-PROT using the Sequence Retrieval System (SRS) and determine how many proteins inferred from homology are there greater than 100 KDa in the mouse (Mus musculus) genome. a. No entries found b. <10 entries c. 10 and <100 entries d. 100 entries Explanation: At the EBI-SRS browsing the UniProtKB/Swiss-Prot database using the combined search terms: - Species: Mus musculus - Protein existence: Inferred from homology - MolWeight: > Da You will find 9 entries. The number of occurrences of dinucleotides in the genoma of Dengue 1 virus has been the following: aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt Note that, = Moreover, for this virus the frequencies of the nucleotides are the following, a c g t Question 3: We are specifically interested in the dinucleotides tg and cg. Are they over or under-represented? a. Both are over-represented b. tg is over-represented and cg is under-represented c. tg is under-represented and cg is over-represented d. Both are under-represented

2 Explanation: To answer the question we need to calculate ρ tg and ρ cg. To do this, we have to calculate the frequencies of the dinucleotides tg and cg, f tg =832/10734, f cg =261/ We know also the frequencies of the nucleotides f c = , f g = and f t = Therefore, ρ tg = f tg /(f t xf g ) and ρ cg = f cg /(f c xf g ) Consequently, tg is over-represented (ρ tg >1) and cg is under-represented (ρ cg <1), and the right answer is b). Question 4: Assuming that a Markov chain model is appropriate to describe the behavior of this DNA sequence, the estimated transition probability p cg = P(X n =g X n-1 =c) is, a x = b. 261/( ) = 261/10734 = c. 261/( ) = 261/2240 = d. ( )/( ) = 761/10734 = Explanation: The transition probability to pass from c to g is calculated as the number of times that cg occurs (261) divided by the total number of occurrences of all the dinucleotides beginning with a c (ca, cc, cg, ct). This is just 261/( ), that is, the answer c). Question 5: The most commonly used multiple aligners a. Align all the sequences at once b. Align the sequences 2 by 2 using a progressive approach c. Add the sequences 1 after the other to a larger alignment d. Add the sequences 3 by 3 using a progressive approach Question 6: Given the following guide tree, which proposition is most compatible with the order in which the sequences could be aligned? a. A with B, followed by C with D, followed by E with F. b. A with B, followed by A with C, followed by A with D. c. A with B, followed by C with D, followed by AB with CD, followed by ABCD with E d. any of the above mentioned alignment order is compatible with the tree

3 Question 7: Given the tree shown above, if it takes on average 2 second to align a pair of sequences, how long should it take to align all the sequence using a progressive strategy? a. 2 x 6 = 12 seconds b. 2 6 seconds c. 6 2 seconds d. 10 seconds Question 8: In the following PROSITE profile, the value circled corresponds to the cost for a. Aligning the considered profile position with a Cystein b. Aligning the considered profile position with an INDEL c. Aligning the considered profile position with ANY residue d. Aligning the considered profile position with ANY charged residue Question 9: Discuss the similarities and differences between the prediction of genes and the prediction of regulatory motifs, for example, in terms of the intrinsic difficulties that we face in each approach, in terms of computational techniques to detect signals and variations in the composition of the sequences that must be analyzed, in terms of comparative genomics, etc. (max. 300 words). Computational Gene-Prediction (CGP) and Regulatory Motifs Finding (RMF) are complex tasks. CGP has to face with the hierarchical structure that encodes protein sequences into the genome and the combinatorial explosion that appears when trying to assemble the basic components of a gene (say here splice sites and coding biases of the sequence, coding and non-coding exons, alternative splicing transcript isoforms, etc). On the other hand, RMF deals with smaller sequence segments that are recognized by the transcription factors at molecular level, which implies that different sequences can fold into a similar structure, but also that same motifs can incorporate different point mutations at distinct positions, making the motifs highly variable at sequence level. There are several computational approaches that can solve pieces of the problem. For instance, Position Weight Matrices (PWMs) can be used in CGP to detect signals like splice sites, which define the boundaries of exons; yet they have

4 been extensively used to find sequence motifs in RMF and there are several Transcription Factor Binding Site (TFBS) databases that were built on sets of PWMs (like TransFac, Jaspar, Oreganno). Hidden Markov Models (HMMs) have been used to find sequence content biases (like coding potential, codon periodicity, etc), as well as signals (donor and acceptor splice sites), or a combination of both (as it happens with the General-HMMs that define a gene model where each node can be a combination of signal and content sensors). Comparative Genomics (CG) techniques can help both, the CGP and RMF, as they highlight those regions that have been more conserved than expected, as evolution tends to keep what is important for a biological function. Finding protein-coding genes benefits of the whole genome comparison, as changes on the nucleoteide sequence can be synonymous at protein sequence level and amino acid substitutions can be better modelled into an scoring matrix (like those used by BLAST-like tools). Phylogenetic Footprinting (PF), based on the comparison of genomic sequences of properly phylogenetically distant species, can point out conserved regions due to a functional constrain that can code for protein-coding features, but also for regulatory features and those non-canonical genes (also known as non-coding RNAs, ncrnas). When species are too closely related, so their genomic sequences haven't had enough time to diverge, we can still take advantage of the CG approach by applying the Phylogenetic Shadowing (PS). Next-Generation Sequencing (NGS) methodologies provide a myriad of new experimental evidences to define genic structures, based on RNA-seq data, and to detect novel regulatory motifs, thanks to the ChIP-seq approaches. However, both approaches require a reference genome to which sequencing reads are mapped to detect functional regions on that genome Question 10: The surname Barbaluenga is very rare in Spain. As a matter of fact there is only one single family. Mr and Mrs Barbaluenga have a single son (called Zifban Barbaluenga). Calculate the probability that this surname is lost in the next generation. a. 25% b. 37% c. 63% d. 75% Explanation: Surnames are passed on to all male children by their father. One surname present in a single man is akin to a new mutation in a population. The surname will be lost in the next generation unless the man has a male child. The population of Spain is large (in practical terms infinite) and constant in size (at least over the short time of a generation). Thus, the average number of children per family is 2 and the distribution of family size (k) can be described using the Poisson (ppt presentation 2, slide 10). The probability that a child is a male is 1/2. The probability of extinction is: e-2 (1) + 2e-2 (1/2) + (22/2!) e-2 (1/4) + + (2k/k!) e-2 (1/2)k = % (ppt presentation 2, slide 11)

5 Question 11: In the practical exercise, we calculated the (absolute) number of neutral mutations fixed in one million years in populations with four different sizes (N= 500, 10000, , ). Which of the following statements is true? a. This number was identical in the four cases. b. This number was higher in large populations than in small populations. c. This number was higher in small populations than in large populations. d. This number was higher in populations with intermediate size than in populations with extreme sizes. Explanation: In a population of size N, the number of new neutral mutations arisen per generation is 2Nµ where µ stands fior the rate of neutral mutation. The fixation probability of a unique neutral mutation is (1/2N). Thus, the rate of neutral evolution (number of neutral mutations fixed per generation) is: 2Nµ x (1/2N) = µ and does not depend on population size. Therefore, the absolute number of neutral mutations fixed in 10 6 years (or generations) is identical in the four cases. Question 12: Indicate which of the following statements about the human genome is false: a. The current assembly of the human genome contains more than 350 gaps b. The human genome has 20,000 protein-coding genes and 13,000 long non-coding RNAs c. The exons represent a 3% of the human genome but only 1.2% of the genome is coding d. The repetitive content of the human genome is 35% Question 13: Which of the following statements about the functional elements found in the positions 73,031,000-73,087,000 of the human X chromosome is false? a. There is a long non-coding intergenic RNA with several alternative transcripts b. An active promoter has been found close to the transcription start site of the longest transcript produced from this genomic region c. The transcription of this gene has been detected by RNA-Seq but no ESTs mapping to this region are currently known d. The second exon of the longest transcript overlaps a transposable element insertion Question 14: The observed nucleotide divergence between the sequences of two different species shows that there are 200 differences in 1000 studied positions (that is 20% observed divergence). Is this value a good estimate of the species divergence? a. Yes, the observed divergence is proportional to the time of divergence by the following relation Div_obs = 2*s*T, where s is the substitution rate and T is the Time to the ancestor. b. No, the observed divergence must be corrected first by possibly recurrent mutations using the relation Div_real = -3/4 * ln(1-4/3 * Div_obs) c. No, the observed divergence must be corrected by possibly recurrent mutations using an appropriate evolutionary model. Question 15: A graphical representation of the relationships among several Drosophila species has been obtained for the gene Adh. A distance matrix was calculated using an appropriate evolutionary model and the method of neighbor-joining (NJ) for phylogenetic reconstruction was used. This phylogeny represents:

6 a. This is a gene tree representing the evolutionary divergence among species for this particular gene, where the branches are proportional to the number of substitutions and the topology indicates the relationships among species. b. This is a species tree representing the real time of divergence among species. This tree is ultrametric and therefore the real times of divergence are defined by the length of the branches. This is a rooted tree. c. This is a cladogram tree, the topology indicates the relationship among species although the divergence times are not scaled in this phylogenetic representation. Question 16: The root mean square deviation (rmsd) calculated from the superimposition of the backbone α- carbon atoms of proteins A and B is 3.5Å, whereas between proteins A and C is 6.7Å. Which protein (B or C) is more similar to protein A? a. B b. C Question 17: The following equation molecular system. c. True d. False å átomoi å j¹i q i q j e 0 r ij calculates the van der Waals interaction energy in a Question 18: Mark the sentences that are TRUE: The hydrophobic effect is driven by entropy. The peptide backbone has three freely-rotatable dihedral angles per amino-acid residue. Natural polypeptides and proteins contain a single type of helical structure: the alpha helix. The folded structure of a protein is stabilized mainly by weak non-covalent interactions. Under given environment conditions, the tertiary structure of a protein is determined by its primary structure. The tertiary structure of a protein is only dependent on its primary structure. The folding process of a protein is governed by thermodynamics (differences in enthalpy and entropy between the unfolded and folded states) and kinetics (free-energy barriers along the path). There is more than one type of DNA helix. X-ray diffraction experiments on protein crystals provide structural information consisting in inter-proton distances. The resolution of a crystallographic protein structure, given in Ångström, is not necessarily homogeneous throughout the structure but depends on internal dynamics.

7 Question 19: Explain the main concepts that differentiate Quantum Mechanical and Molecular Mechanics approaches. QM Explicit representation of electron and nucli resolution of equation of schrodinger Allows to deal with changes in chemical states Computationally demanding small systems MM parameters for atoms, their intra and intermolecular interactions use the laws of classical mechanics Allow large conformational explorations Low computational cost large system Points: 1 for the three concepts or more correctly mentioned; 0.75 for two concepts correctly mentioned; 0.5 point for only one; 0 for none Question 20: Could you briefly overview the limitations of geometry optimizations versus Molecular Dynamics? Geometry optimization only allows to characterize the most stable conformation of a molecule the closest from the initial geometry (generally the most occupied). No consideration of temperature (minimum on the potential energy surface) hence dynamical effects are provided with these calculations. MD allows to provide with trajectory of the molecules (motions), generate ensemble of conformations that allow statistical analysis. Points: 1 point if correctly described, 0,5 if approximative, 0.25 if comparison not done but one of the method is well described, 0 if out subject Question 21: Which region(s) of a protein are the more challenging the model using homology modeling approaches? The most challenging regions to model by HM are those with low sequence identity with the targets. Those are generally the loops and external regions of a prot (like for interaction prot-prot or allosteric) Points: 1 point if both terms correctly mentioned, 0.75 if only one, 0,5 if approximative, 0.25 if unclear "sounds like" answers, and 0 if out subject. Question 22: A mass spectrometer give us: a. the mass of an analyte b. the weight of an analyte c. the mass to charge ratio of an analyte d. the radiofrequency of an analyte

8 Question 23: Which sentence is NOT correct: a. The electrospray ionization methods generate multiple charged states of an analyte b. The MALDI-TOF only generate mono-protonated ions c. The electrospray ionization methods uses solid samples d. ESI and MALDI-TOF are the most soft ionization methods Question 24: Create a workflow using Galaxy that answers the following question: Which is the longest exon in chromosome 21 in humans? (note: use database hg19) (hint: Text Manipulation>Compute may help you) Any of the next exons in chr21 is correct chr uc002ysb.1_cds_2_0_chr21_ _f chr uc002ysc.3_cds_2_0_chr21_ _f chr uc002yse.1_cds_2_0_chr21_ _f Also this exon in chr21 is accepted chr uc021wjf.1_exon_0_0_chr21_ _r or this exon in chr19 chr uc002mkp.3_exon_81_0_chr19_ _r Question 25: Which web services have you used in the previous workflow? The tools in Galaxy are not strictly web services (but under alternative definitions of web services the data retrieval tools or the Galaxy platform as a whole could be considered web services) Question 26: What are 'ID' and 'class' in HTML and the differences between them? Why are they so useful? ID and class are attributes. The main difference is that ID can only be used once and class can be repeated. Both of them are important to reference the elements to give them format or functionality.