3) This diagram represents: (Indicate all correct answers)

Size: px
Start display at page:

Download "3) This diagram represents: (Indicate all correct answers)"

Transcription

1 Functional Genomics Midterm II (self-questions) 2/4/05 1) One of the obstacles in whole genome assembly is dealing with the repeated portions of DNA within the genome. How do repeats cause complications and how are these complications dealt with in WGA projects? Use a diagram with each explanation. 2) Describe the steps taken for library construction of human DNA samples starting with the human, up to the initial in silica step of genome assembly. Use a diagram in your explanation. 3) This diagram represents: (Indicate all correct answers) a genome sequencing strategy that requires less redundant coverage than whole genome shotgun sequencing a genome sequencing strategy that requires less redundant coverage than clone-by-clone shotgun sequencing a genome sequencing strategy that follows a map first, sequence second progression a genome sequencing strategy that follows a sequence first, map second progression a hybrid genome sequencing strategy that incorporates both clone-byclone sequencing and whole genome shotgun sequencing none of the above 4) Nobel Prize winning scientist Sydney Brenner proposed that only EST s should be sequenced and NOT the entire human genome. From what we have read and discussed about the composition of the human genome and EST s, do you agree or disagree with Brenner s proposal for not sequencing the junk DNA? Why or why not?

2 5) The figure above outlines the processes used by Celera to sequence the human genome. Fill in the ovals and lines with the terms listed (one term will be used twice). For the numbered blanks, briefly describe each step. 6) What are EFG, EFTu, HSP70, RecA, RpoB and rrna and why were so many used? How were the listed phylogenetic groups determined?

3 7) Shown below is a diagram of a plasmid vector from a clone in a 2kb plasmid library that has been selected for sequencing. A. Draw an arrow(s) where sequencing primers will bind. B. Shade in the approximate area of the insert that will be sequenced (assume 500 bp per sequencing reaction). C. Would this clone have been specifically selected or randomly selected from the library for sequencing in (a) a clone-by clone approach (b) a whole genome assembly approach? Briefly explain why. 2kb D. If this clone came from a 2kb library that provides 0.3 fold clone coverage of a 3Gbp genome, (a) How many clones are in the library? (b) What is the sequence coverage of this library? Assume 500bp of sequence are generated per read. 8) Human genome sequencing using the Whole Genome Assembly approach takes overcollapsed contigs and further resolves them using the Unitigger/Discriminator. Genome regions sequenced using Environmental Genome Shotgun sequencing methods could have the same amount of coverage to be considered overcollapsed in the human genome project, but with the Environmental Genome technique, these contigs are assembled right away instead of being resolved further. A. What does overcollapsed mean and what does the Unitigger/Discriminator do to overcollapsed contigs? B. Why doesn t the environmental genome project need to resolve highly covered contigs?

4 9) How did Craig Venter and his colleagues use the scaffolds they obtained to differentiate sequences that represented discrete species from sequences that represented a population continuum of species? Which of the two does the figure below represent? Briefly explain.

5 10) You are one of the CSA s most important employees because your job is to write a letter to Science Magazine explaining why the CSA sequence assembly is so much better than the sequence the PFP people came up with. You were given these three pictures to make your argument with. How do you interpret this data, and more importantly what are you going to write to Science Magazine?

6 11) Describe the differences between rocks and stones in the assembly of the human genome sequence. Describe why they were necessary and which the researchers preferred to use. Why didn t they use any pebbles? 12) Dr. Venter, et al. used 2, 10, and 50 kbp plasmid libraries to sequence the human genome. Why is it important to know the size of the clones? 13) If the human genome has never been sequenced before, how do you know if your data is correct? The scaffolds are mapped to the chromosomes using physical mapping data such as high density STS maps. These maps are good for mapping scaffolds to chromosomes, but are not as good for determining the correctness of the whole assembly. Why can't STS be used to determine correctness, what are their limitations considering the where the data comes from, and what is a better way to determine the accuracy of the finished genome? When answering keep in mind where the raw sequencing data comes from and how many steps separate that data from the finished product. 14) When whole genome shotgun sequencing of one organism yields statistically improbable clone coverage at one point, this data is considered not as useful as a regions that has less coverage. Environmental shotgun sequencing seems to the the opposite, high clone coverage is considered good, where low clone coverage is not as important. What can you learn from only knowing the depth of clone coverage when sequencing one organism, or environmental sequencing? 15) The authors of the Human genome paper took the time to explain specific parts of the WGA and CGA like the discriminator, screener, and unitigger. Other than the fact that the WGA and CGA processes have already been explained in the Human Genome paper, why was the Environmental paper less descriptive in the specific processes of the screener, unitigger, and discriminator? 16) Consider the following table. WGA CSA PFP % sequence 91% 92.2% 92.5% % Gaps 9% 7.8% 12.9% What differences in sequencing strategies are most likely responsible for the differences in percentage of Gaps? Why does CSA have the least Gaps? Why does the public project have the most?

7 17) Answer the following questions about Clone- by- Clone Approach: a) Briefly explain why Bacterial Artificial Chromosomes were used instead of Yeast Artificial Chromosomes. b) If you want to shotgun sequence a 30k bp insert, about how many base pairs of sequences do you need to have sufficient sequence coverage? c) If you need to sequence across a 200kbp insert, about how many sequencing reactions do you need to do? 18) Answer the following questions about Whole Genome Assembly Approach: a) In the Human Genome Project, the plasmid libraries are constructed in three different insert sizes, how did they pick out inserts that are about 2kbp, 10kbp and 50kbp? b) Why does the PFP data have to go through Shredder first before the whole genome assembly process? c) What is the difference between a contig and a scaffold? d) How do you identify overcollapsed unitigs? And how does WGA solve this problem? e) Briefly describe the two methods WGA used to close the gaps between contigs. 19) Whole genome shotgun sequencing resolved the issue of over collapsed regions by identifying repeat-induced overlaps using the discriminator. How could this be detrimental in environmental genome shotgun sequencing and how did the Sargasso Sea team resolve this problem? 20) The Sargasso Sea selected for organisms between.2 and.8_m. What were they trying to avoid and why, as well as why were they focusing on organisms that size? 21) Regarding Environmental Genome Shotgun Sequencing of the Sargasso Sea: Diversity and species richness must be resolved from the abundant amount of sequence data. When defining a species it is the accepted standard to use rrna genes for identification of uncultured microbes. a) What are the disadvantages associated with using rrna for estimates of species diversity and abundance? b) What did Venter et.al do to negate these disadvantages? 22) Regarding The Sequence of the Human Genome: One day this spring, Venter shows up in Bellingham Bay in his sailboat with a variety of obscure sea creatures. Unfortunately, on the way to Bellingham from the Sargasso Sea, he ran into some pirates who stole all of his sequencing equipment. He really wants to know the sequence of one these obscure organisms right away, so he offers a million dollar prize to the first person who can accurately sequence its genome the fastest. You remember three different methods of sequencing from your Genomics class last winter WGA, CSA and clone-by-clone and you want to get started right away. Which method do you use and why?

8 23) Short essay. With the Human Genome Project behind us we know several ways of dealing with gaps in the genomic sequence. The environmental shotgun sequencing in the Sargasso Sea is sure to reveal genomes full of gaps, what could be the cause of these gaps and what are several ways that we could deal with them using the whole genome mapping approach? 24) Short answer. In the Environmental Shotgun Sequencing paper, Figure 5B shows a series of sequences that do not quite match the consensus sequence of the Prochlorococcus scaffold; the third block of sequences from the top shows a peculiar vertical pattern of T>C transitions, what could this suggest about these haplotypes from an evolutionary standpoint? 25) Whole-genome assembly (WGA) and compartmentalized shotgun assembly (CSA) are approaches that were used to sequence the human genome. Pretend that you are working in a lab and are about to begin sequencing a new organism. Give a brief explanation of each of these approaches and then choose the approach you are going to use for your experiment and give an argument of why you chose that approach. 26) Table 2 and 3 are needed from the Environmental Genome Shotgun Sequencing of the Sargasso Sea article. a. Explain the relevance of Table 2 to the research performed on the Sargasso Sea. b. What were the biologists trying to show with Table 3?

9 27) What type of plamids were found in the Sargasso Sea and what function were they thought to have? How did they know they were plasmids? 28) About Figure 6 in Sargasso Sea: What was the main genetic marker used and why did they use more? What evidence of eukaryotes was found in the sample? 29) Imagine you are a grad student working in a lab, and the main part of your research is to sequence and analyze unknown genomes. One day you receive samples of soil from a 3 rd world country where the soil used for cultivating the crop has changed its properties, so that it is no longer produces crops vital for that country. The preliminary analyses show that the changes are due to the activities of unknown organisms populating the soil, and not from hazardous human activities (such as secret and unauthorized nuclear waste dump). Your task, as a former student of Dr. Young s functional genomics course, is to analyze and identify the mystery trouble bug. 29A Outline a plan/algorithm for obtaining the necessary information for obtaining the genomic information. 29B Propose a criteria/method on how the organism(-s) might be identified. 30) In Venter s research on the Sargasso Sea how does the data in Table 1 relate to Fig. 6. What methods might you use to determine if some of the genes had eukaryotic origins and/or what methods did these researchers use? Why would these methods work?

10 31) In Whole Genome Assembly why is it important to have both a Unitigger and a Discriminator? In your answer be sure to include the function of each step. 32) In the first project by Venter, to sequence the human genome, all sequence data was from a single organism. This means his task was "simply" to line up all of the contigs they obtained. In his attempt at environmental genome sequencing, the sequence data is from hundreds or thousands of species. What techniques did Venter employ to separate the sequence reads to their appropriate genome? 33) Compare and contrast the assembly and validation analysis done in the human sequencing project with that done in the environmental genomic project. 34) A research opportunity on genome sequencing was announced. In the announcement, it was stated that the research was going to use yeast artificial chromosome (YAC) as a vector. To apply for this research opportunity you are asked to write a proposal to make the experiment more efficient. Give an outline of that proposal. 35) An analysis of microbial populations using whole genome shotgun sequencing has been completed by a researcher. In their analysis, they want to depict the species richness and abundance in the sample. In doing so, they intend to base their entire analysis on the sequence similarity with known rrna sequences. Would you agree with this approach, and do it similarly, if it were you? Explain your logic.