igenetics A Molecular Approach Peter J. Russell Third Edition

Size: px
Start display at page:

Download "igenetics A Molecular Approach Peter J. Russell Third Edition"

Transcription

1 igenetics A Molecular Approach Peter J. Russell Third Edition

2 Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world Visit us on the World Wide Web at: Pearson Education Limited 2014 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without either the prior written permission of the publisher or a licence permitting restricted copying in the United Kingdom issued by the Copyright Licensing Agency Ltd, Saffron House, 6 10 Kirby Street, London EC1N 8TS. All trademarks used herein are the property of their respective owners. The use of any trademark in this text does not vest in the author or publisher any trademark ownership rights in such trademarks, nor does the use of such trademarks imply any affiliation with or endorsement of this book by such owners. ISBN 10: ISBN 13: British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Printed in the United States of America

3 with chemicals (e.g., alkaline conditions) and/or heat is critical to many methods used to produce and analyze cloned DNA. Give three examples of methods that rely on complementary base pairing, and explain what role complementary base pairing plays in each of these methods. 3 Restriction endonucleases are naturally found in bacteria. What purposes do they serve? *4 A new restriction endonuclease is isolated from a bacterium. This enzyme cuts DNA into fragments that average 4,096 base pairs long. Like many other known restriction enzymes, the new one recognizes a sequence in DNA that has twofold rotational symmetry. From the information given, how many base pairs of DNA constitute the recognition sequence for the new enzyme? *5 An endonuclease called AvrII ( a-v-r-two ) cuts DNA whenever it finds the sequence 5 -CCTAGG GGATCC-5 a. About how many cuts would AvrII make in the human genome, which contains about 3! 10 9 base pairs of DNA and in which 40% of the base pairs are G C? b. On average, how far apart (in base pairs) will two AvrII sites be in the human genome? c. In the cellular slime mold Dictyostelium discoidium, about 80% of the base pairs in regions between genes are A T. On average, how far apart (in base pairs) will two AvrII sites be in these regions? 6 About 40% of the base pairs in human DNA are G C. On average, how far apart (in base pairs) will the following sequences be? a. two BamHI sites b. two EcoRI sites c. two NotI sites d. two HaeIII sites *7 The average size of fragments (in base pairs) observed after genomic DNA from eight different species was individually cleaved with each of six different restriction enzymes is shown in Table B. a. Assuming that each genome has equal amounts of A, T, G, and C, and that on average these bases are uniformly distributed, what average fragment size is expected following digestion with each enzyme? b. How might you explain each of the following? i. There is a large variation in the average fragment sizes when different genomes are cut with the same enzyme. ii. There is a large variation in the average fragment sizes when the same genome is cut with different enzymes that recognize sites having the same length (e.g., ApaI, HindIII, SacI, and SspI). iii. Both SrfI and NotI, which each recognize an 8-bp site, cut the Mycobacterium genome more frequently than SspI and HindIII, which each recognize a 6-bp site. *8 What features are required in all vectors used to propagate cloned DNA? What different types of cloning vectors are there, and how do these differ from each other? 9 The plasmid pbluescript II is a plasmid cloning vector used in E. coli. What features does it have that makes it useful for constructing and cloning recombinant DNA molecules? Which of these features are particularly useful during the sequencing of a genome? *10 A colleague has sent you a 2-kb DNA fragment excised from a plasmid cloning vector with the enzyme PstI (see Table 1 for a description of this enzyme and the restriction site it recognizes). a. List the steps you would take to clone the DNA fragment into the plasmid vector pbluescript II (shown in Figure 4), and explain why each step is necessary. b. How would you verify that you have cloned the fragment? *11 E. coli, like all bacterial cells, has its own restric-tion endonucleases that could interfere with the propagation of foreign DNA in plasmid vectors. For example, wild- Table B Species ApaI GGGCCC HindIII AAGCTT Enzyme and Recognition Sequence SacI GAGCTC SspI AATATT SrfI GCCCGGGC NotI GCGGCCGC Escherichia coli 68,000 8,000 31,000 2, , ,000 Mycobacterium tuberculosis 2,000 18,000 4,000 32,000 10,000 4,000 Saccharomyces cerevisiae 15,000 3,000 8,000 1, , ,000 Arabidopsis thaliana 52,000 2,000 5,000 1,000 no sites 610,000 Caenorhabditis elegans 38,000 3,000 5, ,110, ,000 Drosophila melanogaster 13,000 3,000 6, ,000 83,000 Mus musculus 5,000 3,000 3,000 3, , ,000 Homo sapiens 5,000 4,000 5,000 1, , ,

4 type E. coli has a gene, hsdr, that encodes a restriction endonuclease that cleaves DNA that is not methylated at certain A residues. Why is it important to inactivate this enzyme by mutating the hsdr gene in strains of E. coli that will be used to propagate plasmids containing recombinant DNA? 12 E. coli is a commonly used host for propagating DNA sequences cloned into plasmid vectors. Wild-type E. coli turns out to be an unsuitable host, however: the plasmid vectors are engineered, and so is the host bacterium. For example, nearly all strains of E. coli used for propagating recombinant DNA molecules carry mutations in the reca gene. The wild-type reca gene encodes a protein that is central to DNA recombination and DNA repair. Mutations in reca eliminate general recombination in E. coli and render E. coli sensitive to UV light. How might a reca mutation make an E. coli cell a better host for propagating a plasmid carrying recombinant DNA? (Hint: What type of events involving recombinant plasmids and the E. coli chromosome will reca mutations prevent?) What additional advantage might there be to using reca mutants, considering that some of the E. coli cells harboring a recombinant plasmid could accidentally be released into the environment? *13 Genomic libraries are important resources for isolating genes and for studying the functional organization of chromosomes. List the steps you would use to make a genomic library of yeast in a plasmid vector. In what fundamental way would you modify this procedure if you were making the library in a BAC vector? 14 Three students are working as a team to construct a plasmid library from Neurospora genomic DNA. They want the library to have, on average, about 4-kb inserts. Each student proposes a different strategy for constructing the library, as follows: Mike: Cleave the DNA with a restriction enzyme that recognizes a 6-bp site, which appears about once every 4,096 bp on average and leaves sticky, overhanging ends. Ligate this DNA into the plasmid vector cut with the same enzyme, and transform the ligation products into bacterial cells. Marisol: Partially digest the DNA with a restriction enzyme that cuts DNA very frequently, say once every 256 bp, and that leaves sticky overhanging ends. Select DNA that is about 4 kb in size (e.g., purify fragments this size after the products of the digest are resolved by gel electrophoresis). Then, ligate this DNA to a plasmid vector cleaved with a restriction enzyme that leaves the same sticky overhangs and transform the ligation products into bacterial cells. Hesham: Irradiate the DNA with ionizing radiation, which will cause double-stranded breaks in the DNA. Determine how much irradiation should be used to generate, on average, 4-kb fragments and use this dose. Ligate linkers to the ends of the irradiated DNA, digest the linkers with a restriction enzyme to leave sticky overhanging ends, ligate the DNA to a similarly digested plasmid vector, and then transform the ligation products into bacterial cells. Which student s strategy will ensure that the inserts are representative of all of the genomic sequences? Why are the other students strategies flawed? *15 Some restriction enzymes leave sticky ends, while others leave blunt ends. It is more efficient to clone DNA fragments with sticky ends than DNA fragments with blunt ends. What is the best way to efficiently clone a set of DNA fragments having blunt ends? *16 The human genome contains about 3! 10 9 bp of DNA. How many 200-kb fragments would you have to clone into a BAC library to have a 90% probability of including a particular sequence? 17 A biochemist studies a protein with antifreeze properties that he found in an Antarctic fish. After determining part of the protein s amino acid sequence, he decides he would like to obtain the DNA sequence of its gene. He has no experience in genome analysis and mistakenly thinks he needs to sequence the entire genome of the fish to obtain this information. When he asks a more knowledgeable colleague about how to sequence the fish genome, she describes the whole-genome shotgun approach and the need to obtain about 7-fold coverage. The biochemist decides that this approach provides far more information than he needs and so embarks on an alternate approach he thinks will be faster. He decides to sequence individual clones chosen at random from a library made with genomic DNA from the Antarctic fish. After sequencing the insert of a clone, he will analyze it to see if it contains an ORF with the sequence of amino acids he knows are present in the antifreeze protein. If it does, he will have found what he wants and will not sequence any additional clones. If it does not, he plans to keep obtaining and analyzing the sequences of individual clones sequentially until he finds a clone that has the sequence of interest. He thinks this approach will let him sequence fewer clones and be faster than the whole-genome shotgun approach. He must decide which vector to use in building his genomic library. He can construct a library made in the pbluescript II vector with inserts that are, on average, 7 kb, a library made in the vector pbelobac11 with inserts that are, on average, 200 kb, and a library made in a YAC vector with inserts that are, on average, 1 Mb. He assumes that any library he constructs will have an equally good representation of the 2! 10 9 base pairs in a haploid copy of the fish genome, that the antifreeze gene is less than 2 kb in size, and that (somehow) he can easily obtain the sequence of the DNA inserted into a clone. a. Given the biochemist s assumptions, what is the chance that he will find the antifreeze gene if he 255

5 sequences the insert of just one clone from each library? Based on this information, which library should he use if he wants to sequence the fewest number of clones? b. When he tries to sequence the insert of the first clone he picks from the library by a calleague suggested by a colleague in (a), he realizes that he does not enjoy this type of lab work. So, he hires a technician with experience in genomics, assigns the project to her, and goes to Antarctica to catch more fish. He tells her to sequence the inserts of enough clones to be 95% certain of obtaining at least one insert containing the antifreeze gene and says he will analyze all of the sequence data for the presence of the antifreeze gene after he returns. How many clones should she sequence to satisfy this requirement if he constructed the genomic library in a plasmid vector? a BAC vector? a YAC vector? c. What advantages and disadvantages does each of the different vectors have for constructing libraries with cloned genome DNA? d. Suppose the Antarctic fish has a very AT-rich genome and the biochemist propagated the genomic library using E. coli. Will the library be representative of all the sequences in the genome of the fish? *18 When Celera Genomics sequenced the human genome, they obtained 13,543,099 reads of plasmids having an average insert size of 1,951 bp, and 10,894,467 reads of plasmids having an average insert size of 10,800 bp. a. Dideoxy sequencing provides only about nucleotides of sequence. About how many nucleotides of sequence did cetera obtain from sequencing these two plasmid libraries? To what fold coverage does this amount of sequence information correspond? b. Why did they sequence plasmids from two libraries with different-sized inserts? c. They sequenced only the ends of each insert. How did they determine the sequence lying between the sequenced ends? *19 a. What features of pbluescript II facilitate obtaining the sequence at the ends of an insert? b. Devise a strategy to obtain the entire sequence of a 7-kb insert in pbluescript II. c. Devise a strategy to obtain the entire sequence of a 200-kb insert in pbelobac Explain how the whole-genome shotgun approach to sequencing a genome differs from the biochemist s approach described in Question 8(c). What information does it provide that the biochemist s approach does not? What does it mean to obtain 7-fold coverage, and why did his colleague advise him to do this? *21 In a sequencing reaction using dideoxynucleotides that are labeled with different fluorescent dyes, the DNA chains produced by the reaction are separated by size using capillary gel electrophoresis and then detected by a laser eye as they exit the capillary. A computer then converts the differently colored fluorescent peaks into a pseudocolored trace. Suppose green is used for A, black for G, red for T, and blue for C. What pattern of peaks do you expect to see on a sequencing trace if you carry out a dideoxy sequencing reaction after the primer 5 -CTAGG-3 is annealed to the following singlestranded DNA fragment? 3 -GATCCAAGTCTACGTATAGGCC-5 22 How does pyrosequencing differ from dideoxy chaintermination sequencing? What advantages does it have for large-scale sequencing projects? 23 Do all SNPs lead to an alteration in phenotype? Explain why or why not. 24 Researchers at Perlegen Sciences sought to identify tag SNPs on human chromosome 21. After determining the genotypes at 24,047 common SNPs in 20 hybrid cell lines containing a single, different human chromosome 21, they used computerized algorithms to identify haplotypes containing between 2 and 114 SNPs that cover the entire chromosome. A total of 2,783 tag SNPS were selected from SNPs within these blocks. a. What is a SNP marker? b. How do haplotypes arise in members of a population? c. What is a hapmap? d. What is a tag SNP? e. What advantages were there for the researchers to use hybrid cell lines instead of genomic DNA from 20 different individuals? f. The 20 individuals whose chromosome 21 was used in this analysis were unrelated and had different ethnic origins. Do you expect the haplotypes and number of tag SNPs to differ if i. the cell lines were established from blood samples drawn at a large family reunion. ii. the cell lines were established from unrelated individuals, but their ancestors originated in the same geographical region. *25 A set of hybrid cell lines containing a single copy of the same human chromosome from 10 different individuals was genotyped for 26 SNPs, A through Z. The SNPs are present on the chromosome in the order A, B, C,... Z. Table C lists the SNP alleles present in each cell line. State which SNPs can serve as tag SNPs, and which haplotypes they identify. What is the minimum number of tag SNPs needed to differentiate between the haplotypes present on this chromosome? 26 Some features that we commonly associate with racial identity, such as skin pigmentation, hair shape, and facial morphology, have a complex genetic basis. However, it turns out that these features are not representative of the 256

6 Table C Cell Line A1 A1 A2 A3 A1 A3 A2 A3 A1 A2 B1 B1 B2 B3 B2 B3 B2 B3 B1 B2 C3 C3 C1 C2 C1 C2 C1 C2 C3 C1 D4 D4 D3 D2 D1 D2 D3 D2 D4 D3 E1 E1 E2 E2 E3 E2 E2 E2 E1 E2 F2 F1 F2 F2 F2 F1 F2 F2 F2 F2 G3 G2 G3 G3 G1 G2 G1 G3 G1 G3 H1 H1 H1 H1 H2 H1 H2 H1 H2 H1 I3 I1 I3 I3 I2 I1 I2 I3 I2 I3 J2 J1 J2 J2 J2 J1 J2 J2 J2 J2 K1 K1 K1 K1 K2 K1 K2 K1 K1 K1 L2 L1 L2 L2 L1 L1 L1 L2 L2 L2 M1 M1 M2 M1 M1 M2 M2 M1 M2 M1 N2 N2 N1 N2 N2 N1 N1 N2 N1 N2 O1 O1 O1 O1 O1 O2 O1 O1 O1 O2 P2 P1 P2 P1 P2 P1 P1 P1 P2 P1 Q2 Q2 Q2 Q2 Q2 Q1 Q2 Q2 Q2 Q1 R3 R1 R3 R1 R3 R2 R1 R1 R3 R2 S1 S2 S1 S2 S1 S1 S2 S2 S1 S1 T1 T1 T1 T1 T1 T1 T1 T1 T1 T1 U2 U1 U2 U1 U2 U2 U1 U1 U2 U2 V2 V2 V2 V2 V2 V2 V2 V2 V2 V2 W2 W3 W1 W2 W1 W3 W1 W1 W3 W1 X1 X2 X1 X1 X3 X2 X3 X1 X2 X3 Y2 Y1 Y4 Y2 Y3 Y1 Y3 Y4 Y1 Y3 Z1 Z1 Z2 Z1 Z2 Z1 Z2 Z2 Z1 Z2 genetic differences between racial groups individuals assigned to different racial categories share many more DNA polymorphisms than not supporting the contention that race is a social and not a biological construct. How could you use DNA chips to quantify the percentage of SNPs that are shared between individuals assigned to different racial groups? *27 Mutations in the dystrophin gene can lead to Duchenne muscular dystrophy. The dystrophin gene is among the largest known: it has a primary transcript that spans 2.5 Mb, and it produces a mature mrna that is about 14 kb. Many different mutations in the dystrophin gene have been identified. What steps would you take if you wanted to use a DNA microarray to identify the specific dystrophin gene mutation present in a patient with Duchenne muscular dystrophy? 28 Three of the steps in the analysis of a genome s sequence are assembly, finishing, and annotation. What is involved in each step, and how do they differ from each other? 29 What is a cdna library, and from what cellular material is it derived? How is a cdna synthesized, and how do the steps used to clone a cdna differ from the steps used to clone genomic DNA? How are cdna sequences used to help annotation of a sequenced genome? *30 Eukaryotic genomes differ in their repetitive DNA content. For example, consider the typical euchromatic 50-kb segment of human DNA that contains the human b T-cell receptor. About 40% of it is composed of various genome-wide repeats, about 10% encodes three genes (with introns), and about 8% is taken up by a pseudogene. Compare this to the typical 50-kb segment of yeast DNA containing the HIS4 gene. There, only about 12% is composed of a genome-wide repeat, and about 70% encodes genes (without introns). The remaining sequences in each case are untranscribed and either contain regulatory signals or have no discernible information. Whereas some repetitive sequences can be interspersed throughout gene-containing euchromatic regions, others are abundant near centromeres. What problems do these repetitive sequences pose for sequencing eukaryotic genomes? When can these problems be overcome, and how? 31 What is the difference between a gene and an ORF? Explain whether all ORFs correspond to a true gene, and if they do not, what challenges this poses for genome annotation. *32 Once a genomic region is sequenced, computerized algorithms can be used to scan the sequence to identify potential ORFs. a. Devise a strategy to identify potential prokaryotic ORFs by listing features accessible by an algorithm checking for ORFs. b. Why does the presence of introns within transcribed eukaryotic sequences preclude direct application of this strategy to eukaryotic sequences? c. The average length of exons in humans is about bp, while the length of introns can range from about 100 to many thousands of base pairs. What challenges do these findings pose for identifying exons in uncharacterized regions of the human genome? d. How might you modify your strategy to overcome some of the problems posed by the presence of introns in transcribed eukaryotic sequences? 33 Annotation of genomic sequences makes them much more useful to researchers. What features should be included in an annotation, and in what different ways can they be depicted? For some examples of current annotations in databases, see the following websites: (Drosophila) (Arabidopsis) (humans) (humans) 257

7 *34 One powerful approach to annotating genes is to compare the structures of cdna copies of mrnas to the genomic sequences that encode them. Indeed, a large collaboration involving 68 research teams analyzed 41,118 full-length cdnas to annotate the structure of 21,037 human genes (see a. What types of information can be obtained by comparing the structures of cdnas with genomic DNA? b. During the synthesis of cdna (see Figure 15), reverse transcriptase may not always copy the entire length of the mrna and so a cdna that is not full-length can be generated. Why is it desirable, when possible, to use full-length cdnas in these analyses? c. The research teams characterized the number of loci per Mb of DNA for each chromosome. Among the autosomes, chromosome 19 had the highest ratio of 19 loci per Mb while chromosome 13 had the lowest ratio of 3.5 loci per Mb. Among the sex chromosomes, the X had 4.2 loci per Mb while the Y had only 0.6 loci per Mb. What does this tell you about the distribution of genes within the human genome? How can these data be reconciled with the idea that chromosomes have gene-rich regions as well as gene deserts? d. When the research teams completed their initial analysis, they were able to map 40,140 cdnas to the available human genome sequence. Another 978 cdnas could not be mapped. Of these 978 cdnas, 907 cdnas could be roughly mapped to the mouse genome. Why might some (human) cdnas be unable to be mapped to the human genome sequence that was available at the time although they could be mapped to the mouse genome sequence? (Hint: Consider where errors and limited information might exist.) *35 How has genomic analysis provided evidence that Archaea is a branch of life distinct from Bacteria and Eukarya? 36 The genomes of many different organisms, including bacteria, rice, and dogs, have been sequenced. Choose three phylogenetically diverse organisms. Compare the rationales for sequencing their genomes, and describe what we have learned from sequencing each genome. 37 In which type of organisms does gene number appear to be related to genome size? Explain why this is not the case in all organisms. 38 The C-value paradox states that there is no obvious relationship between an organism s haploid DNA content and its organizational and structural complexity. Discuss, citing data from the genome sequencing, whether there is also a gene-number paradox or a gene-density paradox. 39 In the United States, 3 5% of public funds used to support the Human Genome Project were devoted to research to address its ethical, legal, social, and policy implications. Some of the results are described in the website elsi/elsi.shtml. After exploring this website, answer the following questions. a. Summarize the main ethical, legal, social, and policy issues associated with the human genome project. b. Why is legislation necessary to protect an individual s genetic privacy? What such legislation currently exists? c. What are the pros and cons of gene testing? d. Both presymptomatic and symptomatic individuals are subject to gene testing for an inherited disease. How are gene tests used in each situation, and how do the concerns about using gene testing differ in these situations? e. Are laboratories that conduct genetic testing regulated by law? Solutions to Selected Questions and Problems 2 Examples of methods that utilize the hydrogen bonding in complementary base pairing include: (1) the binding of complementary sticky ends present in a cloning vector and a DNA fragment prior to their ligation by DNA ligase; (2) the annealing of a labeled nucleic acid to a complementary singlestranded DNA fragment on a microarray; (3) the annealing of an oligo(dt) primer to a poly(a) tail during the synthesis of cdna from mrna; and (4) the annealing of a primer to a template during a DNA sequencing reaction. In each case, base pairing allows for nucleotides to interact in a sequence-specific manner essential for the procedure s success. For example, the binding of a primer to a template at the start of a DNA sequencing reaction requires complementary base pairing between the sequences in the primer and the template, which in turn defines where the DNA sequencing reaction will start. 4 The average length of the fragments produced indicates how often, on average, the restriction site appears. If the DNA is composed of equal amounts of A, T, C, and G, the chance of finding one specific base pair (A T, T A, G C, or C G) at a particular site is 1 /4. The chance of finding two specific base pairs at a site is ( 1 /4) 2. In general, the chance of finding n specific base pairs at a site is ( 1 /4) n. Here, 1 /4,096 = ( 1 /4) 6, so the enzyme recognizes a 6-bp site. 5 a. Since 40% of the genome is composed of G C pairs, P(G)=P(C)=0.20 and fore, P(CCTAGG)=(0.20) 4!(0.30) 2 = A with 3!10 9 base pairs will have about 3!10 9 different groups of 6-bp sequences. Thus, the number of sites is ( )! (3!10 9 )= 432,000. b. 3!10 9 bp/432,000 sites=1/ =6,944 bp between sites. c. P(CCTAGG)=(0.10) 4!(0.40) 2 = , so two AvrII sites are expected to be about 1/ =62,500 bp apart. 258