Chapter 5. Structural Genomics

Chapter 5. Structural Genomics Contents 5. Structural Genomics 5.1. DNA Sequencing Strategies 5.1.1. Map-based Strategies 5.1.2. Whole Genome Shotgun Sequencing 5.2. Genome Annotation 5.2.1. Using Bioinformatic Tools to Identify Putative Coding genes 5.2.2. Comparison of predicted sequences with known sequences (at NCBI) 5.2.3. Published Genomes 5.3. DNA Sequence Polymorphisms 5.3.1. Simple Sequence Repeats (SSRs) 5.3.2. RFLPs are a Special Type of SNP 5.3.3. Detecting SNPs 5.3.4. Uses of DNA Polymorphisms 5.4. Mutations 5.4.1. Point Mutations Base Substitutions 5.4.2. Point Mutations in Protein Coding Sequences 5.4.3. Point Mutations Base Insertions or Deletions

CONCEPTS OF GENOMIC BIOLOGY Page 5-1 CHAPTER 5. STRUCTURAL GENOMICS (RETURN) Genomic Biology has 3 important branches, i.e. Structural Genomics, Comparative genomics, and Functional genomics. The ultimate goal of these branches of genomics is, respectively; the sequencing of genes and genomes; the comparison of these sequenced genes and genomes across all organisms with the aim of understanding evolutionary relationships and understanding how genes and genomes work to produce the complex phenotypes including gene regulation and environmental signaling. A set of molecular genetic technologies was/is critical to our ability to pursue the goals described above. The Genomic Biologists Tool Kit is provides a brief understanding of these critical tools, and how they are used in the investigation of genomes. While the techniques are intrinsically laboratory tools, the nature of what they can do and how they work can be readily studied using bioinformatic resources. 5.1. DNA SEQUENCING STRATEGIES (RETURN) Beyond the method for generating DNA sequences, it is necessary to have a strategy for how to emply DNA sequencing technology. Strategies for DNA sequencing depend on the features and size of the genome that is being sequenced and the available technology for doing the sequencing. As part of the Human Genome Project two general approaches emerged as most useful and valuable. One of these strategies the Map-based approach was employed by the publicly funded sequencing effort that involved scientists from around the world. The other strategy that was developed by a privately funded group at Celera Genomics, called whole genome shotgun sequencing was perhaps faster and cheaper than the map-based approach, but does not work efficiently with large genomes though it is very useful for smaller genomes. In fact, today these approaches are hybridized or combined to obtain the advantages of both strategies. 5.1.1. Map-based Sequencing (RETURN) The map-based or clone-contig mapping sequencing approach was the method originally developed by the publically funded Human Genome Project sequencing effort. The rationale for this method is that it is the best method for obtaining the sequence of most eukaryotic

CONCEPTS OF GENOMIC BIOLOGY Page 5-2 genomes, and it has also been used with those microbial genomes that have previously been mapped by genetic and/or physical means. Though it is relatively slow and expensive, this method provides dependable high-quality sequence information with a high level of confidence. In the clone-contig approach, the genome is broken into fragments of up to 1.5 Mb, usually by partial digestion with a restriction endonuclease (section 4.1), and these cloned in a high-capacity vector such as a BAC or a YAC vector (section 4.2.5). A clone contig map is made by identifying clones containing overlapping fragments bearing mapped sequence markers. These markers were originally identified using a combination of conventional genetic mapping, FISH cytogenetic mapping, and radiation hybrid mapping. Subsequently, common practice is to use chromosome walking as an approach to making a clone-contig library using this approach sequence markers are generated from BAC ends, and a map of BAC-end sequences is subsequently made. Ideally the cloned fragments are anchored onto a genetic and/or physical map of the genome, so that the sequence data from the contig can be checked and interpreted by looking for features (e.g. STSs, SSLPs, RFLPs, and genes) known to be present in a particular region. Once the clone library and contig map have been Figure 5.1. Clone contig mapping of a series of YAC clones conaining human DNA. developed, relevant clones are sequenced, using shotgun method below (Figure 5.2.). These sequenced contigs are then aligned using the markers and overlapping sequences on the clones to position each clone. 5.1.2. Whole Genome Shotgun Sequencing (RETURN) In the whole genome shotgun approach, smaller randomly produced fragments (1,500-2,000 bp) were produced, cloned, and sequenced. These sequences were then assembled based on random overlap into a

CONCEPTS OF GENOMIC BIOLOGY Page 5-3 genome sequence. Typically, some regions are not well sequenced, and specific sequencing is done to fill in the gaps that cannot be assembled from the randomly made pieces. Figure 5.2. Schematic diagram of sequencing strategy used by the publicly funded Human Genome Project. The DNA was cut into 150 Mb fragments and arranged into overlapping contiguous fragments. These contigs were cut into smaller pieces and sequenced completely.. The shotgun method is faster and less expensive than the map-based approach, but the shotgun method is more prone to errors due to incorrect assembly of the random fragments, especially in larger genomes. For example, if a 500 kb portion of a chromosome is duplicated and each duplication is cut into 2kb fragments, then it would be difficult to determine where a particular 2 kb piece should be located in the finished sequence. This might seem trivial, but duplications seldom retain their original sequences. They tend to develop SNPs over time, and this can generate difficulties in the proper assembly of these duplicated sequences. Which method is better? It depends on the size and complexity of the genome. With the human genome, each group involved believed its approach was superior to the other, but a hybrid approach is now being used routinely. The advent of next generation sequencing allows the use of fragment-end short read sequencing with much more powerful computer-based assemblers generating finished sequences. However, the method still requires at least some second-round sequencing to obtain a completely sequenced genome.

CONCEPTS OF GENOMIC BIOLOGY Page 5-4 5.2. GENOME ANNOTATION (RETURN) Once a genome sequence is obtained via sequencing using one or more strategies outlined in the preceding sections. The hard work of deciding what the sequence means begins. Typically to make such tasks easier some type of database is created that ultimately shows the entire sequence, the location of specific genes in that sequence, and some functional annotation as to the role that each gene has in an organism. The databases at NCBI are a critical repository for these types of information, but there are many other specific and perhaps more detailed repositories of this type of information. The process routinely begins with the implementation of what is termed a Gene Finding bioinformatic pipeline. The separate parts of such a pipeline are described below. 5.2.1. Using Bioinformatic Tools to Identify Putative Protein Coding Genes (RETURN) A first approximation of gene locations in the genomic sequence is usually made using a gene prediction program to predict gene beginning and ending points, transcriptional and translational start and stop sites, intron and exon locations, and polya addition sites. Often such programs produce sequences of the putative transcript produced, and/or the mature mrna and protein amino acid sequence coded for by the gene as well. Many gene prediction programs are so called neural network programs that are capable of learning what algorithms to use to decide the sequence of a gene. Such programs are trained on known sequences, and then once trained used to predict gene regions, and then after predicting, input is given back concerning errors that were made. As the programs are used they refine and improve their predictive power. 5.2.2. Comparison of predicted sequences with known sequences (at NCBI) (RETURN) Once putative coding genes are predicted, the next step is to compare the predicted mrna (cdna) sequences with known coding sequences, in publically available libraries. This can be done with a number of possible tools, but one of the best for doing this is the Basic Local Alignment Search Tool (BLAST) utility at NCBI. By taking your predicted peptide and/or nucleotide sequence and submitting it to a BLAST search of the nr (proteins) or nt (nucleotide) sequence database you can learn what sequences available at NCBI are most similar to your sequence. When you do a BLASTP (protein) comparison,

CONCEPTS OF GENOMIC BIOLOGY Page 5-5 you are also shown conserved domains found in your protein. Recall that conserved domains are amino acid sequences that are conserved in various types of proteins. Thus, BLAST searches can inform you a number of interesting and useful sequence features that are found in your submitted sequence. Also note that if a cdna sequence library or libraries is/are available from the organism you are working with, and if a related sequence from a previously cloned gene is available at NCBI you can also learn about previously known cdna or other sequences found in all of the databases at NCBI from this BLAST search. This becomes a critical method for learning what your gene does. Also note that if you are working with a rare organism where little sequence information is available, you can construct and sequence your own cdna library, to provide information about protein coding genes in your organism. The other things you can learn from inspection of the predicted cdna sequence and the actual sequence found in databases is how accurate the prediction was that was made by the prediction program. This can lead to editing the predicted gene to show the actual sequence that is found by BLAST searching when this is appropriate based on the available data. As we learn more information about each gene, more literature is published related to your gene, and appears in the PubMed database at NCBI or in other NCBI databases. Since you have an interlocking series of databases at NCBI, the BLAST search itself gives you access to a large body of information about sequences related to your predicted sequence and to the actual gene that you discovered in the genome that was sequenced. 5.2.3. Published Genomes (RETURN) Once such preliminary analyses have been performed the data needs to be shared with the applicable communities (scientific, medical, clinical, students, the interested public, etc) to whom the information is useful. The Genomes database at NCBI is a resource where this is done. Note that genomic databases at NCBI and elsewhere are continually evolving, and new information is added as it comes available. This can make it difficult to understand what you find, but with care you can follow the process and wind up with the best information available.

CONCEPTS OF GENOMIC BIOLOGY Page 5-6 5.3. DNA SEQUENCE POLYMORPHISMS (RETURN) When the contig-cloning approach to genome sequencing was developed (see Section 5.1.1. Map-based Strategies) it quick became clear that many type of sequenced-tagged-site (STS) markers found in the genome already existed that were based on some type DNA sequence polymorphism. That is to say, known sequence variations could be found in almost every genome. Often such sequence polymorphisms were identified prior to obtaining genomic sequences for many organisms. However, with the sequencing of multiple individual genomes of a number of organisms, including humans, it became clear that such polymorphic DNA sequences scattered throughout the genome were important sequence markers that are useful for relating important genes to given chromosomal locations, and subsequently for mapping important phenotypes. Recall that we have discussed the use of single nucleotide polymorphisms (SNPs) in quantitative and population genetics previously in sections 2.6 and 2.7. In this chapter we will examine the major types of DNA polymorphisms in greater d etail. Simple Sequence Repeats (SSRs), Sequence-Tagged Microsatellite Sites (STMS or simply microsatellites), Simple Sequence Repeats Polymorphisms (SSRP), Variable Number Tandem Repeats (VNTRs), or Short Tandem Repeats (SSRs) are all names that have been used to describe polymorphic loci present in nuclear DNA and some organellar DNA. SSRs consist of repeated sequence units of 1-10 base pairs, most often 2-3 bp in length. SSRs are highly variable in the number of repeated units found, and they are typically evenly distributed throughout genomes of eukaryotes. SSR-type polymorphisms are most frequently revealed using PCR (Figure 5.3.). PCR primers are made for the DNA sequences flanking the repeated region since the repeats themselves may occur at multiple locations in the genome. The flanking regions tend to be conserved within a species, but they may also be conserved across higher taxonomic levels as well. By determining the size 5.3.1. Simple Sequence Repeats (SSRs) (RETURN)

CONCEPTS OF GENOMIC BIOLOGY Page 5-7 of the PCR fragments, the number of repeated units at a given individual locus can be determined. SSR polymorphism are believed to arise in the genome as a result of a mutational process. Also because of the repeat nature of the SSR, it is possible that a DNA replication error leads to extra duplication of the repeated unit. For more detail of STS and related tool usage, visit the STS Tools page at the Probe Database at NCBI. Figure 5.3. Schematic diagram showing the use of PCR to detect SSR Polymorpisms (STR). Specific PCR primers land outside the repeat region, and copy the entire repeat region. The PCR products can then be sepqrqted on the basis of size, and the number of repeats for each homologous gene determined. 5.3.2. RFLPs/AFLPs are special types of SNP (RETURN) Restriction Fragment Length Polymorphisms are a special type of SNP where a single base inside a restriction site is altered (mutated) producing an SNP. This causes the disappearance of a restriction site, and thus, the DNA is not cut as it should be by a particular restriction enzyme. This generates a polymorphism that can be detected by electrophoresis of DNA restriction fragments followed by blotting the fragments onto a membrane filter, and probing the filter with a complementary DNA probe. This process is called Southern Blotting (see Figure 5.4.). and can be used in this application to detect the presence of an RFLP.

CONCEPTS OF GENOMIC BIOLOGY Page 5-8 Figure 5.4. Restriction Fragment Length Polymorphisms (RFLP) are caused by a single base change in a restriction site (BamH1 in the example given). This loss of a restriction site results in a single 7 kb restriction fragment compared to a 2 kb and a 5 kb fragment. Figure 5.5. Amplified Fragment Length Polymorphisms (AFLP) are caused by a single base change in a restriction site (BamH1 in the example given). This loss of a restriction site results in a single 2 kb PCR product compared to a 500 bp and a 1500 bp fragment.

CONCEPTS OF GENOMIC BIOLOGY Page 5-9 In addition to Southern blotting, it is also possible to PCR amplify a DNA fragment around a restriction site containing the SNP. Once the amplification is complete the amplified fragment is then cut with the restriction endonuclease in question, and the amplified cut fragments are separated by electrophoresis. The fragments are then visualized using a double stranded DNA specific stain (see Figure 5.5.), and the genotype of the organism around the SNP probe can be determined from the fragment pattern. under conditions where only a complete base pair match will form a stable hybrid (top panel of Figure 5.6.). If 5.3.3. Detecting SNPs (RETURN) Most SNPs do not alter restriction sites, so other methods are used for analysis. Certainly, there are means for detecting SNPs by direct DNA sequencing, and now that sequencing and analysis techniques have grown much more sophisticated these are becoming increasingly popular these are typically the method of choice. An early method of SNP detection involves the use of Allele-Specific oligonucleotide (ASO) hybridization. An oligonucleotide complementary to one SNP allele is attached to a membrane filter in a specific location on filter. All other SNP alleles at a given locus can also be attached to the filter at separate locations. The filter is then allowed to hybridize at high stringency (meaning Figure 5.6. Allele-specific oligonucleotide hybridization. This technique allow the investigation of single nucleotide polymorphisms in the genome. In this case the probles are bound to a membrane, and hybridization with lableled DNA from the sample subject. If probes for all possible nucleotide combinations are separately bound to the filter, hybridization occurs only to the probes where there is a complete match. This is called hybridization at high stringency, and can be used to determine the SNP alleles present in a genome. hybridization occurs, target DNA has the allele corresponding to the oligo. If hybridization does not occur a different allele for that SNP is present (see Figure 5.6.), but hybridization would occur at the oligo corresponding to that allele.

CONCEPTS OF GENOMIC BIOLOGY Page 5-10 Such a technique can be extended to make membrane filters with oligonucleotides for multiple SNP loci all incorporated into one array placed on the filter. Additionally, the technique can be extended to place the oligonucleotide probes on other substrates where thousands of oligos corresponding to thousands of different SNPs can be examined simultaneously. These are referred to as SNP arrays which are a type of microarray that we will be discussing in Chapter 6. SNPs are an abundant source of STS markers that have become increasingly important genomic tools both experimentally as well as clinically. It is clear that such polymorphic markers arise from the mutational process, and thus we turn out attention to a more detailed investigation of the process of mutation and its effect on genome structure. 5.3.4. Uses of DNA Polymorphisms (RETURN) Genes have historically been used as markers for genetic mapping experiments. DNA polymorphisms, defined as is two or more alleles at a locus that vary in nucleotide sequence or number of repeated nucleotide units (indels) behave like genes for mapping purposes. DNA markers are polymorphisms suitable for mapping, used in association with gene markers for genetic and physical mapping of chromosomes. DNA testing is increasingly available for many genetic diseases. Genetic tests are now included in the OMIM database for a vast number of genes, and Genomic Medicine is one of the newest branches of medicine. A few examples of diseases for which there are now reliable DNA marker-based genetic tests include: Huntington disease, Hemophilia, Cystic fibrosis, Tay Sachs disease, Breast Cancer, and Sickle-cell anemia. Human genetic testing serves three main purposes. 1) Prenatal diagnosis, often using amniocentesis or chorionic villus sampling to assess risk to the fetus of a genetic disorder. 2) Newborn screening, using blood screen for Phenylketonuria (PKU), Sickle-cell anemia, Tay Sachs disease, and others. 3) Carrier (heterozygote) detection is now available for many genetic diseases listed above. Additional examples of the application of DNAbased markers include: crime scene investigation, population studies to determine variability in groups of people, proving horse pedigrees for registration purposes, conservation biology to determine genetic variation in endangered species, Forensic analysis in wildlife crimes, allowing body parts of poached animals to be used as evidence, detection of pathogenic E. coli strains in foods, detection of genetically modified organisms (GMOs) in bulk or processed foods, and many others.

CONCEPTS OF GENOMIC BIOLOGY Page 5-11 5.4. MUTATIONS (RETURN) Mutations are low frequency changes in the nucleotide sequence that can occur either naturally (spontaneously), or they can be be induced by a host of chemical and physical agents. Mutations are quantified in two different ways. The mutation rate is the probability of a particular kind of mutation as a function of time (e.g., number of observed mutations per gene per generation). Mutation frequency is the number of times a particular mutation occurs in proportion to the number of cells or individuals in a population (e.g., number of mutations per 100,000 organisms). Mutations occur in a number of ways, ranging from major chromosomal rearrangements to the so-called point mutations (one or a few base changes) or single nucleotide mutations. 5.4.1. Point Mutations Base-pair Substitutions (RETURN) A base-pair substitution replaces one base pair with another. There are two general types of substitutions. Transition mutations involve an A-T to G-C change. While transversion mutaitons involve a C-G to G-C or a A-T to T-A change. 5.4.2. Point Mutations in Protein Coding Sequences (RETURN) Another way of classifying substitutions is based on the effect of the mutation on the protein coded for by the ORF in which they occur. A missense mutation occurs when the result is that one amino acid is changed to another as a result of the base substitution.

CONCEPTS OF GENOMIC BIOLOGY Page 5-12 When a missense mutation causes a stop codon rather than another amino acid being specified, it is referred to as a nonsense mutation. A Silent Mutation results when a base substitution occurs in the third wobble position of a codon, and as a result the amino acid that is coded for is not changed. When the amino acid change produced by a missense mutation involves substituting one amino acid for a similar functional amino acid (see the table of amino acid R-groups for examples of similarity), the resulting amino acid is a Neutral mutation. 5.4.2. Point Mutations base insertions or deletions (RETURN) Frameshift mutations result from single base insertions or deletions. Frame shift mutations cause all amino acids from the point of insertion on to change, and often lead to truncation of the protein as a result of a premature stop codon now being in frame.