The Pennsylvania State University. The Graduate School. College of Engineering INFERENCE OF ORTHOLOGS, WHILE CONSIDERING GENE CONVERSION,

Size: px

Start display at page:

Download "The Pennsylvania State University. The Graduate School. College of Engineering INFERENCE OF ORTHOLOGS, WHILE CONSIDERING GENE CONVERSION,"

Emily Barrett
6 years ago
Views:

1 The Pennsylvania State University The Graduate School College of Engineering INFERENCE OF ORTHOLOGS, WHILE CONSIDERING GENE CONVERSION, TO EVALUATE WHOLE-GENOME MULTIPLE SEQUENCE ALIGNMENTS A Dissertation in Computer Science and Engineering by Chih-Hao Hsu 2009 Chih-Hao Hsu Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy December 2009

2 The dissertation of Chih-Hao Hsu was reviewed and approved* by the following: Webb Miller Professor of Biology and Computer Science and Engineering Dissertation Advisor Chair of Committee Raj Acharya Professor of Computer Science and Engineering Head of the Department of Computer Science and Engineering Wang-Chien Lee Associate Professor of Computer Science and Engineering Ross Hardison T. Ming Chu Professor of Biochemistry and Molecular Biology *Signatures are on file in the Graduate School

3 iii ABSTRACT The problem of computing a multiple-sequence alignment (MSA) is very important for the analysis of biological sequences. An equally critical problem is to evaluate the quality of an alignment. In the preliminary project described here, alignments produced by Multiz and ROAST of the human genome to other vertebrate genomes are evaluated using orthologous genes in 13 gene clusters from 6 mammalian species, which are identified using maximum-likelihood phylogenetic tree reconstruction methods. Analysis of the α- and β-globin gene clusters show that inferred ortholog relationships are accurate. The orthologous β-globin genes from over 14 species are used to evaluate the performance of four MSA programs (MLAGAN, MAVID, TBA and ROAST). The results show that the performance of ROAST is superior to the others. Furthermore, differences among gene clusters and among species are studied. This approach not only indicates the quality of a given alignment, but also helps us understand the alignment s drawbacks and gives us some clues about how to build the next generation of multiple alignment programs. To obtain accurate orthologs, the impact of gene conversion is studied in this thesis. Gene conversion events are often overlooked in analyses of genome evolution. In such an event, an interval of DNA sequence (not necessarily containing a gene) overwrites a highly similar sequence. The event creates relationships among genomic intervals that can confound prediction of orthologs and attempts to transfer functional information between genomes. Here we propose different gene conversion detection methods for different scale of data. Detailed information about conversion events between gene pairs is determined, including their directionality. Furthermore, we analyze 1,112,202 highly conserved pairs of human genomic intervals, and

4 iv detect a conversion event for about 13.5% of them. Properties of the putative gene conversions are analyzed, such as the distributions of the lengths of the converted regions and the spacing between source and target. Finally, we also apply our method for several well-studied gene clusters, including the globin genes.

5 v TABLE OF CONTENTS LIST OF FIGURES...vii LIST OF TABLES...x ACKNOWLEDGEMENTS...xi Chapter 1 Introduction Evolution of genomes Duplication of genome Orthologs and paralogs Inference of orthologs and paralogs Gene conversion...5 Chapter 2 Evaluation of Whole-Genome Multiple Sequence Alignments Introduction Multiple sequence alignments Methods for evaluation of multiple sequence alignments Motivation Methods Gene clusters identification Extracting coding sequences Phylogenetic tree reconstruction Orthology identification Evaluation of alignments Results Analysis of ortholog assignments for the α- and β-globin gene clusters Comparison of different alignment programs Comparison of different gene clusters Comparison of different species Conclusion...41 Chapter 3 Gene conversion detection between a pair of genes Introduction Motivation What is gene conversion Impact of gene conversion to the inference of orthology Methods for gene conversion detection Limitations of these methods Methods Site-by-site compatibility method Gene conversion inference Boundaries of gene conversion...51

6 vi 3.3 Results and limitations Beta and delta genes Two gamma genes Limitations...55 Chapter 4 Gene conversion detection for whole genome Introduction Methods Highly conserved pairs of sequences Gene conversion detection between each pair of sequences Space-efficient modifications Extension to quadruplet testing Multiple-comparison correction Directionality of gene conversion Results Number and distribution of gene conversion events in human Correlations with the distance, length, and relative orientation of the paralogs Length of converted regions The effect of protein-coding DNA Correlation with GC-content Discussion...79 Chapter 5 Applying gene conversion detection method to gene clusters Introduction Results Beta-globin gene cluster (hg18.chr11: 5,180,996-5,270,995) CCL gene cluster (hg18.chr17: 31,334,806-31,886,998) IFN gene cluster (hg18.chr9: 21,048,761-21,471,698)...90 Chapter 6 Conclusions and future works Conclusions Future Works...95 Bibliography...96

7 vii LIST OF FIGURES Figure 1-1: Duplication caused by transposition...3 Figure 1-2: Normal crossing over and unequal crossing over...3 Figure 1-3: Effect of gene conversion...5 Figure 1-4: Duplication caused by transposition...6 Figure 2-1: Pseudoorthologs result from gene loss...11 Figure 2-2: Processes for extracting coding sequences for species other than human or mouse Figure 2-3: Problem of multiple substitutions in phylogenetic tree reconstruction...13 Figure 2-4: Species tree and inferred trees Figure 2-5: An example for selecting a gene as the query sequence for database search Figure 2-6: Changing the gene tree of the TRIM gene cluster via re-rooting...18 Figure 2-7: Orthologous relationships within a phylogenetic tree...19 Figure 2-8: Some drawbacks of using a bootstrap value as the estimated reliability of orthologous genes...20 Figure 2-9: Evolutionary information for orthologs inference...22 Figure 2-10: Definitions of sensitivity and specificity...23 Figure 2-11: Phylogenetic trees for 25 species in β-globin gene cluster...29 Figure 2-12: Gene conversion and gene duplication...30 Figure 2-13: Phylogenetic trees for 30 species in the α-globin gene cluster Figure 2-14: Sensitivity and specificity among different alignment programs...37 Figure 2-15: Sensitivity and specificity of different gene clusters...40 Figure 2-16: Sensitivity and specificity of different species Figure 3-1: Phylogenetic tree of beta globin gene cluster....44

8 viii Figure 3-2: Gene conversion and gene duplication...45 Figure 3-3: Effect of gene conversion...45 Figure 3-4: Example shows the issues of Drouin s method Figure 3-5: An example of bottom-up phase...48 Figure 3-6: An example of top-down phase Figure 3-7: An example shows how to determine gene conversion and it s directionality...50 Figure 3-8: All gene conversion events between beta and delta genes...53 Figure 3-9: All gene conversion events between two gamma genes Figure 4-1: Evidence of gene conversion in the human δ-globin gene Figure 4-2: Determining the occurrence of gene conversion events in a triplet Figure 4-3: A cubic-space algorithm for computing the probabilities x m,n.k...65 Figure 4-4: Comparisons between quadruplet testing and triplet testing Figure 4-5: Algorithm for determining cutoff position of P-values...68 Figure 4-6: Timing of evolutionary events...70 Figure 4-7: Evidence that the β-globin gene (HBB) converted the δ-globin gene (HBD) Figure 4-8: Frequencies of gene conversion events in each human chromosome Figure 4-9: Frequency of intra-chromosomal gene conversions as a function of distance between the paralogs...74 Figure 4-10: Correlation with length of the paralogous human sequences Figure 4-11: Correlation between orientation and separation distance of the human paralogs...76 Figure 4-12: Distribution for the length of the converted regions Figure 4-13: Correlation between gene conversion and GC content Figure 5-1: Gene tree and detected conversion events for the beta-globin gene cluster Figure 5-2: Influences of gene conversion between the beta and delta genes...84 Figure 5-3: Inferred evolutionary histories for mammalian beta and delta genes...86

9 ix Figure 5-4: Phylogenetic trees for CCL gene cluster...88 Figure 5-5: Evidences of gene conversions between CCL15 and CCL23 genes...89 Figure 5-6: Inferred evolutionary histories for the CCL gene cluster...90 Figure 5-7: Gene tree and detected conversion events for the Interferon gene cluster...91 Figure 5-8: Influences of gene conversion to the phylogeny in the distal group....93

10 x LIST OF TABLES Table 2-1: Ortholog assignment of β-globin gene cluster Table 2-2: The coordinates of all genes for 25 species of mammals in the β-globin gene cluster...25 Table 2-3: The predicted ortholog assignments of HBA-related genes for 30 mammalian species...30 Table 2-4: Coordinates of all genes for 30 mammalian α-globin gene clusters Table 2-5: Detailed information for 13 gene clusters Table 4-1: Information for duplicated human genomic intervals used in this study Table 4-2: Distribution of intra- and inter-chromosomal gene conversions Table 4-3: Distribution of gene conversion events classified by orientation Table 4-4: Conversion frequency as a function of the presence of protein-coding sequence...78 Table 4-5: Number of conversion for different directionality in 1-coding category....78

11 xi ACKNOWLEDGEMENTS I would like to express my deepest appreciation to my advisor, Webb Miller, for giving me the opportunity to participate in this interesting field, and for his help and guidance to my graduate study at the Pennsylvania State University. I also want to thank my committee members, Raj Acharya, Wang-Chien Lee, and Ross Hardison, for their time and effort. Furthermore, I thank all my colleagues in the Pennsylvania State University for their assistance. Finally, I would like to thank my wife, Mei-Jen Liao, for her support and encouragement, and my children, Emily and Ethan, for their love.

12 Chapter 1 Introduction In this chapter, we give an introduction about the evolution of genomes. Two main methods, which shape the genomes, are studied in this chapter and the formation and impact of duplicated regions is described in detail. Furthermore, the differences between orthologs and paralogs are explained and the inference of orthologs and paralogs are demonstrated. Finally, An important evolutionary force, gene conversion, is introduced to show how it could affect the inference of orthologs. 1.1 Evolution of genomes Millions of organisms are living on this Earth. An organism is a living species, which could be plant, animal or virus, and may have one single cell (unicellular) or is composed of many of cells (multi-cellular). All cells contain the genetic materials called DNA, which can produce essential functions for each cell and be inherited through generations. This kind of inheritance between generations results in shared traits between organisms in the same lineage. Nevertheless, small changes in genetic materials of the organisms are still going on from one generation to the next. This kind of change is called evolution. Even though the evolution in each generation is really small, the accumulated changes through many generations could produce new features in an organism or even form a new species. Evolution may arise in two major methods, e.g. mutations and large-scale transfer of nucleotide sequence within or between species such as insertion, deletion, inversions and duplication. Mutations are the substitutions in the nucleotide sequence, which may be caused

13 2 from the error of cell division or the exposure to the radiation. While large-scale events have gotten much attention now and are believed to play a main role in evolution. Ohno (1967) even suggested that gene duplication is the most important force in the evolution since the emergence of the universal common ancestor. In fact, gene duplication is the main force for the expansion of genome and the genome sizes of different organisms range from a few thousand bases (Fiers et al. 1976), i.e. virus, to more than one hundred billion bases, i.e. marbled lungfish. It is believed to have many duplication events occurred in these genomes. 1.2 Duplication of genome There are two main mechanisms by which duplication occur, e.g. transposition and unequal crossing over. Transportation is the movement of genetic materials from one chromosomal location to another. There are many transposable elements called transposons, which are part of DNA sequence and can move from one location to another one. As shown in Figure 1-1, after the replication of transposon, the copy one can insert to another location and form a duplication. Another method to form duplication is unequal crossing over. During the period of meiosis, genetic materials are exchanged between chromosomes as shown in Figure 1-2A. This process is called crossing over. However, the homologous chromosomes could be misaligned in some cases (Figure 1-2B). This situation is called unequal crossing over and duplication arises in one chromosome. Duplication plays a very important role in the evolutionary genomics, especially does gene duplication. Gene duplication is a duplication of DNA sequence, which contains a gene. Gene duplication is the major way to form a new gene. As usual, after gene duplication, one copy of the gene may remain the same function, while another copy of the gene could have a new function.

14 3 Figure 1-1: Duplication caused by transposition. Figure 1-2: (A) Normal crossing over. (B) Duplication arises from unequal crossing over due to the misaligned of homologous chromosomes. 1.3 Orthologs and paralogs Because of duplication, there are many similar genes or genomic intervals in the DNA sequence. A concept called homologs, which means sharing similar characteristics inherited from the common ancestor, is very important for evolutionary genomics and can be basically classified

15 4 into two different types, orthologs and paralogs. Orthologs are genes or genomic intervals that diverged via a speciation event, while paralogs diverged via a duplication event. In general, since orthologous genes are descend from a single gene in the last common ancestor of the species, their function and structure often remains the same or similar. However, since paralogous genes are created by a duplication event within a species, one copy of the duplicated genes can be recruited for a different function. 1.4 Inference of orthologs and paralogs A typical way to infer orthologs and paralogs is to compare the phylogenetic tree with the species tree. The concept of reconciled tree was first introduced by Goodman (1979) and was shown to be an effective method for the inference of orthologs and paralogs (Yuan et al. 1998). Reconciled tree would construct a new tree, which reconciles the phylogenetic tree with the species tree with the postulation of the existence of gene losses. A minimum reconciled tree is a tree, which minimizes the number of gene losses. It is shown to be a HP-hard problem to find the minimum reconciled tree (Goodman et al. 1979). Therefore, many heuristic algorithms for constructing a reconciled tree are proposed (Page 1994; Mirkin et al. 1995). Basically, it consists of two major steps to construct a reconciled tree. The first step is to generate a phylogenetic tree and the second step is to reconcile the phylogenetic tree with the species tree with minimum cost, which is the number of gene losses. Figure 1-3 shows an example to construct a reconciled tree. Orthologs and paralogs could be found directly from the reconciled tree. In this case, there are three gene losses. And a1 is orthologous to b1; a2, c2 and d2 are orthologous to each other, while a1 and b1 are paralogous to a2, c2 and d2.

16 5 Figure 1-3: Construction of reconciled tree. (A) Phylogenetic tree. (B) Species tree. (C) Reconciled tree. 1.5 Gene conversion Using phylogenetic tree to find orthologs and paralogs is efficient. However, the topology of the phylogenetic tree is not always correct, therefore, the inferred orthologs and paralogs could be inaccurate. Many factors could affect the reliable of the topology of a phylogenetic tree. One of the most main forces is gene conversion. Gene conversion is a nonreciprocal transfer of genetic materials. In the process of gene conversion, part of DNA sequence is transferred from one DNA helix, A, to another DNA helix, B. In this case, A remains the same, while B is replaced by A s sequence. Gene conversion can result from the base mismatch repair during recombination or the double strand repair process when the occurrence of DNA damage. Gene conversion could affect the inference of orthologs and paralogs. For example, as shown in Figure 1-4, a gene conversion event occurred in part of the sequences between gene D and gene E. The phylogenetic tree in the conversion region (Figure 1-4B) could be different from the original tree shown in Figure 1-4A. Therefore, an unreliable phylogenetic tree could be generated and incorrect relationships of orthologs and paralogs might be inferred.

17 6 Figure 1-4: Effect of gene conversion. (A) Gene tree for five genes. (B) Phylogenetic tree for the conversion region. In this thesis, we try to construct phylogenetic tree to find the orthologs and use the inferred orthologs to evaluate the performance of multiple sequence alignments. To study the impact of gene conversion to the inference of orthologs, three different scales, pair of genes, gene clusters and whole genome, are analyses. The structure of this thesis is as follows. The mechanisms for the inference of orthologs and the evaluation of multiple sequence alignments are shown in chapter 2. Analyses of gene conversion for pair of genes, whole genome and individual gene cluster are studies in chapter 3, chapter 4 and chapter 5 respectively. Finally, conclusions and future works are shown in chapter 6.

18 Chapter 2 Evaluation of Whole-Genome Multiple Sequence Alignments In this chapter, the method to evaluate the quality of a multiple sequence alignment is proposed. In this study, orthologous genes are identified using maximum-likelihood phylogenetic tree reconstruction methods. Alignments produced by Multiz and ROAST of the human genome to other vertebrate genomes are evaluated using orthologous genes in 13 gene clusters from 6 mammalian species. Furthermore, two gene clusters, e.g. α- and β-globin gene clusters, are analyzed. The orthologous α- and β-globin genes are used to evaluate the performance of four MSA programs (MLAGAN, MAVID, TBA and ROAST). Finally, differences among gene clusters and among species are studied. This approach not only indicates the quality of a given alignment, but also helps us understand the alignment s drawbacks and gives us some clues about how to build the next generation of multiple alignment programs. 2.1 Introduction Multiple sequence alignments A multiple sequence alignment (MSA), i.e., an alignment containing more than two sequences (namely genomic sequences or regions in this study), can be used to identify intervals that are conserved among those sequences. The goal might be analysis of phylogeny, or to help us to infer the functional elements and structures of a genome. One of the most widely used tools in the alignment of multiple sequences is ClustalW (Chenna et al. 2003), which uses a straightforward progressive alignment method to add sequences one by one into the multiple

19 8 alignment using a pre-calculated guide-tree. ClustalW also uses additional strategies, e.g. individual weights for each sequence, gap penalty adjustment, and automatic replacement of amino-acid substitution matrices, to improve the accuracy of the alignment. However, as more and more genomic sequences become available, it is imperative to have a tool that can align large-scale multiple genomic sequences rapidly. Recently, several programs, e.g. MLAGAN (Brudno et al. 2003), MAVID (Bray et al. 2004) and TBA (Blanchette et al. 2004), have been proposed for this purpose. MLAGAN is a multiple-sequence global aligner, which uses a pairwise aligner, LAGAN, to construct the multiple alignments progressively. Moreover, an iterative refinement is done to improve the alignment locally. MAVID automatically constructs a guidetree, with maximum-likelihood ancestral sequences inferred progressively. In addition, more information, i.e. protein-based anchors and a homology map, is utilized. TBA proposes a new concept of threaded blockset, which is a local multiple alignment, and uses it to construct multiple alignments for large-scale genomic sequences Methods for evaluation of multiple sequence alignments In addition to generating multiple-sequence alignments, it is equally important to evaluate the quality of an alignment. HOMSTRAD (Mizuguchi et al. 1998), OxBench (Raghava et al., 2003) and PREFAB (Edgar, 2004) use 3-D protein structural superpositions to study the performance of an alignment. However, these methods can be applied only to alignments of protein sequences. In another approach, Golubchik et al. (2007) focus on gaps to determine MSA quality. Two models, overlapped gaps and non-overlapped gaps, are used in that paper to study the performance of MSA programs. However, the paper uses only artificially generated sequences to analyze these two models. In reality, few biological data match with these two models. One example studied in that paper is alternatively spliced sequences. However, there is no clear

20 9 evolutionary relationship between such sequences, perhaps making use of that example inappropriate for evaluating the performance of MSA programs that use evolutionary relationship to do alignments. For instance, ClustalW uses a guide tree to align multiple sequences. Since there is no tree-like relationship among alternative splice variant, the guide tree generated basing on these sequences is not well founded, which may make ClustalW perform poorly. A third evaluation strategy is employed by TBA (Blanchette et al. 2004) and Mulan (Ovcharenko et al. 2005), which use simulations to estimate the accuracy of an alignment program. A hypothetical ancestral sequence is created, and simulated neutral evolutionary processes are applied to generate multiple sequences. Furthermore, the relationships among sequences are recorded and used to score the agreement of a given alignment with the recorded (artificial) evolutionary history. However, sometimes it is very difficult to simulate evolutionary processes correctly. In particular, different genomic sequence have different evolutionary characteristic. A fourth plan is adopted by Margulies et al. (2007), who use some particular classes of sequences, e.g. annotated protein-coding sequences, ancestral repeats (AR), and Alu elements, to assess the coverage and correctness of an alignment. However, these sequences can only be used to estimate the overall quality of an alignment; they cannot determine precisely which positions are correctly aligned according to biological considerations Motivation The concept of homologs, defined as the same organ in different animals under every variety of form and function by Richard Owen in 1843, is very important for evolutionary genomics and can be basically classified into two different types, orthologs and paralogs. Orthologs are genes or genomic intervals that diverged via a speciation event, while paralogs diverged via a duplication event. In general, since orthologous genes are descend from a single

21 10 gene in the last common ancestor of the species, their function and structure often remains the same or similar. However, since paralogous genes are created by a duplication event within a species, one copy of the duplicated genes can be recruited for a different function. Therefore, in this thesis, ability to align orthologous genes is used as the criterion for to evaluating alignment quality. First, 13 gene clusters among 6 different species are identified, and coding sequences are extracted for each gene clusters. Then, a phylogenetic tree is generated for each cluster and orthologous genes are identified. Finally, these orthologous genes are used to study the quality of an alignment. 2.2 Methods Gene clusters identification A gene cluster is a set of genes, arranged close together on a chromosome, and having the same or at least somewhat similar function and structure. In particular, all genes in a gene cluster are homologous. In this project, to infer which genes are orthologous, a phylogenetic tree is reconstructed for each gene clusters. Particularly if there are only a few species having sequence data for a given gene cluster, we need to be aware of the possible problem of pseudoorthologs (Koonin et al. 2005) resulting from lineage-specific gene loss. For example, if only two species are used to reconstruct a phylogenetic tree and infer orthologous genes, as shown in Figure 2-1, and if gene A1 and B2 are lost, gene A2 and gene B1 might be erroneously inferred to be orthologs. However, if too many species (e.g., with poorly resolved phylogenetic relationships) are used to reconstruct the phylogenetic tree, it is not only time-consuming but also unreliable. Therefore, in our project, gene clusters are identified among 6 different species, i.e. human, chimp, rhesus, mouse, rat and dog. First, functionally related genes in human are identified to

22 11 define a gene cluster. Then, corresponding regions for the other species are obtained using a program called liftover (Kent et al. 2002), which uses chain structures (Kent et al. 2003) to convert genome coordinates and genome annotation across species. In total, 13 human clusters are used in this project. More details are given next. Figure 2-1: Pseudoorthologs result from gene loss Extracting coding sequences In order to find orthologous genes in each gene cluster, the coding sequences of six different species are identified for each gene cluster. The coding sequences for human and mouse can be extracted from the UCSC Genome Browser (Kent et al. 2002). In order to extract the coding sequences of other species, we start with genomic sequences of those species. Subsequently, the pairwise alignment of human and each species is done by Blastz (Schwartz et al. 2003) and possible coding regions are found using the Laj interactive alignment viewer (Wilson et al. 2001). With these possible coding regions and the coding sequences of human, the corresponding coding regions for other species can be found by using GeneWise2 (Birney et al.

23 2004), which is a similarity-based tool for aligning an expressed DNA sequence with a genomic sequence. These processes are shown in Figure Figure 2-2: Processes for extracting coding sequences for species other than human or mouse Phylogenetic tree reconstruction To identify orthologous genes for each cluster, we start by generating the phylogenetic tree. Actually, there are many issues that will affect the correctness of phylogenetic tree reconstruction. Mutational saturation is one potential problem. A gene is called mutationally saturated when multiple substitutions have occurred at the same position so frequently that the evolutionary distance between two genes is underestimated. In Figure 2-3, multiple substitutions occurred at the same position in gene 3, so that gene 1 and gene 3 are erroneously considered more related. In order to deal with this problem, a phylogenetic tree is built for each gene cluster, which can generally be done with fair reliability, since the genes in each gene cluster are similar. Moreover, protein sequences are used to reconstruct the phylogenetic tree, since protein sequences are more conserved than the corresponding DNA sequences.

24 13 Figure 2-3: Problem of multiple substitutions in phylogenetic tree reconstruction. Variation of substitution rate among sites is also a critical issue for phylogenetic tree reconstruction. Some researchers (Fitch et al. 1967) have observed that the substitution rate among site is variable in most genes or proteins. Sullivan et al. (1997) use the Jukes-Cantor model (Jukes et al. 1969) to show that phylogenetic tree reconstruction methods, i.e. parsimony, distance-based, and maximum-likelihood methods, can be misled if the among-site substitution rate is incorrectly assumed to be invariable. In general, a statistical distribution could be used to approximate the substitution rate to solve this problem. Therefore, in this project, a Hidden Markov Model (HMM) method (Felsenstein et al. 1996) is used to infer the substitution rate at different amino acid positions. The substitution rate is approximated using a Gamma distribution at variable sites, while some sites are assumed to be invariable. This is called an I + Γ model (Gu et al. 1995), which is the most widely used model to reconstruct phylogenetic trees. There are two parameters, α, the shape parameter of gamma distribution, and p inv, the fraction of invariant sites, needed in this model. To estimate these parameters, a computer program called TREE- PUZZLE (Schmidt et al. 2003) is used. TREE-PUZZLE uses a fast tree search algorithm, called quartet puzzling, to estimate the parameters of maximum likelihood automatically.

25 14 Some researchers (Van et al 1998; Foster et al. 1999) also show that compositional bias could affect the correctness of reconstruction of phylogenetic trees. When unrelated genes have similar nucleotide compositions, they could be considered related and grouped together in the phylogenetic tree. Phillips (Phillips et al 2004) shows that compositional bias could mislead the reconstruction via distance-based methods, i.e. UPGMA and Neighbor Joining, but not Maximum Likelihood methods. Therefore, in this project, Maximum Likelihood methods are used to reconstruct a phylogenetic tree for each gene cluster. Figure 2-4: Species tree and inferred trees. (A) Species tree. (B) Two different inferred trees. Long branch attraction (LBA) (Felsenstein et al. 1987) could cause inconsistency of phylogeny reconstruction by grouping long branches together. In our project, genes for 6 different species are extracted to reconstruct a phylogenetic tree for each gene cluster. The species tree for these 6 species is shown in Figure 2-4A. However, since the evolutionary rates of rodents, i.e.

26 15 mouse and rat, are faster than others species, the inferred tree, shown in Figure 2-4B could be different from the species tree. In order to deal with this problem, when inferring the orthologs, dog and rodents are taken into consideration separately. When inferring the orthology of human and rodents, nodes containing dog genes are ignored. Similarly, when inferring the orthology of human and dog, we skip the nodes with genes of mouse and rat. Consequently, even if the phylogenetic trees are inconsistent, inferred orthologous genes are still correct. Since the phylogenetic tree generated by the maximum likelihood method is an unrooted tree, in order to infer orthologs from a phylogenetic tree, the tree should be converted into a rooted tree. In general, an outgroup could be used to find the root of an unrooted tree. There are several issues that should be considered. When an outgroup is too distant from all genes in a gene cluster, it could result in an incorrect phylogenetic tree due to mutational saturation of sequences. However, when an outgroup appears to be too close to some genes in a gene cluster, it might not be a real outgroup. In this project, Blastp, is used to find an appropriate outgroup. Blastp uses BLAST (Basic Local Alignment Search Tool) to compare an amino acid sequence with a proteins database and find the most similar proteins. For a gene cluster containing many paralogous genes, e.g. there are five paralogous genes, i.e. β, δ, γ 1, γ 2 and ε, in β-globin gene cluster and some clusters have hundreds of paralogs, the common ancestor for all genes in the cluster could be much earlier than the common ancestor of Boreoeutheria. Therefore, genes of three different species, opossum, chicken and zebra fish, are used as outgroups in this project. In order to find genes in these three species, one gene in the gene cluster should be used to search in the protein database. In this project, this gene is determined using a heuristic method as follows: (1) A phylogenetic tree is generated for all genes of human using Maximum Likelihood. (2) The gene with the longest branch length is selected as a query sequences to search in the protein database.

27 16 Figure 2-5 shows an example of how to select such a gene. Figure 2-5 is the phylogenetic tree of human genes in the Chemokine Ligand (CCL1) gene cluster, generated using Maximum Likelihood method. CCL1 is selected to search in the protein database. There are two reasons why CCL1 is chosen. One is based on the idea of the mid-point rooting method; the root of an unrooted tree is positioned at the mid-point of the longest span across the tree. Another reason is to reduce the Long Branch Attraction effect. Not all of the three outgroups, i.e. opossum, chicken and zebra fish, must be the common ancestor of all genes in a gene cluster. Some of them could be the roots of a subtree, and grouping them with the gene with longest branch length could reduce the LBA effect (Swofford et al. 1996). Figure 2-5: An example for selecting a gene as the query sequence for database search. Nevertheless, in some gene clusters, the common ancestor for all genes is earlier than the speciation of zebra fish, and it is very difficult to find a true outgroup for these gene clusters. Moreover, even if we can find the true outgroup for these gene cluster, we still have the problem of mutational saturation. Therefore, when the outgroup we use to reconstruct phylogenetic tree is not a real outgroup, a re-root process will be executed. The concept of re-root is similar to the idea behind construction of reconciled tree (Page et al. 1998; Paola et al. 2003), which reconciles gene trees to a species tree with the assumption of minimizing the duplication events and gene losses. However, unlike finding a reconciled tree, which requires checking all branches

28 17 to find the optimal root, re-root only traces along a tree branch to find the real root. Figure 2-6 shows an example of how re-root works. Figure 2-6A is the gene tree of the Tripartite motif (TRIM) gene cluster with the root between gene d1 and other genes. In order to find the true root, the evolutionary history should be traced first. In Figure 2-6A, branch E contains all 6 species; it is the common ancestor of all species (Boreoeutheria) so that it could be the true root for the gene tree. Even branch F doesn t include all species, it comprises primate and dog, and it is also the common ancestor of all species as a result of gene losses in rodents. Therefore, branch C, D, E and F are ancestors of Boreoeutheria and any of them could be the true root of this gene tree. We check branch A first and find that it only contains the gene of dog. We trace along the tree and use C as the new root as shown in Figure 2-6B. We find that branch A will be combined with branch B and form a new ancestor of Boreoeutheria. Now, we check children of the root and find that both of them are ancestors of Boreoeutheria, and therefore, we stop and use branch C as the root. Consequently, even if branch C is not a real root of the gene tree of the TRIM cluster, e.g. it could be D, E or F, this does not affect the inferring of orthologous genes.

29 18 Figure 2-6: Changing the gene tree of the TRIM gene cluster via re-rooting. (A) Gene tree before re-rooting. (B) Gene tree after re-rooting Orthology identification In order to infer orthologous genes for each gene cluster, the PHYLIP package (Felsenstein et al. 2002) is used to generate the phylogenetic tree. The Maximum Likelihood method with bootstrapping of 100 replicates is applied. Subsequently, orthologous genes are determined from the tree. There are three types of relationships of orthologs, i.e., one-to-one, oneto-many, and many-to-many, that could be found in the tree. Figure 2-7A shows the relationship of one-to-one. In this situation, A and B are orthologous to each other. Figure 2-7B shows the relationship of one-to-many. In this case, A is orthologous to B 1 and B 2. In third case, shown in Figure 2-7C, A 1 is orthologous to B 1 and B 2 ; similarly, A 2 is orthologous to B 1 and B 2.

30 19 Figure 2-7: Orthologous relationships within a phylogenetic tree. (A) One-to-one. (B) One-tomany. (C) Many-to-many. In general, a bootstrap value is used to evaluate the reliability of a phylogenetic tree. However, there are some drawbacks to using bootstrap values to estimate the reliability of putative orthologous genes. Figure 2-8A shows one of these cases. The tree in Figure 2-8A is part of the gene tree for the human and rhesus β-globin gene clusters. If we use a bootstrap value to infer the reliability of putative orthologous genes, the reliability of orthology of HBG1 and R3 is 98%. The reliability of the orthology of HBG2 and R3 is also 98%. However, when I checked all of the trees, i.e. 100 trees, generated by the bootstrap method, I found that 93 trees support the orthology of HBG2 and R3, whereas only 62 trees support the orthology of HBG1 and R3. This result indicates the difference between bootstrap value and real reliability of orthologs. Another case is shown in Figure 2-8B. The tree in Figure 2-8B is the gene tree for human and mouse in β-globin gene clusters. The branches with large bootstrap value, e.g. 1000, 998 and 988, can represent the reliability of orthologs correctly. However, the branches with less bootstrap value, e.g. 510 and 452, can't show accurate reliability of orthologs. For example, the bootstrap value shows that the reliability of the orthology of HBG1 and Hbb-bh is 45.2%. However, a total of 591 trees support the orthology of HBG1 and Hbb-bh, which means the reliability of the orthology of HBG1 and Hbb-bh should be 59.1%. Thus, if we would like to take the orthologous genes with less reliability into consideration, the bootstrap values should be used with caution.

31 20 Figure 2-8: Some drawbacks of using a bootstrap value as the estimated reliability of orthologous genes. (A) Parts of gene tree of β-globin gene cluster. (B) Gene tree for Human and Mouse in the β-globin gene cluster. Therefore, in order to evaluate the reliability of orthologs accurately, in this project, instead of using a bootstrap value to indicate the reliability of orthologs, we check all of the trees, i.e. 100 trees, generated by bootstrapping method and calculate how many trees support the orthology for each pair of genes. For this purpose, a program infer_orthologs is needed to infer

32 21 orthologs automatically for each gene tree. Before executing infer_orthologs program, some evolutionary information, as shown in Figure 2-9, should be collected first as follows: (1) If a node contains only genes of human and chimp, it is the common ancestor of human and chimp. (2) If a node contains genes of rhesus and at least one gene of human or chimp (but no genes of other species), it is the common ancestor of primates (among the six species used here). (3) If a node contains only genes of mouse and rat, it is the common ancestor of rodents. (4) If one child of a node is the common ancestor of rodents, and the other child is one of: (1) the common ancestor of primates, (2) the common ancestor of human and chimp, (3) rhesus, (4) chimp, or (5) human, then this node is the common ancestor of Euarchontoglires. (5) If one child of a node contains genes of dog, and the other child of the node is one of: (1) the common ancestor of Euarchontoglires, (2) common ancestor of rodents, (3) common ancestor of primates, (4) common ancestor of chimp and human, (5) mouse, (6) rat, (7) rhesus, (8) chimp, or (9) human, then this node is the common ancestor of Boreoeutheria. (6) If one child of a node is the common ancestor of Boreoeutheria, then this node is the ancestor of all species (in this set) and is not useful for inference of orthologs.

33 22 Figure 2-9: Evolutionary information for orthologs inference. Using the above evolutionary information, the algorithm of infer_orthologs is as follows: (1) For each leaf N that corresponds to a human gene, H, do; (2) Check the type of the parent of N; (3) If the parent of N is of type 1, report that H is orthologous to all genes of chimp in the other child of the parent of N. (4) If the parent of N is of type 2, report that H is orthologous to all genes of rhesus in the other child of the parent of N. (5) If the parent of N is of type 4, report that H is orthologous to all genes of mouse and rat in the other child of the parent of N. (6) If the parent of N is of type 5, report that H is orthologous to all genes of dog in the other child of the parent of N. If N is root or the parent of N is of type 6 go to step 1, otherwise, assign the parent of N to N and go to step Evaluation of alignments To evaluate the quality of an alignment, sensitivity and specificity are calculated. Informally, sensitivity measures the coverage of an alignment and specificity measures the

23 correctness of an alignment. The definitions of sensitivity and specificity are pictured in Figure 2-10.

34 23 correctness of an alignment. The definitions of sensitivity and specificity are pictured in Figure Sensitivity is the ratio of correctly aligned orthologs to total true orthologs, and specificity means the proportion of correctly aligned orthologs among all aligned sequences. In general, sensitivity and specificity can be calculated according to following formulas. Sensitivity = Correctly aligned orthologs (A T) / True orthologs (T) * 100 % Specificity = Correctly aligned orthologs (A T) / Aligned sequences (A) * 100 % Figure 2-10: Definitions of sensitivity and specificity. However, since we obtain a subset of the true orthologs, the quality of an alignment is evaluated using only the subset of an alignment, say, A. Therefore, the formulas for sensitivity and specificity could be modified as follows: Sensitivity = (A T) / T * 100 % Specificity = (A T) / A * 100 % Moreover, a reliability value (R i ) is assigned to each ortholog (T i ) for each pair of genes. Therefore, the formulas for sensitivity and specificity can be modified once more as follows: Sensitivity = $ (( A # i " Ti )! Ri ) / $ ( Ti! Ri )! 100% i Specificity = $ (( A " i # Ti )! Ri ) / $ ( Ai"! Ri )! 100% i i

35 Results Analysis of ortholog assignments for the α- and β-globin gene clusters To assess the accuracy of inferred orthology relationships, we organize and oversee a set of ortholog assignments for the genes of mammalian α- and β-globin gene clusters. Some of the ortholog assignments are based on published literatures (e.g., ENCODE Project Consortium 2004; Hou 2007; Prychitko 2005; Aguileta 2004; Aguileta 2006a; Aguileta 2006b), others are determined by the analysis of protein sequences and flanking non-coding regions. The predicted ortholog assignments of HBB-related genes for 25 species in mammals are shown in Table 2-1. The coordinates for these genes are given in Table 2-2. Table 2-1: Ortholog assignment of β-globin gene cluster. Human 1=HBB beta 2=HBD delta 3=HBBP1 eta 4=HBG1 Agamma 5=HBG2 Ggamma 6=HBE1 epsilon Chimp , 5 4, 5 6 Gibbon , 5 4, 5 6 Colobus , 5 4, 5 6 Rhesus , 5 4, 5 6 Baboon , 5 4, 5 6 Green_monkey Night_monkey Squirrel_monkey , 5 4, 5 6 Titi , 5 4, 5 6 Marmoset , 5 4, 5 6 Galago Mouse 1, 2 3, 4 Del 5 5 Partial6, 7 Rat 1, 2, 3, 4 5 Del 6, 7 6, 7 8, 9 Rabbit 1 2 Del Guinea_pig 1 Del St_squirrel 1, 2, 3 Partial4 Del Lemur 1 2 Del RfBat Del Del 4 Dog 1 2, Cat Partial4 Partial4 5 Cow 4, 10 2, 6 1, 5 Armadillo 1 2 Partial Elephant 1 2 Del 3 3 4

36 25 Opossum Table 2-2: The coordinates of all genes for 25 species of mammals in the β-globin gene cluster. Gene Chromosome/ Start End Strand Accession ID Human 1 Chr Human 2 Chr Human 3 Chr Human 4 Chr Human 5 Chr Human 6 Chr Chimp 1 Chr Chimp 2 Chr Chimp 3 Chr Chimp 4 Chr Chimp 5 Chr Chimp 6 Chr Gibbon 1 NT_ Gibbon 2 NT_ Gibbon 3 NT_ Gibbon 4 NT_ Gibbon 5 NT_ Gibbon 6 NT_ Colobus 1 NT_ Colobus 2 NT_ Colobus 3 NT_ Colobus 4 NT_ Colobus 5 NT_ Colobus 6 NT_ Rhesus 1 Chr Rhesus 2 Chr Rhesus 3 Chr Rhesus 4 Chr Rhesus 5 Chr Rhesus 6 Chr Baboon 1 NT_ Baboon 2 NT_ Baboon 3 NT_ Baboon 4 NT_ Baboon 5 NT_ Baboon 6 NT_ Green_monkey 1 NT_ Green_monkey 2 NT_ Green_monkey 3 NT_ Green_monkey 4 NT_

37 Green_monkey 5 NT_ Night_monkey 1 NT_ Night_monkey 2 NT_ Night_monkey 3 NT_ Night_monkey 4 NT_ Night_monkey 5 NT_ Squirrel_monkey 1 NT_ Squirrel_monkey 2 NT_ Squirrel_monkey 3 NT_ Squirrel_monkey 4 NT_ Squirrel_monkey 5 NT_ Squirrel_monkey 6 NT_ Titi 1 NT_ Titi 2 NT_ Titi 3 NT_ Titi 4 NT_ Titi 5 NT_ Titi 6 NT_ Marmoset 1 NT_ Marmoset 2 NT_ Marmoset 3 NT_ Marmoset 4 NT_ Marmoset 5 NT_ Marmoset 6 NT_ Galago 1 NT_ Galago 2 NT_ Galago 3 NT_ Galago 4 NT_ Galago 5 NT_ Mouse 1 NT_ Mouse 2 NT_ Mouse 3 NT_ Mouse 4 NT_ Mouse 5 NT_ Mouse 6 NT_ Mouse 7 NT_ Rat 1 Chr Rat 2 Chr Rat 3 Chr Rat 4 Chr Rat 5 Chr Rat 6 Chr Rat 7 Chr Rat 8 Chr Rat 9 Chr Rabbit 1 NT_ Rabbit 2 NT_

38 Rabbit 3 NT_ Rabbit 4 NT_ Guinea_pig 1 NT_ Guinea_pig 2 NT_ Guinea_pig 3 NT_ St_squirrel 1 NT_ St_squirrel 2 NT_ St_squirrel 3 NT_ St_squirrel 4 NT_ St_squirrel 5 NT_ St_squirrel 6 NT_ Lemur 1 NT_ Lemur 2 NT_ Lemur 3 NT_ Lemur 4 NT_ RfBat 1 NT_ RfBat 2 NT_ RfBat 3 NT_ RfBat 4 NT_ Dog 1 Chr Dog 2 Chr Dog 3 Chr Dog 4 Chr Dog 5 Chr Cat 1 NT_ Cat 2 NT_ Cat 3 NT_ Cat 4 NT_ Cat 5 NT_ Cow 1 Chr Cow 2 Chr Cow 4 Chr Cow 5 Chr Cow 6 Chr Cow 10 Chr Armadillo 1 NT_ Armadillo 2 NT_ Armadillo 3 NT_ Armadillo 4 NT_ Armadillo 5 NT_ Elephant 1 NT_ Elephant 2 NT_ Elephant 3 NT_ Elephant 4 NT_ Opossum 1 ChrUn Opossum 2 ChrUn

39 28 We used this manually prepared set of ortholog assignments to analyze the robustness of the automatic, tree-based method described above. In theory, the automatic method would start by constructing the phylogenetic tree for all complete genes that we predicted for the 25 species in the β-globin gene cluster, using Maximum Likelihood. However, since the tree for all genes is so huge that its reliability would be low, we divided these genes into four categories, Primates, Rodentia, Laurasiatheria and Atlantogenata, and separately constructed a phylogenetic tree for each category. The results are shown in Figure We can find that in three categories, Primates, Rodentia and Laurasiatheria, almost all orthologous genes are included in the same sub-tree so that it is very reliable to use the phylogenetic tree to identify orthologs. One exception is the β and δ genes, which are mixed together. The main reason is gene conversion. As shown in Figure 2-12, gene conversion is the process that one gene is replaced by another similar gene, while gene duplication refers to duplication of a region of DNA that contains a gene. (For more details, see Chen 2007.) It is difficult to distinguished gene conversion from gene duplication. One possible solution is to reconstruct the ancestral gene cluster. When two similar genes are found in one species, but its ancestor contains only one copy, this suggests occurrence of a gene duplication. Otherwise, gene conversion is likely.

40 29 Figure 2-11: Phylogenetic trees for 25 species in β-globin gene cluster. (A) Primates. (B) Rodentia. (C) Laurasiatheria. (D) Atlantogenata. In the Atlantogenata category shown in Figure 2-11D, γ and ε genes are also mixed together. One possible reason is multiple substitutions at the same sites. Another possible reason is low stemmines, which means that internal branches are much shorter than terminal branches.

41 30 Low stemminess decreases the probability that the correct topology will be found (Smith et al. 1994). In this project, species in the Atlantogenata category are only used as outgroups; they are not used to do evaluation, and therefore, cannot affect the accuracy of ortholog assignments. Figure 2-12: Gene conversion and gene duplication. (A) Gene conversion. (B) Gene duplication. Species whose speciation events are very close in time could not be distinguished very well by using a phylogenetic tree. For instance, the β-globin gene tree for Laurasiatheria doesn't match very well with its species tree. Therefore, when we choose species to do the evaluation, we would prefer to use species whose tree lacks short internal branches. In this project, 6 species, human, chimp, rhesus, mouse, rat and dog, are chosen to evaluate the performance of MSA programs, though extra care is needed because dogs separated from primates only shortly before rodents did. Also, three species, opossum, chicken and zebra fish, are chosen as outgroups. Therefore, the phylogenetic tree generated using genes of these species is generally reliable. Ortholog assignments for the α-globin gene cluster are also studied in this project. The predicted ortholog assignments of HBA-related genes for 30 species in mammals are shown in Table 2-3. The coordinates for these genes are given in Table 2-4. Table 2-3: The predicted ortholog assignments of HBA-related genes for 30 mammalian species. Human 1=HBZ 2=HBZP 3=HBM 4=HBAP1 5=HBA2 6=HBA1 7=HBQ1 Gibbon partial1, partial1, , 6 5, Colobus 1, 2, 1, 2, 4 5 6, 7 6, 7 8

42 partial3 partial3 Rhesus partial1, partial1, partial2 partial2 Baboon 1, 2 1, Owl_monke , 4 3, 4 5 y Squirrel_mo 1, 2 1, nkey Vervet , 4 3, 4 5 Marmoset partial3, partial3, Titi , 4, 5, 6 3, 4, 5, 6 7 Lemur 1, 2 1, 2 3 4, 5 4, 5 4, 5 6 Tree_shrew 1, 2 1, Galago 1, 2 1, 2 3 4, 5 4, 5 4, 5 6 Mouse 1 1 2, partial4, 5 2, partial4, 5 2, partial4, 5 3, 6 Rat 1 1 2, 4, 6 2, 4, 6 2, 4, 6 3, 5, 7 Rabbit 1, 2, 3, 7, 10, 11, 12, partial15 1, 2, 3, 7, 10, 11, 12, partial15 4, 4, 5, 8, 13 Guinea_pig St_squirrel 1 1 2, 3, 4, 5 2, 3, 4, 5 2, 3, 4, 5 6 RfBat , 4, 5 3, 4, 5 3, 4, 5 6 Sbbat 1, 2 1, 2 3 4, 5 4, 5 4, 5 6, 7 Cat partial1, partial1, Cow 1, 2 1, 2 3 4, 5 4, 5 4, 5 6 Horse 1 1 2, 3 2, 3 2, 3 4 Hedgehog 1, 2, 3 1, 2, 3 4 5, 6, 5, 6, 5, 6, 8 partial7 partial7 partial7 Shrew 1, 2 1, 2 3 Partial4, 5, 6 Partial4, 5, 6 Partial4, 5, 6 Armadillo Tenrec 1, 3 1, 3 2, 4 2, 4 2, 4 5 Elephant 1, 2 1, 2 3 4, 5 4, 5 4, 5 6 Rock_hyrax 1, 2 1, Opossum 1, 2 1, 2 3, 4 3, 4 3, 4 31 Table 2-4: Coordinates of all genes for 30 mammalian α-globin gene clusters. Gene Chromosome/ Start End Strand Accession ID Human 1 Chr Human 2 Chr

43 Human 3 Chr Human 4 Chr Human 5 Chr Human 6 Chr Human 7 Chr Gibbon 1 NT_ Gibbon 2 NT_ Gibbon 3 NT_ Gibbon 4 NT_ Gibbon 5 NT_ Gibbon 6 NT_ Gibbon 7 NT_ Colobus 1 NT_ Colobus 2 NT_ Colobus 3 NT_ Colobus 4 NT_ Colobus 5 NT_ Colobus 6 NT_ Colobus 7 NT_ Colobus 8 NT_ Rhesus 1 Chr Rhesus 2 Chr Rhesus 3 Chr Rhesus 4 Chr Rhesus 5 Chr Rhesus 6 Chr Baboon 1 NT_ Baboon 2 NT_ Baboon 3 NT_ Baboon 4 NT_ Baboon 5 NT_ Baboon 6 NT_ Owl_monkey 1 NT_ Owl_monkey 2 NT_ Owl_monkey 3 NT_ Owl_monkey 4 NT_ Owl_monkey 5 NT_ Squirrel_monkey 1 NT_ Squirrel_monkey 2 NT_ Squirrel_monkey 3 NT_ Squirrel_monkey 4 NT_ Squirrel_monkey 5 NT_ Vervet 1 NT_ Vervet 2 NT_ Vervet 3 NT_ Vervet 4 NT_ Vervet 5 NT_

44 Marmoset 1 NT_ Marmoset 2 NT_ Marmoset 3 NT_ Marmoset 4 NT_ Marmoset 5 NT_ Titi 1 NT_ Titi 2 NT_ Titi 3 NT_ Titi 4 NT_ Titi 5 NT_ Titi 6 NT_ Titi 7 NT_ Lemur 1 NT_ Lemur 2 NT_ Lemur 3 NT_ Lemur 4 NT_ Lemur 5 NT_ Lemur 6 NT_ Tree_shrew 1 NT_ Tree_shrew 2 NT_ Tree_shrew 3 NT_ Tree_shrew 4 NT_ Galago 1 NT_ Galago 2 NT_ Galago 3 NT_ Galago 4 NT_ Galago 5 NT_ Galago 6 NT_ Mouse 1 NT_ Mouse 2 NT_ Mouse 3 NT_ Mouse 4 NT_ Mouse 5 NT_ Mouse 6 NT_ Rat 1 Chr Rat 2 Chr Rat 3 Chr Rat 4 Chr Rat 5 Chr Rat 6 Chr Rat 7 Chr Rabbit 1 NT_ Rabbit 2 NT_ Rabbit 3 NT_ Rabbit 4 NT_ Rabbit 5 NT_ Rabbit 6 NT_

45 Rabbit 7 NT_ Rabbit 8 NT_ Rabbit 9 NT_ Rabbit 10 NT_ Rabbit 11 NT_ Rabbit 12 NT_ Rabbit 13 NT_ Rabbit 14 NT_ Rabbit 15 NT_ Guinea_pig 1 NT_ Guinea_pig 2 NT_ Guinea_pig 3 NT_ Guinea_pig 4 NT_ St_squirrel 1 NT_ St_squirrel 2 NT_ St_squirrel 3 NT_ St_squirrel 4 NT_ St_squirrel 5 NT_ St_squirrel 6 NT_ RfBat 1 NT_ RfBat 2 NT_ RfBat 3 NT_ RfBat 4 NT_ RfBat 5 NT_ RfBat 6 NT_ Sbbat 1 NT_ Sbbat 2 NT_ Sbbat 3 NT_ Sbbat 4 NT_ Sbbat 5 NT_ Sbbat 6 NT_ Sbbat 7 NT_ Cat 1 NT_ Cat 2 NT_ Cat 3 NT_ Cat 4 NT_ Cow 1 Chr Cow 2 Chr Cow 3 Chr Cow 4 Chr Cow 5 Chr Cow 6 Chr Horse 1 NT_ Horse 2 NT_ Horse 3 NT_ Horse 4 NT_ Hedgehog 1 NT_

46 35 Hedgehog 2 NT_ Hedgehog 3 NT_ Hedgehog 4 NT_ Hedgehog 5 NT_ Hedgehog 6 NT_ Hedgehog 7 NT_ Hedgehog 8 NT_ Shrew 1 NT_ Shrew 2 NT_ Shrew 3 NT_ Shrew 4 NT_ Shrew 5 NT_ Shrew 6 NT_ Armadillo 1 NT_ Armadillo 2 NT_ Armadillo 3 NT_ Tenrec 1 NT_ Tenrec 2 NT_ Tenrec 3 NT_ Tenrec 4 NT_ Tenrec 5 NT_ Elephant 1 NT_ Elephant 2 NT_ Elephant 3 NT_ Elephant 4 NT_ Elephant 5 NT_ Elephant 6 NT_ Rock_hyrax 1 NT_ Rock_hyrax 2 NT_ Rock_hyrax 3 NT_ Rock_hyrax 4 NT_ Rock_hyrax 5 NT_ Opossum 1 Chr Opossum 2 Chr Opossum 3 Chr Opossum 4 Chr Opossum 5 Chr The genes of the α-globin clusters are divided into four categories, Primates, Rodentia, Laurasiatheria and Atlantogenata, and phylogenetic trees for each category are constructed separately. The results are shown in Figure We can find that some gene conversions occurred between the α and ζ genes.

47 Figure 2-13: Phylogenetic trees for 30 species in the α-globin gene cluster. (A) Primates. (B) Rodentia. (C) Laurasiatheria. (D) Atlantogenata. 36

37 2.3.2 Comparison of different alignment programs In this project, the performance of different alignment programs, e.g. MLAGAN (Brudno et al. 2003), MAVID (Bray et al. 2004), TBA (Blanchette et al.

48 Comparison of different alignment programs In this project, the performance of different alignment programs, e.g. MLAGAN (Brudno et al. 2003), MAVID (Bray et al. 2004), TBA (Blanchette et al. 2004) and ROAST (Hou et al. 2005) is studied. β-globin gene cluster (Hardies et al. 1984) from the ENCODE project (ENCODE Project Consortium 2004) is used to evaluate the quality of different alignment programs. The results are shown in Figure Figure 2-14A shows the sensitivity of different alignment programs. In general, the sensitivity of ROAST is better than MAVID, MLAGAN and TBA. Figure 2-14B shows the specificity among these alignments. The performance of ROAST is superior to the others among most species. Figure 2-14: Sensitivity and specificity among different alignment programs. (A) Sensitivity. (B) Specificity Comparison of different gene clusters While we limited evaluation of MLAGAN and MAVID to the two globin clusters because we had access to those alignments only in ENCODE regions, we can compare aligners developed in our lab on as many gene clusters as we care to analyze. In total, 13 gene clusters are used in this part of the project to evaluate two multiple alignment programs, ROAST and Multiz

49 38 (Blanchette et al. 2004; Ovcharenko et al. 2005), which is the multi-aligner component of TBA (Blanchette et al. 2004) (More detailed information for these 13 gene clusters can be found in Table 2-5). The results are shown in Figure Among the 13 gene clusters, sensitivities computed for three gene clusters, IFN (Interferon genes), SPRR (Small proline rich genes) and UGT2 (UDP glycosyltransferase 2 family), are worse than the others. We find that all three of these gene clusters contain many orthologous genes with many-to-many relationships, which means that a gene in a species is orthologous to many genes in another species, while at the same time many genes in this species are orthologous to the same gene in the other species. Therefore, existing multiple-alignment programs could not align orthologous genes with many similar copies very well. Better handling of these three clusters could be a goal for the next generation of multiple-alignment programs. Table 2-5: Detailed information for 13 gene clusters. Gene clusters Assemblies Chromosome Start End hg18 Chr pantro2 Chr BTN rhemac2 Chr mm8 Chr rn4 Chr canfam2 Chr hg18 Chr pantro2 Chr CCL rhemac2 Chr mm8 Chr rn4 Chr canfam2 Chr hg18 Chr pantro2 Chr CYP2 rhemac2 Chr mm8 Chr rn4 Chr canfam2 Chr HB hg18 Chr pantro2 Chr rhemac2 Chr

50 39 HLA-D IFN MMP MT OR SERPINA SPRR TRIM mm8 Chr rn4 Chr canfam2 Chr hg18 Chr pantro2 Chr rhemac2 Chr mm8 Chr rn4 Chr canfam2 Chr hg18 Chr pantro2 Chr rhemac2 Chr mm8 Chr rn4 Chr canfam2 Chr hg18 Chr pantro2 Chr rhemac2 Chr mm8 Chr rn4 Chr canfam2 Chr hg18 Chr pantro2 Chr rhemac2 Chr mm8 Chr rn4 Chr canfam2 Chr hg18 Chr pantro2 Chr rhemac2 Chr mm8 Chr rn4 Chr canfam2 Chr hg18 Chr pantro2 Chr rhemac2 Chr mm8 Chr rn4 Chr canfam2 Chr hg18 Chr pantro2 Chr rhemac2 Chr mm8 Chr rn4 Chr canfam2 Chr hg18 Chr pantro2 Chr

40 UGT2 rhemac2 Chr14 67623262 67763703 mm8 Chr7 104092629 104244403 rn4 Chr1 162059723 162190566 canfam2 Chr21 31792348 31892437 hg18 Chr4 69546932 70548006 pantro2 Chr4 60980943 61774067 rhemac2

51 40 UGT2 rhemac2 Chr mm8 Chr rn4 Chr canfam2 Chr hg18 Chr pantro2 Chr rhemac2 Chr mm8 Chr rn4 Chr canfam2 Chr Figure 2-15: Sensitivity and specificity of different gene clusters. (A) Sensitivity. (B) Specificity Comparison of different species We have looked at the performance of different species in our approach to evaluating MSA programs. In particular, the sensitivity and specificity of 5 different species, i.e. chimp, rhesus, mouse, rat and dog are analyzed. The results are shown in Figure The performance of rodents, i.e. mouse and rat, is worse than other species. Although the speciation of dog (relative to human) preceded that of rodents, the substitution rate in rodents is faster than in the dog lineage. Therefore, the similarity of to human of genes of rodents is lower than the similarity of genes of dog, so that multiple-alignment programs could not align the rodent sequences very

well. Thus, another main challenge for the next generation of multiple-alignment programs could be to more accurately align sequences from distant species.

52 well. Thus, another main challenge for the next generation of multiple-alignment programs could be to more accurately align sequences from distant species. 41 Figure 2-16: Sensitivity and specificity of different species. (A) Sensitivity. (B) Specificity. 2.4 Conclusion The work described above represents the initial portion of a project that is a component of a larger effort in the Miller lab to produce an improved suite of tools for evaluating MSAs. This effort includes (1) building an improved sequence-evolution simulator that consolidates and extends earlier work in the lab (Blanchette 2004; Ovcharenko 2005; Ma 2007; Zhang 2007), (2) use and enhancement of tools and ideas developed elsewhere for evaluating whole-genome alignments (Margulies 2007; Prakash and Tompa 2007; Wang 2007) and (3) development of new tools to evaluate alignments in gene clusters. In each case, the aim is to produce robust tools that can easily be run by others. The third component of this effort, i.e., evaluation tools for gene-cluster MSAs, is our responsibility. As described above, we have developed a method that uses phylogenetic trees to identify orthologous genes within gene clusters. A detailed analysis in the α- and β-globin gene clusters indicates that orthology relationships inferred this way are reasonably accurate. These

53 42 orthology assignments were applied to analyze the quality of alignments produced by several alignment programs, i.e., MAVID, MLAGAN, TBA and ROAST. The results show that the performance of ROAST is better than the others in terms of sensitivity and specificity. We also analyzed two aligners, ROAST and Multiz on 13 gene clusters.

54 Chapter 3 Gene conversion detection between a pair of genes In this chapter, we try to detect gene conversion between a pair of genes. A site-by-site compatibility method is proposed to detect the occurrence and directionality of gene conversion between a pair of genes. This method is applied for two data sets, e.g. beta and delta genes, and two gamma genes. Detailed gene conversion information for these two data sets are shown in this chapter. 3.1 Introduction Motivation Phylogenetic trees are used to infer the orthologs between genes by comparing with species tree. However, in some cases, the phylogenetic tree is not reliable so that the inference of orthology could be incorrect. For instance, figure 3-1 shows the phylogenetic tree for the beta globin gene cluster. We can find that the beta genes and delta gene are mixed together. The phylogenetic tree cannot separate these two genes. The main reason is due to the gene conversion. Some delta genes are converted by the beta genes so that their sequence might be closer to each other. In this chapter, we will deal with the gene conversion problem.

55 44 Figure 3-1: Phylogenetic tree of beta globin gene cluster What is gene conversion Gene conversion is one gene, which is replaced by another similar gene. As shown in figure 3-2A, originally, there are two similar genes and one gene is replaced by another gene. Finally, there are two very similar genes. Figure 3-2B shows the process of gene duplication. The difference of gene conversion and gene duplication is that gene conversion can only convert part of gene. For example, part of one gene can be converted by another gene. However, part of gene

56 remains the same. It will still be a complete gene. However, for gene duplication, the whole gene should be duplicated. Otherwise, it cannot form a complete gene. 45 Figure 3-2: Gene conversion and gene duplication. (A) Gene conversion. (B) Gene duplication Impact of gene conversion to the inference of orthology Gene conversion will make the inference of phylogenetic tree more difficult. For example, as shown in figure 3-3A, there is a gene tree for five genes. And there is a gene conversion event in some regions between gene D and gene E. Then the gene tree for the converted region might like the tree shown in figure 3-3B. So the relationship for these five genes will be inconsistent in different regions. Current tree reconstruction method cannot deal with this problem. Figure 3-3: Effect of gene conversion. (A) Gene tree for five genes. (B) Phylogenetic tree for the

57 46 conversion region Methods for gene conversion detection A lot of gene conversion detection methods are proposed. Basically, they can be classified into four different types. Similarity method tries to find high conserved regions. However, high conserved regions sometimes are due to the selection pressure. Phylogenetic method tries to find if there are some regions whose phylogenetic trees are different from others. Compatibility method will find all parsimoniously informative sites and check if they are compatible to each other. And substitution method tries to find if there is a significant clustering of substitution or some kind of particular pattern Limitations of these methods However, most of current methods have these problems. They just find gene conversion events in each gene. However, they cannot find the gene conversion events in the ancestors. Also, some of these methods cannot determine the directionality of gene conversion. They can find a gene conversion event between two genes. However, they cannot tell which gene is converted. Finally, if there are multiple gene conversion events in a gene and they are overlapped to each other. Most of these methods cannot deal with it very well. 3.2 Methods A method called site-by-site compatibility method, which checks all parsimoniously informative sites and finds all possible gene conversion events for all genes and their ancestors.

58 47 This method was first proposed by Fitch in However, they implemented this method by hand. Drouin proposed an automatic method in However, there are still some problems for their method. First is their hypothesis of a gene conversion event. They think there is a gene conversion event if the paralogous genes are identical and the orthologous genes are different. However, in some cases, it could be not correct. For example, figure 3-4 is the gene tree for four genes. According to their method, there is gene conversion event between A1 and A2. Also, there is gene conversion event between B1 and B2. However, it could depend on the status of their parent. If the status of their parent is T, there is only one gene conversion event between B1 and B2. If the status of their parent is A, there is only one gene conversion event between A1 and A2. Another problem is the directionality of gene conversion. They didn t propose a method to determine which gene is converted. Finally, they also didn t have a method to determine the boundaries of gene conversion events. Figure 3-4: Example shows the issues of Drouin s method Site-by-site compatibility method To deal with these problems, first, we will determine the status of all ancestors. We use Fitch s algorithm to find a parsimony tree. Fitch s algorithm includes two phases, bottom-up phase and top-down phase. The bottom-up phase will find possible status s i of internal node i with children j and k using following formula:

59 48 (3-1) Figure 3-5 shows an example of the bottom-up phase. The possible statuses for all internal nodes are determined. Figure 3-5: An example of bottom-up phase. While Top-down phase will determine the final status s j of internal node j with parent i using following formula: determined. (3-2) Figure 3-6 shows an example in which the final statuses for all internal nodes are

60 49 Figure 3-6: An example of top-down phase Gene conversion inference Basing on the statuses of all internal nodes, for each parsimoniously informative site, k, pairs of paralogous genes, s j with parent s i and s j with parent s i, that favor the hypothesis of a gene conversion are determined using following formula: (3-3) Figure 3-7 shows an example about how to detect gene conversion and the directionality. all paralogous genes are checked. For the pair of paralogous genes, F1 and F2, they are the same as their parents; however, they are different from each other. So, there is no gene conversion

61 50 between this pair of genes. For the pair of paralogous genes, E1 and E2, E1 is different from its parent; however, it is the same as its paralogous gene, E2. So, there is a gene conversion event between this pair of genes. And E1 is converted by its paralogous gene, E2. Similarly, for the pair of paralogous genes, C1 and C2, there is a gene conversion event between these two genes. However, the directionality is different. C2 is converted by C1. Finally, for the pair of paralogous genes, B1 and B2, both of them are different from their parents; however, they are the same as each other. There is a gene conversion event between this pair of genes. However, we cannot determine the directionality. Figure 3-7: An example shows how to determine gene conversion and it s directionality.

62 Boundaries of gene conversion Finally, the boundaries of gene conversion events are determined. Assume that r-site indicates a conversion event and s-site shows no evidence of gene conversion. For a sequence with s of s-sites and r of r-sites, Stephan proposed a statistic test to determine the probability that there is at least k of r-sites between a randomly chosen pair of consecutive s-sites by using following formula: P m r & s + r ' k ' 3# & s + r ' = ( $! / $ j= k % r ' k " % r 2#! " (3-4) Therefore, the probability that at least one of s - 1 pair of consecutive s-sites contains at least k of r-sites is: P = 1! (1! P m ) s! 1 (3-5) However, there is a problem for this method; it only can find a region containing all of gene conversion events. However, usually, in the conversion regions, some sites might not show evidence of conversion just by chance. Therefore, we modify this formula so that we can allow some no gene conversion events in the gene conversion regions. For a sequence with s of s-sites and r of r-sites, the probability that there are at least k of r-sites between a randomly chosen pair of s-sites that contains i of s-sites is as follows: P i m r & k + i# & s + r ' k ' i ' 3# & s + r ' = ( $! $! / $ j= k % i "% r ' k " % r 2#! " (3-6) Therefore, the probability that at least one of s i + 1 pair of sites that contains i of s-sites has at least k of r-sites is: P = 1! (1! P m ) s! i! 1 (3-7)

63 Results and limitations Beta and delta genes I use this method to find all gene conversion events between beta and delta genes. The result is shown in figure 3-8. Empty Square represents no gene conversion event. And there are three different types of gene conversion events. Blue line indicates delta gene is converted, while red line represents beta gene is converted. And green line means that the directionality cannot be determined. We can find that there are gene conversion events in the coding regions of cat, dog and rf_bat. Also, there are some gene conversion events in the common ancestor of human, chimp, rhesus, baboon and marmoset (HCRBM); cat and dog (CaD); cat, dog and rfbat (CaDRf); and all species (HCRBMCaDRf).

64 53 Figure 3-8: All gene conversion events between beta and delta genes. Gene conversion within 8 species, i.e. human (H), chimp (C), rhesus (R), baboon (B), marmoset (M), cat (Ca), dog (D) and rf_bat (Rf) and their ancestors are detected Two gamma genes

65 54 The proposed method is also applied to detect all gene conversion events between two gamma genes. The result is shown in figure 3-9. We can find that there are gene conversion events in the coding regions and non-coding regions of all species. Also, there are gene conversion events that occurred in the ancestors of these species. Figure 3-9: All gene conversion events between two gamma genes. Gene conversion within 8 species, i.e. human (H), Gibbon (G), colobus monkey (C), rhesus (R), baboon (B), squirrel monkey (S) marmoset (M), and dusky titi (T) and their ancestors are detected.

66 Limitations However, there are still some problems to use this method to find orthologous genes for different gene clusters. The most major problem is that phylogenetic information should be known in advance for this method. However, for some gene clusters, we don t know their phylogeny. Furthermore, as shown in figure 2-7, there are three types of orthologous relationships could be determined from the gene clusters: One-to-one; one-to-many and many-to-many. Each type of orthologous relationship needs to take into consideration.

67 Chapter 4 Gene conversion detection for whole genome Gene conversion events are often overlooked in analyses of genome evolution. In such an event, an interval of DNA sequence (not necessarily containing a gene) overwrites a highly similar sequence. The event creates relationships among genomic intervals that can confound prediction of orthologs and attempts to transfer functional information between genomes. Here we analyze 1,112,202 highly conserved pairs of human genomic intervals, and detect a conversion event for about 13.5% of them. Properties of the putative gene conversions are analyzed, such as the distributions of the lengths of the converted regions and the spacing between source and target. We also compare results from the whole-genome predictions with previous analyses for several well-studied gene clusters, including the globin genes. 4.1 Introduction Several classes of evolutionary operations have sculpted the human genome. Nucleotide substitutions have been studied in great detail for years, and much attention is now focused on large-scale events such as insertions, deletions, inversions, and duplications. Frequently overlooked are gene conversion events (reviewed by Hurles 2004 and Chen et al. 2007), in which one region is copied over the location of a highly similar region; before the operation there are two genomic intervals, say A and B with 95% identity, and afterwards there are two identical copies of A, one in the position formerly occupied by B. Conversion events need to be accounted for when attempting to understand the human genome based on identification of orthologous regions in other species. To take a hypothetical

68 57 example, suppose human genes A and B are related by a duplication event that pre-dated the separation of humans and Old World monkeys, so that rhesus macaques also have genes A and B. A conversion event in a human ancestor that overwrote some of A with sequence from B could cause all or part of A s amino-acid sequence to be more closely related to the rhesus B protein than to the rhesus A protein, even though A s regulatory regions might remain intact. Successful design and interpretation of experiments in rhesus to understand gene A might well require knowledge of these evolutionary relationships. Gene conversion events have been studied in a variety of species, including the following investigations. Drouin (2002) characterized conversions within 192 yeast gene families; Semple and Wolfe (1999) detected conversion events in 7,397 Caenorhabditis elegans genes; Ezawa et al. (2006) studied 2,641 gene quartets, each consisting of two pairs of orthologous genes in mouse and rat, and found that 488 (18%) appear to have undergone gene conversion; and Xu et al. (2008) detected 377 gene conversion events within 626 multigene families in the rice genome. However, all of these studies detect gene conversion events only between pairs of protein-coding genes, although conversion can occur between any pair of highly similar regions (Chen et al. 2007). Besides, some analyses of gene conversion are done in the human genome (Jackson et al. 2005; McGrath et al 2009; Benovoy and Drouin 2009). While none of the datasets used in these previous studies compare to the size of the one analyzed here. In this paper, we cover more than one million paralogous pairs of regions, requiring a more efficient method to deal with such a large dataset. Evidence of conversion between human genes frequently appears in cases where the conversion involves only part of a duplicated region. For instance, consider the δ-globin and β- globin genes, which lie close together on human chromosome 11 in a gene cluster shaped by conversion events (Papadakis and Patrinos 1999). A human-human alignment reveals similarities extending beyond the genes, created by an ancient duplication event pre-dating the radiation of

69 58 placental mammals, roughly 100 million years ago; see Figure 4-1B. To test whether the elevated percent identity in the protein-coding regions can be explained entirely by purifying selection on those regions, we can compare the pattern of sequence conservation between the paralogous human regions with that between human δ-globin and its ortholog in an appropriately diverged species. Using dog, we see that in most of the interval around the δ-globin gene, the human sequence is more similar to the dog δ-globin region (blue) than to the human β-globin region (red) as expected, but this is reversed in a large interval containing exons 1 and 2, and perhaps in part of exon 3; see Figure 4-1C. One reasonable inference from this observation is that a conversion event overwrote that interval with the homologous sequence from the β-globin gene, or vice versa. Indeed, the procedure described in this paper identifies a conversion event covering an interval that starts somewhat upstream of exon 1 and extends just beyond exon 2, while the potential conversion of part of exon 3 is not significant in this test (Figure 4-1D). On the other hand, testing with marmoset sequence instead of dog does find significant evidence (p=0.0006) of a conversion event involving part of exon 3 (data not shown). Figure 4-1: Evidence of gene conversion in the human δ-globin gene. (A) Schematic view of the gene. (B) Percent identity plot of an alignment to an interval containing the human β-globin gene; each short horizontal line indicates the percent identity over a subinterval of the alignment. (C) Plot of the alignment

70 to the dog δ-globin gene (blue) compared to the paralogous human alignment (red). (D) An interval of gene conversion detected by the method described in this paper, and a smaller interval where the dog sequence does not detect statistically significant evidence of conversion. See the text for further discussion. 59 A number of statistical tests have been proposed to detect gene conversions. However, most of these tests are only efficient for small data sets, e.g. individual gene clusters. Boni et al. (2007) nicely summarize computational methods available for detecting mosaic structure in sequences, and propose a new method that is particularly economical in terms of computer execution time for large data sets. One drawback is that their algorithm requires large amounts of computer memory. However, we show here that this method can be reformulated so that the memory requirements are no longer a limiting factor, which allows us to conduct a comprehensive scan for gene-conversion events in the human genome, starting with 1,112,202 pairs of paralogous human intervals. For each pair of paralogous intervals, say H 1 and H 2, we choose a sequence from another species, say C 1, that is believed to be orthologous to H 1. These triplets of sequences are examined to find cases where part of H 1 is more similar to H 2 than to C 1, while another part is more similar to C 1. In such cases, the interval of high H 1 - H 2 similarity is inferred to have resulted from a conversion event, as illustrated in Figure 4-1. Our findings include the following observations about predicted human gene conversions. About 71% of the detected 149,799 conversion events occurred between two intervals on different chromosomes, but the conversion rate for intra-chromosomal paralogous pairs is ~1.5 fold compared to that for inter-chromosomal paralogous pairs. For the intrachromosomal conversions, distance between the pair of sequences is a key factor affecting conversion frequencies. Pairs of similar sequences that are too close or too distant have a lower conversion frequency; the highest conversion frequency is between paralogs separated by 10Kb to 100Kb. Moreover, we find that: (i) conversion frequencies are proportional to the length of the paralogs; (ii) 57% of conversion events cover less than 100 bp; (iii) selection plays a critical role

71 60 for the occurrence of gene conversion; (iv) conversion frequency varies among chromosomes, and does so in a manner suggesting that the mechanisms generating conversions are not associated with the process of genome replication; (v) the relative orientation of the two sequences has little effect on conversion frequency; (vi) gene conversions have a preference for lower GC-content. 4.2 Methods Highly conserved pairs of sequences We aligned each pair of human chromosomes, including self-alignments, using BLASTZ (Schwartz et al. 2003) with T=2 and default values for the other parameters. Chaining of the human-human alignments was performed using the method of Zhang et al. (1994). For alignments between human intervals and their putative orthologs in other species, we used the pairwise alignment nets (Kent et al. 2003) downloaded from the UCSC Genome Browser website (Kent et al. 2002) Gene conversion detection between each pair of sequences For each pair of similar human sequences, say H1 and H2, we found an interval, C1, from another species (perhaps chimpanzee) that appears to be orthologous to H1. Our plan was to look for cases where part of H1 is more like H2, while other parts are more like C1. Following Boni et al. (2007), we used the H1 - H2 alignment and the H1 C1 alignment to identify informative positions in H1, such that either H1 and H2 have one nucleotide and C1 has another (score 1), or H1 and C1 have one nucleotide and H2 has another (score +1). In Figure 4-2, we plot idealized examples of the cumulative sum of these scores along H1, which constitute what is called a

72 61 hypergeometric random walk (HGRW; Feller 1957) under the assumption that H1 s relationships to H2 and C1 are invariant across the interval. If the duplication event producing H1 and H2 preceded the speciation that divided H1 and C1, then the plotted quantity will generally increase because there will be more +1 scores (H1 like C1) than 1 scores (H1 like H2); this is illustrated in panels A and D of Figure 4-2. Panels C and F illustrate the opposite case where the duplication followed the speciation. The case of interest to us is when duplication preceded speciation, but where following duplication, a subinterval of H1 was overwritten by the corresponding subinterval of H2; in that subinterval the plot decreases, contrary to the behavior in the rest of the plot (panels B and E). Our task is to identify cases where the +1s and 1s are distributed along H1 to create an interval of maximum descent, which is the maximum decrease of scores across the interval (in one direction only) as shown in Figure 4-2, in the cumulative sum that cannot be explained by chance alone. For a given pair H 1 and H 2, we needed to find sequence from a species at an appropriate evolutionary distance, i.e., that split from the human lineage somewhat after the duplication event and before the most recent conversion. Thus, we tried a gamut of available mammalian genome sequences: chimp, orangutan, rhesus, marmoset, dog and opossum. Each of these species could be used to identity gene conversion events in particular period of evolution along the lineage leading to human. Because the orthologs of H 1 and H 2 often differ, up to 12 triplets were used to look for gene conversion between a given paralogous pair. More formally, we detect conversions using the test statistic x m,n,k, which is the probability of a maximum descent of k occurring by chance for a triplet with m +1s and n 1s. Boni et al. (2007) give a dynamic programming algorithm for computing x m,n,k.

73 62 Figure 4-2: Determining the occurrence of gene conversion events in a triplet, where H 1 and H 2 are paralogous human intervals and C 1 is an ortholog of H 1 from another species. Panels A to C depict three scenarios relating the three sequences, while D to F are associated plots where the horizontal axis shows positions along H 1 and the slope is positive if and only if H 1 matches C 1 more frequently than it matches H 2. A region with a statically improbable decrease is predicted to identify a conversion event Space-efficient modifications The original formulation by Boni et al. (2007) requires an amount of computer time and memory that is proportional to B4, where B is an upper bound for m, n, and k. For a triplet with

74 informative sites, this approach would use 6.4 GB of computer memory, allowing the method to work only with relatively short sequences. We modified that method to need only space proportional to mn + n 2 + SP, as we now describe. In the notation of Boni et al., the test statistic x m,n,k is defined as P(md H m,n = k) and can be calculated using the equation k x m,n,k = " y m,n,k, j, j= 0 (4-1) where: y m,n,k,j = P(md H m,n = k min H m,n = j) (4-2) The probabilities of y can be obtained by dynamic programming based on the following recursive relationships. )" m % + $ '[ y # m + n& m(1,n,k,1 + y m(1,n,k,0 ] + " m % " n % + $ ' y y m,n,k, j = # m + n& m(1,n,k, j +1 + $ ' y # m + n& m,n(1,k, j(1 * + " n % + $ '[ y # m + n& m,n(1, j(1, j(1 + y m,n(1, j, j(1 ] +, + 0 if if if if j = 0, k > j > 0, j = k > 0, j > k - 0. (4-3) In order to reduce the usage of memory, we introduce the additional variable A m,n,k, defined as: " n % " n % A m,n,k = y m,n,k,k = $ '[y # m + n& m,n(1,k(1,k(1 + y m,n(1,k,k(1 ] = $ ' A # m + n & m,n(1,k(1 + y m,n(1,k,k(1 [ ] (4-4) Then,

75 64 k x m,n,k = " y m,n,k, j = y m,n,k,0 + " y m,n,k, j + y m,n,k,k k#1 j= 0 j=1 $ m ' = & ) y % m + n( m#1,n,k,1 + y m#1,n,k,0 k#1 [ ] + " j=1 * $ m ' $ n ' -,& ) y % m + n( m#1,n,k, j +1 + & ) y % m + n( m,n#1,k, j#1 / +. $ n ' + & )[ y m,n#1,k#1,k#1 + y m,n#1,k,k#1 ] % m + n( k $ m ' $ n '* k#1 - = & )" y % m + n( m#1,n,k, j + & )," y % m + n( m,n#1,k, j + y m,n#1,k#1,k#1 / j= 0 +, j= 0./ $ m ' $ n '* k - = & ) x % m + n( m#1,n,k + & )," y % m + n( m,n#1,k, j # y m,n#1,k,k + y m,n#1,k#1,k#1 / +, j= 0./ $ m ' $ n ' = & ) x % m + n( m#1,n,k + & ) x % m + n( m,n#1,k # A m,n#1,k + A m,n#1,k#1 [ ] (4-5) The key observation is that for fixed k, the only component of the equation (4-3) that depends on k 1 is when j = k > 0, and in that case the required value is A m,n-1,k 1. (On the other hand, the initialization of y m,n,k,j for m = 0 does depend on k.) Consequently, provided that we record the 3-dimensional array of values A m,n,k, we can store the values of y for a fixed k in another 3-dimensional array that we call y m,n,j and overwrite them with the values corresponding to k+1 as the computation proceeds. The resulting algorithm uses only two arrays of size mn 2 (x and y can be stored in the same array) as shown in Figure 4-3. It can handle triplets with 2000 informative sites on a mid-sized workstation. MODIFIED-3SEQ(MAX_M, MAX_N) 1 for m 0 to MAX_M do 2 A[m, 0, 0] 1 3 for n 1 to MAX_N do 4 A[m, n, 0] 0 5 for k 1 to MAX_N do 6 for m 0 to MAX_M do 7 for j 0 to MAX_N do 8 y[m, 0, j] 0 9 for n 0 to MAX_N do 10 for j 0 to MAX_N do

76 65 11 if k = n and j = n then 12 y[0, n, j] 1 13 else y[0, n, j] 0 14 for m 1 to MAX_M do 15 for n 1 to MAX_N do 16 for j 0 to k - 1 do 17 if k > n or k < n m then 18 y[m, n, j] 0 19 else if j > k or j > n or j < n - m then 20 y[m, n, j] 0 21 else if j = 0 then 22 y[m, n, j] m / (m + n) (y[m - 1, n, 1] + y[m - 1, n, 0]) 23 else y[m, n, j] m / (m + n) y[m - 1, n, j + 1] + n / (m + n) y[m, n - 1, j - 1] 24 if k > n or k < n m then 25 A[m, n, k] 0 26 else A[m, n, k] n / (m + n) (A[m, n - 1, k - 1] + y[m, n - 1, k - 1]) 27 y[m, n, k] A[m, n, k] 28 for n 0 to MAX_N do 29 for k 0 to MAX_N do 30 if n = k then 31 x[0, n, k] 1 32 else x[0, n, k] 0 33 for m 1 to MAX_M do 34 x[m, 0, 0] 1 35 for k 1 to MAX_N do 36 x[m, 0, k] 0 37 for n 1 to MAX_N do 38 x[m, n, 0] 0 39 for k 1 to MAX_N do 40 x[m, n, k] m / (m + n) x[m - 1, n, k] + n / (m + n) (x[m, n - 1, k] + A[m, n - 1, k - 1] - A[m, n - 1, k]) 41 return x Figure 4-3. A cubic-space algorithm for computing the probabilities x m,n.k. Furthermore, since the value of x depends only on the values in the same loop, e.g. x m,n-1,k, and in the previous loop, e.g. x m-1,n,k (when using m as the outer loop), an O(mn + n 2 + SP) space method (where S = number of outgroup species and P = number of pairs of sequences) is possible. First, the values of m, n, and k for the triplets of all pairs of sequences are determined

77 66 and stored in a three-dimensional linked list, which consumes O(SP) space. Then the value of x is calculated and summed to the relevant triplets. Since only those values that are necessary for further calculation are kept, the maximum table size required for the calculation of x is O(mn + n 2 ). Although the space requirement is thus reduced, the time complexity is still quartic (exponent 4). Also, the longest interval in our data is 251,067 base pairs. In order to deal with long alignments, those with length greater than 5000 are divided into several sub-alignments with 1000 sites overlapped. The p-value for each sub-alignment is then calculated, and a multiplecomparison correction method (Holm 1979) is used to determine if the set of sub-alignments supports an assertion that the whole alignment shows significant signs of a conversion Extension to quadruplet testing It is not uncommon that we have a pair of paralogs in the other species, say C1 and C2 in chimpanzee, that are orthologs for H1 and H2 in human, respectively. In a fashion similar to the triplet testing for gene conversions, we can perform quadruplet testing (H1, H2, C1, C2) that is the summation of the hypergeometric random walks of two triplets, i.e. (H1, H2, C1) and (H1, H2, C2), as shown in Figure 4-4. Quadruplet testing may have higher specificity and sensitivity than triplet testing for detecting conversions. For example, in Figure 4-4A, a weakly significant (0.032) conversion event was detected between the HBD and HBBP1 paralog pair in one triplet testing, which is inconsistent with a previous study (Papadakis and Patrinos 1999) in the betaglobin gene cluster. This could be due to a faster evolutionary rate in HBBP1, which is a pseudogene. However, quadruplet testing did not show any evidence of conversion in this region. This suggests that the effect of one triplet can be neutralized by that of another triplet when there is no conversion between a paralog pair. On the other hand, when applying quadruplet testing for a

78 67 conversion region, we can get more significant result, as shown in Figure 4-4B (2.4e-18). Therefore, whenever orthologs for both H1 and H2 are available in a particular outgroup species, we combine the results of the two triplets to perform quadruplet testing, and use the same formula as triplet testing, i.e. equation (4-5), to get p-values. Figure 4-4: Comparisons between quadruplet testing and triplet testing. (A) the HBD and HBBP1 paralog pair; (B) the HBB and HBD paralog pair Multiple-comparison correction When several statistical tests are performed simultaneously, a multiple-comparison correction should be applied. In our study, six outgroup species are used. Here we use the Bonferroni correction (Holm 1979); we multiply the smallest p-value for each paralogous human pair by the number of tests (up to 6), and then compare the adjusted p-value to the p-value threshold, α. Multiple-comparison correction is also applied to the tests for all pairs of paralogous sequences. For the 1,112,202 pairs that were analyzed, we used a multiple-comparison correction method that controls the false discovery rate (FDR), proposed by Benjamini and Hochberg (1995). The cutoff threshold for p-values can be found by the following algorithm:

79 68 CutOff(α, P-values) 1 sort P-values 2 for i 1 to number of P-values do 3 if P i > (i / number of P-values) α 4 return (i / number of P-values) α Figure 4-5: Algorithm for determining cutoff position of P-values In our study, α is 0.05 and the cutoff threshold for p-values is This means that only a test whose p-value after Bonferroni correction is less than is considered as significant for gene conversion Directionality of gene conversion We attempt to determine the source and target of a conversion event as follows. As shown in Figure 4-6B, let us suppose that a conversion event happened z years ago, with x > y > z, and consider a converted position. Regardless of the direction of the conversion (from H1 to H2, or vice versa), in the converted region, H1 and H2 are separated by 2z total years. If H1 converted H2 (i.e. part of H1 overwrote part of H2), then the separation of H1 and C1 is 2y but the separation of H2 and its ortholog, C2, is 2x > 2y. This observation serves as a basis for determining the conversion direction. Figure 4-7 shows an example of determining the source and target of a conversion from HBB to HBD. Specifically, assume (m 1, n 1 ) with maximum descent k 1 in the first triplet (H 1, H 2, and C 1 ), and (m 2, n 2 ) with maximum descent k 2 in the second triplet (H 1, H 2, and C 2 ). Note that m i and n i here are not the m and n in equation 4-1; rather, they are the numbers of ups and downs within

80 the common maximum descent region of the two triplets (union). The probabilities of going down in the maximum descent regions of two triplets are: 69 p 1 = n 1 ( m 1 + n 1 ) (4-6) p 2 = n 2 ( m 2 + n 2 ) (4-7) When combining these data, there are a total of (m 1 + m 2 ) ups and (n 1 + n 2 ) downs, and the possibility of going down in the combined data is: p = ( n 1 + n 2 ) ( m 1 + m 2 + n 1 + n 2 ) (4-8) As shown in Figure 4-6B, if H 1 converted H 2, the separation of H 1 and C 1 is closer than the separation of H 2 and C 2 in the converted region. Thus, p 1 should be smaller than p 2. Our objective function (O) is therefore to determine how significant the difference of p 1 p 2 is, based on the binomial distribution: $ O = & p 1 " p % 2 ( ) " E(p 1 # " p2 # ) ' ( ) sqrt $ & V ( p # " # ' 1 p2 ) ) % ( (4-9) Where: E( p 1 " # p2 " ) = 0 (4-10) $ V ( p " 1 # p2 " 1 1 ' ) = & + ) * p * (1# p) % m 1 + n 1 m 2 + n 2 ( (4-11) In this paper, 6 outgroup species are used to detect gene conversions. We use the outgroup species that shows the most significant difference of p 1 p 2 to determine the directionality of conversion for a given paralogous pair. However, there are several reasons why the direction of a conversion might not be clear, even when using several outgroup species, including conversions in the outgroup species and missing outgroup data. Our approach indicates a direction for 65.4% of the putative conversions.

81 70 Figure 4-6: Timing of evolutionary events. The assumed duplication, speciation, and conversion events occurred respectively x, y, and z years ago. See text for further explanation. Figure 4-7: Evidence that the β-globin gene (HBB) converted the δ-globin gene (HBD). Percent identity plots for (A) HBB and (B) HBD showing alignments to the human paralog in red, and alignments to the putative marmoset ortholog in blue. In the converted region, the human-marmoset alignments have 92% identity for β-globin and 85% identity for δ-globin. 4.3 Results We downloaded the March 2006 assembly of the human genome from the USCS Genome Browser (Kent et al. 2002), aligned each pair of chromosomes using Blastz (Schwartz et al. 2003), and collected alignments into longer chains, each of which is intended to identify the

82 71 results of a duplication event in which one or both copies may have been subsequently disturbed by insertion or deletion events (Kent et al. 2003). Table 4-1 contains information about the resulting set of duplicated genomic intervals. Table 4-1: Information for duplicated human genomic intervals used in this study. Number of duplicated regions 1,112,202 Length of the longest duplicated region (bp) 251,067 Average length (bp) 876 Intra-chromosomal pairs 241,141 Inter-chromosomal pairs 871,061 Both regions contain coding sequences 122,207 Only one region contains annotated coding sequence 225,144 Neither region contains annotated coding sequence 764, Number and distribution of gene conversion events in human Of the 1,112,202 analyzed pairs of human sequences, 149,799 (13.5%) indicated a gene conversion event (Table 4-2). The occurrence of a gene conversion for 6,737 (0.6%) pairs could not be tested due to a lack of available orthologous sequence in the other species used in this study. These results are consistent with results of Ezawa et al. (2006), where about 13% of the mouse sequence pairs show signs of gene conversion. Among these 149,799 putative gene conversion events, approximately 71% (106,872) occurred between chromosomes. However, the fraction of intra-chromosomal pairs indicating a conversion (17.8%) is significantly higher than for inter-chromosomal conversions (12.3%). (Note that a substantial majority of the pairs are inter-chromosomal.) The frequencies of intra-chromosomal conversions are shown for each human chromosome in Figure 4-8, in which they appear to vary substantially. For instance, the conversion frequency in chromosome 5 (25.7%) is more than double that in chromosome 18

83 72 (11.1%). We performed a chi-square test (Abramowitz et al. 1965) to see the dependency of these conversions in different chromosomes. The test rejects the null hypothesis that conversion frequencies in different chromosomes are constant with χ 2 = with 22 degree of freedom. Following the reasoning applied by, e.g., Makova et al. (2004), the fact that the conversion frequencies for the sex chromosomes are similar to those of the autosomes suggests that gene conversion events are not associated with cell replication. Table 4-2: Distribution of intra- and inter-chromosomal gene conversions. Intra-chromosome Inter-chromosome Total Gene conversion 42,927 (17.8%) 106,872 (12.3%) 149,799 (13.5%) No gene conversion 195,180 (80.9%) 760,486 (87.3%) 955,666 (85.9%) Unknown a 3,034 (1.3%) 3,703 (0.4%) 6,737 (0.6%) Total 241, ,061 1,112,202 a For some pairs of sequences, all orthologous sequences in other species cannot be found, it is impossible to determine the occurrence of gene conversion.

84 73 Figure 4-8: Frequencies of gene conversion events in each human chromosome Correlations with the distance, length, and relative orientation of the paralogs To study how the physical distance between paralogous sequences affects intrachromosomal conversions, conversion frequencies for different ranges of distance were examined (Figure 4-9). When the pairs of sequences are very close (< 1Kb), conversion frequencies are low (9.35%), although they also decrease gradually when the separation exceeds 100Kb. The distances with the highest conversion frequency (25.74%) lie between 10Kb and 100Kb. This result differs from some previous studies (Ezawa et al. 2006; Xu et al. 2008), which suggest that the frequency of gene conversion is inversely proportional to the physical separation of the paralogs. However, other reports suggest an optimal separation of 850 bp for yeast (Sugawara et al. 2000) and 3800 bp in mammals (Schildkraut et al. 2005). Schildkraut et al. offered the

85 74 plausible explanation that homologous recombination (HR), which repairs DNA double-strand breaks (DSB), could occur by two mechanisms, either conservative or non-conservative. Gene conversion is conservative and single-strand annealing (SSA) is non-conservative. When the pairs of sequences are very close (< 10Kb), SSA becomes more competent than gene conversion with the decrease of distance of pair of sequences since SSA takes less extensive end-processing. This could explain why frequency of gene conversion decreases when the separation is less than 10Kb. Figure 4-9: Frequency of intra-chromosomal gene conversions as a function of distance between the paralogs. We are also interested in the correlation between gene conversion and the length of the homologous pair of sequences. Our results show that the rate of gene conversion is directly proportional to the sequence length (Figure 4-10). When the length is less than 200 bp, the frequency of gene conversion is very low (3.95%). This is consistent with previous studies (Liskay et al. 1987; Waldman and Liskay 1988; Reiter et al. 1998), which suggest that the so-

86 called minimal efficient processing segment for gene conversion in mammals exceeds 200 bp. Naturally, the opportunity for gene conversion increases with sequence length. 75 Figure 4-10: Correlation with length of the paralogous human sequences. Also, the correlation with orientation is studied. Pairs of sequences are classified into two types of orientation, e.g. same-direction and reverse-direction, based on the physical map. Moreover, only intra-chromosomal pairs of sequences are analyzed. For the intra-chromosomal pairs of sequences, more gene conversion events (24,533) occur in the same-direction (Table 4-3), although both same-directional and reverse-directional pairs of sequences have similar conversion frequency. In order to realize the inconsistency between number and frequency of intra-chromosomal gene conversion, the correlation between orientation and distance of pair of sequences is analyzed (Figure 4-11). The result shows that conversion frequencies for different ranges of distance are almost the same between two types of orientation (Figure 4-11A).

87 76 However, approximately 66% (73,682 / 111,689) of pairs of sequences with distance less than 1Mb have the same orientation (Figure 4-11B). This could explain why the conversion events of same-direction paralogs are more than that of reverse-direction paralogs. Therefore, these analyses indicate that relative orientation has little effect on conversion frequencies. Table 4-3: Distribution of gene conversion events classified by orientation. Same direction Reverse direction Gene Conversion 24,533 (17.8%) 18,394 (17.8%) No Gene Conversion 111,783 (81.0%) 83,397 (80.8%) Unknown 1,610 (1.2%) 1,424 (1.4%) Total 137, ,215 Figure 4-11: Correlation between orientation and separation distance of the human paralogs. (A) Conversion frequencies for two type of orientation in different ranges of distance. (B) Number of pairs of sequences for two type of orientation in different ranges of distance Length of converted regions To analyze the conversion lengths, the region with maximal descent (Figure 4-2E) is taken as the converted region. The distribution of these lengths is shown in Figure 4-12, which indicates that the frequency of gene conversion is inversely proportional to the length of gene

88 77 conversion regions. Approximately 56.7% (84,983 / 149,799) of conversion events have length less than 100 bp. By comparison, data of Xu et al (2008) indicates that 66% of conversion events in the rice genome are less than 100 bp. Figure 4-12: Distribution for the length of the converted regions The effect of protein-coding DNA To investigate whether selective pressure affects the occurrence of fixation of the gene conversion events, we classified each pair of sequences into three categories: 2-coding, 1- coding and non-coding, meaning respectively that both, only one, or neither of the two paralogs contains coding sequence according to the UCSC KnownGenes annotations. The results (Table 4-4) indicate that conversion frequencies are higher for 2-coding and 1-coding paralogs.

89 78 One possible explanation is that non-coding pairs diverge the fastest, and more quickly leave the state where gene conversion is possible; conversion is though to require at least 92% nucleotide identity, with over 95% identity being typical (Chen et al. 2007). Furthermore, for more understanding of the influence of selection, the category of 1-coding is separated into three subcategories: coding-to-noncoding, noncoding-to-coding and unknown basing on the directionality of conversion (Table 4-5). The result shows that the number of conversion from coding sequence to non-coding sequence is more than that from non-coding sequence to coding sequence. This could suggest that selection pressure would reduce the probability of occurrence and/or fixation of gene conversion. Therefore, selection plays an important role for the occurrence of gene conversion. Table 4-4: Conversion frequency as a function of the presence of protein-coding sequence. 2-coding a 1-coding b Non-coding c Gene Conversion 17,809 (14.6%) 33,133 (14.7%) 98,857 (12.9%) No Gene Conversion 104,101 (85.2%) 190,768 (84.7%) 660,797 (86.4%) Unknown 297 (0.2%) 1,243 (0.6%) 5,197 (0.7%) Total 122, , ,851 a Both paralogs contain coding sequence. b Only one paralog contains coding sequence. c Neither paralog contains coding sequence. Table 4-5: Number of conversion for different directionality in 1-coding category. coding-to-noncoding noncoding-to-coding unknown a Gene Conversion 13,541 8,311 11,281 a There are several reasons why the direction of a conversion might not be clear including conversions in the outgroup species and missing outgroup data.

90 Correlation with GC-content Furthermore, the correlation between gene conversion and GC-content is studied (Figure 4-13). Figure 4-13A shows the number of duplication events for different ranges of GC-content. GC-contents of more than 76% (1,004,879 / 1,308,632) of duplication events are less than 50%. Furthermore, higher conversion frequencies occur in those duplication events with lower GCcontent (Figure 4-13B). These results indicate that both of duplication events and conversion events have a preference for lower GC-content, which is consistent with the hypothesis that DNA with higher GC-content is more stable than DNA with lower GC-content because of bounding by three hydrogen bonds for the GC pair. Figure 4-13: Correlation between gene conversion and GC content. (A) Number of pairs of sequences for different ranges of GC content. (B) Conversion frequency for different ranges of GC content. 4.4 Discussion For much of the half-century since multigene families were discovered, it has been known that copies of the repeated genes within a species are more similar than would be expected from their interspecies divergence. The processes generating this sequence homogeneity in repeated DNA are mechanisms of concerted evolution. Gene conversion is one of those processes, and while its impact on disease genes is appreciated (Chen et al. 2007), the extent of its

91 80 impact on the evolution of the human genome has not been fully investigated in previous studies. Our work documents about one hundred and fifty thousand conversions (13.5%) between duplicated DNA segments in humans. Similarly large fractions of conversion events among duplicated segments have been reported in whole-genome studies of yeast (Drouin 2002), Drosophila melanogaster (Osada and Innan 2008) and rodents (Ezawa et al. 2006), even though the total number of observed gene conversions is much higher in our study. The genome-wide identification of DNA segments undergoing concerted evolution via gene conversions will make the application of comparative genomics to functional annotation considerably more accurate. This resource will allow the conversion process to be factored into functional inference based on sequence similarity to other species; for example it could flag potential false positives for inferred positive or negative selection. In this thesis, we study the gene conversion in the human genome. In order to characterize gene conversion in human, 1,112,202 highly conserved pairs of sequences in human are extracted. Moreover, to determine the occurrence of gene conversion for these plenty of pairs sequences, a memory intensive but very rapid program is used. In order to reduce the usage of memory, we propose a modified method, which can greatly reduce the memory consumption so that gene conversion events for longer pair of sequences could be determined. After the analyses of gene conversion, many interesting characteristics of gene conversion in the human genome are observed. (i) More gene conversion events occurred between chromosomes, although interchromosomal conversion frequencies are lower intra-chromosomal conversion frequencies; (ii) Gene conversion frequencies in different chromosomes are not invariable and are not normal distributed; (iii) Physical distance between pair of sequences plays an important role for the frequencies of intra-chromosomal gene conversion; (iv) Longer pair of sequences has more opportunity for the occurrence of gene conversion; (v) Orientation of pair of sequences does not affect the occurrence of gene conversion significantly; (vi) The length of most gene conversion

92 81 events is very short; (vii) Selection pressure would reduce the chance for the occurrence of gene conversion, although it could also increase the similarity level, which would increase conversion frequencies. (viii) Gene conversions have a preference for lower GC-content. All of these results could help us have more detailed understanding about the formation and impact of gene conversion.

93 Chapter 5 Applying gene conversion detection method to gene clusters 5.1 Introduction In this chapter, our method for detecting gene conversion for whole genome is applied for three gene clusters, i.e. the beta-globin, CCL, and Interferon gene cluster. For each gene cluster, gene conversion events for all species are detected. For this purpose, all other species are used as outgroup to detect gene conversion in one species. Furthermore, phylogenetic tree is constructed to demonstrate the correctness of our method. 5.2 Results Beta-globin gene cluster (hg18.chr11:5,180,996-5,270,995) The beta-globin gene cluster has been shown to have higher conversion rates that are 3 to 30 times normal (Smith et al. 1998). To study gene conversion events in this gene cluster, 13 species from the ENCODE project (ENCODE Project Consortium 2004) are used and 57 genes are extracted by using GeneWise2 (Birney et al. 2004). All conversion events are detected in this gene cluster and are shown in Figure 5-1. We can find that plenty of conversion events occur between two gamma genes and between beta and delta genes.

94 83 Figure 5-1: Gene tree and detected conversion events for the beta-globin gene cluster. Red arrows indicate conversions that might affect the topology of the gene tree. In order to understand how these conversion events affect the evolution of the beta-globin gene cluster, detailed analysis is applied for the beta and delta genes. According to the results of gene conversion, three exons of the gene have been affected by different gene conversion events. Thus, we divide the multiple sequence alignment of the beta and delta genes into four regions as shown in Figure 5-2A. Furthermore, a phylogenetic tree is constructed for each region (Figure 5-2B). It appears that the evolutionary histories of these four regions are dissimilar. In the intron 2 region, it seems not to under the influence of gene conversion since the beta and delta genes for all species are separated very well. In the other hand, different phylogenies for other three regions could suggest that they are under the influences of different gene conversion events.

95 Figure 5-2: Influences of gene conversion between the beta and delta genes. (A) Informative sites in the aligned sequences of the beta and delta genes are divided into four regions basing on our detected 84

96 conversion events. (B) Neighbor-joining phylogenetic tree (1000 bootstraps) is constructed for each region. 85 Figure 5-3 shows the inferred evolutionary histories of the mammalian beta and delta genes basing on the phylogenetic trees of different regions and our detected gene conversion events. The beta and delta genes are separated by an ancient duplication event pre-dating the radiation of placental mammals. Afterward, three exons and intron 1 of delta gene are converted by their orthologous region in beta gene after the split of Xenarthra and Boreoeutheria. Following the separation of Laurasiatheria and Euarchontoglires, a conversion event covering an interval that starts somewhat upstream of exon 1 and extends just beyond exon 2 occurred in both lineages. Finally, after the speciation event of ape and New World Monkey, a region including exon 1 and intron 1 in the delta gene is converted once more by the beta gene.

97 86 Figure 5-3: Inferred evolutionary histories for mammalian beta and delta genes. Three speciation events are the separations of Xenarthra and Boreoeutheria (X-B), Laurasiatheria and Euarchontoglires (L-E), ape and New World Monkey (Ape-NWM) respectively. Four different species, i.e. armadillo (A), dog (D), marmoset (M) and human (H), are shown in these trees CCL gene cluster (hg18.chr17:31,334,806-31,886,998) This gene cluster contains several chemokine ligand genes. Some of these genes are shown the ability to prevent HIV virus from entering the cell (Modi et al. 2006). In our study, we find that this gene cluster expanded after the separation of humans and Old World monkeys and

98 87 several conversion events are detected. To study this gene cluster, 7 species from the National Human Genome Research Institute (NHGRI) are used and 42 genes are extracted by using GeneWise2 (Birney et al. 2004). Our results show that several conversion events occurred between CCL15 and CCL23 in different lineage. Also, some conversions are detected between CCL18 and CCL3 and between CCL4 and CCL4L1. To evaluate the correctness of our results, three phylogenetic trees are constructed in different regions between CCL15 and CCL23 genes (Figure 5-4). From the trees shown in Figure 5-4B, we can find that the evolutionary histories of these three regions are different. In exon 1 region, the conversion event seems to occur after the separation of Humans and New World monkeys, while a later conversion event occurred after the speciation event between Humans and Old World monkeys in exon 2 region. At the same time, several gene conversion events occurred in the coding regions between two black lemur genes. These inferences are consistent with our results, which are shown in Figure 5-5.

99 Figure 5-4: Phylogenetic trees for CCL gene cluster. (A) Aligned sequences of CCL gene family at which at least three genes share a mutation from the consensus sequence are shown. Alignment is divided into three regions basing on our detected conversion events. (B) Neighbor-joining phylogenetic tree (1000 bootstraps) is constructed for each region. 88

89 Figure 5-5: Evidences of gene conversions between CCL15 and CCL23 genes. (A) Red line is the self alignment of gorilla and blue line shows the pairwise alignment between gorilla and dusky titi.

100 89 Figure 5-5: Evidences of gene conversions between CCL15 and CCL23 genes. (A) Red line is the self alignment of gorilla and blue line shows the pairwise alignment between gorilla and dusky titi. A conversion region (yellow region) is detected around exon 1. (B) A conversion region around exon 2 is detected in gorilla (red line) using ag monkey (blue line) as outgroup species. (C) Two conversion events are found in the coding regions of black lemur. Figure 5-6 shows the inferred evolutionary histories of the CCL gene cluster. There are five genes in the root of Primates and it has undergone significant expansion after the separation of Humans and Old World monkeys. Several gene conversion events occurred among different lineages.

101 90 Figure 5-6: Inferred evolutionary histories for the CCL gene cluster IFN gene cluster (hg18.chr9:21,048,761-21,471,698) The human type-i interferon gene cluster spans about 500k bases on the chromosome 9 and has been identified to be related to some important effect such as an antiproliferative effect and the modulation of expression of cell surface modules (Diaz 1995). The mammalian type-i Interferon gene cluster could be divided into several subfamilies, i.e. IFN-β, IFN-α, IFN-ω and IFN-ε. Among these subfamilies, the IFN-α is the largest gene family and could be separated into two groups, i.e. distal group and proximal group, basing on their location. The distal group that

102 91 originate more recent contains genes that are closer to the telomere. In the other hand, the proximal group that is closer to the centromere expanded earlier (Diaz 1995). Many previous studies (Woelk et al. 2007) have shown that the Interferon gene cluster is subject to the influence of gene conversion. To study gene conversion events in this gene cluster, 7 species from the National Human Genome Research Institute (NHGRI) are used and 76 genes are extracted by using GeneWise2 (Birney et al. 2004). All conversion events are detected in this gene cluster and are shown in Figure 5-7. The results show that there are a lot of conversion events in the IFN-α gene family, especially in the distal group. Figure 5-7: Gene tree and detected conversion events for the Interferon gene cluster.

103 92 Our results suggest that there are a lot of conversions in the 5 half of gene and its 5 flanking region. Also, there are a lot of conversions in the 3 half of gene and its 3 flanking region. In order to have more detailed understanding about how gene conversion affects the phylogeny of the distal group, the alignment of genes in the distal group has been divided into two regions as shown in Figure 5-8A. The phylogenetic tree for each region is shown in Figure 5-8B. The phylogenies in these two regions are quite different, which could be the consequence of gene conversion. Our method suggests that there are some conversion events in the 5 half of gene and its 5 flanking region among the Old World Monkey. Besides, a lot of conversion events occurred in the 3 half of gene and its 3 flanking region for both human and Old World Monkey.

104 93 Figure 5-8: Influences of gene conversion to the phylogeny in the distal group. (A) Aligned sequences of IFNA gene family at which at least three genes share a mutation from the consensus sequence are shown. Alignment is divided into two regions basing on our detected conversion events. (B) Neighbor-joining phylogenetic tree (1000 bootstraps) is constructed for each region.

BME 110 Midterm Examination

BME 110 Midterm Examination May 10, 2011 Name: (please print) Directions: Please circle one answer for each question, unless the question specifies "circle all correct answers". You can use any resource