Computational approaches to the study of human trypanosomatid infections

Size: px
Start display at page:

Download "Computational approaches to the study of human trypanosomatid infections"

Transcription

1 University of Iowa Iowa Research Online Theses and Dissertations Fall 2012 Computational approaches to the study of human trypanosomatid infections Jason Lee Weirather University of Iowa Copyright 2012 Jason Lee Weirather This dissertation is available at Iowa Research Online: Recommended Citation Weirather, Jason Lee. "Computational approaches to the study of human trypanosomatid infections." PhD (Doctor of Philosophy) thesis, University of Iowa, Follow this and additional works at: Part of the Genetics Commons

2 COMPUTATIONAL APPROACHES TO THE STUDY OF HUMAN TRYPANOSOMATID INFECTIONS by Jason Lee Weirather An Abstract Of a thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Genetics in the Graduate College of The University of Iowa December 2012 Thesis Supervisors: Professor Mary E. Wilson Associate Professor Anne E. Kwitek

3 1 ABSTRACT Trypanosomatids cause human diseases such as leishmaniasis and African trypanosomiasis. Trypanosomatids are protists from the order Trypanosomatida and include species of the genera Trypanosoma and Leishmania, which occupy a similar ecological niche. Both have digenic life-stages, alternating between an insect vector and a range of mammalian hosts. However, the strategies used to subvert the host immune system differ greatly as do the clinical outcome of infections between species. The genomes of both the host and the parasite instruct us about strategies the pathogens use to subvert the human immune system, and adaptations by the human host allowing us to better survive infections. We have applied unsupervised learning algorithms to aid visualization of amino acid sequence similarity and the potential for recombination events within Trypanosoma brucei s large repertoire of variant surface glycoproteins (VSGs). Methods developed here reveal five groups of VSGs within a single sequenced genome of T. brucei, indicating many likely recombination events occurring between VSGs of the same type, but not between those of different types. These tools and methods can be broadly applied to identify groups of non-coding regulatory sequences within other Trypanosomatid genomes. To aid in the detection, quantification, and species identification of leishmania DNA isolated from environmental or clinical specimens, we developed a set of quantitative-pcr primers and probes targeting a taxonomically and geographically broad spectrum of Leishmania species. This assay has been applied to DNA extracted from both human and canine hosts as well as the sand fly vector, demonstrating its flexibility and utility in a variety of research applications. Within the host genomes, fine mapping SNP analysis was performed to detect polymorphisms in a family study of subjects in a region of Northeast Brazil that is

4 2 endemic for Leishmania infantum chagasi, the parasite causing visceral leishmaniasis. These studies identified associations between genetic loci and the development of visceral leishmaniasis, with a single polymorphism associated with an asymptomatic outcome after infection. The methods and results presented here have capitalized on the large amount of genomics data becoming available that will improve our understanding of both parasite and host genetics and their role in human disease. Abstract Approved: Thesis Supervisor Title and Department Date Thesis Supervisor Title and Department Date

5 COMPUTATIONAL APPROACHES TO THE STUDY OF HUMAN TRYPANOSOMATID INFECTIONS by Jason Lee Weirather A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Genetics in the Graduate College of The University of Iowa December 2012 Thesis Supervisors: Professor Mary E. Wilson Associate Professor Anne E. Kwitek

6 Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL PH.D. THESIS This is to certify that the Ph.D. thesis of Jason Lee Weirather has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Genetics at the December 2012 graduation. Thesis Committee: Mary E. Wilson, Thesis Supervisor Anne E. Kwitek, Thesis Supervisor Lori L. Wallrath Jeffrey C. Murray Todd E. Scheetz Wendy J. Maury

7 ACKNOWLEDGMENTS I would like to thank Dr. John E. Donelson, for his mentorship and guidance throughout my course of study and through all projects undertaken. I would also like to thank Dr. Selma M. Jeronimo (Federal University of Rio Grande do Norte, Natal, Rio Grande do Norte, Brazil), Dr. Edgar M. Carvalho (Immunology Service, Federal University of Bahia, Salvador, Brazil), Dr. Albert Scharief (Immunology Service, Federal University of Bahia, Salvador, Brazil), and Dr. Shyam Sundar (Banaras Hindu University, Varanasi, India) for the continuing collaborative efforts made by themselves and their laboratories without which this research would not have been possible. I also thank Dr. Priya Duggal (Johns Hopkins University) for continual assistance, supporting the analysis of human genetics data, as well as Dr. Jenefer M. Blackwell, Dr. Michaela Fakiola, and Dr. Anne E. Kwitek for their advice about the analysis of human genetic data. I would like to thank Dr. Jong Kwang Kim for the introduction to the MCL unsupervised clustering algorithm. I would like to thank Nikhil Anand and Adam Deluca for maintaining and educating me on the use of the computing cluster provided by the Bioinformatics Training Grant. I would also like to thank Christopher D. Campbell, Glenn P. Johnson and others managing the University of Iowa s Helium highperformance computing cluster for their helpful assistance. Finally, I would like to personally thank all the members of the Wilson lab for lessons and assistance rendered throughout this research, and Dr. Mary E. Wilson for her mentorship. I would like to thank the Genetics Training Grant, and the Bioinformatics Training Grant for funding me during this research. I would also like to thank the Department of Veteran s Affairs for grants supporting the development of the quantitative PCR assay. ii

8 ABSTRACT Trypanosomatids cause human diseases such as leishmaniasis and African trypanosomiasis. Trypanosomatids are protists from the order Trypanosomatida and include species of the genera Trypanosoma and Leishmania, which occupy a similar ecological niche. Both have digenic life-stages, alternating between an insect vector and a range of mammalian hosts. However, the strategies used to subvert the host immune system differ greatly as do the clinical outcome of infections between species. The genomes of both the host and the parasite instruct us about strategies the pathogens use to subvert the human immune system, and adaptations by the human host allowing us to better survive infections. We have applied unsupervised learning algorithms to aid visualization of amino acid sequence similarity and the potential for recombination events within Trypanosoma brucei s large repertoire of variant surface glycoproteins (VSGs). Methods developed here reveal five groups of VSGs within a single sequenced genome of T. brucei, indicating many likely recombination events occurring between VSGs of the same type, but not between those of different types. These tools and methods can be broadly applied to identify groups of non-coding regulatory sequences within other Trypanosomatid genomes. To aid in the detection, quantification, and species identification of leishmania DNA isolated from environmental or clinical specimens, we developed a set of quantitative-pcr primers and probes targeting a taxonomically and geographically broad spectrum of Leishmania species. This assay has been applied to DNA extracted from both human and canine hosts as well as the sand fly vector, demonstrating its flexibility and utility in a variety of research applications. Within the host genomes, fine mapping SNP analysis was performed to detect polymorphisms in a family study of subjects in a region of Northeast Brazil that is iii

9 endemic for Leishmania infantum chagasi, the parasite causing visceral leishmaniasis. These studies identified associations between genetic loci and the development of visceral leishmaniasis, with a single polymorphism associated with an asymptomatic outcome after infection. The methods and results presented here have capitalized on the large amount of genomics data becoming available that will improve our understanding of both parasite and host genetics and their role in human disease. iv

10 TABLE OF CONTENTS LIST OF TABLES... viii LIST OF FIGURES... ix CHAPTER I. INTRODUCTION The Trypanosomatid protozoa Gene expression in Trypanosomatids Trypanosomatid genomes Human genetic susceptibility and resistance to leishmaniasis Summary Published works...9 CHAPTER II. MAPPING OF VSG SIMILARITIES IN TRYPANOSOMA BRUCEI Introduction Materials and methods VSG sequence database A web-based interface to simplify the visualization of clusters of similar sequences Construction of a graph of VSG similarities Identification of N-terminal and C-terminal VSG sequence types Betweenness centrality analysis Statistical analysis of distance associations and N- and C- terminal associations Heatmaps Alignments and profile-hidden Markov models D and 3D visualizations of clustered graphs Detection of mosaic genes Results Similarities based on BLAST reveal five distinct N-terminal domain types Profile-HMMs accurately classify N-terminal types BLAST hit similarities do not readily separate C-terminal domain types Unsupervised MCL clustering agrees with previous clustering and can further sub-divide VSG types Atypical and functional VSGs display high betweenness centrality Pairings between N- and C-terminal types show some biases Locations of similar VSGs are usually independent of their chromosomal positions N-termini of Expressed Functional VSGs of T. brucei and T. congolense Evidence for mosaic gene conversion events Discussion...41 v

11 CHAPTER III. SERIAL QUANTITATIVE PCR ASSAY FOR DETECTION, SPECIES-DISCRIMINATION AND QUANTIFICATION OF LEISHMANIA SPP. IN HUMAN SAMPLES Introduction Materials and methods Leishmania species and strains qpcr primers and probe design DNA extraction qpcr assay conditions Multiplex TaqMan qpcr Results Detection and differentiation between Leishmania species Specificity of primer pairs Leishmania species identification kdna copy numbers Detection of L. donovani in clinical and other unknown isolates Discussion...81 CHAPTER IV. FIELD APPLICATIONS OF QPCR: SPECIES IDENTIFICATION AND QUANTIFICATION Introduction Applications Quantification in the sand fly vectors and asymptomatic human hosts in Rio Grande do Norte, Brazil Detection and species determination of parasites in human specimens in Bahia, Brazil Quantitative analyses during treatment of visceral leishmaniasis due to L. donovani in blood from subjects with visceral leishmaniasis in Bihar State, India Speciation of Leishmania spp. in fixed, stained slides of cutaneous lesions from subjects in Tegucigalpa, Honduras Discussion...97 CHAPTER V. HOST IMMUNOGENETIC FACTORS INFLUENCING THE OUTCOME OF VL Introduction Methods Subject entry and ascertainment Phenotype determination Numbers of subjects Quantitative trait normalization Pedigree selection by exposure SNP selection for fine mapping Genotyping methods Quality control (summarized in Figure 15) Corrections for multiple tests and haplotype blocks Linkage and association analyses Results Population characteristics Family-based association tests vi

12 Chromosome Chromosome Chromosome Candidate genes Discussion CHAPTER VI. DISCUSSION A preface to the discussion qpcr assay applications Application of unsupervised clustering to the analysis of conserved non-coding genomic elements in Trypanosomatids Annotating unknown protein-coding genes Human population genetics and susceptibility to leishmaniasis Future combined annotation aggregation from both protein-coding genes and regulatory sequences Conclusion APPENDIX A. CONSTRUCTING AN INFORMATICS INFRASTRUCTURE FOR ANNOTATING PROTEIN-CODING GENES OF UNKNOWN FUNCTION IN ANY SPECIES A.1. Introduction A.2. Methods A.2.1. Profile-hidden Markov models A.2.2. Genomic Sequences and Annotations A.2.3. High-throughput HMM scanning A.2.4. Efficient storage with an SQL database A.2.5. Taxonomy data A.3. Proposed applications A.4. Future directions APPENDIX B. CONSTRUCTING AN INFORMATICS INFRASTRUCTURE FOR IDENTIFYING AND ANNOTATING CONSERVED NON-CODING SEQUENCES OF UNKNOWN FUNCTION IN TRYPANOSMATIDS B.1. Introduction B.2. Methods B.3. Preliminary results B.4. Future directions REFERENCES vii

13 LIST OF TABLES Table 1. Preferences in N- and C- terminal pairing...33 Table 2. Co-localization of same-type VSGs...35 Table 3. Sources of Leishmania spp. isolates used in this project development...48 Table 4. Primers and probes used for Leishmania qpcr diagnosis and speciation...51 Table 5. Relative efficiency of SYBR green or TaqMan assays...62 Table 6. Sensitivity of qpcr assays for parasite detection...67 Table 7. Serial qpcr studies recommended for detection and speciation of Leishmania spp. in clinical or environmental specimens based on geographic region...71 Table 8. The relative copy number differences in kinetoplast mini- and maxicircle sequences...74 Table 9. Detection of Leishmania (L.) infantum chagasi in human serum...90 Table 10. Tissue biopsies and the detection of L. (V.) braziliensis in Bahia Brazil...92 Table 11. Species specificity of primers used to identify the species of Leishmania from DNA samples of samples extracted from slides...96 Table 12. Candidate genes chosen for study of associations with VL and DTH phenotypes Table 13. Fine-mapping population characteristics Table 14. Most significant results of association testing between three phenotypes and SNPs in linkage follow-up regions and candidate genes Table 15. Locations and allele frequencies for SNPs most highly associated with VL, DTH+ and DTH size phenotypes Table A1. Brief descriptions of SQL column names viii

14 LIST OF FIGURES Figure 1. Flow chart of the analyses of VSG sequences...14 Figure 2. Clustering of N-terminal VSG sequences depicted in two dimensions...21 Figure 3. Locations of BLAST hits within the N-termini, with representative primary and secondary sequence features...24 Figure 4. Clustering of C-terminal portions of VSG sequences...26 Figure 5. Unsupervised clustering using the MCL algorithm...28 Figure 6. VSGs with a high betweenness centrality...32 Figure 7. Evidence for mosaic gene conversion between similar VSG N-terminal domains...38 Figure 8. Melt curves of selected qpcr assays useful for species discrimination...60 Figure 9. Flow chart of the serial qpcr assay...69 Figure 10. kdna copy numbers during stage transition...77 Figure 11. qpcr detection of L. (L.) donovani in human sera...79 Figure 12. Quantification of Leishmania in Sand Fly DNA samples...89 Figure 13. Quantification of L. (L.) donovani from the blood of VL patients given different preparations of amphotericine...94 Figure 14. Quantitative DTH size transformation Figure 15. Quality control Figure 16. Results of genotyping in chromosome 9 linkage follow-up regions Figure 17. Results of genotyping in chromosome 15 linkage follow-up regions Figure 18. Results of genotyping in chromosome 19 linkage follow-up regions Figure 19. Regions of highest association with VL in linkage follow-up regions Figure 20. Associations between SNPs in candidate genes and the symptomatic VL phenotype Figure 21. Associations between SNPs in candidate genes and the asymptomatic DTH response phenotype (qualitative) Figure 22. Associations between SNPs in candidate genes and the asymptomatic DTH response size phenotype (quantitative) Figure 23. Most highly associated SNPs in candidate genes ix

15 Figure A1. An SQL implementation of the described data schema Figure B1. Known SIDER elements can be separated using Blast and MCL Figure B2. Visualizing conserved inter-coding region sequence clustering and motifs x

16 1 CHAPTER I. INTRODUCTION 1.1. The Trypanosomatid protozoa Trypanosomatids are members of the order Trypanosmatida (class Kinetoplastida), and are single celled, flagellated protozoa. Many Trypanosomatid species are parasitic and likely descended from a common parasitic ancestor. 1 Of special interest are pathogenic Trypanosomatids causing diseases in humans and other mammals. These pathogenic species share the feature of having dimorphic life-stages one of which infects an insect vector and the other a mammalian host. The preferred insect vector, the parasite survival strategy within the host, and the disease resulting from infection differ depending on the Trypanosomatid species. Leishmania spp. are Trypanosomatids that are transmitted by a phlebotomine sand fly vector belonging to the genus Lutzomyia or Phlebobomus. Parasites are transmitted between insect and mammalian hosts, where they cause a spectrum of diseases collectively called leishmaniasis. Within their mammalian host, Leishmania are obligate intracellular pathogens and primarily infect mononuclear phagocytic cells including macrophages and blood monocytes. Despite their similar vectors and host cells, Leishmania spp. have diverged in the means by which they evade the host immune response, causing distinct diseases in humans depending on the infecting Leishmania species. Such diseases include cutaneous infections with slow healing ulcers and visceral infections spreading to the liver and spleen. Unlike Leishmania, Trypanosoma genera species of African origin are extracellular, and thrive in the blood stream, lymph glands and central nervous system. These trypanosomes are transmitted by the tsetse fly and cause a wasting disease called African sleeping sickness in humans and nagana in infected livestock. These African trypanosomes differ from the New World trypanosome, Trypanosoma cruzi. T. cruzi is transmitted between hosts by the Reduviidae spp. insects,

17 2 and in the mammal it infects a variety of host cells, replicates intracellularly and causes Chagas disease. The broad spectrum of diseases caused by these parasites makes identification and study of each individual species and its interaction with the host immune system important. However, the common parasitic ancestry of the order provides an opportunity to study which common biological features of the Trypanosomatids make them so well suited as pathogens with dimorphic life-stages. All Trypanosomatids have a kinetoplast organelle. This organelle carries out similar oxidative phosphorylation functions as the mitochondrion in other eukaryotes. The kinetoplast DNA is organized into a network of minicircle and maxicircle sequences. Transcripts from maxicircle DNA undergo a decrypting step known as RNA editing to become functional. 2 RNA editing involves inserting and removing uridylate into RNA and requires the transcription of guide RNAs created from the minicircle sequences. These sequences are circular strands of DNA, typically smaller than 2.5kb, and thousands may exist within the kinetoplast. The requirement for carrying out oxidative phosphorylation can vary between the insect and mammalian host. 3 Maintaining the ability to (eventually) use oxidative phosphorylation even in the absence of selective pressure provided by one of the two hosts, therefore, is essential, making the kinetoplast an important feature to study to better understand parasitism. Plasticity in minicircle copy number have been reported in cultures of L. tarentolae, 4 but variability across species and between life-stages remains a matter of interest since the kinetoplast minicircle is extensively used as a target for molecular probes in the detection of the parasites. 5 The Leishmania species insect life-stage form is called the promastigote, named for an anterior (pro) flagellum (mastigote). In the mammalian host, the intracellular life stage form is called an amastigote (without flagellum). Trypanosomes exist as procylics during their insect life-stage, and as bloodstream forms in the bloodstream of the mammalian host. Both leishmania and trypanosomes have intermediate life-stage forms

18 3 that give them survival advantages when transitioning from host to vector. The parasites would not be able to survive in both insect and mammalian hosts without the ability to undergo dramatic changes in protein expression in response to environmental stimuli. Leishmania undergo dramatic changes in mrna expression between axenic growth as promastigotes and intracellular amastigotes (growth conditions with a lowered ph and raised temperature). 6 Life-stage specific gene expression controls changes in the surface molecules expressed by parasites. 7, 8 These can facilitate survival in the insect environment through processes such as attachment to the insect gut. 9 In their mammalian hosts, expression of different sets of surface molecules assist in the successful evasion of the immune system. Examples include the Trypanosoma brucei Variant Surface Glycoprotein (VSG) and Leishmania spp. Major Surface Protease (MSP also called 10, 11 GP63). Leishmania species are pervasive throughout tropic and into subtropic regions of the world. An estimated 350 million people worldwide are exposed to infection. 12 Several clinical forms of diseases can be caused by leishmania infection and are collectively referred to as leishmaniasis. Visceral leishmaniasis (VL) is the most deadly form of the disease, infecting an estimated 500,000 people annually 13. In VL, a suppressed cellular immune response allows parasites to proliferate in the liver and spleen, and will ultimately result in death if untreated. Tegumentary forms of leishmaniasis, involving the skin and mucosal surfaces, are even more common. These forms include localized cutaneous leishmaniasis, which commonly occurs as a painless ulcerating lesion at the site of inoculation, sometimes within a few local satellite lesions, that eventually self-heals. An estimated 1.5 million people are infected annually. 13 Tegumentary forms common in Latin America also include Diffuse Cutaneous Leishmaniasis (DCL, often due to L. mexicana), in which there are disseminated nonulcerative lesions containing parasites throughout the body 14, and Disseminated Leishmaniasis (DL, often caused by L. braziliensis) 15 is another

19 4 disseminated form in which numerous acneiform lesions spread throughout the skin, with few parasites present in lesions. Forms of tegumentary leishmaniasis also include mucosal leishmaniasis, which occurs at the time of a cutaneous lesion or months-years afterwards and is sometimes called mucocutaneous leishmaniais. In this disease, infection with Leishmania braziliensis can cause a disfiguring mucocutaneous infection involving mucosal tissues at lower temperature than the core temperature, involving a hyperactive cellular immune response but with very few parasites present in lesions. 16 The outcomes of other Trypanosomatid infections also vary based on the infecting species. African trypanosomiasis, or African sleeping sickness, is caused by Trypanosoma brucei species. West African trypanosomaisis, caused by Trypanosoma brucei gambiense, has a long preclinical dormant cycle. East African trypanosomiasis, caused by Trypanosoma brucei rhodiense, progresses more rapidly to a severe acute phase and will quickly result in death if untreated. Human African Trypanosomiasis (HAT) occurs mostly in rural areas of the Democratic Republic of Congo, Angola, 17, 18 Sudan, Central African Republic, Chad and northern Uganda. A critical mechanism by which the parasite escapes recognition by the human immune system is its capacity to vary its major surface antigen, the Variant Surface Glycoprotein (VSG). This occurs through varied mechanisms, the most common of which is gene duplication and transposition of a donor VSG in the active expression site (ES). Other mechanisms include duplication of the entire telomeric region from one chromosome to another telomere, or exchange of an entire VSG and associated genes (ESAGs) to another telomere (telomere exchange). 10 We will examine in this thesis the possibility that further variation occurs through the generation of mosaic genes amongst the basic VSG copies in the T. brucei genome. Treatment of visceral leishmaniasis is especially problematic, as the treatment is lengthy and the most affordable available agents are toxic. All except one drug for visceral leishmaniasis must be administered intravenously over a prolonged period,

20 5 which can be difficult in endemic regions. The one oral drug, Miltefosine has 19, 20 considerable gastrointestinal toxicity, and compliance is poor. The disease is usually fatal if untreated; however there is a significant mortality, approximately 10%, even 21, 22 among people who receive treatment. Early and accurate diagnosis is helpful in guiding therapeutic decisions. Current clinical diagnostic methods include culture and microscopy from specimens involving invasive procedures (bone marrow, skin or splenic biopsy). While these can be highly specific, they lack sensitivity as some parasites will fail to grow in culture conditions. Definitive species determination currently requires successfully culturing parasites, and those parasites undergoing isoenzyme analysis, a process that can take 4-6 weeks. Serologic tests are unreliable especially for cutaneous leishmaniasis, and in endemic regions these are not specific enough for identifying active disease. Selection of treatment is also guided by knowledge of infecting species. 23 Currently clinicians use a combination of factors including clinical presentation, culture, microscopy and serology and their best estimate given what is known about species present in the region to guide their choice of treatment. 23 Thus more accuracy in detection and species determination would be invaluable in controlling the disease Gene expression in Trypanosomatids Gene expression in Trypanosomatids has some important differences from gene expression in higher eukaryotes. Despite targeted searches, Trypanosomatids have been found to lack consensus RNA pol II promoter DNA sequences. 24 However, studies of the Trypanosoma brucei genomes have revealed a major contribution of epigenetic factors controlling transcription start and termination. These are four histone variants: H2AZ, H2BV, H3V, and H4V. H2AZ, H2BV, along with H4K10ac and BDF3, mark RNA pol II transcription start sites. H3V and H4V are enriched in probable transcription termination sites. 25 These transcription start sites occur in divergent strand switch regions where long unidirectional clusters with up to hundreds of protein coding genes

21 6 proceed away from one another on both DNA strands in a 5 to 3 direction. 25 Beyond these epigenetic markers the only remarkable sequence feature has been runs of guanine nucleotides. 25 This mode of transcription initiation is specific to RNA pol II. Transcription initiated from these sites produces long polycistronic transcripts. 26 Unlike bacteria, genes comprising these polycistronic precursor RNAs are often unrelated in function. Trypanosomatids individually regulate messages by decoupling messages within the polycistronic transcript through trans-splicing events. During trans-splicing, a capped splice leader (SL) sequence (39 bp in the case of Leishmania spp.) is added to the 5 end of each mrna, in a reaction coupled with polyadenylation of the upstream transcript in the polycistronic RNA. 27 These mrnas are subject to regulation primarily at the level of mrna stability and elements in the 3 UTRs are critical for this 8, 28 posttranscriptional regulation. As such, the inter-coding regions contain critical information needed for gene regulation. The sequenced genomes of Trypanosomatid species provide an opportunity to explore common regulatory sequences conserved between multiple species. Families of RNA binding proteins further support the importance of post-transcriptional regulation of gene expression Trypanosomatid genomes The release of the complete genomes of Trypanosomatid species has provided a powerful tool for understanding common mechanisms in the regulation of gene expression while exploring differences that may allow each species to function in its own unique niche. This began with the release of the Tri-tryp genomes: T. brucei, T. cruzi, and L. major Since then the genomes of L. braziliensis and L. infantum have been published, 33 providing additional Leishmania species for comparison. Most recently the L. mexicana genome has been sequenced and made available 34 with certain restrictions, pending completion of publications. Additional Trypanosoma genomes have also been sequenced. T. brucei gambiense, which causes human disease, and T. vivax and T.

22 7 congolense which infect livestock have been sequenced and made available. Some of the first clues about a large family of cis-regulatory elements regulating gene expression to be elucidated from this genomic data came from the leishmania genomes, as did the observation of inactive transposable elements, called retroposons, present in the 3 UTRs of genes. Some of these sequences named SIDERs (Short Interspersed Degenerated Retroposons) 35 have since been shown to be life-stage specific cis-regulatory sequences, some of which confer instability to the upstream mrna. These sequences are approximately nucleotides in length, and are thought to have once been a retrotransposable element that has since become widely used throughout the genome to enable post transcriptional regulation of mrna by modifying the rate of mrna degreataion. 36 The function of many of these SIDER elements remains to be determined. Additionally, SIDER sequences are unique to the Leishmania species, suggesting other sequences related to transposable elements such as the ingi-related retroposons in trypanosomes 37 or other mechanisms may contribute to the control of stage specific gene expression in Leishmania and other Trypanosomatid protozoa Human genetic susceptibility and resistance to leishmaniasis Leishmania infantum chagasi causes visceralizing leishmaniasis (VL) in humans. The majority of infections are asymptomatic or subclinical (>75%), but the remainder go on to develop VL. 38 Differences in susceptibility of distinct mouse strains to leishmania infection clearly show an important role for host genetics in the outcome of the disease. Evidence for a contribution of genetics to susceptibility in humans to VL stems from two classical segregation analyses 39, 40, and observations of familial aggregation of VL amongst first degree family members. The mode of inheritance of susceptibility to VL is predicted to be complex as many genes are involved in guiding the immune response.

23 8 Asymptomatic individuals exposed to leishmania antigens develop a positive delayed type hypersensitivity (DTH) skin test to leishmania antigen, also called the Montenegro test or LST (leishmania skin test), which is detected by an induration of 5mm or higher hrs after test placement. A positive DTH response (DTH+) indicates the person was exposed to the parasite, and is able to mount an appropriate Th1 response, indicating resistance to the parasite. 41 Segregation analysis of Brazilian subjects has provided evidence for a contribution of genetics to an asymptomatic infection detected by a DTH+ skin test response. 42 A genome-wide linkage analysis of families in Northeast Brazil included both probands treated for VL, and DTH+ asymptomatic individuals. This study indicated linkage between regions of chromosome 15 and 19 and the size of the DTH response among asymptomatic individuals with a heritability of H=0.84. The study also suggested linkage between a region on chromosome 9 and susceptibility to VL. 43 Following up on these three regions suggestive of linkage, fine mapping SNP genotyping has been completed, and an analysis of these SNPs may provide insights into which genetic factors are associated with the DTH+ quantitative and VL traits. Also, SNPs among 43 candidate genes, selected for their hypothesized involvement in VL, have been genotyped in the same population. The data gathered in this study can identify genetic variants associated with either an asymptomatic outcome of infection or the development of visceral leishmaniasis Summary The studies presented here have produced computational tools that have aided in the description of Trypanosomatid genomes, producing visual representations of the similarities between biological sequences. The quantitative PCR assays based on sequences from those genomes have aided in the detection, quantification, and species identification of Leishmania spp. parasites. Finally, genetic analysis of the human host population has identified genetic loci that influence the outcome Leishmania infection.

24 9 Together these studies bring together a multitude of parasite and host genomes in order to better understand the human diseases they cause Published works The works presented in chapter II and chapter III have been adapted from peerreviewed publications where Jason L. Weirather (JLW) is the primary author, and he designed and performed all experiments in conjunction with his mentors Dr. Donelson and Dr. Wilson. Chapter II, Mapping of VSG similarities in Trypanosoma brucei, was adapted from the like named article published in Molecular and Biochemical Parasitology (February 2012). Content from chapter III was adapted from the Journal of Clinical Microbiology (November 2011) article Serial quantitative PCR assay for detection, species discrimination, and quantification of Leishmania spp. in human samples. Content from Chapter IV includes both published and unpublished results of which JLW is a contributing author. With the exception of Figure 13 and Table 9, all studies were performed by JLW. Chapter V material was adapted from a work being prepared for publication of which JLW is an equally contributing primary author along with Priya Duggal (Johns Hopkins). Dr. Duggal chose the fine mapping and candidate gene SNPs for the genotyping and did the original quality control of data to remove markers out of Hardy-Weinberg equilibrium and make corrections to the pedigrees. All the analyses presented were performed by Jason under the guidance of Dr. Duggal as well as Dr. Wilson, Dr. Kwitek, and Dr. Murray.

25 10 CHAPTER II. MAPPING OF VSG SIMILARITIES IN TRYPANOSOMA BRUCEI 2.1. Introduction African trypanosomiasis or African sleeping sickness is caused by Trypanosoma brucei subspecies gambiense (West African sleeping sickness) or rhodesiense (East African sleeping sickness). In contrast to visceral leishmaniasis, in which immunity is suppressed, T. brucei guides the host immune response to produce an effective but insufficient antibody response against the Variant Surface Glycoprotein (VSG). The capacity of the parasite to continually switch between VSG antigen types allows outgrowth of clones containing antigenically diverse VSGs, letting the parasite stay ahead of the development of an effective host immune response that can eradicate the infection. 10 T. brucei clonally produces a single, highly antigenic, VSG as its primary surface coating. While the host mounts an adaptive immune response to the antigens present on the VSG, a small fraction T. brucei will have changed their surface coating and escaped the immune response, resulting in the characteristic cycle of fevers experienced with infection. 10 Individual trypanosomes switch from one VSG to another at a rate of 10-2 to 10-7 switches per generation time of 5 10 hours 44 in a process called antigenic variation. Trypanosoma brucei cycles between tsetse flies and the mammalian bloodstream, remaining extracellular throughout its life cycle. The VSG is a glycosylphosphatidylinositol (GPI)-anchored protein covering bloodstream trypanosomes with about 10 7 copies. This surface protein constitutes about 5% of the total protein of the organism a high percentage for a surface protein. The only known function of this VSG monolayer is to protect other invariant constituents of the membrane from immune attack 45. Mature VSGs are amino acids in length with the GPI anchor linked to their C-terminus. Immunologically distinct VSGs typically have less than 25% amino

26 11 acid identity. 46 The three-dimensional structures of two VSG N-terminal regions (the first 80% of the protein) have been determined by X-ray crystallography. 47 Despite little primary sequence similarity, they share a nearly identical dumbbell-like shape that permits the VSGs to pack tightly on the parasite s surface. Two large α-helices form a coiled-coil that acts as a backbone for other smaller helices and loops of the structure. The solution structure of a VSG C-terminal region (the last 20%) has been determined by NMR spectroscopy and molecular modeling indicates that all of the semi-conserved VSG C-terminal domains share highly similar tertiary structures. 47 Thus, all VSGs are thought to adopt very similar tertiary structures mediated by similar α-helical backbones and disulfide bridges. This monolayer of closely packed VSGs confronts the immune system with a highly repetitive (10 7 ) set of epitopes on the surface of each trypanosome. The genome of T. brucei includes linear chromosomes of varying sizes. VSG genes (italics refers to the DNA sequence while non-italicized VSG refers to the amino acid sequence or protein product) occur on all three chromosomal size classes i.e., the mega-base, intermediate-sized and mini-chromosomes. 48 The eleven mega-base chromosomes, which have been sequenced, 30 possess nearly 1000 VSGs and the entire archive of VSGs, including those on the yet-to-be-sequenced intermediate and minichromosomes, has been estimated at 1600, 49 or about 20% of the protein-coding capacity of the genome. Only about 7% of the nearly 1000 sequenced VSGs actually encode functional VSGs, i.e., VSGs with conserved cysteines and semi-conserved C-terminal domains. Most of the remaining (66%) are pseudogenes (with frameshifts or in-frame stop codons), some (9%) are atypical genes (that do not encode all of the expected cysteines and/or C-terminal similarities) and the rest (18%) are gene fragments, most of which encode C-terminal domains. These VSGs occur in transcriptionally silent arrays of 3 to 250 members located in subtelomeric regions that are 100 kb or more from the telomere repeats. Functional VSGs are scattered amongst the pseudo-vsgs in the arrays in no discernible pattern. The expressed or active VSG is not located within an array, but

27 12 is always immediately adjacent to a telomere of one of the mega-base or intermediate chromosomes. The most common VSG switching mechanism is a gene conversion event in which a duplicated copy of a VSG in a silent array replaces the active VSG in a telomere-linked expression site (reviewed in 10, 50, 51 ). Since VSGs may enter a telomeric expression site via homologous recombination, there is potential for the formation of new VSGs that are mosaics of existing VSGs. This potential for a fragmented exchange of sequence information within the VSG repertoire led us to use BLAST, which is suitable for identifying fragments of similarity, to generate graphs of relations among the T. brucei VSGs. Previous comparisons of VSG primary sequences have categorized their N-terminal domains into three types, A, B and C, and their C-terminal domains into six types, 1 6, on the basis of cysteine positions and sequence similarities. 52 Different N-terminal domains can associate with different C- terminal domains. 46 An online database of VSGs located on the eleven mega-base chromosomes is available. 52 We used this database and these alternative methodologies to re-examine the N- and C-terminal domain categories and to search for the presence of potential mosaic VSGs created via multiple cross-over events within VSG coding regions. These methods were used to assign clusters of VSGs to categories and sub-divide categories of more closely related VSGs. We find that the original N-terminal type A domain can be sub-divided into three additional groupings for a total of five N-terminal domain categories and that mosaic VSG formation between VSGs of the same N-terminal type is likely a regular occurrence Materials and methods VSG sequence database Sequences and annotations were obtained from VSGdb, an online resource of annotated VSG sequences created by Marcello et al. 52 VSGdb provides separate information on the VSG N- and C-terminal domains, as well as categorization of

28 13 domains, and predicted functional annotations (i.e., functional, atypical, or pseudogene VSG). Chromosomal locations for the VSGs were derived from the annotated genome of T. brucei stock TREU obtained from data repositories formerly provided by The Institute for Genomic Research (TIGR) and presently available from the Wellcome Trust Sanger Institute. All VSGs analyzed were from this genome unless otherwise indicated. A local database was created to provide the combined VSG information for subsequent steps of the analysis. Sequences and annotations in FASTA format were processed by a pipeline of scripts to visualize networks of relationships among VSGs. Additional sequences, such as additional T. brucei and T. congolense expressed (cdna) VSG sequences, could be conveniently appended to the FASTA files for comparison in the analysis. The software used to generate clustered plots of nucleotide and amino acid sequences, including a web-based interface for more user-friendly access (see Section 2.2.2), is freely available 53. An overview of the analytical approaches for analyzing the VSG sequences is summarized in Figure A web-based interface to simplify the visualization of clusters of similar sequences A web-based interface was created as part of the previously mentioned software to visualize graphical output of reciprocal BLAST searches. This interface minimally accepts FASTA files and a BLAST E-value (the probability cutoff for such a match occurring in a random set of sequences of the same size) as inputs. Individual sequences are required to have a unique identification (ID) as the first term in the FASTA header, accordingly color-code clusters of similar VSGs, the Markov clustering algorithm (MCL) 54 is used to perform unsupervised clustering. When MCL is used to partition clusters, the interface will also output individual FASTA files for each (profile- HMMs) 55, 56 generated from the alignments. This software can be generally

29 14 Figure 1. Flow chart of the analyses of VSG sequences Note: From the top, the VSG sequences and annotations were selected for analysis from a locally generated database comprised of sequences and annotations derived from the online T. brucei VSGdb database 52, and the sequenced T. brucei genome 30. More generally, the analysis approach can be applied to any group of sequences input in FASTA format. Blue modules represent scripts written to organize and direct data through the analysis, including one to execute BLAST searches between all selected VSGs. BLAST reports generated from the searches were used to compute statistics about the relations between VSGs, or to generate graphs to visualize the network of similarity among VSGs. Network graphs were clustered and processed into images using software represented by the gray modules.

30 15 applied to simply generate and visualize clusters from any set of nucleotide or protein sequences Construction of a graph of VSG similarities Reciprocal BLAST searches 57 initiated from Perl scripts using the bioperl interface 58 were conducted among the approximately 800 annotated VSG sequences, in FASTA format, selected from the local database. Both amino acid and nucleotide sequences were analyzed using the described approach. Graphs could be constructed from N- or C-terminal nucleotide or amino acid sequences using the local database. The similarity of relationships indicated by BLAST hits meeting a given threshold E-value were used to construct an un-weighted non-directional graph of related VSGs. An E- value threshold was used to limit the graph to a set of more closely related VSGs. At high (less stringent) E-values the similarities between more distantly related genes produce a more dense graph of relationships at the risk of more false positives being permitted. When the E-value threshold was lowered (more stringent), only highly similar VSGs remained connected, and sub-graphs became disjointed, allowing the resolution of smaller clusters of more highly related VSGs. The BLAST reports were compiled into a flat file to store the names of every VSG pair matched, which part of the sequence matched, and the E-value of each match Identification of N-terminal and C-terminal VSG sequence types BLAST searches were performed at either the nucleotide or amino acid level with gap size parameters set to 30 nucleotides or 10 amino acids. The VSG sequences represented in the BLAST searches were labeled with sequence annotations such as chromosomal locations, previously reported N-terminal variable region sequence types, C-terminal sequence types, and potential functional significance (i.e., either a functional, atypical or pseudo-vsg). The annotated BLAST report comprises a graph of similarities

31 16 between VSG sequences. Nodes (or vertices) on these graphs are the VSGs, and BLAST hits make up the edges (or lines) on the graph between nodes. E-value thresholds for BLAST were varied to resolve the graph at multiple scales, or densities, of BLAST hits. These graphs were used to identify clusters of similar VSGs and produce reports on the statistical relationships between the described types of VSG N- and C-termini and their chromosomal locations. The same graphs were also used for visualizing clusters of similar VSGs. To validate the visualized 5 N-terminal type clustering scheme (see Section 2.9 for methods), the MCL algorithm was applied to the BLAST results of the N-terminal protein sequences. MCL is an unsupervised clustering algorithm, requiring no prior class-label information in order to sort sequences into groups, although such information is useful in validating and interpreting the results. Based on sequence similarity determined by BLAST, MCL is capable of partitioning the domains of VSGs into subfamilies in an unsupervised manner. 54 MCL clustering of BLAST results has been applied to trypanosome, and other eukaryotic, genomes for the purpose of identifying groups of orthologs in the orthomcl database. 59 The MCL inflation parameter (i-value) affects the granularity, or number, of clusters generated from each graph, so i-values were chosen that would maximize the modularity (Q) of the clustering scheme for each E- value threshold tested. Modularity is the fraction of edges within clusters minus the fraction that would be expected if the edges were distributed at random. This maximization was done to highlight only the most well segregated clusters of VSGs at each E-value. E-value thresholds were tested on a log scale from 1 to 1e-10, with i- values ranging from 1.1 to 19. For each E-value threshold, an i-value that maximized modularity for clusters of more than 8 VSGs members was selected. Maximization resulted in nominal i-values ranging from 1.2 to 2. These optimized clusters were subjected to further graph analyses for evidence of inter-cluster mosaic VSG formation and measurements of VSG centrality (see section 2.5) within each cluster.

32 Betweenness centrality analysis Betweenness centrality is the number of times any particular node (i.e., VSG) lies on the shortest path between any other two nodes in the cluster. 60 By this measurement, VSGs with a high betweenness will be the most similar VSGs between less related subfamilies. Tendencies for atypical and functional type VSGs to have high betweenness centrality were determined to assess their possible importance to the diverse community structure. The average betweenness within individual (MCL identified) clusters of N- terminal VSG amino acid sequences was measured at E-values ranging from 0.1 to 1e-10 clustered by MCL. These clusters are the same as those visualized as colored groups within each scale of E-value (see Figure 5). Statistical significance of the observed average betweenness of functional and atypical genes within the undirected graphs of each of the describe clusters was calculated by comparison with two randomized datasets. One data set (random) was a random graph with the same number of edges and nodes, whereas the other (permutation) was a random selection of VSGs from the same graph in which the atypical or functional VSGs were removed. Additionally, a random set of VSGs of equal size to the set of atypical or functional VSGs was removed prior to the calculation of the observed betweenness in each permutation to maintain an equal number of nodes in the observed and permuted betweenness calculations of each iteration. Both types of analysis were conducted 1000 times to compute p-values Statistical analysis of distance associations and N- and C-terminal associations Possible enrichment within a chromosomal region of the same N-terminal type VSGs was calculated by averaging distances between all pairs of VSGs of the same type on the same contiguous region (contig). This average was then compared to a recalculated average after shuffling all VSG type labels. The averages from these

33 18 permutations were recalculated 1e05 times. For each contig, the number of simulations in which the average distance between same-type VSGs was less than or equal to the observed distance was divided by the number of permutations (1e05) to find the p-value indicating an aggregation of types. Biases in the associations of N- and C-terminal types were determined by simulating pools of VSGs based on randomly matching the N- and C-termini, then counting the number of associations that exceeded the observations. Associations between N- and C- terminal domains of VSGs were randomly permuted 1e06 times to identify significant associations Heatmaps The number of sequences sharing similarity on a portion of the primary sequence, according to BLAST, was counted within each N-terminal type, and image files were generated using the matrix2png tool. These BLAST hit overlaps reveal the most common location of BLAST hits occurring within particular N-terminal types. This analysis was used for observing differences in N-terminal types, and determining which portion of the sequence made the greatest contribution to the N-terminal type clustering. The observation that BLAST hits tended to overlap with cysteine-rich regions of the VSGs was tested for statistical significance by comparing the observed fraction of cysteine residues within BLAST hits in each N-terminal type to 1000 sets of randomly shuffled VSG amino-acid sequences. Only amino acid sequences were shuffled with each iteration; the graph and the start and stop positions of each BLAST hit were held constant Alignments and profile-hidden Markov models The primary amino-acid sequences were aligned for each N-terminal type to identify regions of similarity and conserved cysteine residues using ClustalW 61, and formatted for viewing using the ESPript tool 62. Secondary structures representative of

34 19 each N-terminal type were calculated using the psipred 63 secondary structure prediction server 64. Profile-HMMs (profile-hidden Markov models) representative of the 55, 56 alignments were generated by HMMER software. These models were useful in rapidly identifying VSGs and recapitulating the type definitions determined in this study. Automated generation of profile-hmms from clustered sequences is a feature of the software developed for this analysis. 53 The classification accuracy of profile-hmm models created from the N-terminal clustering scheme describe in this report was tested using a leave-one-out approach. Profile-HMMs from each cluster minus one VSG, were built and then tested for classification accuracy as applied to the left-out VSG. This approach was iteratively performed for all VSGs clustered D and 3D visualizations of clustered graphs The 2D and 3D visualizations were generated through a pipeline of Perl scripts. To visualize clusters in either 2D or 3D, all the sequence similarities between VSGs, after being computed as BLAST reports, were compiled into dot format graphs and colorcoded according to the annotation information. Each VSG is represented by a single node on the graph. An edge (line) between two nodes indicates a BLAST hit between those two VSGs. In both the 2D and 3D graphs, clustering (node dispersal and edge length) is determined from an un-weighted graph constructed from the presence of BLAST hits (graph edges) between VSGs (graph nodes). The order of edges written into the dot graph was randomized to ensure that the clusters generated were not dependent on the order of inputs. To render the 2D graphs, the dot files were processed by neato, a component of Graphviz 65. The 3D cluster visualizations were generated by clustering dot files with springgraph 66, and rendering the springgraph output with MegaPOV 67, an implementation of POV-ray 68, a ray-tracing program. POV-ray can also be used to generate a series of

35 20 images to be spliced into a 3D movie by ImageJ 69. These 3D visualizations reveal substructures not easily visualized in the 2D graphs Detection of mosaic genes BLAST hits between functional VSGs and other VSGs were used to illustrate regions of possible mosaic gene formation. Multiple sequence alignments from ClustalW were used to examine these BLAST hits and alignment data were used to create illustrations of where sequences in the related genes matched, or did not match. The purpose was to identify short regions at which potential cross-over events between VSGs occurred Results Similarities based on BLAST reveal five distinct N- terminal domain types The N-termini of VSGs (the first 80% of the amino acids; ~330 residues) are much more variable than the semi-conserved C-termini (the last 20%) 30. Therefore, similar to earlier studies 46, 49, 71, we analyzed these two domains separately. Three general groupings, A, B and C, of T. brucei VSG N- termini, which are based primarily on cysteine positioning and sequence similarity, have been described previously 46. When the 767 available VSG N-termini whose corresponding chromosomal gene locations are known were visualized in a 2D clustering based on amino acid identity determined by a BLAST search with an E-value < 1e-03, five distinct clusters of N-termini were observed (Figure 2A). E < 1e-03 was found to be the most stringent E-value that can be used in which the most BLAST hits in the graph remain connected. These five types, called N1 N5 (Figure 2B), agree with, and expand on, the previously described three types, A, B and C, and provide a detailed look at sub-clusters of VSG N-termini.

36 Figure 2. Clustering of N-terminal VSG sequences depicted in two dimensions 21 21

37 Figure 2. Continued. Note: Panels A-E contain 767 VSGs for which all annotation information is available, including N- and C-terminal sequences, chromosomal gene location and predicted functional annotations (functional, atypical, or pseudogene) 30, 49. Panel F contains 17 additional expressed (cdna) N-terminal VSG sequences from other T. brucei isolates 71-74, and 17 expressed (cdna) N-terminal VSG sequences from T. congolense 75-77, for a total of 801 sequences. All panels were clustered according to similarity between the N-terminal VSG amino acid sequences determined by a BLAST hit of E < 1e-03. Solid lines represent a match from one continuous BLAST hit. Dotted lines represent multiple non-contiguous BLAST hits between two VSGs. (A) VSG N-termini are color-coded according to previously assigned three types A, B and C. 71 The red ovals indicate additional divisions into five distinct types (N1 N5), three of which (N1 N3) are subsets of the previous group A 71. (B) VSGs are color-coded according to these five newly defined N-terminal types. The arrow indicates a VSG discussed in the text that shares similarity with both N1 and N2 types. (C) VSGs are clustered according to their N-terminal type and color-coded according to the C-terminal type (see Figure 3), illustrating a non-random distribution of some C-terminal types with N-terminal types (see also Table 1). (D) Color coding the clustered N-termini by chromosome does not reveal a tendency for highly similar VSGs to be located near each other within the genome. (E) Color coding according to the VSG annotations as either functional, atypical or pseudogene 30, 49 indicates that functional and atypical VSGs exist within all five N-terminal types. (F) The sampling of additional expressed (cdna) VSG sequences from T. brucei and the related species T. congolense show that the T. brucei expressed (cdna) sequences have similarity to types N1, N2, and N4, whereas the available T. congolense expressed (cdna) sequences only show similarity with type N

38 23 N-terminal type N4 (equivalent to type B in previous studies 71 ) is the most clearly resolved and tightly clustered of the five types. Three distinct clusters of VSGs among those previously labeled type A were observed and designated as types N1, N2 and N3. These data also are consistent with the observation 49 that N5 (equivalent to type C) and N3 (part of the original type A) are not as distinctly separated. Nevertheless, profile- HMMs from N3 and N5 can be applied to accurately distinguish between N3 and N5 types (see Section 2.3.2). Types N1 and N2 display very little similarity with each other based on BLAST analysis (Figure 2A and B). Only one VSG (Tb ) shares any similarity at this E < 1e-03 threshold with both types (red arrow, Figure 2B). This VSG is actually a truncated pseudogene encoding about 70 amino acids at its extreme N- terminus that are similar to type N1, followed by approximately 70 amino acids more similar to type N2. Therefore, this particular VSG was left as no type. Since there is only one nonfunctional potential gene conversion event between these two large clusters of VSGs, and none involving the other large cluster, N4, we speculate that mosaic VSGs formed between these families may be nonfunctional, and that most mosaic combinations between genes encoding different N-terminal types, if they even occur, result in a nonviable VSG. No apparent overall clustering of similar N-terminal coding sequences on specific chromosomes was observed (Figure 2D); however, a number of examples of similar VSGs of the same N-terminal type occurring on the same chromosome or in close proximity on the same contiguous sequence (contig) were observed in the statistical analysis (see below). In addition, VSGs predicted to be nonfunctional pseudogenes or atypical are represented in each of the five types of N-terminal domains (Figure 2E). Substructures of related VSG N-terminal types can also be observed through 3D animation of the clustered BLAST analysis graph. 70 Using an even less stringent E-value of E < 1e-02, the separation between five N-terminal types is clearly observed. The video figure S1 published by Weirather et al. 70 also illustrates additional features, such as the possibility of subgroups within the N1 cluster.

39 Figure 3. Locations of BLAST hits within the N-termini, with representative primary and secondary sequence features Note: VSG families are represented by 161 type N1, 105 type N2, 47 type N3, 349 type N4 and 32 type N5 VSGs. Overall, these N- terminal sequences have an average length of 330 amino acids. An illustration of the predicted secondary structures within the family is shown for each type. Dark blue winding structures indicate a predicted alpha helix. Rectangles indicate a predicted beta sheet structure. A black-to-green heatmap is also shown for each of the five N-terminal types to display the approximate location of BLAST hits within that type. The lighter green areas depict more BLAST hits, indicating the region of the N-terminal domain making the greatest contribution to the intra-family cluster, and the black areas indicate fewer BLAST hits. Underlying each family is a multiple alignment of the primary amino acid sequence. The cysteine residues that are most conserved have been highlighted in orange and indicated in bold print at the bottom of each sequence alignment for each type

40 25 Residues making the greatest contribution to the intra-family clustering are shown in Figure 3, which is a heatmap of (blastp) hits overlaid with secondary structure predictions and cysteine residue conservation. This analysis shows that in types N1, N3 and N4 the central portion of the N-terminal domain contributes the greatest number of blast hits to the clustering, whereas sequences proximal to the N-terminus contributes the least. In contrast, type N2 differs in that sequences proximal to the N-terminus of the N- terminal domain most frequently share similarity, whereas sequences proximal to the C- terminus share the least. Type N5 appears to be intermediate between these two extremes of N1, N3 and N4 versus N2, but it also contains the fewest VSGs (32 members), which may compromise its analysis. BLAST hits within N-terminal types show a significant tendency to overlap with cysteine residues (p < 0.05) for N1 N4, but not N5 (see Section for the methods). These results imply those portions of the N-terminal domain may be under different structural constraints, hindering diversifying selection. Comparing the predicted secondary structures between different N-terminal types, types N1 N3 and N5 were observed to share conserved long predicted alpha helices at the N- terminal end, whereas the corresponding region of type N4 has shorter discontinuous segments of predicted helical structure. It is worth noting, however, that secondary structure predictions are sometimes inaccurate Profile-HMMs accurately classify N-terminal types The classification accuracy of profile-hmms generated from the clusters used in this study was tested using an iterative leave-one-out approach. The five-type clustering scheme described in Figure 2 could accurately classify 773 of 778 N-termini tested (99.4% accuracy). The MCL s unsupervised clustering at the same scale (E < 1e-03) had the best accuracy of any scale of E-value threshold for the generation of the graph tested, with correct classification of 756 of 763 VSGs (99.1% accuracy). These results support the conclusion that the selection of E-value threshold and the subsequent clustering

41 26 Figure 4. Clustering of C-terminal portions of VSG sequences Note: Similar sequences among the C-terminal regions of 799 VSGs identified by BLAST between nucleotide sequences with E-values less than 1e-10 are visualized as clusters in 3 dimensions (A,B) and 2 dimensions (C, D). The six reported VSG C- terminal domain types (C1 C6) are represented by nodes color-coded according to the C-terminal type (panels A, B, C), and lines between nodes represent BLAST hits. Clustering among C-terminal types is especially apparent in the case of type C2, which is outlined by red lines in panels A, C and D, and expanded in B. Similar clustering is not observed when C-termini are color-coded according to the chromosomes on which their genes are located (D). The color coding is difficult to see in panels A and B.

42 27 scheme of N-termini are distinct and distinguishable groups of sequences. These profile- HMMs can be automatically generated from the web-interface and were useful in identifying the VSG types that T. congolense shared with T. brucei, and support the distinctiveness of the types described (see Section 2.3.8) BLAST hit similarities do not readily separate C- terminal domain types Within the semi-conserved C-termini, six distinct types, C1 C6, have been described previously based on conservation of cysteine positions and the ~ 20-amino acid 46, 49, 71 hydrophobic tail that is replaced with a GPI anchor. Thus, we also generated a clustered graph of the relationships among VSG sequences coding for C-terminal domains, excluding the ~20 amino acids replaced by the GPI anchor (Figure 4). This graph is based on similarities between nucleotide sequences identified by BLAST with an expected (E) value less than 1e-10. When amino acid sequences are used for the clustering, the graph is too dense to resolve any C-terminal types. The clustered VSGs were color-coded by the six C-terminal type definitions in both 3 dimensions (3D; Figure 4A, B) and 2 dimensions (2D; Figure 4C, D). Type C2 can be visualized as a cluster with many more intra-cluster edges than inter-cluster edges in both formats, as highlighted by red-lined squares. Types C4 and C6 are not readily discernable (Figure 4A, C). Viewing the clustering in a 3D animation provides better visual resolution of the C2 type; 70 however, the high degree of similarity among most of the C-terminal types and thus the dense graph of BLAST hits among types precludes resolution of all but the most different types. Interestingly, some small sets of VSG sequences coding for C-terminal domains did not cluster with any other known types. These may represent more diverse C- terminal domain structures. Alternatively, sequences could represent artifacts from incomplete fragments of VSGs, and would benefit from revaluation prior to any updates to the annotations of the VSG sequences. Thus, our methodology and our exclusion of

43 Figure 5. Unsupervised clustering using the MCL algorithm 28

44 29 Figure 5. Continued. Note: Unsupervised clustering using the MCL algorithm also partitions VSGs into groups of similar sequences at various E-value thresholds. Nodes represent the N- termini of VSG proteins, and connecting edges represent BLAST hits between them at stringencies determined by the E-value listed to the left. As E-value thresholds becomes more stringent, fewer BLAST hits are included, the graphs become sparser and sub-sets of VSGs are separable by MCL. Colors indicate clusters partitioned by the unsupervised MCL algorithm. The coloring scheme for the E < 1e-03 plot is the same as Figure 2B. Arrowed lines drawn between different scales of E-value indicate how a cluster of VSGs was sub-divided in the next scale. Only clusters with at least 8 members were plotted, so there are fewer nodes plotted as some VSGs no longer shared that level of similarity. E < 1e-03 is the last E-value threshold before large sub-graphs become disconnected.

45 30 the highly conserved C-terminal hydrophobic tail from the analysis did not result in as clear a resolution of six specific C-terminal regions as previously described; nevertheless, our findings do not disagree with this earlier categorization of the VSG C-terminal 46, 49 domains. We also color-coded each VSG C-terminus according to its corresponding gene s chromosome (Figure 4D) and did not generally observe an overall association between C- terminal type and chromosome distribution. An exception to this can be seen in the bottom left corner of panels 4C and 4D where a cluster of 5 related C-termini are tandemly repeated on chromosome 9. However, these do not have a defined C-terminal type and they do not share significant similarity with other types. A similar cluster of 6 C-termini in the upper right can be observed, which also does not share similarity at this scale of E-value with any other sequences. However, 5 of the 6 are of type C5, and only 4 are tandemly repeated (on chromosome 4) Unsupervised MCL clustering agrees with previous clustering and can further sub-divide VSG types The relationships observed between VSG N-terminal sequences at E < 1e-03 were useful in classifying the types and observing differences among them (Figure 2). As expected, when the stringency of the E-value was increased, graphs became sparser since less similar BLAST hits were excluded. As an alternative clustering approach, unsupervised clusterings with MCL at various thresholds of E-value were conducted (see Section 2.4 for methods), revealing further N-terminal sub-categories and highlighting smaller groups of highly similar VSGs that may indicate a recent expansion or selective process. Figure 5 shows representative graphs generated by MCL with increasing stringency from E < 1e-01 to 1e-04. At the high (less stringent) E < 1e-01 value, only type N4 is well resolved from the other types. At E < 1e-02, types N1, N3, and N4 are resolved whereas N3 is not distinguishable from N5, and at E < 1e-03, five clusters are

46 31 apparent that closely resemble the N1-N5 clusters shown in Figure 2. At a still lower (more stringent) E-value (E < 1.e-04), the five clusters decompose into much smaller subdivisions. For example, N5 partitioned into two additional sub-families and N2 into multiple sub-families Atypical and functional VSGs display high betweenness centrality Betweenness centrality is a measure of how centrally located a node on a graph is in relation to its connection to any other nodes on a graph. Therefore, VSGs with a high betweenness centrality are positioned more centrally between less-connected sub-families of VSGs (see Section 2.2.5). As shown in two examples in Figure 6, both functional and atypical VSGs have, on average, significantly higher betweenness values than other VSGs in multiple clusters when identified at multiple scales of E-values and compared to either random graphs or random permutations of the same graphs (p < 0.05). These tendencies for atypical and functional VSGs to have a high average betweenness in MCL generated clusters were tested at E-value thresholds ranging from 0.1 to 1e-10. For the functional VSGs, 42/47 (89%) were determined to be part of a cluster at one of the tested E-value in which the average betweenness was significantly high (p < 0.05). In the case of atypical VSGs, 34/88 (39%) were part of clusters that had significantly high betweenness. Since these observations were made across multiple scales of E-value, some caution should be taken in the consideration of the p-values; however, no sub-graph studied at any scale of E-value was found to have atypical or functional VSGs with an average betweenness significantly lower than those measured in random graphs or permuted graphs (p < 0.05). Furthermore, VSGs that were an atypical or functional type were observed as also having a high betweenness centrality measure, and were observed within all five N-terminal types. These results show that both atypical and functional

47 32 Figure 6. VSGs with a high betweenness centrality Note: Betweenness centrality is described in section Atypical (Tb ) and functional (Tb ) VSGs represent VSGs from among MCL-identified clusters with significantly high average betweenness for functional or atypical VSGs. These high average betweenness clusters are shown at multiple scales of E-value. Pesudogenes are color-coded as gray, atypical are orange and functional are blue. As graphs become sparser at more stringent E-values, the tendency for an atypical or functional VSG to occur between a less connected sub-graph becomes more visibly apparent, illustrating the meaning of high-betweenness.

48 33 VSGs are more centrally connected by similarity, occurring between more divergent subgraphs distributed across the VSG repertoire Pairings between N- and C-terminal types show some biases When the five N-terminal types are considered at the E < 1e-03 threshold, the corresponding six C-terminal types do not specifically assort with a specific N-terminal type, i.e., an N-terminal type can be matched with any C-terminal type (Figure 2C). However, some pairings are clearly biased for or against, as shown when the observed N- and C-terminal combinations are compared to expected values (Table 1). For example, N1 is much more likely to pair with C2 or C5 than to C1 or C3. N2 pairs with any C- terminal type except C5, for which no pairs were observed. N4 tends to pair with either C1 or C5. N3 and N5 are small enough groups that it is difficult to identify a bias. These tendencies in N- and C-pairing likely provide further evidence for the importance of substructures in the overall VSG tertiary structure Locations of similar VSGs are usually independent of their chromosomal positions We considered whether two or more VSGs of the same N-terminal type were more likely to occur on the same chromosome in closer physical proximity than expected by random chance. The presence of some similar VSGs located in close physical proximity to one another is supported by statistical analysis (Table 2). When analyzing all VSGs on the same contig, some VSGs of the same N-terminal type were more likely to be located physically close to other VSGs of the same type than to VSGs of different N- terminal types, a finding consistent with the possibility that some members of the same N-terminal type are likely to be generated by tandem gene duplication. However, this statistical observation was only seen on 2 of 19 contigs [chromosome (chr) 4 and Tryp_IXb-68a05.-2k3050], indicating that clustering by physical location does not occur

49 34 Table 1. Preferences in N- and C- terminal pairing C1 C2 C3 C4 C5 C6 Obs Exp Obs Exp Obs Exp Obs Exp Obs Exp Obs Exp N N N N N Note: Based on 710 VSGs containing both N- and C-termini annotations, the biases of each N-terminal type to associate with a particular C-terminal type are shown by the observed (Obs) number of VSGs seen with a particular N-C terminal combination, and the number expected (Exp) if the types were equally likely to pair based the abundance of the N- or C-terminal type. Italicized values show combinations occurring significantly less than expected (p < 0.001, calculated by permutations of simulated data). Bolded combinations are significantly enriched for a combination, and the remaining pairs are not significantly different than expected. For each N-Cterminal type pair, the first number is the number of VSGs observed. The second number is the number of these N-C-terminal pairs expected if N- and C-termini randomly associated with each other.

50 35 Table 2. Co-localization of same-type VSGs Name of Contig (chr = chromosome) Number of VSGs Number of N1, N2, N3, N4 and N5 types 27P2_V3 35 6, 11, 1, 17, Tb927_01_v4 (chr 1) 5 0, 1, 0, 3, Tb927_02_v4 (chr 2) 9 1, 1, 0, 6, Tb927_03_v4 (chr 3) 34 10, 10, 0, 14, Tb927_04_v4 (chr 4) 29 7, 8, 3, 11, Tb927_05_v4 (chr 5) 63 8, 21, 2, 29, Tb927_06_v4 (chr 6) 46 13, 10, 2, 20, Tb927_07_v4 (chr 7) 9 0, 3, 0, 5, Tb927_08_v4 (chr 8) 30 6, 5, 2, 14, Tb927_09_v4 (chr 9) , 50, 15, 81, Tb927_10_v4 (chr 10) 26 4, 6, 2, 12, Tb927_11_01_v4 (chr , 18, 8, 54, ) Tb927_11_02_v4 (chr 60 7, 27, 8, 26, ) Tryp_IXb-217g08.q1c 6 0, 1, 1, 4, Tryp_IXb-218d07.p1c 29 5, 5, 0, 17, Tryp_IXb- 13 0, 2, 4, 2, a05.p2k3050 Tryp_X-254c10.q1c 16 3, 5, 1, 6, Tryp_X-302f11.q1ca 9 0, 2, 0, 7, Tryp_X-324h11.p1k 24 6, 6, 0, 12, p value for a smaller than expected average distance between same-type VSGs based on 100k permutations Note: For each genomic sequence fragment (contig) containing VSGs, the number of each VSG type is listed. To compute a p-value indicating a significant co-localization of same-type VSGs, the average distance between VSGs encoding the same N- terminal type was compared to simulated repertoires in which the type labels were permuted. Two of the contigs examined have same-type VSGs occurring significantly close and are bolded; the others did not show significant co-localization of same-type VSGs.

51 36 ubiquitously. Some of these tandem duplication events appear to be atypical VSGs that may actually be VSG-like genes. On chr 4 (Tb927_04_v4), which contains 29 VSGs, significant clustering of same-type VSGs (p < 0.01) was observed. However, if an eightmember locus of VSGs comprised of 4 repeated atypical VSGs, 3 pseudogenes, and 1 functional VSG, is excluded, this significant co-localization is lost (p < 0.81). These results indicate that in general highly similar VSGs, possibly due to recombination do not as a rule exist in close physical proximity to one another, however it is likely that tandem duplications are an exceptional event where neighboring sequences will tend to be highly similar N-termini of Expressed Functional VSGs of T. brucei and T. congolense The above analyses were conducted on VSGs that occur in the T. brucei stock TREU 927 genome, 30 most of which are unexpressed pseudogenes located in subtelomeric regions. To extend these analyses to functional VSGs known to be expressed by bloodstream-stage T. brucei, we also compared the N-termini of 17 VSGs from several different T. brucei isolates that are derived from complete VSG cdna sequences available in the literature These 17 VSGs were found to occur in the three most abundant N-terminal types, N1, N2 and N4, with the majority occurring in N2 (Figure 2F red coloring). The presence of these three types in expressed VSGs along with the existence of similar functional genomic VSGs in each of the five types suggest that all N-terminal types have the capacity to produce functional VSGs. We also examined whether functional expressed VSGs of another African trypanosome species, Trypanosoma congolense, fall into similar N-terminal types. Using an E < 1e-03 threshold of similarity, 17 available T. congolense VSGs (as determined from cdnas) occurred in the N4 N-terminal type (Figure 2F, yellow coloring). At higher levels of stringency, however, this relationship between the T.

52 37 congolense and T. brucei VSGs is lost (not shown). When hidden Markov models representative of VSG N-terminal types were used to search the unpublished T. congolense genomic sequences available online from the Wellcome Trust Sanger Institute 78, both type N2 and N4 functional VSGs could be identified in the T. congolense genomic sequences (data not shown). Thus, as more T. congolense genomic sequence becomes available and annotated, this will be a fruitful area of comparison between these two African trypanosome species Evidence for mosaic gene conversion events Mosaic gene conversion occurs when a recombination event copies only part of a VSG into a VSG expression site, or when crossovers during a recombination event result in fragments from both the old and new VSG to remain present. Several examples have been reported of mosaic VSG formation during the duplicative transposition (i.e., gene conversion) of a VSG into a VSG expression site In the current analysis, N-terminal VSG sequences with highly similar or identical segments are excellent candidates for recent mosaic gene conversion events since their regions of high similarity or identity imply there may have been a recent duplication or gene conversion event. We examined those N-terminal VSGs with substantial similarity, i.e., E < 1e-20, at both the amino acid and nucleotide level for potential gene cross-over regions. We then further narrowed the selected set by focusing on functional VSGs and their relatives since functional VSGs contain complete coding sequences and may have been involved in a more recent gene conversion event. At the stringent criteria of E < 1e-20 (plot not shown), no similarity between functional VSGs was observed. However, many relationships were detected between functional VSGs and the atypical VSGs and pseudogenes, of which eleven examples are shown in the upper panels of (Figure 7).

53 Figure 7. Evidence for mosaic gene conversion between similar VSG N-terminal domains 38 38

54 Figure 7. Continued. Note: (Top panels). BLAST hits between VSGs nucleotide sequences coding for functional products (green), or related pseudogenes (blue) or coding for an atypical product (red) at E < 1e-20, were examined for evidence of mosaic gene conversion. Nodes are labeled by VSG name, and lines connecting nodes representing the BLAST hits are labeled with the range of the sequence matches and the total number of nucleotides in the query sequence. Solid lines represent a match from one continuous BLAST hit and dashed lines represent multiple non-contiguous BLAST hits between two VSGs. Borders of each square or rectangle are colorcoded according to the indicated N-terminal type. Examples numbered 1-4 in the upper panels correspond to examples 1-4 in the lower panels. (Lower panels). Four examples of the relationships between a functional VSG gene and two related VSG pseudogenes are illustrated where the green line represents the indicated a functional VSG product. The blue bars above and below the green line represent two related VSG pseudogene N-terminal sequences, and the output of the BLAST hit between that VSG sequence and the functional VSG is given for each match. In regions where the VSG sequence coding for a functional product is identical or nearly identical to both related VSG sequences, the green line is in the center. In regions where the functional sequence is more similar to one of the two related sequences, the green line is offset, and when the functional sequence is unrelated to either of two sequences, there is no green line

55 40 In these examples, green colored VSGs indicate functional VSGs, blue colored are pseudogenes, red colored are atypical genes and lines represent a BLAST hit between two similar VSGs. The four examples numbered 1 4 in the upper panels of Figure 7 are groups of three genes shown in more detail in the lower panels. In example 1, functional VSG Tb matches both of its pseudogene relatives for the first several hundred nucleotides. This segment of identity is followed sequentially by (i) a region of no similarity with other VSGs, (ii) a short region of similarity with both pseudogenes, (iii) a region in which the functional gene shares sequence with peudogene Tb10.v4.0009, (iv) a region with identity to pseudogene Tb , and finally (v) a region highly similar to both pseudogenes. This pattern suggests that multiple cross-over events have transpired between these three genes and likely at least one other genomic sequence that is not in any of the annotated VSGs. In example 2, functional VSG Tb matches one of its pseudogene relatives near the beginning of the N-terminus and matches both of its pseudogene relatives at the end of N-terminal domain. The region or gap in between has little or no similarity with either pseudogene relative. Example 3 shows a functional VSG bearing extensive similarity with two pseudogenes across the entire N-terminal domain, whereas example 4 shows functional VSG Tb whose N-terminal domain has only small segments of similarity with one or the other of its pseudogene relatives followed by a large region of identity with both pseudogenes. The gaps in identity and the regions of greater identity to differing pseudogenes could both be taken as evidence suggesting mosaic gene formation. It is notable that the gaps in BLAST search can identify regions without similarity (highly dissimilar); however, the BLAST search is not sensitive in differentiating regions where differences are due to only small sequence changes. This reduces our ability to apply this method to identify mosaic gene conversion events when the gene conversion events occur between highly similar sequences, as may often be the case. This BLAST method of similarity

56 41 mapping has the advantage of being able to identify discontinuous regions of identity, and also can be efficiently applied to describe the relationships between a large number of genes Discussion About 20% of the protein-coding capacity of the T. brucei genome is consumed 30, 49 by the estimated 1600 VSG sequences. Although only one VSG is transcribed at a time, the sequentially-expressed functional VSG proteins must be both structurally similar and antigenically diverse in order to pack closely on the parasite surface and achieve the parasite s goal of evading the immune system. By examining the large repertoire of VSGs, it is possible to describe the clusters of relationships between similar VSGs and their annotated locations. We utilized a previously created VSG database 52 to re-examine these VSG relationships in ways that allowed them to be clustered and visualized graphically in 2D or 3D form (Figures 1-2,4-6), and to be searched for multiple, non-contiguous segments of sequence similarity (Figure 7). We detected differences between these gene families and observed how subsets of expressed VSG sequences are related to the genomic sequences. We also identified gaps in similarity between VSG sequences and found extensive evidence for potential cross-over events among multiple VSGs. Despite the need for antigenic diversity, the VSG N-terminal domains display amino acid similarities, which have been used to classify them into groupings or types. Previous analyses have described three VSG N-terminal types 46, 49 and we report here the existence of five N-terminal domain types, N1 N5, of which the first three, N1 N3, are derived from one of the three originally described types. These five N-terminal domain types are based on amino acid identities meeting an E < 1e-03 stringency where separation of the five clusters is most apparent (Figure 2). As the E-value is changed to higher stringencies, increasing numbers of subtypes are detected within each of the five

57 42 types (Figure 5). Furthermore, the region of maximal similarity within the ~330 amino acid N-terminal domain differs among the five types. Members of types N1, N3 and N4 are most similar in the middle of the domain, type N2 members are most similar near the N-terminus of the domain and N5 members are more uniformly similar along the entire domain (Figure 3). These differences do not correlate with the overall extent of similarity among members within a type, i.e., the compactness of the five domain clusters observed in Figure 2. For example, of the five types the N4 family members are the most similar to each other (most intra-cluster BLAST hits, forming the most compact cluster), consistent with an earlier finding 49, and N3 members are the least similar (Figure 2A). It is likely these differences in regions of maximal similarity among the five types reflect both evolutionary expansion of the family and structural constraints imposed on the functional VSGs themselves. The N4 family is also the largest family, suggesting it may have been more prone to recent expansion events than the other types. The N4 family also has a somewhat different predicted alpha-helical structure than the other four types (Figure 3), although secondary structure predictions must be treated with caution. No clear restrictions lie in the association of a given N-terminal and C-terminal domain type, although some N- and C-terminal pairings are preferred (Table 1), again likely reflecting possible structural constraints on the VSG protein to ensure close packing of the trypanosome surface, and highlighting the utility in analyzing the N-terminal domain independently from the C-terminal domain. Varying the threshold of similarity determined by the E-value in BLAST provided a means to resolve different-sized clusters of similar VSGs. The unsupervised MCL algorithm was applied to generate these clusters, within which there was a tendency for the functional and atypical VSGs to be more centrally connected (similar) to divergent sub-groups than are pseudogenes, suggesting these VSGs might be important, either as the progenitors of different sub-groups of degenerating pseudogenes, and/or the product of an ancient mosaic between two more diverse clusters. The central positioning of

58 43 atypical VSGs is especially interesting since they do not appear to be, in general, diverging in form from the rest of the VSG repertoire, but rather may be important for maintaining the diverse community structure indicated by the different VSG types described. 76, 79 Experimental evidence for mosaic VSG formation dates back to the 1980s, and has indicated that mosaic VSGs are more likely to appear late, rather than early, in 49, 79 infection. The data described here show that mosaic VSG formation is confined almost entirely to donor VSGs from within the same N-terminal type. Despite the different N-terminal types being dispersed throughout the genome, only one clear example was detected of a mosaic VSG containing sequences derived from two N- terminal types (N1 and N2) (red arrow in Figure 2B). This result is consistent with the likelihood that recombination between highly similar segments of family members within an N-terminal type is responsible at least in part for formation of mosaic VSGs, and the lack of inter-type mosaics suggests those mosaics may be non-functional, possibly due to improper folding. This interpretation is supported by a close examination of the four examples of mosaic functional VSGs shown in Figure 7B. In each of these cases, the boundaries of potential recombinational cross-over points that could be examined occur in regions of highly similar sequences. Left unexplained are the origins of sequences within the mosaic functional VSGs of Figure 7B that do not have a counterpart sequence in a known donor VSG, but these sequences could be derived from the several hundred VSGs on minichromosomes or intermediate chromosomes whose sequences have yet to be determined. 48 Our methods provide a means to explore the communities of related VSGs at varying degrees of similarity. As we have shown with the T. congolense VSG cdnas, profile-hmms generated from these sets of VSGs can be useful for comparing and contrasting the composition of the VSG repertoire between species. The approaches used here can be readily applied to compare and contrast the VSG repertoires in different

59 44 trypanosome isolates. As the genome sequences of additional species and strains of the Trypanosomatid protozoa and other microorganisms are emerging, these methods can be applied to comparisons between families of genes to assess the diversity or relatedness of virulence factors in almost any genera. Correlation of these data with differences in clinical presentation or disease severity may add new dimensions to our ability to discern whether microbial diversity might account for diverse manifestations of different microbial infectious diseases.

60 45 CHAPTER III. SERIAL QUANTITATIVE PCR ASSAY FOR DETECTION, SPECIES- DISCRIMINATION AND QUANTIFICATION OF LEISHMANIA SPP. IN HUMAN SAMPLES 3.1. Introduction The Leishmania spp. are Kinetoplastid protozoa that are transmitted to humans and other mammalian hosts by a sand fly vector. Different species of the parasites cause different types of clinical disease, which are collectively known as leishmaniasis. The infecting species of Leshmania require different approaches to therapy, but differentiating species is difficult. 20, 23 Procedures for diagnosis of leishmaniasis are often invasive, and isolates are frequently difficult to grow in vitro. Tests less invasive than bone marrow or splenic biopsy have low sensitivity. Current diagnostic tests include parasite culture, microscopic examination of biopsy specimens, and serology. Serology is unreliable both because titers are low in tegumentary leishmaniasis, and because high titers can persist for unknown periods of time after individuals recover from visceral leishmaniasis. Cultures of biopsy specimens require prolonged incubation times and lack adequate sensitivity since some Leishmania spp. isolates are difficult to grow in culture. Tests to distinguish between the Leishmania species involve the separation of isoenzymes of culture-derived parasites, which can be accomplished only when parasites grow in culture and take 4-6 weeks. 23 Therefore we set out to develop a rapid, species distinguishing test for these parasites, based on the emerging knowledge of Leishmania spp. genomes. The Leishmania spp. have been detected in and isolated from blood cultures of subjects with all forms of leishmaniasis, and in the blood of asymptomatic individuals 83, living in regions of risk. Some of these isolates were derived from serum specimens, indicating that leukocytes containing the parasite lyse during blood draw or (less likely) there can be circulating free parasites. This led us to query whether

61 46 Leishmania DNA may be present in the bloodstream more often than previously recognized. We therefore hypothesized that amplification-based methods to detect parasite DNA in blood or serum might be a feasible means of diagnosis and species differentiation. The spectrum of symptomatic human leishmaniasis is wide, and the most important factor determining the clinical outcome of infection seems to be the species of Leishmania. Nonetheless there are variable clinical presentations of disease due to each species, and increasing reports document atypical presentations of leishmaniasis, 91, 92 sometimes but not always in the setting of immunocompromise. Differentiation between the Leishmania species is important, since there are overlapping and dynamic geographic regions of risk, and different susceptibilities to treatment. 19, 93 Thus, a method of diagnosis that is sensitive enough to detect low levels of the parasite in asymptomatic or early symptomatic infection, and can distinguish between the different Leishmania species, could be of tremendous utility in endemic and non-endemic regions. 94 Nucleic acid-based methods avoid the need for parasite cultivation, replacing this with either hybridization or amplification. 83, The latter approaches provide the advantage of increased sensitivity. Amplification methods reported for the detection of individual Leishmania species include conventional PCR or quantitative PCR methods, including reverse transcriptase quantitative polymerase chain reaction (RTqPCR), DNA-based qpcr, quantitative nucleic acid sequence-based amplification (QT- NASBA) and in situ hybridization to quantify Leishmania spp. in blood or tissue samples. 100 Protozoa belonging to the order Trypanosomatida (class Kinetoplastida), including Leishmania spp. and Trypanosoma spp. The Leishmania spp. can be divided into two sub-genuses. Leishmania Leishmania spp. and Leishmania Viannia spp. These parasites are all morphologically similar, but genomes of the sub-genuses share more homology within subgenus than between subgenus. Both sub-genuses are characterized

62 47 by a prominent kinetoplast structure containing the mitochondrial DNA in the parasites single mitochondrion. Whereas Leishmania spp. have chromosomes in their nuclear genomes, 33 the kinetoplast contains tens to hundreds of DNA maxicircles encoding genes that are destined for RNA editing, and thousands of DNA minicircles, circular molecules with a conserved origin of replication encoding guide RNA sequences for RNA editing. 101 Because of their abundance, specificity and repetitive nature, kinetoplast DNA (kdna) sequences have frequently been targeted for nucleic acid based detection A drawback of the use of kdna for parasite quantification is the uncertainty of whether the kdna copy number differs between Leishmania species, strains, and growth stages. The goal of this study was to develop a serial nucleic acid amplification based method for diagnosis and speciation of Leishmania spp. parasites in human or animalderived tissues. As such we developed a set of primers and probes for serial qpcr assays. The assays were sensitive enough to detect low levels of parasites, and to distinguish between Leishmania species in human specimens. Using non-species discriminating probes, we quantified the relative differences in kdna copy numbers between parasite species, among isolates of the same species, and between stages of the same parasite strain. The serial qpcr assay has potential applications for diagnosis and species discrimination, as well as providing novel approaches to determining parasite load and following treatment response in infected humans Materials and methods Leishmania species and strains Promastigotes were cultivated in modified minimal essential medium [HOMEM] 107 or in Schneider s Insect Medium with 10% heat-inactivated FCS and 50 µg gentamicin/ml. DNA was extracted from the species and strains listed in Table 3. L. (L.)

63 48 Table 3. Sources of Leishmania spp. isolates used in this project development Leishmania species Source Name if any L. (L.) tropica 1 F. Steurer, CDC FJ L. (L.) tropica 2 F. Steurer, CDC You L. (L.) tropica 3 F. Steurer, CDC Sp L. (L.) tropica 4 F. Steurer, CDC KiR L. (L.) tropica 5 F. Steurer, CDC GH L. (L.) tropica 6 F. Steurer, CDC TAM L. (L.) chagasi S. Jeronimo, UFRN, Natal, Brasil MHOM/BR/00/1669 L. (L.) chagasi attenuated Above isolate multiply passaged in MHOM/BR/00/1669 strain L5 vitro, M. Wilson L. (L.) infantum D. McMahon-Pratt, Yale U. (reference 108 ) L. (L.) donovani Buddy Ullman, OHSU D1700 L. (L.) donovani Shyam Sundar, Banaras Hindu LEM 138(MHOM/IN/DEVI) Reference University L. (L.) donovani 1 Shyam Sundar, Banaras Hindu BHU 764 University L. (L.) donovani 2 Shyam Sundar, Banaras Hindu BHU 770 University L. (L.) donovani 3 Shyam Sundar, Banaras Hindu BHU782 University L. (L.) donovani 4 Shyam Sundar, Banaras Hindu BHU796 University L. (L.) donovani 5 Shyam Sundar, Banaras Hindu BHU814 University L. (L.) donovani 6 Shyam Sundar, Banaras Hindu BHU 921 University L. (L.) donovani 7 Shyam Sundar, Banaras Hindu BHU922 University L. (L.) major M. Wilson, University of Iowa CDCID:LW9 L. (V.) braziliensis R. Almeida and E. Carvalho, CDCID:LW10 UFBA, Salvador, Brazil L. (V.) braziliensis A. Schriefer, UFBA, Salvador, Brazil L. (V.) braziliensis A. Schriefer, UFBA, Salvador, Brazil 13968

64 49 Table 3. Continued. L. (V.) guyanensis (French Guyana) L. (V.) guyanensis (French Guyana) L. (V.) panamensis (Panama) L. (V.) panamensis (Panama) N. Aronson, WRAIR WR 2853 N. Aronson, WRAIR WR 2334 N. Aronson, WRAIR WR 2306 N. Aronson, WRAIR WR 2307 L. (L.) mexicana R. Almeida and E. Carvalho, UFBA, Salvador, Brazil L. (L.) amazonensis Diane McMahon-Pratt, Yale University Leptomonas sp. BHU Shyam Sundar, Banaras Hindu University CDCID:LW BHU 151( Leptomonas seymouri like) Crithidia fasciculata ATCC LEM 138(MHOM/IN/DEVI)

65 50 chagasi MHOM/BR/00/1669 originally isolated from a Brazilian with visceral leishmaniasis qpcr primers and probe design Primers targeting kinetoplast minicircle, maxicircle, and the nuclear genome were designed to be used in a SYBR green qpcr assay for detection, species discrimination and quantification (Table 4 Forward and Reverse Primers). In four cases, these assays were designed using primers from a previously reported TaqMan assay (marked as adapted from in Table 4). All other primer pairs used for SYBR green qpcr assays were designed from sequence databases to amplify either genes that had been targeted by other PCR diagnostic assays of leishmaniasis, or genes targeted uniquely for the current study. In addition to SYBR green assays, we developed probes between a subset of these primer sequences as TaqMan assays. TaqMan probes were designed by JW with the exception of the two sequences indicated in Table 4 (kdna5, DNA polymerase 1). Sequences from kinetoplast minicircles were derived from the NCBI Entrez nucleotide database 111. Primers and probes suitable for qpcr were designed using the Primer3 website 112. Product sizes were designed to be less than 150 bp. Minicircle primers were designed against both conserved targets and speciesspecific sequences. Sequences from the kinetoplast maxicircles were derived from protein coding sequences that are not subject to the extensive RNA editing. 113 The maxicircle 1 primer set was specifically designed against a variable region to maximize usefulness in species discrimination. Other maxicircle genes were designed to amplify cytochrome B sequences. Within the nuclear genome, the assay for the DNA polymerase 1 gene was based upon a reported TaqMan assay. 109 The flanking primer set for this TaqMan assay was found to be adequate for an independent SYBR green qpcr. The DNA polymerase gene

66 Table 4. Primers and probes used for Leishmania qpcr diagnosis and speciation Designation Forward Primer Reverse Primer TaqMan Sequence Source 1 Reference Probe 5 FAM TM / 3 TAMRA TM kdna 1 minicircle Minicircle DNA GGGTAGGGGCGTT CTGC TACACCAACCCC CAGTTTGC M94088 JW 1 kdna 2 kdna 3 minicircle AACTTTTCTGGTCC TCCGGGTAG GGGTAGGGGCGTT CTGC ACCCCCAGTTTCC CGCC EU CCCGGCCTATTTT ACACCAACC M94088 JW Leish1 and 2 kdna 4 minicircle GGGTGCAGAAATC CCGTTCA CCCGGCCCTATTT TACACCA ACCCCCAGTTTCCC GCCCCG Conserved kdna 2 JW kdna 5 minicircle kdna 7 minicircle CTTTTCTGGTCCTC CGGGTAGG AATGGGTGCAGAA ATCCCGTTC CCACCCGGCCCT ATTTTACACCAA CCACCACCCGGC CCTATTTTAC TTTTCGCAGAACG CCCCTACCCGC L infantum Contig1335 CCCCAGTTTCCCGC CCCGGA Conserved kdna 2 JW TaqMan qpcr 3 L.(L.) amazonensis kdna 1 GGTCCCGGCCCAA ACTTTTC CCGGGGTTTCGC ACTCATTT U19810 JW L. (L.) amazonensis kdna 2 GGTAGGGGCGTTC TGCGAAT CCCGGCCTATTTT ACACCAACC EU JW 51 51

67 Table 4. Continued. L. (L.) amazonensis kdna 3 L. (L.) amazonensis kdna 4 L (V.) braziliensis kdna 1 GGGTAGGGGCGTT CTGC TGAGTGCAGAAAC CCCGTTCATA AATTTCGCAGAAC GCCCCTAC TACACCAACCCC CAGTTTGC M94089 JW ACACCAACCCCC AGTTGTGA EU JW GTACTCCCCGAC ATGCCTCTG U19807 JW L (V.) braziliensis kdna 3 TGCTATAAAATCG TACCACCCGACA GAACGGGGTTTC TGTATGCCATTT TTGCAGAACGCCC CTACCCAGAGGC AF Adapted from PCR 4 Adapted L (L.) infantum Minicircle 1 TCCGCAGGAGACT TCGTATG CACGACTATCCA CCCCATCC CTGAGAGACCCGC CGGGGCG AF from nested PCR L. (L.) major Minicircle 1 L. (L.) mexicana Minicircle 1 ACGGGGTTTCTGC ACCCATT AATGCGAGTGTTG CCCTTTTG GTAGGGGCGTTC TGCGAAAA LM15_BIN_Contig 406 JW GCCGAACAACGC CATATTAACC AY JW L. (L.) tropica Minicircle 1 GGGGGTTGGTGTA AAATAGGG ACCACCAGCAGA AGGTCAAAG TCCTGGCGGGGGT TTTCGCT AF JW L. (L.) donovani Minicircle 1 GCGGTGGCTGGTT TTAGATG TCCAATGAAGCC AAGCCAGT CCCATACCACCAA ACGCAGCCCA FJ JW 52 52

68 Table 4. Continued. Cytochrome B 1 L. (L.) amazonensis Cytochrome B1 L. (L.) tropica Cytochrome B 1 L. (L.) tropica Cytochrome B 2 L. (L.) tropica Cytochrome B 3 L. (L.) tropica Cytochrome B 4 Maxicircle 1 Alpha-tubulin 1 Maxicircle DNA ATTTTAGTATGAGT GGTAGGTTTTGTT GCGGAGAGGAAAG AAAAGGCTTA CAGGTTGCTTACT ACGTGTTTATGGT G TCAGGTTGCTTACT ACGTGTTTATGGT G TGACACACATATT TTAGTGTGGGTGG TAGG CACATATTTTAGTG TGGGTGGTAGGTT TTG GCTTGGTTGGATT ATTTTTGCTG Genomic DNA GAGGTGTTTGCCC GCATC CAATAACTGGGA CGGTTGCT CCATGTACGATGA TGTCGTATTGAGG TCTAACA AB JW AAAAGTCATGCT AAACACACACCA CA AB JW TCGTATTACAAA CCCTAAATCAAA ATCTCA AB JW TGCTAAACAAAC ACCACATATGAT CTGC AB JW TCCCCAATAAGA CATCATTGTACAT GGTAA EF JW TCCCCAATAAGA CATCATTGTACAT GGTAA EF JW AACAACATTTTA ACTCTTGTAGGAT TCG CTCGCCCATGTCG TCG CTTTAGGTAGGGA GTTGTACTACGTTT TTTGACCT DQ JW TGAGGGCATGGAG GAGGGCG XM_ JW 53 53

69 Table 4. Continued. DNA polymerase 1 DNA polymerase 2 Mini-exon 1 Mini-exon 2 MSP Associated Gene 1 (MAG 1) MSP Associated Gene 2 (MAG 2) SIDER repeat 1 L. (V.) braziliensis DNA polymerase 1 L. (V.) braziliensis DNA polymerase 2 L.(L.) major MSP associated gene 1 (L. major MAG 1) L (L.) amazonensis DNA polymerase 1 TGTCGCTTGCAGA CCAGATG AGGAGGATGGCAA GCGGAAG CGAAACTTCCGGA ACCTGTCTT GTGTGGTGGCGGG TGTATGT AGAGCGTGCCTTG GATTGTG AGTTTTGGTTGGC GCTCCTG CGACCCTGTCACC ACCACAG TCGTTGAGGGAGG AGGTGTTTC ACGTCGCCAACTG CTTCACC GTCGTTGTCCGTGT CGCTGT GACGACGACGAGG AGGATGG GCATCGCAGGTG TGAGCA GCGACGGGTACA GGGAGTTG CACCACACGCAC GCACAC qpcr CAGCAACAACTTC GAGCCTGGCACC AF (ref 109 ) 4 TGGGGTCGAGCAC CATGCCGCC AF JW CGGCAAGATTTTG GAAGCGCGCA AL JW GCCCAGGTCGCT GTGAGG LbrM02_V JW CGCTGCGTTGATT GCGTTG CCCACTCGCTTTC CTTGGTC TGCGCACTGCACT GTCGCCCCC AF JW CGCTGAGAGCGAG GCAGGCACGC AF JW GAGGCCACCCTA TCGCTGAC AM JW TCGGCTTTGAGGT TGGCTTC XM_ JW GTGTTCGCACCG CCTTGAC XM_ JW CGCTGTGTGTGTC CGTGTGT XM_ JW GCGACGGGTACA GGGAGTTG AF JW TaqMan 54 54

70 Table 4. Continued. GPI HSP70-1 HSP70-4 SLACS CCAGATGCCGACC AAAGC GAAGGTGCAGTCC CTCGTGT TCGAGATCGACGC GTTGTT GGAGAAACTCACG GCACAGG Leptomonas DNA CGCGCACGTGAT GGATAAC AM (ref 110 ) CCTCCGTCTGCTT FN GCTCTTG JW CCGCACAGCTCC TCGAA FN JW GCGCCTCGTAGG TCACAGTT XM_ JW Leptomonas Mini-exon 1 TGGAGCGGGTGCA TTAACTC GGTCTCGAGGTG CCCATGAC S78663 JW Leptomonas GAPDH 2 AGAAGCCGGATGT GCTTGTG GCCCTCAGCCTTC ACCTTGT AF JW Human DNA Human TNF alpha 1 GCCCTGTGAGGAG GACGAAC AAGAGGTTGAGG GTGTCTGAAGGA CCTTCCCAAACGC CTCCCCTGCCCC NM_ JW Human TNF alpha 2 GCGCTCCCCAAGA AGACAGG TGCCACGATCAG GAAGGAGAAG CACCGCCTGGAGC CCTGGGGC NM_ JW Human GAPDH 1 GGGCTCTCCAGAA CATCATCC CCAGTGAGCTTC CCGTTCAG NG_ JW Human GAPDH 2 CATCAAGAAGGTG GTGAAGCAG CGTCAAAGGTGG AGGAGTGG NG_ JW 55 55

71 Table 4. Continued. Human GAPDH 3 GCATGGCCTTCCG TGTCC CGCCTGCTTCACC ACCTTCT NG_ JW Note: Primers and probes used for Leishmania qpcr diagnosis and speciation. All sequences were developed for use either as SYBR green and/or TaqMan assays. 1 JW - Primers and probes were designed by JW from sequences available on the web. 2 The conserved kdna sequences used multiple sources aligned with CLUSTALW. 3 TaqMan qpcr Primers and probes are exact ones used for a TaqMan, but not a SYBR green, assay reported in the literature citation. 4 Adapted from (ref) The targeted sequence in the literature reference was used to re-design primers and probes by JW

72 57 is a single copy gene, raising its utility for quantitative assay of parasites. Mini-exon 1 and 2, alpha-tubulin 1, HSP70 and the SIDER repeat 1 primer sets were chosen because they amplify repetitive sequences and should therefore be sensitive. The MSP Associated Gene (mag) 1 and 2 primer sets were designed against mag gene sequences only known to be present in the Leishmania (L.) infantum and L.(L.) chagasi ( ), 114 in anticipation that these sequences could be useful in species discrimination. L. (L.) major contains a hypothetical ortholog to mag (LmjF ), identified by a shared P-fam B domain (21814) identified by the ADDA algorithm. 115 This prompted the design of primer set L. (L.) major MSP associated gene 1. Finally, the gene encoding glucose phosphate isomerase (GPI) was selected due to prior reports of its use in qpcr assays. 110 The genus Leishmania has two subgenuses, designated L. Viannia spp. or L. Leishmania spp. Examination of genomes has revealed that L. Viannia braziliensis has retrotransposable elements not found in L. Leishmania species. 33 We designed a primer set to target the splice-leader associated (SLACS) retrotransopons of L. (V.) braziliensis. Although genome sequences are not available for other members of the Viannia subgenus L. (V.) guyanensis and L. (V.) panamensis, we also tested the ability of this primer set to differentiate between L. (V.) braziliensis and each of these other species. Two non-leishmania species members of the order Trypanosomatida (Crithidia fasciculata and Leptomonas) were studied for potential cross-reactivity of primers. The latter organism, Leptomonas sp., has been recovered from the spleens of several individuals with symptoms of visceral leishmaniasis. 116 Primers for human TNF alpha 1 and 2 target the single copy human TNF alpha gene. 117 These and primers for human GAPDH were designed as positive controls for qpcr amplification.

73 DNA extraction 5-10 ml cultures were suspended in lysis buffer (150 mm NaCl, 4 mg/ml SDS, 10 mm EDTA, 10 mm Tris-HCl, ph 7.5) with 200 µg proteinase K/ml for 1 hr. at 56 o C. DNA was extracted in Tris-equilibrated phenol, then Phenol:CHC1 3 :IsoAmyl Alcohol (25:24:1) followed by CHC1 3 :IsoAmylAlcohol (49:1). Samples were ethanol precipitated and resuspended in water qpcr assay conditions All reactions were conducted on a 7900 Fast Real-Time PCR system from Applied Biosystems (ABI) in 10ul reaction volumes. SYBR green reactions were composed of ABI Power SYBR green 2x master mix, and 500 nm each of forward and reverse primers. TaqMan reactions were performed with ABI TaqMan universal PCR 2x master mix, 375 nm each of forward and reverse primers, and 250 nm of probe (label 5 FAM, quench 3 TAMRA). Thermocycling parameters were: hold (95 degrees, 10 min) followed by 40 temperature cycles (95 degrees for 15 sec, 60 degrees for 1 min). A melt curve analysis was performed on all SYBR green reactions. Results were analyzed in SDS 2.4 software with an automatic baseline and a manual cycle threshold (CT) of 0.2 for all reactions. The R software package was used to generate melt curve plots for primer sets in which melting temperatures differed by at least 1 degree between species Multiplex TaqMan qpcr A multiplex TaqMan qpcr assay was designed to detect all tested Leishmania species, and simultaneously differentiate between members of the visceralizing L. donovani complex and other species that cause primarily tegumentary disease. The DNA polymerase 2 primers and probe (label 5 TET, quench Iowa Black), and MSP associated gene 2 (label 5 FAM, quench TAMRA) were used together in 10 µl reaction with 125 nm of each forward and reverse primers and 83 nm of each probe. Thermocycling parameters were identical to the conditions described above.

74 Results Detection and differentiation between Leishmania species Based on the above designed SYBR green primer sets and TaqMan probes, we developed qpcr assays with the dual objectives of developing a sensitive assay that could detect all Leishmania species infections in clinical or experimental specimens, and discriminating between the different Leishmania species. For these purposes, forty one Leishmania specific primer pairs were tested in SYBR green assays. All were tested in the presence of 10-fold excess of human DNA compared to DNA extracted from parasites to ensure the reaction occurs appropriately in a setting where most of the DNA template available is host DNA. Seventeen targeted the kinetoplast minicircle, seven targeted maxicircle sequences and seventeen targeted genes in the nuclear genome. Two primer pairs were developed to identify the related Leptomonas spp. protozoa, which have been reported as co-isolates with L. donovani. 116 Five primer pairs were also included as positive controls for human DNA in specimens. These could additionally be used for normalization to a baseline of human DNA in clinical samples tested. For some but not all primer sets, differences in melt temperature curves were useful in distinguishing between Leishmania species (Figure 8). For instance, kdna1 clearly distinguished between L. (L.) amazonensis and L. (L.) mexicana or L. (L.) tropica and L. (L.) major, whereas L. (L.) donovani, L. (L.) chagasi and L. (L.) infantum exhibited very similar Tm peaks. In another example, the MSP associated gene 1 primer pair distinguished L. (L.) donovani from L. (L.) chagasi/l. (L.) infantum.

75 60 Figure 8. Melt curves of selected qpcr assays useful for species discrimination Note: Rate of change in the intensity of the fluorescent qpcr signal is plotted across a 10 degree temperature range for each species with each primer set. Data are only shown for sets that amplified with a CT of less than 30, and showed peaks of melt curves greater than 1 degree apart between species. Peaks of curves indicate the melting temperature of the amplicon. Each melt curve is color-coded according to the parasite species it represents. Plots with a solid line indicate that the reaction amplified with a CT less than 25. Plots with a dotted line indicate the reaction amplified with a CT greater than or equal to 25 but less than 30.

76 61 TaqMan assays corresponding to a subset of the SYBR primers that are listed on Table 4 were experimentally validated. Table 5 spells out the relative efficiency of each SYBR green or TaqMan assay for detecting and differentiating between Leishmania spp. CT values listed on this table reflect the cycle in which the intensity exceeded a threshold of 0.2 when amplifying 0.1 ng of total parasite DNA in the presence of 1 ng of human DNA. Products that amplified with a CT greater than 30 were excluded because these extreme values were often caused by primer-dimers or off-target amplification in SYBR green reaction assays. Such artifacts can be discerned using melt curve analyses (not shown) for some primer sets. The cutoff could be raised for TaqMan assays, as primerdimers do not impact those results. With SYBR green reagents, the cutoff of 30 cycles is roughly equivalent to parasites per well when amplifying minicircle sequences, or 102 parasites when amplifying a genomic sequence. With TaqMan reagents, the cutoff of 30 cycles is roughly equivalent to 0.18 parasites per well when amplifying minicircle sequences, or 318 parasites when amplifying a genomic sequence. Primer pairs targeting the multi-copy minicircle kdna (kinetoplast DNA) were the most sensitive tests to detect any species of Leishmania. For many of the minicircle primers, exponential amplification occurred at a CT below 15 when amplifying 100 femtograms of parasite DNA using SYBR green. Maxicircle and nuclear genome targets were also effective in detecting parasites, although these primer sets were not as sensitive and required 25 or more cycles to reach detection (Table 5). Some of the late-amplifying markers demonstrated melt curves that were useful for distinguishing between species (Table 5, Figure 8). L. (L.) major MSP associated gene 1 is notable in that it amplifies all Leishmania species without amplifying the Leptomonas or Crithidia negative controls. The CT values in Table 5 can be used to assess the relative performance difference between species, but strains within species may differ slightly. Also, absolute CT values may vary somewhat between different laboratories.

77 62 Table 5. Relative efficiency of SYBR green or TaqMan assays SYBR Green L. (L.) ama zone nsis L. (V.) braz ilien sis L. (L.) cha gasi L. (L.) don ova ni L. (L.) infa ntu m L. (L.) maj or L. (L.) mex ican a L. (L.) trop ica Lept omo nas Crit hidi a Alpha-tubulin 21.4 > Cytochrome B 1 >30 > >30 >30 >30 >30 >30 DNA polymerase >30 >30 DNA polymerase GPI >30 >30 HSP >30 >30 HSP kdna kdna 2 > kdna kdna kdna 5 > kdna L. (L.)amazonensis >30 >30 >30 >30 >30 Cytochrome B 1 L. (L.) amazonensis DNA polymerase 1 L. (L.) amazonensis kdna 3 L. (L.) amazonensis 15.9 >30 >30 >30 >30 > >30 >30 >30 kdna 4 L. (L.) amazonensis 15.2 >30 > >30 >30 >30 kdna 2 L. (L.) amazonensis > > >30 >30 >30 kdna 1 L. (V.) braziliensis > >30 >30 >30 >30 >30 >30 >30 >30 DNA polymerase 1 L. (V.) braziliensis DNA polymerase > >30

78 63 Table 5. Continued. L. (V.) braziliensis kdna 1 L. (V.) braziliensis kdna 2 L. (V.) braziliensis kdna 3 L. (L.) chagasi SIDER 1 L. (L.) infantum minicircle 1 L (L.). major MSP associated gene 1 (L. major MAG 1) L. (L.) major minicircle 1 L. (L.) mexicana minicircle 1 L. (L.) tropica Cytochrome B 1 L. (L.) tropica Cytochrome B 2 L (L.). tropica Cytochrome B 3 L. (L.) tropica Cytochrome B 4 L. (L.) tropica minicircle 1 L. (L.) donovani minicircle 1 Leptomonas GAPDH 2 Leptomonas miniexon 1 MSP associated gene 1 (MAG 1) MSP associated gene 2 (MAG 2) > >30 >30 >30 >30 >30 >30 >30 >30 > >30 >30 >30 >30 >30 >30 >30 >30 > >30 >30 >30 >30 >30 >30 >30 >30 >30 > >30 > >30 >30 >30 > > >30 >30 >30 >30 > >30 > > >30 >30 >30 >30 >30 > >30 >30 >30 >30 >30 >30 >30 >30 >30 > >30 >30 >30 > >30 > >30 > >30 > > > >30 > > >30 >30 >30 >30 >30 >30 >30 >30 > >30 >30 * >30 > >30 >30 >30 >30 >30 > >30 >30 >30 >30 >30 > >30 >30 >30 >30 >30 >30 >30 > >30 >30 > >30 > >30 >30 >30 > >30 >30 >30 >30 >30 maxicircle 1 >30 > >30 >30 >30 >30 >30 mini-exon mini-exon >30 >30 >30

79 64 Table 5. Continued. SLACS > >30 > TAQMAN L. (L.) ama zone nsis L. (V.) braz ilien sis L. (L.) cha gasi L. (L.) don ova ni L. (L.) infa ntu m L. (L.) maj or L. (L.) mex ican a L. (L.) trop ica Lept omo nas Crit hidi a TaqMan alphatubulin 26.1 > >30 >30 TaqMan >30 > >30 >30 >30 >30 >30 Cytochrome B 1 TaqMan DNA 27.1 > >30 >30 polymerase 1 TaqMan DNA >30 >30 polymerase 2 TaqMan kdna 4 >30 > >30 >30 >30 >30 >30 TaqMan kdna 5 >30 > >30 >30 TaqMan kdna 7 >30 > >30 >30 >30 >30 TaqMan L. (V.) > >30 >30 >30 >30 >30 >30 >30 >30 braziliensis kdna 3 TaqMan L. (L.) infantum minicircle 1 >30 > >30 >30 >30 >30 >30 >30 >30 TaqMan L. (L.) tropica minicircle 1 TaqMan L. (L.) donovani minicircle 1 TaqMan MSP associated gene 1 (MAG 1) TaqMan MSP associated gene 2 (MAG 2) TaqMan maxicircle 1 TaqMan mini-exon 1 >30 >30 >30 >30 >30 >30 > >30 >30 * >30 >30 > >30 >30 >30 >30 >30 >30 >30 > >30 > >30 >30 >30 > >30 >30 >30 >30 >30 >30 > >30 >30 >30 >30 > > > >30 >30

80 65 Table 5. Continued. Multiplex TaqMan TaqMan MSP associated gene 2 (MAG 2; 5 FAM / 3 TAMRA) TaqMan DNA polymerase 2 (5 TET / 3 Iowa Black) L. (L.) ama zone nsis L. (V.) braz ilien sis L. (L.) cha gasi L. (L.) don ova ni L. (L.) infa ntu m L. (L.) maj or L. (L.) mex ican a L. (L.) trop ica Lept omo nas Crit hidi a >30 > >30 >30 >30 >30 > >30 >30 Note: Relative efficiency of SYBR green or TaqMan assays for detecting and differentiating Leishmania spp., or for Leptomonas spp. or Crithidia fasciculata. Data indicate the average CT values of 2 replicates for Leptomonas and L. amazonensis, or 4 replicates for all other species. CT values reflect the cycle in which fluorescence intensity reached 0.2 when amplifying 0.1 ng of total parasite DNA in the presence of 1 ng of human DNA. The cutoff of CT=30 avoids avoid ambiguities caused by primer dimers in SYBR green assays. Please note that CT cutoffs could differ between different strains, or when assays are performed in different labs.

81 66 Species in the subgenus Viannia were recognized by the primer set designated L. (V.) braziliensis kdna 3, which was designed against the L. (V.) braziliensis genome (average CT values were 17.7, 13.5 or 13.3 when amplifying from L. (V.) braziliensis, L. (V.) guyanensis or L. (V.) panamensis genomic DNA, respectively). Therefore we tested additional primers that might differentiate this from the other Viannia subgenus members. The splice-leader associated (SLACS) retrotransopons are found in the L. (V) braziliensis genome, but not the subgenus Leishmania spp. analyzed to date. 33 The primer pair SLACS amplified L. (V.) braziliensis DNA as predicted, although the primer set suffered from primer-dimer peaks at CTs greater than 28 in many species (Table 5). Because of a lack of genomic sequence information, it was unknown whether these primers would also amplify DNA from other Viannia subgenus members L. (V.) guyanensis or L. (V.) panamensis. Analyses revealed that despite the background amplification at CT values >28 due to primer-dimers, the SLACS primer set specifically amplified sequences in genomic DNA of L. (V.) braziliensis, but not L. (V.) guyanensis or L. (V.) panamensis DNA (average CT values were 19.7, 30.2, and 32.5 using genomic DNA from L. (V.) braziliensis, L. (V.) guyanensis or L. (V.) panamensis as templates). SLACS, therefore, provided the only set tested that distinguished L. (V.) braziliensis from L. (V.) guyanensis and L. (V.) panamensis, within the Viannia subgenus. To quantitatively assess the relative sensitivity of SYBR green versus TaqMan assays of kdna or genomic DNA sequences, we used kdna 5 (minicircle) or DNA polymerase to amplify L. (L.) chagasi DNA. Standard curves were generated from DNA extracted from a known number of parasites. The approximate CT values for detection of a single parasite and the number of parasites detected at the CT cutoff point of 30 were determined (Table 6). The table demonstrates, first, increased sensitivity of the minicircle target sequence compared to the single copy gene target, as expected (20,400 fold enhanced sensitivity in the SYBR green assay). The table also showed increased sensitivity of SYBR green compared to TaqMan for detection of either sequence when

82 67 Table 6. Sensitivity of qpcr assays for parasite detection Approximate CT for 1 promastigote Minicircle, SYBR green Minicircle, TaqMan Genomic, SYBR green Genomic, TaqMan Detection threshold with CT cutoff of 30 Note: Sensitivity of qpcr assays for parasite detection. DNA extracted from known numbers of L. infantum chagasi promastigotes was amplified with kdna 5 minicircle primers, or with primers hybridizing to the single copy gene DNA polymerase. The average CT values corresponding to numbers of promastigote genomes in a single well are indicated. Please note that this table should not be used for exact quantification, because the efficiency of the PCR reaction is expected to vary between the Leishmania species. using earliest amplification as a measure of performance. TaqMan is expected to have greater specificity than SYBR green, although when used in combination with melt curve analysis SYBR green can approach the sensitivity of TaqMan assays for numbers of parasites exceeding the threshold Specificity of primer pairs The reactions in Table 5 were carried out using 0.1 ng of parasite DNA in the presence of 10 ng of human DNA. No primer sets amplified in human DNA alone at a CT lower than 30. alpha-tubulin, DNA polymerase 2, HSP70-4, L. amazonensis DNA polymerase 1, L. (V.) braziliensis DNA polymerase 2, kdna 1-7, L. (L.) amazonensis kdna 3, L. (L.) major minicircle 1, and mini-exon 1 showed cross-reactivity with Crithidia fasciculata or Leptomonas but not human DNA, indicating primer hybridization to sequences in other Trypanosomatid protozoa. The CT can be used to differentiate species. A low CT serves as a criterion for choosing one marker over another with a higher expected CT for the species of interest. Examination of melt curves for some of these primer sets (e.g. mini-exon 2 in Figure 8) show low amplification (high CT, shown

83 68 as dotted lines in the figure) for many different species, migrating at a lower melt temperature than the main peak amplifying at the lowest CT value (L. (V.) braziliensis in Figure 8). This low amplification could have been caused by the production of primerdimers, although we cannot rule out a universally shared off-target DNA sequence. The final identification of Leishmania species rather than cross-reactive sequence can be made either by examining the melt temperature in SYBR green assays, or by use of the TaqMan probe at the validation step Leishmania species identification The above qpcr assays can be used, first, to identify and quantify any organism belonging to the genus Leishmania, and second, to discriminate between Leishmania species. Species identification is accomplished either by observing the melt temperature of the amplicon, or observing the presence or absence of some amplicons. Nine of the designed primer sets were found to be useful in discriminating between Leishmania species based upon the melt temperature of the amplicon (see Figure 8). L. (L.) infantum and L. (L.) chagasi could not be distinguished based on any melt curve analysis. This provides validity to the melt temperature approach. More distantly related but nonidentical Leishmania species could be readily distinguished depending on the primer set used. In some instances, species identification could be attained through the exclusive amplification of one or a few species by a primer set. For example, L. (L.) mexicana minicircle 1 uniquely amplifies L. (L.) mexicana DNA but not other species. We used the above information to generate a flow chart of qpcr tests recommended to (1) determine whether any Leishmania spp. is present in a clinical or environmental sample and (2) identify which species is present (Figure 9). The example shows a sequence of tests appropriate for species found in Latin America. According to the figure, the DNA is first amplified using SYBR green qpcr assays for kdna1 and L. (V.) braziliensis kdna 3, to detect all Leishmania species. The next step is examination

84 69 Figure 9. Flow chart of the serial qpcr assay Note: This figure shows a minimal application of the serial diagnostic qpcr assay to determine presence and species of Leishmania in a sample. In step 1 (detection), an unknown DNA sample can be tested for the presence of Leishmania spp. DNA using SYBR primers. kdna 1 amplifies most species. L. (V.) braziliensis Minicircle 3 amplifies kdna sequences within L. (V.) braziliensis, a primer set that would be included when testing samples from Latin America to provide the most sensitive detection across species. In step 2, a sample which has tested positive for the presence of parasite kdna 1, but not L brazilieinsis kdna 3, can be classified according to its species. Melt curve analysis of kdna1 amplicons can distinguish between several Old World species (L. (L.) tropica, L. (L.) major, L. (L.) infantum). Application of SYBR green melt curve analysis using MSP associated gene (MAG) 1 is capable of separating L. (L.) chagasi and L. (L.) infantum from L. (L.) donovani. The presence or absence of L. (L.) mexicana specific minicircle amplicons is sufficient to differentiate between L. (L.) mexicana and L. (L.) major. Distinguishing L. (L.) amazonensis from the members of the L. (L.) donovani complex requires an additional such as L. (L.) amazonensis kdna 2 (not shown in flowchart). The inability of any primer pairs tested to distinguish between L. (L.) chagasi and L. (L.) infantum is consistent with the current belief that these species are virtually identical. 119 Note: The primer sets listed are minimal sets. It is advisable to select additional primers based upon the expected Leishmania species in the geographic region.

85 70 of the melting temperature of kdna amplicons, a step that will differentiate false from true positives. Samples that test positive for Leishmania DNA would undergo a secondary set of SYBR green qpcr reactions to determine the species of infecting parasite, and to validate the initial result. These include L. (L.) amazonensis kdna 2 and MSP associated gene 1 for specimens in which L. (L.) amazonensis, L. (L.) chagasi/infantum or L. (L.) donovani are suspected from the kdna 1 melt curve, and L. mexicana minicircle 1 when L. mexicana or L. major are suspected. The melting temperature differs based on size and GC composition. It is recommended to conduct the species test along with appropriate positive control DNAs, as melt curve temperatures may vary slightly between experiments due to subtle differences in buffers or between machines. As a final verification, which could be considered optional especially in settings where cost is a primary issue, a TaqMan assay specific for the infecting Leishmania species may be used. The inability of any primers to distinguish between L. chagasi (also refered to as L. (L.) infantum chagasi) and L. infantum is consistent with the 118, 119 current belief that these species seem to be genetically identical. Other series of tests would be recommended for detection/species discrimination in other geographic regions (see Table 7 for a list of markers to distinguish common species). The list is not exhaustive, and existing tests will need to be tested for their capacity to distinguish additional species as needed for some regions. The combination of DNA polymerase 2 and MSP associated gene 2 TaqMan qpcr assays was developed into a multiplex reaction that distinguished between visceralizing versus other Leishmania species. DNA polymerase 2 is capable of amplifying DNA from all species, albeit less efficiently in L. (L.) tropica. The latter required several more cycles of amplification compared to other species tested. Among the visceralizing species L. (L.) chagasi, L. (L.) donovani and L. (L.) infantum, the MSP associated gene was amplified at earlier CTs than DNA polymerase, mitigating the possibility that the DNA polymerase 2 amplification could consume the reaction reagents

86 71 Table 7. Serial qpcr studies recommended for detection and speciation of Leishmania spp. in clinical or environmental specimens based on geographic region Region North, Central and South America North, Central and South America North, Central and South America Clinical Syndrome a LCL MCL, DL, DCL VL Expected Leishmania species L. (L.) amazonensis L. (L.) mexicana L. (V.) b braziliensis, L. (V.) panamensis,l. (V.) guyanensis L. (L.) infantum chagasi (rare) L. (V.) braziliensis, L. (V.) panamensis, L. (V.) guyanensis L. (L.) amazonensis L. (L.) mexicana L. (L.) infantum L. (L.) amazonensis (rare) Europe VL L. (L.) infantum L. (L.) donovani Middle East, Northern and sub- Saharan Africa Middle East, Northern and sub- Saharan Africa CL, DCL, LR VL, PKDL L. (L.) major L. (L.) tropica L. (L.) infantum (rare) L (L.) donovani (rare) L. (L.) infantum L. (L.) donovani L. (L.) tropica (rare) First step: Detection kdna 1 L. braziliensis kdna 3 L. amazonensis kdna 3 L braziliensis kdna 3 kdna 2 L. amazonensis kdna4 kdna 1 kdna 1 kdna 1 Second step: Species discrimination kdna 1 melt curve MAG 1 melt curve L. braziliensis kdna 3 SLACS L. amazonensis kdna3 melt curve L. braziliensis kdna 3 SLACS kdna 2* L. amazonensis kdna4* MAG 1 melt curve kdna 1 melt curve MSP associated gene 1 (MAG 1) melt curve L. tropica Cytochrome B 1* kdna 1 melt curve MAG 1 melt curve L. tropica cytochromeb1*

87 72 Table 7. Continued. India/Bangladesh/ Pakistan/Nepal VL, PKDL L. (L.) donovani kdna 1-7 (any one) Asia VL L. (L.) donovani kdna 1-7 (any one) Note: Clinical syndromes are: DCL diffuse cutaneous leishmaniasis; DL disseminated leishmaniasis; LCL localized cutaneous leishmaniasis; LR leishmaniasis recidivans; MCL mucocutaneous leishmaniasis; VL visceral leishmaniasis. b Viannia or V. refers to parasites belonging to the subgenuses Viannia. All other species listed belong to the Leishmania subgenus. *Marker amplifies only one species listed in the 3 rd column.

88 73 and mask the results (Table 5). L. (L,) tropica did show some amplification at late CTs with the MSP associated gene 2 primer set. However since these occurred late in the reaction, and later than the DNA polymerase 2 amplification, it is easily discernable from the L. (L.) donovani complex members. This finding is of interest due to the ability of L. (L.) tropica to cause not only cutaneous ulcers but also disseminated disease, 120 although one cannot draw conclusions without sequence and functional data kdna copy numbers Real time and TaqMan PCR are capable of quantifying parasite numbers in clinical or experimental specimens. Absolute quantification is determined by comparison to standard curves from parasite DNA. Since kdna probes are most sensitive, it makes sense to perform quantification with these primers. However, the use of kdna for quantification is complicated by the fact that Leishmania species contain multiple copies of both minicircle and maxicircle kdna. It is not known whether copy numbers differ between the Leishmania species, between different isolates of the same species, or between parasite stages. We therefore employed minicircle and maxicircle kdna qpcr assays to determine the relative difference in minicircle and maxicircle kdna copy numbers between three different species of Leishmania (Table 8), between a recent and a multiply-passaged line of the same L. (L.) chagasi isolate (Table 8A), between eight strains of L. (L.) donovani derived from patients in the same endemic region of Bihar State, India (Table 8B), and between four L. (L.) tropica isolates recently derived from infected humans in the Middle East (Table 8C). L. (L.) tropica isolates were kindly provided by Dr. F. Steurer of the Centers for Diseases Control. To calculate differences in kdna copy numbers between species or growth conditions, kdna quantification was normalized to a DNA polymerase I target, which is a single copy gene in the parasite genome, and relative copy number differences were calculated using the ΔΔCT method.

89 74 Table 8. The relative copy number differences in kinetoplast mini- and maxicircle sequences A. Leishmania species Maxicircle Minicircle (A) L. donovani complex L. (L.) chagasi L. (L.) chagasi (attenuated) L. (L.) donovani L. (L.) infantum* Mean ± SE fold increase 3.6 ± ± 1.6 B. Leishmania species Maxicircle Minicircle (B) L. donovani isolates L. (L.) donovani (reference)* L. (L.) donovani L. (L.) donovani L. (L.) donovani L. (L.) donovani L. (L.) donovani L. (L.) donovani L. (L.) donovani Mean ± SE fold decrease/increase 2.2 ± ± 0.8 C. Leishmania species Maxicircle Minicircle (C) L. tropica isolates L. (L.) tropica 1* L. (L.) tropica L. (L.) tropica L. (L.) tropica L. (L.) tropica Mean ± SE fold increase 2.4 ± ± 0.9 Note: (A) The relative copy number difference in kinetoplast mini- and maxicircle sequences between the indicated Leishmania donovani complex species. Expression changes are relative to the (*) indicated sample. (B) The relative copy number of

90 75 Table 8. Continued. maxi- or minicircle DNA is indicated between 8 isolates of L. donovani and (C) 5 isolates of L. tropica.

91 76 These data showed that there are indeed differences between the copy numbers of maxi- or minicircle DNA both between species and between isolates within a single species. The minicircle or maxicircle copy numbers varied by 2.53±0.92 or 2.44±0.78- fold (average±se), respectively, between different isolates of L. tropica. Mini- or maxicircle copy numbers differed by 3.13±0.77 or 2.18±0.41-fold, respectively (Table 8C), between different isolates of L. donovani (Table 8B). Although we did not compare all Leishmania species listed in Tables 1 and 3, there were differences in maxi- or minicircle copy numbers between the 3 species tested (3.61±1.45 or 3.56±1.60, respectively). As a means of addressing changes in kdna copy numbers during stage transition, we applied the same technique to an L. (L.) chagasi strain capable of transforming between promastigote and axenic amastigote in culture. Mini- and maxicircle copies were quantified weekly during the conversion from promastigote to amastigote, or from amastigote to promastigote, in response to culture conditions inducing stage transition (Figure 10). At the week one time-point there was no significant difference between minicircle copies as parasites converted either from amastigote to promastigote or from promastigote to amastigote (p>0.1, n=3; two-tailed T-test assuming unequal variances). Maxicircle copies underwent a greater fold change although this did not reach significance. One week after the transition from amastigote to promastigote, newly transformed promastigotes contained 0.74 fold fewer copies of maxicircles than the amastigote baseline (solid bar week 1; p=0.09, N=3). Conversely, one week after conversion from promastigotes, axenic amastigotes contained 1.50 fold higher numbers of maxicircle copies than the baseline in promastigotes (striped bar week 1; p=0.08, N=3). Although axenic amastigotes are not identical to tissue-derived amastigotes, these findings suggest that standard curves generated from promastigote-derived DNA, particularly using minicircle kdna probes, can be validly applied to quantification of amastigotes of the same species/strain.

92 77 Figure 10. kdna copy numbers during stage transition Note: Maxicircle (left) or minicircle (right) kdna primer pairs were used to quantify copy numbers in the promastigote and amastigote life stages when converting in vitro between the two life stage forms. Solid bars: Total parasite DNA was extracted from amastigotes to determine basal expression, and then weekly after in vitro conversion to promastigotes for comparison. Striped bars: Total parasite DNA was extracted from promastigotes to use as baseline, and then weekly after in vitro conversion to amastigotes for comparison. kdna abundance was normalized to the single copy gene DNA polymerase I, and expressed as fold change in promastigotes (solid bars, over an illustration of a sand fly) or amastigotes (striped bars, over an illustration of a macrophage) relative to the pre-conversion stage. Arrows between an illustration of a sand fly or macrophage represent the change of culture conditions introduced at time zero.

93 Detection of L. donovani in clinical and other unknown isolates SYBR green based qpcr assays were applied to serum samples from patients with visceral leishmaniasis (VL) caused by L. donovani, and control patients undergoing treatment for other diseases (non-vl) (Figure 11). A priori, serum specimens would be expected to be cell-free, and therefore not contain DNA from intracellular Leishmania spp. parasites. Nonetheless, if there were inapparent lysis of leukocytes during blood collection, DNA could be present. Indeed, there was DNA extracted from serum specimens, and these were positive for Leishmania DNA in all five patients with VL, using the kdna 3 SYBR green assay (Figure 11). All negative controls were negative for parasite DNA, including five control Bangladeshi subjects, one individual from Iowa, and control human DNA from a North American subject. Parasites were quantified in all positive subjects, revealing 595 ±178 (standard error) leishmania parasites per ml of serum. Serial qpcr was used to speciate parasites from a different type of clinical specimens, i.e. cutaneous biopsies for diagnosis of cutaneous leishmaniasis. Biopsy specimens are routinely obtained for diagnostic evaluation of cutaneous ulcers that are suspected to be cutaneous leishmaniasis in clinical settings in northeast Brazil. To evaluate the serial qpcr for diagnosis/speciation using these specimens, DNA was extracted from the biopsies of three patients living in a region of the state of Bahia, Brazil 121, 122 where L. (V.) braziliensis and L. (L.) amazonensis have been endemic. Extracted DNA samples from all three patient biopsies were tested by serial qpcr as in Figure 9, and all three were positive for the presence L. Viannia spp. according to the L. (V.) braziliensis kdna 3 TaqMan assay. We tested an additional 11 DNA samples extracted from parasites isolated from cutaneous lesions of subjects living in Manaus, Amazonas, Brazil, as well as 19 DNA samples from different clinical forms of tegumentary

94 79 Figure 11. qpcr detection of L. (L.) donovani in human sera Note: DNA extracted from 200 µl of serum was used to detect and quantify Leishmania DNA in five VL patients from Bangladesh (VL). The low numbers of parasites detected may either be the result of whole parasites being sampled or lysed parasite DNA floating in peripheral blood. Non-VL endemic control (EC) patients and donor serum without exposure to Leishmania (NC 1) showed no evidence of infection. Other negative controls (NC) were a human DNA template NC 2 and water NC 3.

95 80 leishmaniasis from Corte de Pedra, Bahia, Brazil (6 from cutaneous, 7 from disseminated, and 6 from mucosal leishmaniasis). L. (V.) braziliensis kdna 3 amplified members of the L. Viannia spp. subgenus regardless of species. Furthermore, all 19 isolates from subjects in Corte de Pedra amplified with the primer sets SLACS, 123 which has specifically amplified L. (V.) braziliensis from among the Viannia sub-genus, suggesting that these samples were L. (V.) braziliensis as opposed to, L. (V.) guyanensis or L. (V.) panamensis. These data also suggested that the SLACS genes were present regardless of clinical form. None of the 11 samples from Manaus were amplified by the SLACS primer set, indicating that these 11 isolates were Leishmania Viannia sub-species, but not L. (V.) braziliensis. The presence LRV1 RNA viruses in L. (V.) guyanensis has been recently reported to be associated with distinct clinical presentations of disease. 123 It was of interest to consider whether retrotransposons might also be a biomarker for the clinical presentation of symptomatic infections with Leishmania subgenus Viannia braziliensis infections. However, no significant differences in SLACS genomic copy numbers were observed between isolates with different clinical presentations (tested via delta-delta CT method relative to the genomic L. (V.) braziliensis DNA polymerase 2 primer set, according to a student s T-tests at p < 0.05 between each clinical presentation). Five DNA samples originally obtained from patients in different regions of the world were tested as unknowns using the serial qpcr assay. No information about geographic location or species was provided at the time of serial qpcr testing. Species identification of the isolates, determined at the Centers for Diseases Control by isoenzyme testing, indicated they were L. (V.) braziliensis, L. (L.) donovani, L. (L.) infantum, L. (L.) mexicana, and L. (L.) major. Using the qpcr primer pair selection strategy outlined in Table 7, the species identified through the qpcr serial testing corresponded to the species designation according to isoenzyme analysis for all 5

96 81 unknowns. Primer sets used in this analysis were kdna 1, L. (V.) braziliensis kdna 3, MSP associated gene 1 (MAG 1), and L. (L.) mexicana minicircle Discussion The global burden of leishmaniasis is high. 124 Visceral leishmaniasis has reached epidemic proportions in three regions (Bangladesh/Nepal/Bihar State, India, southern Sudan-Ethiopia/Afghanistan and Brazil). Drug resistance has further impacted the disease burden in northern India, 23 and the HIV epidemic has led to new patterns of visceral leishmaniasis in Spain and Portugal. 125 Urbanization has caused spread of visceral leishmaniasis in peri-urban settings of northeast and southern Brazil. 21 Cutaneous leishmaniasis is most prevalent in the Middle East, Syria, Brazil and Peru, 126 but imported cutaneous leishmaniasis is becoming problematic in the USA particularly in military personnel. 127, , 129 Leishmaniasis in the blood supply is becoming a concern, and an outbreak of Leishmania (L.) infantum infection in foxhounds has introduced a possible reservoir into the USA. 130 As such, leishmaniasis is emerging as a concerning disease of both human and veterinary importance. Diagnostic testing for leishmaniasis is less than optimal. The need for improved diagnostic procedures is evident in the current setting of expanding disease burden. Clinical presentation and a positive test of immune response to the parasite (serology during visceral leishmaniasis or leishmania skin testing during cutaneous leishmaniasis) can be suggestive, but definitive diagnosis requires parasite identification. The latter can be accomplished by microscopic examination of tissue biopsies, bone marrow or splenic aspirates, and/or cultivation of live organisms from clinical tissue specimens. The sensitivity is variable with the specimen source. Microscopic exam and culture from splenic aspirate has >95% specificity and close to 100% sensitivity for diagnosis of Indian kala azar, respectively, but the procedure is risky. Even in the most skilled centers there have been deaths due to hemorrhage after splenic biopsy. Thus the procedure is not

97 82 practiced in the Western Hemisphere, but it is the diagnostic procedure of choice in India. The sensitivity of bone marrow aspirate is lower (52-69%) but the procedure is safer 114, albeit still invasive. Microscopic identification and/or culture from biopsied cutaneous lesions is less sensitive and sensitivity varies dependent on the disease form, with parasites isolated from only 30-50% of localized cutaneous ulcers and rarely isolated from lesions of mucosal or disseminated leishmaniasis. The species of Leishmania are most commonly differentiated by electrophoretic mobility of isoenzymes from cultured promastigotes, 134 a lengthy method that is only feasible if parasites are isolated in culture. Molecular methods to detect Leishmania spp. DNA include hybridization of infected tissues 135 and amplification using routine PCR (polymerase chain reaction) , PCR-ELISA 139, or quantitative PCR 103, 106, 140. These methods have been successfully applied for detection of individual Leishmania species in clinical samples, including peripheral blood leukocytes of patients with visceral leishmaniasis or asymptomatic 83, 141, 142 infection. Discrimination between the Leishmania species has been reported using RAPD (random amplification of polymorphic DNA) 143 and qpcr. Speciesdiscriminating qpcr assays that can distinguish between the two Leishmania subgenuses Viannia and Leishmania, between Leishmania complexes (L. (L.) mexicana, L. (L.) donovani/infantum, L. (L.) major) or between L. (L.) donovani strains 144 have been reported. Despite many reports, a standardized rapid and sensitive test that distinguishes the spectrum of common Leishmania species is still needed. 105 The purpose of this study was to develop a comprehensive series of qpcr assays adequate for detection and speciation of Leishmania spp. in clinical, environmental or experimental samples. To accomplish this goal, we chose as qpcr targets, sequences that are present in multi-copy genes, or reported targets for conventional or quantitative PCR in the literature. Targets also included kinetoplast minicircle DNA, 106, 140 other repetitive sequences such as rrna coding or intergenic spacer regions, and single copy genes including those encoding DNA polymerase 109 or glucose phosphate isomerase. 110

98 83 From these we designed a series of primers and probes that were then tested for their sensitivity and specificity to detect eight different Leishmania species. All assays retained specificity in the presence of 10-fold excess human DNA. As expected, multi-copy targets were the most sensitive assays for detection of Leishmania spp. DNA. The most efficient was minicircle kdna, although the sensitivity of individual kdna primer sets differed between Leishmania species. A combination of two kdna primers was finally chosen (kdna1, L. (V.) braziliensis kdna 3) to detect all Leishmania spp., with the latter used to detect L. (V.) braziliensis alone. Species discrimination was approached starting with the melt curve information from the kdna1 primer set. The MAG1 primer set could be used to differentiate L. (L.) donovani from L. (L.) infantum chagasi. Identification of L. (L.) amazonensis could be confirmed with the L. (L.)amazonensis kdna 2 primer set. The L. (L.) mexicana minicircle 1 primer set was found to differentiate L. (L.) mexicana from L. (L.) major. If the above led to ambiguity regarding species identification, a number of other primer sets could be employed, with melt-curve criteria used for species differentiation as shown in Figure 8. Thus, detection and speciation can be achieved using SYBR green alone, without a need for expensive TaqMan probes. Thus, the sets should provide enough flexibility to detect and differentiate infections with Leishmania isolates in many clinical specimens. In addition to the above SYBR green assays, we also tested several TaqMan probes and primer sets. These could be used to provide additional validation of findings in problematic cases. Furthermore, at least some of these can be developed into multiplex assays for detection/species determination, illustrated by the ability of a multiplex assay containing MSP associated gene 2 and DNA polymerase 2 TaqMan primer pairs to distinguish visceralizing species from all other Leishmania species (Table 5). There are a few reports documenting individuals with symptoms resembling visceral leishmaniasis, from whom Leptomonas has been isolated alone or in combination

99 84 with L. donovani. 116 Whether these are primary causes of disease or coinfections is not yet clear. Therefore we incorporated Leptomonas into our design of primers and probes. GAPDH and repetitive mini-exon sequences were sufficient to identify and quantify Leptomonas species. Although Leptomonas GAPDH 2 primer sets also amplified L. braziliensis, Leptomonas mini-exon 1 did not, making these primer sets useful even in the case of Leptomonas - Leishmania spp. co-infections. Needing to quantify parasites in clinical specimens and using the most sensitive primer sets, we investigated the validity of minicircle primers for comparison of clinical samples with standard curves of promastigote DNA. Therefore, we applied qpcr tests using primers for mini- or maxicircles, normalized to a single copy gene (DNA polymerase I), to compare kdna copy numbers between the promastigote and amastigote life stages, between different strains of individual Leishmania species, and between different species of Leishmania. There were no significant differences between copy numbers of either mini- or maxicircle DNA between the promastigote and the amastigote stages, suggesting the use of promastigote DNA to quantify amastigotes in mammalian specimens would be valid. However, kinetoplast DNA sequence copy numbers differed moderately between strains of the same species, and even more so between different species of the Leishmania donovani complex. The latter conclusions are complicated by the expected variability of primer hybridization to kdna of different species. We conclude that the use of minicircle primer sets for quantification might be most accurate by comparison with a standard curve generated using DNA from the same species, but absolute numbers would not be as accurate as quantification based on a chromosomal gene target. Logically it seems valid to use the kdna primer set for relative quantification in a single patient to follow response to therapy, or in a single household or family that might share parasite strains. However caution must be used in comparing absolute parasite numbers between isolates or species using kdna primers.

100 85 In conclusion, we report herein a series of qpcr assays that is sufficient to identify and speciate parasite DNA in serum samples from patients with visceral leishmaniasis or lesion biopsies of subjects with tegumentary leishmaniasis. The validation using SYBR green makes this a rapid and relatively inexpensive means of detecting, quantifying and determining the Leishmania species in clinical or epidemiologic samples. In addition to diagnosis, additional useful applications could include quantitative analysis of parasites in buffy coat or serum specimens to document response to therapy, species determination in tissue biopsies or tissues scraped from microscopic slides, and studies of parasite loads in sand flies from endemic regions. The qpcr serial testing strategy requires a reference lab with the technical capacity to perform qpcr. This technology is becoming available in many countries and could be developed in a central diagnostic laboratory in endemic countries.

101 86 CHAPTER IV. FIELD APPLICATIONS OF QPCR: SPECIES IDENTIFICATION AND QUANTIFICATION 4.1. Introduction Transmission of leishmaniasis occurs when an infected sand fly feeds on human host. Leishmania replicate over a 7-9 day period during which promastigotes undergo a series of morphologic changes while traveling retrograde from the midgut to the foregut and proboscis of the insect vector. They ultimately become virulent non-replicating metacyclic stage parasites, ready for transmission to a new host. 145, 146 Leishmania are capable of infecting a number of mammalian hosts, and current literature considers domestic dogs to be the primary reservoir for Leishmania infantum chagasi in Brazil. In contrast, humans are the primary reservoir for Leishmania donovani in India. 147 Rodents and wild animals serve as reservoir for a number of other species. 148 Efforts to cull infected dogs in Northeast Brazil have failed to decrease the incidence of human visceral leishmaniasis leading us to question whether the domestic dog is always the most important reservoir. Asymptomatic infections with the Leishmania species are commonly found in humans. In Brazil, these are detected by a positive Montinegro skin test with Leishmania antigens. 22 It is unknown, however, how many asymptomatic infections go undetected. 149 Furthermore, there are cases of a cure of a DTH response; as such, we cannot be certain that all asymptomatically infected persons will be DTH positive. 149 Efforts to measure the risks of asymptomatic infection and the ratio of asymptomatic to symptomatic infections are hampered by our inability to assess how many people are actually exposed to infected sand flies. This led us to question whether the prevalence of leishmania infections in sand flies in endemic neighborhoods could be assessed using molecular methods. This would allow us to identify the most at risk households and

102 87 neighborhoods. Furthermore, access to sand flies in houses where visceral leishmaniasis is spread would give us access to the blood meal, and allow us to determine the source of the blood meal, an indication of the most likely reservoir. For these reasons we have applied the quantitative PCR assay described in chapter III to the detection and quantification of leishmania in both sand fly vectors and human hosts. We ascertain the houses in which there has been recent spread of L. infantum chagasi through individuals admitted to the local public hospitals with acute visceral leishmaniasis. After obtaining consent, our collaborator, Dr. Jeronimo, is able to travel to their homes with a member of the state Ministry of Health. This has allowed her and colleagues to collect sand flies from the insides and outsides of houses in these endemic regions, and the homes of consenting close neighbors. She is also able to determine the ratio of humans to dogs or other potential reservoirs in the region. Furthermore, the access to members living in the same households as subjects with VL enables us to screen for individuals with asymptomatic infection in a population that is likely to be exposed. Quantitative PCR (qpcr) assays provide a sensitive means of detection for the pathogenic parasites of the genus Leishmania. Assays capable of detecting Leishmania spp. in both the host and vector, and determining species give laboratories in leishmania endemic regions an improved ability to address questions in molecular epidemiology providing a higher throughput of parasite detection and speciation. We will provide background on specific applications of our serial qpcr assay, and highlight how it is aiding investigations in the states of Bahia, Brazil, Rio Grande do Norte, Brazil, and Bihar, India. The serial qpcr assay described in chapter III has been applied to the projects of several investigators in endemic regions. Preliminary results from these studies have informed investigators as to the prevalent species causing disease in Bahia, Brazil, and Honduras. Along with investigators in Rio Grande do Norte, Brazil we have quantified the parasite burden carried by infected sand fly vectors. These assays have also been

103 88 applied to laboratory samples of patients infected with VL in Rio Grande do Norte, Brazil, and Bihar, India. Access to qpcr in endemic regions is limited to regional hospitals and research laboratories. In a setting such as Natal, Brazil, the serology and qpcr is centralized in our collaborator s laboratory and thus is available to all clinicians in the city. However it would not be accessible in countries where cases are not periurban, but occur in more remote locations (e.g. India). qpcr assays are being adopted by research laboratories interested in molecular epidemiology and the evaluation of clinical interventions world-wide. Thus it is useful even in regions where its implementation for clinical use is not currently cost effective. Further extensions of the assay, using the simple constant-temperature LAMP (Loop-mediated isothermal amplification of DNA) technology 150, may enable future adaptations of this assay to be applied to clinical diagnosis Applications Quantification in the sand fly vectors and asymptomatic human hosts in Rio Grande do Norte, Brazil In Northeast Brazil, visceral leishmaniasis (VL) is caused by Leishmania (Leishmania) infantum chagasi. Near the city of Natal, Rio Grande do Norte, Brazil, these infections are often acquired in homes within impoverished peri-urban areas. Sand flies having recently taken a blood meal were captured in households in Natal, Brazil, and DNA isolated. Preliminary results of quantifications made with kdna 5 primers and TaqMan probe (Table 4) estimate of the number of Leishmania present in infected sand flies (Figure 12). The results expressed in number of Leishmania per sand fly were backcalculated based on the proportion of the total amount of sample DNA that was included in the reaction. Values of less than one parasite could be attributed to parasite DNA degradation during the digestion of the blood meal by a sand fly or alternatively to problems with the efficiency of the DNA extraction. These results confirm the accuracy

104 89 Figure 12. Quantification of Leishmania in Sand Fly DNA samples Note: DNA samples from 10 flies judged infected by L. (L.) infantum chagasi by conventional PCR and 10 samples judged uninfected by conventional PCR were used to quantify the number of Leishmania present in each fly. Source: Macedo e Silva VP, Martins DRA, de Queiroz PVS, Freire CC, Gomes CEM, Pontes NNC, Monteiro GRG, Queiroz JW, Pearson RD, Weirather JL, Wilson ME, Jerônimo SMB, Ximenes MFFM; manuscript in preparation.

105 90 of qpcr as compared with conventional PCR based detection performed on the same source DNA, while also indicating there could be some wide disparities in the parasite burden carried by infected sand flies that cannot be observed with conventional PCR. In addition to understanding the infection among the vectors, research undertaken in Northeast Brazil is also aiming to better understand how, in addition to humans, other hosts may be acting as reservoirs for the disease. Studies conducted by Lima ID, et. al. have measured leishmania infections among humans and canines in Natal, Rio Grande do Norte, Brazil. Of 345 human subjects tested, 85 were positive for anti-leishmania antibodies and subsequently 15 of those tested positive for leishmania kdna in their serum (Table 9). These subjects were asymptomatic for VL, and the extent to which this subclinical or pre-symptomatic group could contribute to disease is an open question which qpcr will be helpful in addressing. Table 9. Detection of Leishmania (L.) infantum chagasi in human serum Total Percent Positive for anti-leishmania antibodies 85 of % Positive for anti-leishmania antibodies and qpcr positive for Leishmania kdna 15 of % Source: Lima ID, Queiroz JW, Lacerda HG, Queiroz PV, Pontes NN, Barbosa JD, Martins DR, Weirather JL, Pearson RD, Wilson ME, Jeronimo SM. Leishmania infantum chagasi in northeastern Brazil; asymptomatic infection at the urban perimeter. Am J Trop Med Hyg: 86(1):99-107, This analysis of 345 asymptomatic individuals living in the endemic region in Northeast Brazil shows a subset of individuals with anti-leishmania antibodies also with leishmania DNA circulating in their bloodstream. Estimates for the amount of time cellfree DNA remains circulating are between 4 and 30 minutes 151 based on measurements of

106 91 male DNA in maternal bloodstreams post-birth, therefore even small quantities of DNA detected in patient blood samples may due to a current infection. As antibody response may outlast a current infection, the qpcr results may indicate individuals with a persistent infection. These results do not include measurements of infection based on qpcr among individuals who were antibody negative. Future work should involve testing this population to better assess the number of asymptomatically infected individuals in the population Detection and species determination of parasites in human specimens in Bahia, Brazil Tegumentary Leishmania infections acquired in rural Bahia, Brazil can take on any of three forms. The most common is a cutaneous infection, an ulcerating lesion with parasites present in the edge around the wound. Cutaneous infections can, at times, lead to a mucosal involvement, with a vigorous inflammatory response at mucosal sites such as the nose and pharynx. Increasingly patients are being recognized with a third emerging form of tegumentary leishmaniasis called Disseminated Leishmaniasis (DL). In this disease, disseminated acneiform and ulcerating lesions form throughout the body, distant from the site of initial acquisition. 152 Beyond a visual assessment of lesions, a diagnosis can be made by growing parasite cultures of samples taken from either aspirates or biopsies around the lesion. However, this test has low sensitivity (~30%) and parasites cannot always be cultured from samples taken from patients. As such, the species cannot be determined using existing methods. Historically both L. (L.) amazonensis and L. (V.) braziliensis have caused infections in Bahia. Therefore, we first applied our serial qpcr assay to archived laboratory samples of DNA extracted from parasite cultures grown from patient samples to confirm the hypothesis that L. (V.) braziliensis is now exclusively causing the infections acquired and treated in the area. Knowing the species, a TaqMan qpcr assay

107 92 specific for the L. Viannia sub-genus, which includes L. (V.) braziliensis, was used to evaluate the ability of qpcr assays to detect L. (V.) braziliensis in either aspirates or biopsies taken from patients samples. Our data indicates the serial qpcr assay was more effective at detecting parasites in DNA isolated from skin punch biopsies than from DNA isolated from aspirates drawn from the wound by needle (Table 10). Table 10. Tissue biopsies and the detection of L. (V.) braziliensis in Bahia Brazil biopsy and aspirate for both patients positive negative Total percent positive Biopsy.1 parasite cutoff % Biopsy 1 parasite cutoff % aspirate.1 parasite cutoff % aspirate 1 parasite cutoff % Note: The sensitivity of qpcr primers and a TaqMan probe targeting the kinetoplast minicircle DNA of L. braziliensis is compared between DNA extracted from biopsies and aspirates from cutaneous lesions of patients from Bahia Brazil. Source: AQ Silva, RS Sousa, JL Weirather, ME Wilson, P Machado, A Schriefer; manuscript in preparation. Differences in the DNA extraction efficiency between biopsies and aspirates should be considered as an alternative explanation for the results observed in Table 10. As these results may be useful in guiding the development of future clinical protocols involving invasive procedures, improving the effectiveness of the aspirate assay could be very beneficial. Regardless, the current results indicate lesion biopsies provide an excellent source of parasite DNA, and show qpcr could improve the speed of diagnosis greatly as parasite cultures require days.

108 Quantitative analyses during treatment of visceral leishmaniasis due to L. donovani in blood from subjects with visceral leishmaniasis in Bihar State, India Visceral leishmaniasis (VL) is endemic to the state of Bihar, India, and is caused by the species L. (L.) donovani. VL is lethal without treatment, and amphotericin-b treatments are currently an effective treatment. However, some patients relapse or fail to respond to treatment, and there are currently no in vitro tests for drug resistant 19, 20 organisms. Recovered or partially recovered patients in India are also at risk of developing post-kala-azar dermal leishmaniasis (PKDL). PKDL is a cutaneous infection occurring 5-10% of patients in India, months or decades after a seemingly successful treatment of visceral disease. Therefore, an effective way to monitor the progress of treatment would be helpful in quickly identifying patients at risk. Amphotericin-B works by binding sterols in the parasite membrane permitting ion flux. However treatment is not always effective and relapses have been known to occur. A qpcr assay was used to monitor parasite load in patients receiving different preparations of amphotericin-b for VL. Patients received either a more expensive liposomal amphotericin-b preparation, or a performed fat emulsion preparation. DNA was extracted from the blood of VL patients drawn during the course of treatment. Using a kdna primer set described in our serial qpcr assay with SYBR green chemistry (Table 4), reactions carried out by M. Sudarshan observed a reduction in the amount of measured parasite DNA during the course of treatment, with more significant discrepancies in the effect of treatment observed later as parasite numbers diminished (Figure 13). Importantly, the qpcr revealed parasite loads that diminished over time of treatment. As such, we suggest that qpcr could be used as a measure to estimate the response or lack of response, during treatment. This assay offers a promising way for researchers and clinicians to monitor the efficacy of treatment and

109 94 Figure 13. Quantification of L. (L.) donovani from the blood of VL patients given different preparations of amphotericine Note: The SYBR green primers kdn2 (Table 4) were used to compare the clearance of L. donovani from the blood of visceral leishmaniasis patients given different preparations of Amphotericin. ApL indicates a typical fat emulsion preparation of Amphotericin-B. L-AmB represents a liposomal preparation of Amphotericin-B. Source: M Sudarshan, JL Weirather, ME Wilson and S Sundar. Study of parasite kinetics with antileishmanial drugs using real-time quantitative PCR in Indian visceral leishmaniasis. J. Antimicrob. Chemother. 66(8): , 2011; Figure 2.

110 95 monitor for a relapse with sensitive method that is less costly and less invasive than culture or microscopy from splenic aspirates Speciation of Leishmania spp. in fixed, stained slides of cutaneous lesions from subjects in Tegucigalpa, Honduras Patients treated for Leishmaniasis in Honduras could have been exposed to a number of Leishmania species there or from neighboring regions, including: L. (L.) braziliensis, L. (L.) infantum chagasi, or L (L.) mexicana. To further complicate species determination, L. (L.) infantum chagasi in Honduras can cause a non-ulcerating dermal infection 153 or a classical disseminated visceral leishmaniasis. Thus all species in the region are capable of causing cutaneous infections. We applied a set of primers from the serial qpcr assay (Table 4) to detect and determine the species of Leishmania from DNA extracted from fixed tissue prepared on microscopy slides after long-term storage. We detected L. (V.) braziliensis in one sample, and either L. (L.) infantum chagasi or L. (L.) mexicana in 14 other samples (Table 11). The inability to distinguish between L. (L.) infantum chagasi and L. (L.) mexicana is most likely because the small quantities of DNA extracted from the slides do not provide adequate template for the L. (L.) infantum chagasi specific primer set, MSP associated gene 1, as they were targeting templates with fewer copy numbers than those primer sets targeting the kinetoplast. These results are informative for future studies, where it will be advisable to use TaqMan qpcr primers when possible as it does not produce as much noise due to primer dimers as SYBR green, and to use multiple primer sets targeting the kinetoplast to confirm the results when parasite DNA quantity expected to be low. Species specific kdna-targeting TaqMan primer sets would be especially useful in these situations where a very small amount of parasite DNA is expected to be

111 96 Table 11. Species specificity of primers used to identify the species of Leishmania from DNA samples of samples extracted from slides A. Capability of each primer set to detect the species Primers L. (V.) braziliensis L. (L.) chagsi L. (L.) mexicana kdna 1 X X kdna 3 X X kdna 4 X X L. braziliensis Minicircle 3 X L. mexicana Minicircle X MSP associated gene 1 B. Observed species Species L. chagasi or L. mexicana 21 L. braziliensis 1 Undetermined 14 Number observed X Note: The bottom table (B) shows the numbers of the identified species. Ambiguities in L. chagasi and L. mexicana species are likely due to lack of sensitivity within low DNA quantity extractions for the MSP associated gene 1 primer set, or possibly, but less likely due to the higher copy number, the L. mexicana Minicircle primer set. Source: EA Gulleen, JL Weirather, ME Wilson, J Alger; Unpublished.

112 97 present. For these DNA extractions taken from fixed slides where the amount of parasite DNA template is extremely limited, SYBR green results may be difficult to distinguish from primer dimers. Nevertheless, parasite DNA was amplified from fixed slides, and this could provide a means to confirm hypotheses or enrich our understanding about the causative species present in archived microscopy samples Discussion Investigators working in the Leishmania endemic nations of Brazil, India, and Honduras have applied the serial-qpcr assay to a variety of ongoing projects. Results thus far have demonstrated qpcr can augment ongoing studies by simplifying the methods used in quantifying parasites and identifying parasite species. The needs of most field applications thus far indicate a need for TaqMan qpcr assays allowing sensitive detection even when where little parasite DNA is present. The applications of such assays including determining whether or not sand flies are infected as well as estimating the number of parasites in the sand fly. Detection of Leishmania DNA in host serum or tissues demands even more sensitive methods as detection may depend on very few parasites or free floating copies of DNA. We hope such wide adoption of these assays, and our involvement in their deployment can help ensure good communication between all involved researchers enabling them to develop protocols that ensure the best practices and help produce high quality results.

113 98 CHAPTER V. HOST IMMUNOGENETIC FACTORS INFLUENCING THE OUTCOME OF VL 5.1. Introduction Visceral leishmaniasis (VL) is a potentially fatal parasitic disease of humans caused by protozoa belonging to the Leishmania donovani complex. Symptomatic VL is a severe progressive infection that is usually fatal if untreated. Despite its potential severity, 80-90% of individuals infected with the causative parasites harbor either subclinical or asymptomatic infection. 154 Protozoa causing VL include Leishmania infantum chagasi in Latin America and Leishmania infantum in regions surrounding the Mediterranean Sea, and Leishmania donovani in India and Africa. The species and strain of Leishmania, environmental factors such as sand fly vectors and host nutritional state, age, sex and immunocompetence contribute to the clinical outcome of infection. 92, 147, 154 Host genetics also influences the outcome of infection. Evidence for a genetic susceptibility to visceral leishmaniasis stems from two segregation analyses performed on populations in Northeast Brazil, these analyses implicated either a single major locus 39 or one or two major loci 40 contributing to the susceptibility to leishmaniasis. Furthermore, a segregation analysis looking at asymptomatic infection in Bahia, Brazil identified a major genetic component to the transmission of infection. 42 However, as these segregation analyses may rely on uniform exposure to Leishmania infection within families, and exposure is difficult to measure, additional evidence for a genetic component to susceptibility should also be considered. Experimental infections in mouse models, and studies of human populations from endemic regions have documented evidence that genetic polymorphisms are partial determinants of the outcome of murine or human infection with the Leishmania species causing VL VL in humans is accompanied by high titers of antibodies and a

114 99 suppressed cellular immune response to parasite antigens. The latter manifests as a negative delayed-type hypersensitivity response (DTH, also called the Montenegro or the leishmania skin test) to Leishmania antigens. Antibody titers drop to low levels after treatment, simultaneous with development of a DTH response. In contrast, individuals with prior Leishmania infection, whether asymptomatic or symptomatic VL, 21, 149 characteristically have a positive DTH response. Studies of human genetics have highlighted regions of the genome potentially associated with the development of asymptomatic or symptomatic infections. 154, The purpose of the current study was to identify host genetic factors that influence whether a fatal or a self-curing outcome will ensue after infection with Leishmania infantum chagsi. These findings may highlight important host immune factors that contribute to the cure of or immune protection from leishmaniasis. The current study focuses on individuals from highly endemic neighborhoods of Northeast Brazil in which we conducted a family study of VL and asymptomatic infection, (the latter detected by the Montenegro DTH response). This allowed us to address loci associated with either disease, or innate protection against developing disease. Our previously reported genome-wide microsatellite scan showed weak linkage between several chromosomal regions and susceptibility for symptomatic VL or the size of the Montenegro DTH response among asymptomatically infected individuals. 168 The current study constitutes a follow-up study of the most promising of these genomic regions. We genotyped a set of fine-mapping SNPs spanning the regions showing potential linkage with VL susceptibility or the size of the DTH response, as well as SNPs tagging 42 candidate genes. We confined our analyses to the families in our study population with the highest chance of uniform exposure of family members to sand flies. Furthermore, we analyzed the DTH response as a qualitative response, in an attempt to treat asymptomatic infection as a single entity.

115 100 Our data suggested that several polymorphic loci are associated with symptomatic VL, as well as a single polymorphic locus associated with the presence of an asymptomatic infection (qualitative: a positive DTH response). These studies highlight mechanisms and pathways that deserve future investigations of the pathogenesis of VL in individuals harboring distinct allelic variants Methods Subject entry and ascertainment Probands enrolled in the study were diagnosed and treated for VL in public hospitals in Natal, Rio Grande do Norte, Brazil. Criteria for a diagnosis of VL were compatible clinical presentation, parasitologic diagnosis by identification of parasites on bone marrow aspirate and positive serology, and response to treatment. Subjects were encountered in hospitals. If patients with VL and any family members present consented, they were visited in their homes after discharge. During home visits, consent was obtained from all subjects and parents/guardians of minors. Minors age signed an assent form. Control non-vl families living in the closest household with a similar family structure were also invited to enter the study. Non-VL control families lived within 500 yards of the proband household. Data collection included family medical history, review of symptoms for each subject and physical exam. Venous blood was collected for anti-leishmanial serology and routine blood counts. Leishmania serology was performed using both SLA and rk39 antigens as previously described. The cutoff for positive was determined as the mean + 3SD of the absorbance for 30 control sera. 169 The Montenegro DTH response test for Leishmania antigen, and PPD to detect tuberculosis were placed on all consenting subjects. Skin tests were read hrs after placement in a second home visit. The cutoff for a positive Montenegro was 5 mm of induration; the cutoff for a positive PPD was 10 mm.

116 Phenotype determination Both VL and a history of treated VL were verified according to the above definition from current or prior hospital records. The DTH response was categorized as unknown for VL patients in this study, since it progresses from negative to positive after recovery. 21 Individuals with an induration diameter of 5 mm or more hrs after placement of the Montenegro were considered DTH positive, a qualitative phenotype indicating an asymptomatic infection. 170 Subjects with negative serology and a DTH size of 0-5 mm, whether living in VL or non-vl control households, were considered DTH negative. Individuals with positive serology but a negative DTH response were omitted from genetic studies, since they are considered to have early infection and could develop into any other phenotype Numbers of subjects 1200 individuals were genotyped for the study. The phenotypes of 38 additional family members were known, but they did not contribute a DNA sample. 49% of subjects were males and 51% were females. The median pedigree size for nuclear and extended families was 8 individuals. Thirteen individuals with DNA were removed as a result of quality control analyses, and 4 additional individuals no longer contributing were removed (see and Figure 15), leaving 1183 contributing to allele frequency information. Phenotype information was unavailable for a subset of subjects, usually due to a lack of DTH response test result. The numbers of individuals with DNA available for genotyping, and with known phenotypes are indicated in Table Quantitative trait normalization DTH diameter was assessed as both a qualitative trait (DTH+) and as a quantitative trait. For the quantitative trait analysis, DTH induration size measurements greater than 0 were normalized using a Box-Cox transformation with λ = 0.481, selected by a best fit to normality as assessed by the Kolmogorov-Smirnov test statistic in the

117 102 'nortest' (Tests for Normality) package for the R statistical computing software (Figure 14) Pedigree selection by exposure Because it is not possible to measure whether an individual has been exposed to leishmania-infected sand flies, the analyses were limited to nuclear families within which 50% of assessed individuals demonstrated evidence for exposure to leishmania infection, as indicated by either a history of VL or a positive DTH test. We reasoned that DTHindividuals in these nuclear families would have a higher chance of exposure to the parasite than individuals in families with a lower documented infection rate. The strategy was implemented in the analysis by setting the phenotypes of the less exposed families to unknown. This limitation reduced the number of VL cases considered from 196 to 145, and DTH positive individuals from 506 to 421 (Table 13), while permitting the less exposed families to still contribute genetic information to the study for calculations of genotype frequencies and linkage disequilibrium (LD) structure SNP selection for fine mapping SNPs were selected to cover the regions of putative linkage in our prior study. 168 Tagging SNPs were selected from haplotype blocks, chosen using three populations (African, Native Brazilian, and European) representing admixture in the region, 171 with minor allele frequency >0.05. The median distances between 5485 SNPs analyzed in the linkage follow-up regions was 10.2 kb. An additional 248 SNPs were selected to tag 42 candidate genes, chosen from the literature because of their relevance to leishmaniasis or related diseases. Candidate genes included cytokines, chemokines and their receptors, signal transduction molecules, collagens and transporters (Table 12). SNP selection criteria were the same as the above paragraph. The median distance between SNPs tagging each candidate gene was 2.4 kb.

118 Figure 14. Quantitative DTH size transformation

119 Figure 14. Continued. Note: A comparison of different approaches to standardizing the DTH size measurements are shown by comparing (A) no transformation, (B) a log2 transformation, and (C) a Box-Cox transformation. All distributions are normalized to have a mean of zero and standard deviation of 1. Deviations from expected normal distributions are shown in qq-plots, and histograms show the distribution of the values. The value of λ=0.481 selected for the Box-Cox transformation provided the best fit to a normal distribution, especially for the larger DTH size measurements

120 105 Table 12. Candidate genes chosen for study of associations with VL and DTH phenotypes Gene Description Location Tagging SNPs LEPR leptin receptor chr1: FCGR2A Fc fragment of IgG, low affinity IIa, receptor (CD32) chr1: IL10 interleukin 10 chr1: SLC11A1 solute carrier family 11 (protoncoupled divalent metal ion transporters), member 1 chr2: COL4A3 collagen, type IV, alpha 3 chr2: COL4A4 collagen, type IV, alpha 4 TGFBR2 IL12A transforming growth factor, beta receptor II interleukin 12A (natural killer cell stimulatory factor 1, cytotoxic lymphocyte maturation factor 1, p35) chr3: chr3: IL21 interleukin 21 chr4: IL15 interleukin 15 chr4: CSF2 colony stimulating factor 2 (granulocyte-macrophage) IL3 interleukin 3 (colonystimulating factor, multiple) chr5: IL13 interleukin 13 chr5: IL4 interleukin 4 IL12B interleukin 12B (natural killer cell stimulatory factor 2, cytotoxic lymphocyte maturation factor 2, p40) chr5: LST1 leukocyte specific transcript 1 chr6:

121 106 Table 12. Continued. LTA LTB TNF lymphotoxin alpha (TNF superfamily, member 1) lymphotoxin beta (TNF superfamily, member 3) tumor necrosis factor IFNGR1 interferon gamma receptor 1 chr6: COL15A1 collagen, type XV, alpha 1 chr9: TGFBR1 transforming growth factor, beta receptor 1 GATA3 GATA binding protein 3 chr10: SCNN1A TNFRSF1A PTPN6 sodium channel, non-voltagegated 1 alpha subunit tumor necrosis factor receptor superfamily, member 1A protein tyrosine phosphatase, non-receptor type 6 chr12: chr12: IFNG interferon gamma chr12: SCARB1 scavenger receptor class B, member 1 chr12: SMAD6 SMAD family member 6 chr15: IL4R interleukin 4 receptor chr16: CCL2 chemokine (C-C motif) ligand 2 chr17: CCL7 chemokine (C-C motif) ligand 7 CCL5 chemokine (C-C motif) ligand 5 chr17: CCR7 chemokine (C-C motif) receptor 7 CCL18 chemokine (C-C motif) ligand 18 (pulmonary and activationregulated) CCL4 chemokine (C-C motif) ligand 4 chr17: chr17: TBX21 T-box 21 chr17:

122 107 Table 12. Continued. SMAD7 SMAD family member 7 chr18: EBI3 CD209 Epstein-Barr virus induced gene 3 dendritic cell-specific intracellular adhesion molecules (ICAM)-3 grabbing nonintegrin chr19: chr19: IL27RA interleukin 27 receptor, alpha chr19: IFNGR2 interferon gamma receptor 2 (interferon gamma transducer 1) chr21: Note: Candidate genes chosen for study of associations with VL and DTH phenotypes. The genes listed were selected as candidates for associations with a VL or DTH phenotype based on reported or proposed roles in a leishmania infection. Locations for genes or groups of nearby genes are based on HG18 coordinates. The number of tagging SNPs for that region is listed on the right. Locations and numbers of SNPs in regions encompassing more than one gene are shown in boxes in columns 3 and 4.

123 Genotyping methods A custom SNP assay designed on an Illumina bead platform was designed by Dr. Priya Duggal (Johns Hopkins University). Sequencing reactions were performed by the Center for Inherited Disease Research (Johns Hopkins University) Quality control (summarized in Figure 15) SNPs from unrelated individuals were tested for deviations from Hardy-Weinberg equilibrium (HWE), assessed by the SNP-HWE Perl script written by Joshua C. Randall 172. Median p-values for HWE deviations for each SNP were determined by iterating the SNP-HWE calculation through random sets of unrelated individuals. No SNPs were included with a median p-value less than An analysis of SNPs removed for violations of HWE is not included in these results. Such SNPs may be in violation of HWE due to allelic heterogeneity due to the admixture of the population. The possibility of type II errors being introduced by the omission of SNPs in violation of HWE remains to be investigated. The Pedstats and Merlin software were used to identify and remove any Mendelian errors and unlikely genotypes. Individuals or SNPs with more than 2% bad calls or errors were removed from the analysis. Nuclear families with more than 5% errors were also excluded. Finally, SNPs with a MAF (minor allele frequency) < 0.05 were excluded from the analysis (Figure 15) Corrections for multiple tests and haplotype blocks Linkage disequilibrium (LD) blocks were identified using the default settings of the Haploview program 173. The number of haplotype blocks within linkage follow-up regions or amongst all candidate genes were respectively used to estimate the number of independent tests for Bonferroni multiple-test corrections. Chromosome 9 had 289 blocks or singlets, chromosome 15 had 355, chromosome 19 had 317, and the candidate

124 Figure 15. Quality control

125 Figure 15. Continued. Note: Beginning with 6295 autosomal SNPs in 1200 individuals, 45 SNPs were removed for lack of heterozygocity (MAF < 0.001). The remaining 6250 SNPs were tested for violations of Hardy-Weinberg equilibrium, and none were indicated by a p-value < (step 1). Error rates comprised of the either missing genotypes or Mendelian were errors detected by the Merlin software package. Out of 1200 individuals, 7 with error rates greater than 0.02 were removed (step 2). From the remaining 1193 individuals, SNPs with per-snp error rates greater than 0.02 were removed from the analysis, leaving 6246 SNPs (step 3). Nuclear families with error rates greater than 0.05 were also removed from the analysis. After removal of 3 nuclear families and uninformative spouses, 1183 individuals remained (step 4). Finally, 513 SNPs with a minor allele frequency less than 0.05 were removed from the analysis leaving 5733 SNPs (step 5)

126 111 genes had 46. Additionally, significance thresholds were calculated using a set of 1000 simulated populations generated with the Merlin software simulation feature. Simulated data sets maintain the same allele frequencies and missing data points as the original study population Linkage and association analyses For purpose of measuring linkage (not the family-based association tests that will be the focus of the presented results) SNPs were pruned to eliminate SNPs with R 2 > 0.2 using Plink 174 SNP pruning, then iteratively removing SNPs with the lowest minor allele frequency between SNPs with a pairwise D prime > 0.2 calculated in Haploview. Pruning reduced the number of SNPs included in linkage analyses from 5733 to 435. Linkage was assessed across the pruned SNPs using the Merlin software package for qualitative [ALL] and quantitative traits [VC and QTL]. The median distance between pruned SNPs covering the linkage follow-up region was 107 kb. The population in Northeast Brazil is comprised of individuals with a mix of European, African, and Native Brazilian ancestry. 171 Therefore, family-based tests of 175, 176 association robust against population stratification were used in the analysis. Association tests for qualitative traits (VL or DTH positive results) were conducted on all 5733 SNPs with the LAMP (Linkage and Association Modeling in Pedigrees) 175 software package with the [--ignore-linkage] option to measure associations independently for each SNP in a computationally efficient manner. A prevalence of 0.7 was assigned for calculations of DTH positive, associations, and 0.5 for VL associations. For the quantitative trait DTH size associations were measured using the Merlin software 176 association test [ASSOC]. Plots of associations were generated with the Locuszoom software package 177.

127 Results Population characteristics Characteristics of subjects in the total population, the subjects who donated DNA and were genotyped, and the subset of genotyped subjects living in households in which at least 50% of the members were infected with Leishmania spp., are delineated in Table 13. Importantly, the average ages and sex ratios did not differ significantly between all study participants for whom phenotype information was available, and subjects in the highly exposed nuclear families with at least 50% of members showing evidence for exposure (age: T-test p=0.99; sex: p=0.636, Chi Squared Test). The latter highly exposed subjects were used for the analyses in this study. As expected, we observed differences between the age and gender of VL subjects compared with other phenotypes, in both the total population and the highly exposed sub-population. The average age of VL subjects was significantly lower than the total population, and a significantly greater proportion of VL subjects were male than the total population. Although the difference was not as great as VL, the average age of individuals with a positive DTH response was significantly higher than that of total population, and we did not observe a sex-bias. Similar observations were made within highly exposed nuclear families as with the overall population, although the difference only reached significance when comparing subject ages Family-based association tests Using Haploview to analyze genotypes of all subjects, we were able to determine haplotype blocks in our population. This enabled us to prune SNPs and include only the most polymorphic markers from each haplotype block in linkage analysis. Regions on chromosomes 9, 15 and 19 that had shown potential linkage with putative loci determining risk for VL or DTH response size, according to our previousgenome wide microsatellite scan, were genotyped by fine mapping SNPs spanning those

128 Table 13. Fine-mapping population characteristics Subjects with DNA and/or phenotype total Male (%) Age (mean, sd)* Genotyped Male (%) Total Age (mean, sd)* Genotyped, 50% exposed families Male (%) Age (mean, sd)* Genotyped* , , 19.4 VL , *** 11.5, 13.4# ,12.4*** non-vl , , , 19.4 DTH , ,19.3## ,19.1## DTH , , , 18.6 phenotype known , , 19.8 phenotype/ genotype known , , 19.5 Note: Numbers of total subjects are listed, as well as subjects for whom the phenotype and/or genotypes were known in all, or in presumed highly exposed families. Highly exposed families were those in which 50% of individuals had proven exposure to Leishmania (DTH+ or VL). These families were considered for the association and linkage analyses. All genotyped subjects contributed to calculations of allele frequency. * Age calculations based on 1147 of 1183 subjects - Redundant data is omitted. Significantly different from the total population: *** p<1.0e-10 (T-test); #p=0.002 (Chi squared); ## p<5.0e-07 (T-test)

129 114 regions (Illustrated at the top of Figure 16-18). We chose to evaluate potential risk alleles only in members of families in which exposure to the parasite is a high risk, based on a high rate of infection (VL or a positive DTH response) in nuclear family members. The adjusted Bonferroni correction was done to correct for multiple tests. For loci that neared significance, an additional calculation was performed using simulations based on the allele frequencies in the original population to add stringency to the tests of significance. The adjusted Bonferroni correction consistently yielded higher significance than simulations. A list of the best associations between each phenotype and SNP allele frequencies in linkage follow-up regions or candidate genes is available in Table 14. The table includes SNPs with an uncorrected p value less than 0.01 in candidate genes, an uncorrected p value less than in linkage follow-up regions, or the top five SNPs from each trait and region. Detailed descriptions of allele frequencies and the observed penetrance values for the markers most highly associated with each phenotype are shown in Table Chromosome 9 On chromosome 9, the data suggested that SNP rs is associated with a VL outcome (p=5.9e-05 UN [Uncorrected], p=0.017 AB [Adjusted Bonferroni correction for multiple tests], p=0.089 ES [Empirical Simulation used to correct for multiple tests]) (Figure 16). This SNP is situated in an intergenic region between TMEM215 (Transmembrane protein 215) and APTX (Aprataxin) (Figure 19A). The association held up to adjusted Bonferroni correction for multiple testing. The data showed that the 'A' major allele was significantly associated with VL, and/or being homozygous for the 'G' minor allele was associated with the absence of VL. No significant associations were detected between either the qualitative or the quantitative DTH phenotypes and any marker on

130 115 Table 14. Most significant results of association testing between three phenotypes and SNPs in linkage follow-up regions and candidate genes Candidate genes SNP LOD Pvalue hap_blocks correctedpval location relative VL phenotype rs TGFBR2 upstream rs TGFBR2 intron rs FCGR2A upstream rs IL27RA intron rs SLC11A1 intron rs COL4A4 intron DTH qualitative phenotype rs FCGR2A upstream rs SMAD7 intron rs CCL2 intron rs SMAD7 intron rs IL4R intron DTH size phenotype rs IL12A intron rs IL4R intron rs CCL5 intron rs TNFRSF1A intron rs LEPR intron Chromosome 9 SNP LOD Pvalue hap_blocks correctedpval location relative VL phenotype rs TMEM215 APTX intergenic rs TMEM2 upstream rs ATP8B5P intron rs UNC13B intron rs UNC13B intron

131 116 Table 14. Continued. rs PIP5K1B PRKACG intergenic rs TMEM2 upstream DTH qualitative phenotype rs TUSC1 LOC intergenic rs MOB3B intron rs LINGO2 upstream rs CD72 intron rs KIAA1045 3' utr DTH size phenotype rs C9orf11 intron rs TJP2 intron rs TRPM3 TMEM2 intergenic rs TRPM3 TMEM2 intergenic rs TJP2 intron Chromosome 15 SNP LOD Pvalue hap_blocks correctedpval location relative VL phenotype rs C15orf60 intron rs HCN4 intron rs SLCO3A1 upstream rs KIAA1199 intron rs LOC upstream DTH qualitative phenotype rs MEX3B EFTUD1 intron rs MAN2A2 intron rs SCAPER intron rs TMC3B MEX3B intergenic rs THSD4 intron

132 117 Table 14. Continued. DTH size phenotype rs ARIH1 intron rs TMEM202 ARIH1 intergenic rs ARIH1 GOLGA6B intergenic rs HEXA intron rs HEXA 5' utr Chromosome 19 SNP LOD Pvalue hap_blocks correctedpval location relative VL phenotype rs LTBP4 intron rs LTBP4 intron rs LTBP4 intron rs LTBP4 intron rs PSG5 intron rs PSG11 downstream rs LTBP4 intron DTH qualitative phenotype rs MAP3K10 intron rs BAX intron rs ZNF229 upstream rs FCGBP intron rs MAP3K10 exon DTH size phenotype rs KIAA0355 intron rs LIN7B upstream rs PLA2GC4 intron rs NUCB1 intron rs SNRNP70 intron rs PPFIA3 intron Note: Candidate genes or linkage follow-up regions on the indicated chromosomes were studied for associations with the 3 phenotypes. VL and DTH positive, both qualitative traits, were analyzed using LAMP. The ASSOC test from Merlin software package was used for association testing of DTH size, a quantitative trait. Hap blocks

133 118 Table 14. Continued. indicates the number of haplotype blocks in either linkage follow-up region of the indicated chromosome, or in all candidate genes. Haplotype blocks were used for the adjusted Bonferroni correction. LOD and P value indicate raw outputs from the program. Corrected P value shows the adjusted Bonferroni correction. Locations of SNPs and relative position of the SNP were based on proximity in the UCSC hg18 version of the human genome 38, 39.

134 Table 15. Locations and allele frequencies for SNPs most highly associated with VL, DTH+ and DTH size phenotypes Location Phen Alleles (Linkage follow-up fine mapping) rs Chr 9 Genotype frequencies Total (Affected) per genotype Penetrance per genotype Association 115kb downstream of TMEM215 VL A = GG = 0.13 GG = 107 (13) GG = p un = kb downstream of G = AG = 0.45 AG = 299 (78) AG = p ab = APTX intergenic AA = 0.42 AA = 259 (48) AA = p es = rs LTBP4 VL C = CC = 0.34 CC = 254 (48) CC = p un = Chr 19 intronic T = CT = 0.5 CT = 331 (63) CT = p ab = TT = 0.16 TT = 80 (28) TT = p es = rs kb upstream of FCGR2A DTH G = GG = 0.43 GG = 190 (156) GG = p un = Chr 1 A = GA = 0.46 GA = 253 (186) GA = p ab = AA = 0.11 AA = 64 (62) AA = p es = rs kb upstream of TGFBR2 VL A = AA = 0.77 AA = 525 (102) AA = p un = Chr 3 G = AG = 0.22 AG = 133 (34) AG = p ab = GG = GG = 6 (3) GG = p es =

135 Table 15. Continued. Location Pheno Alleles (Candidate genes) Genotype frequencies rs ARIH1 DTH A = GG = GG = 8 Asymptomatics per genotype Chr 15 intronic size G = AG = 0.22 AG = 89 AA = 0.77 AA = 307 rs KIAA0355 DTH A = GG = 0.02 GG = 3 Chr 19 intronic size G = AG = 0.19 AG = 85 AA = 0.79 AA = 316 Mean ± Standard deviation induration diameter (mm) per genotype GG = 15.5±8.1 mm AG = 17.0±8.2 mm AA = 13.5±7.2 mm GG = 20.7±7.4 mm AG = 16.9±8.6 mm AA = 13.5±7.1 mm Association p un = p ab = 0.11 p es = 0.48 p un = p ab = p es = 0.24 Note: The chromosomal and relative location of SNPs associated with the phenotype listed under the Pheno column are displayed, along with alleles and their frequencies. Also the penetrance of the phenotype for each genotype is shown to provide insight into the possible effect of each allele. The three p-values displayed under the Association column are p un for uncorrected, p ab for adjusted Bonferroni corrected, and p es for empirical simulation corrected

136 121 Figure 16. Results of genotyping in chromosome 9 linkage follow-up regions Note: Linkage follow-up regions on chromosomes 9 are indicated by the red boxes on the chromosome illustrations. The distribution of plotted SNPs across the region of interest is shown under the illustration of each chromosome. Three panels for each chromosome show the p-values for associations of SNPs with the VL, DTH positive (LAMP analysis), and DTH size traits (Merlin association test). The large blue points represent the strongest association for that trait within the region of interest. The LD structure between the SNP with the best p-value and surrounding SNPs is indicated by a color gradient on other large points, with white being linked, and red being unlinked. The fine dotted line represents an adjusted Bonferroni threshold of significance based on a p-value of 0.05 divided by the number of haplotype blocks and single SNPs as determined by Haploview within the linkage follow-up region. Dashed lines mark a p-value threshold of 0.05 based on 1000 empirical simulations of the study population.

137 122 chromosome 9. This was expected, as the chromosome 9 region was selected because of putative linkage to the VL phenotype Chromosome 15 As part of the fine mapping of chromosome 15, the first intron of ARIH1 (Drosophila homolog of Ariadne) (Figure 17) was genotyped for several SNPs in LD with one another. The uncorrected association test for SNP rs suggested an association with the DTH size phenotype (p=2.97e-04 UN). However, this did not hold up well to multiple test corrections (p= AB, p=0.475 ES). The uncorrected p- value of 2.7e-04 for the association between the VL phenotype and SNP rs on chromosome 15 lies 1Mb more distal from the centromere and is not in LD with SNP rs Chromosome 19 The most significant association among the linkage follow-up regions was on chromosome 19 between the SNP rs , and nearby SNPs, with the VL phenotype (p=1.4e-05 UN, p=4.44e-03 AB, p=0.022 ES) (Figure 18, Table 14 and Table 15). This association held up to multiple testing according to both simulation and adjusted Bonferroni methods. A higher frequency of the 'T' allele in subjects with VL, and lower frequency of the 'C' allele in subjects without VL could indicate either an association of the 'T' minor allele with developing VL, or association of the 'C' allele with resistance to VL. This SNP is located within an intron of LTBP4 (Latent transforming growth factorbeta-binding protein 4) (Figure 19). There was not a significant association of these SNPs with the DTH+ phenotype. The association of the DTH positive phenotype with rs on chromosome 19 (Figure 18) is located 426kb upstream of rs , more proximal to the centromere, and not is not in LD with rs The association did not hold up to multiple test corrections (p=2.1e-04 UC, p=0.067 AB).

138 123 Figure 17. Results of genotyping in chromosome 15 linkage follow-up regions Note: Linkage follow-up regions on chromosomes 15 are indicated by the red boxes on the chromosome illustrations. The distribution of plotted SNPs across the region of interest is shown under the illustration of each chromosome. Three panels for each chromosome show the p-values for associations of SNPs with the VL, DTH positive (LAMP analysis), and DTH size traits (Merlin association test). The large blue points represent the strongest association for that trait within the region of interest. The LD structure between the SNP with the best p-value and surrounding SNPs is indicated by a color gradient on other large points, with white being linked, and red being unlinked. The fine dotted line represents an adjusted Bonferroni threshold of significance based on a p-value of 0.05 divided by the number of haplotype blocks and single SNPs as determined by Haploview within the linkage follow-up region. Dashed lines mark a p-value threshold of 0.05 based on 1000 empirical simulations of the study population.

139 124 Figure 18. Results of genotyping in chromosome 19 linkage follow-up regions Note: Linkage follow-up regions on chromosomes 19 are indicated by the red boxes on the chromosome illustrations. The distribution of plotted SNPs across the region of interest is shown under the illustration of each chromosome. Three panels for each chromosome show the p-values for associations of SNPs with the VL, DTH positive (LAMP analysis), and DTH size traits (Merlin association test). The large blue points represent the strongest association for that trait within the region of interest. The LD structure between the SNP with the best p-value and surrounding SNPs is indicated by a color gradient on other large points, with white being linked, and red being unlinked. The fine dotted line represents an adjusted Bonferroni threshold of significance based on a p-value of 0.05 divided by the number of haplotype blocks and single SNPs as determined by Haploview within the linkage follow-up region. Dashed lines mark a p-value threshold of 0.05 based on 1000 empirical simulations of the study population.

140 125 Figure 19. Regions of highest association with VL in linkage follow-up regions Note: Associations between the SNPs on chromosomes 9 (A) and 19 (B) are shown in detail. On chromosome 9, rs is indicated as the blue plot point. The distribution of SNPs covering 350kb is shown above. On chromosome 19, rs is shown with 200kb around the region. The blue points represent the strongest association for that trait within the region of interest. The LD structure between the SNP with the best p value and surrounding SNPs is indicated by a color gradient on flanking SNPs, with white being linked, and red being unlinked. The fine dotted line represents an adjusted Bonferroni threshold of significance based on a p-value of 0.05 divided by the number of haplotype blocks and single SNPs as determined by Haploview within the overall linkage follow-up region. Dashed lines mark a p-value threshold of 0.05 based on 1000 empirical simulations of the study population.

141 Candidate genes Potential associations were also tested between VL and both DTH phenotypes and tagging SNPs covering 42 candidate genes (Figure 20-22). The SNP rs on chromosome 1 upstream of FCGR2A (Low affinity immunoglobulin gamma Fc region receptor II-a) was associated with a positive DTH response (p=8.4e-06 UN, p=3.864e-04 AB, p=0.002 ES) (Figure 23). This value was signifcant when corrected for multiple tests by both methods. The DTH+ phenotype was present in most individuals homozygous for the 'A' minor allele of rs (Table 15). In addition to it being the best association with the DTH+ phenotype among candidates, this SNP also has an uncorrected p-value of for its association with the VL phenotype, making it the second best candidate gene association with the VL phenotype following TGFBR2 (TGFbeta receptor type 2) (Table 14). The best association between a VL outcome and candidate gene was observed for the SNP rs upstream of TGFBR2 (p=5.7e-04 UN, p= AB, p= ES) (Figure 23) Discussion The contribution of host genetics in the outcome of Leishmania infections has been well documented in mouse models. Susceptibility to infection with Leishmania donovani, Salmonella typhimurium and Mycobacterium BCG was mapped to the same locus on murine chromosome 1, and termed the Lsh, Ity and Bcg gene. 155 This susceptibility gene was positionally cloned to a trans-membrane protein, initially called Nramp1, but later renamed as SLC11A Mouse strains can be characterized as Bcg s or Bcg r based on a single mutation in an SLC11A1 trans-membrane region. These observations provided an impetus for the search for susceptibility genes influencing the outcome of human leishmaniasis. The hypothesis that human genetic variants also influence susceptibility to both VL and a positive DTH response indicative of an asymptomatic infection was supported

142 127 Figure 20. Associations between SNPs in candidate genes and the symptomatic VL phenotype Note: Associations for candidate genes are shown for VL. The dotted line represents a significance threshold based on an adjusted Bonferroni test, as in Figure 19. The blue points represent the strongest association for the trait within the candidate gene. As indicated in the color key, the LD structure between the most closely associated SNP and other flanking SNPs is indicated by a color gradient on flanking SNPs, with white being linked, and red being unlinked.

143 128 Figure 21. Associations between SNPs in candidate genes and the asymptomatic DTH response phenotype (qualitative) Note: Associations for candidate genes are shown for the DTH qualitative phenotype. The dotted line represents a significance threshold based on an adjusted Bonferroni test, as in Figure 19. The blue points represent the strongest association for the trait within the candidate gene. As indicated in the color key, the LD structure between the most closely associated SNP and other flanking SNPs is indicated by a color gradient on flanking SNPs, with white being linked, and red being unlinked.

144 129 Figure 22. Associations between SNPs in candidate genes and the asymptomatic DTH response size phenotype (quantitative) Note: Associations for candidate genes are shown for the DTH size. The dotted line represents a significance threshold based on an adjusted Bonferroni test, as in Figure 19. The blue points represent the strongest association for the trait within the candidate gene. As indicated in the color key, the LD structure between the most closely associated SNP and other flanking SNPs is indicated by a color gradient on flanking SNPs, with white being linked, and red being unlinked.

145 130 Figure 23. Most highly associated SNPs in candidate genes Note: Associations between the DTH positive phenotype and the rs , upstream of FCGR2A (A), and between the VL phenotype and rs upstream of TGFBR2 (B) are plotted along with tagging SNPs covering the candidate gene. The blue points represent the strongest association for the trait within the candidate gene. The LD structure between the most closely associated SNP and other flanking SNPs is indicated by a color gradient on flanking SNPs, with white being linked, and red being unlinked. The fine dotted line represents an adjusted Bonferroni threshold of significance based on a p-value of 0.05 divided by the number of haplotype blocks and single SNPs as determined by Haploview within the overall linkage follow-up region. Dashed lines mark a p-value threshold of 0.05 based on 1000 empirical simulations of the study population.

146 131 by segregation analyses in Brazilian populations. 42, 178 Efforts to identify the specific genes conferring susceptibility have inspired both microsatellite-based genome scans and studies of candidate genes , Reports have documented both individual genes and genetic regions associated with either VL or a persistent dermal complication of L. donovani infection called post-kala azar dermal leishmaniasis (PKDL). SLC11A1 is only 154, 156, 161, 162, one of these putative susceptibility loci. The current study constitutes a fine mapping examination of regions on Chromosomes 9, 15 and 19 that we previously reported were weakly linked to symptomatic or asymptomatic Leishmania infantum chagasi infection in subjects residing in the vicinity of Natal, Northeast Brazil, where the parasite is endemic. 168 Three definable phenotypes were examined. (1) symptomatic VL, a dramatic and severe illness resembling leukemia, (2) a positive DTH skin test response to leishmania antigen in the absence of symptoms of disease, and (3) the size of the DTH skin test response (induration size). Both (2) and (3) are measures that indicate a protective type 1 cellular immune response has developed. In the absence of clinical disease, a positive DTH response is indicative of asymptomatic infection. 21, 38, 168 Regions of potential linkage were investigated by fine mapping with a dense set of SNPs covering the region, chosen with our best estimate of haplotype blocks based on studies of admixture. 171 SNPs in candidate genes of interest were also studied. Because the two sets of markers were chosen according to different criteria, they were analyzed separately. A pruned set of these SNPs used for linkage analyses did not reveal any additional significant regions of linkage within the follow-up regions. However, three markers were identified in which there was significant association with symptomatic VL after adjustment for multiple testing with the Bonferroni correction. The most robust was a group of SNPs on chromosome 19 in an intron of the LTBP4 gene. The most significant of these markers (rs ) remained significant after both the Bonferroni correction and after simulation, the more stringent of the corrections for multiple testing. Also

147 132 significant after Bonferroni correction was a SNP in an intergenic region on chromosome 9 between TMEM215 and APTX, as well as SNPs in the promoter regions TGFBR2 on chromosome 3. Chromosome 9 was the only chromosome which was implicated with potential linkage to the VL phenotype in our published study. 168 The potential association with rs in the intergenic region between TMEM215 and APTX is a result of fine mapping in the region of this putative linkage. The APTX gene, encoding aprataxin, a DNA-binding protein involved in single strand DNA repair that is mutated in hereditary 182, 183 ataxia-telangiectasia. Two regions with proteins influencing the TGF-β activity were associated with symptomatic VL. The first comprised SNPs on chromosome 19 in an intron of the LTBP4 gene and were associated with symptomatic VL. Similar to LTBP1 and LTBP3, LTBP4 binds the inactive TGF-β complexed with latency associated peptide (LAP) both intracellularly and upon release from cells. The LTBP4 complex remains in extracellular tissues until activated via a number of mechanisms that alter its physiochemical characteristics, causing it to release its cargo. Mutations in LTBP4 that render it unable or poorly able to bind LAP likely affect its capacity as a storage molecule for this 184, 185 cytokine. The second is a SNP that lies upstream of the gene encoding the TGF-β receptor subunit 2, which was also significantly associated with the symptomatic VL phenotype. TGF-β is important in visceral leishmaniasis both as a suppressor of T cell responses , 189 and potentially an activator of TH17 development. As such the availability of extracellular TGF-β stores, and the efficiency with which it retains or releases the active form of the cytokine, could greatly influence disease outcome. These data raise hypotheses about the potential impact of this pathway on the development of symptomatic versus asymptomatic Leishmania infantum chagasi (L. i. chagasi) infections in exposed individuals.

148 133 The one genotype associated with the DTH qualitative trait which remained significant after all corrections for multiple testing was rs , which lies upstream of the FCGR2A gene on chromosome 1. Variants in rs upstream of FCGR2A are reported to be associated with an increased risk of ulcerative colitis 190 and systemic lupus 191. In addition to autoimmune disorders, variants in FCGR2A have been associated with variable responses to infectious diseases including an increased risk in chronic Pseudomonas aeruginosa infection among CF patients, 192 and protection/risk of 193, 194 developing severe pneumococcal and viral pneumonia. Furthermore, the FcγR has been found to mediate IgG-mediated uptake of L. major amastigotes, triggering IL , 196 release and consequently suppressing parasite clearance. Ligation of FcγR also mediates enhanced survival of L. i. chagasi in macrophages of infected Brazilian 195, 197 subjects. DTH+ individuals could potentially harbor persistent parasites without disease. This association therefore leads to logical hypotheses regarding the importance of IgG and FcγR ligation in modifying the microbicidal responses to L. i. chagasi. After infection with Leishmania infantum chagasi, most individuals in Northeast Brazil will go on to develop an asymptomatic infection. In contrast, fewer than 1 in 6 subjects will develop symptomatic VL requiring treatment 21, 198, 199. The immunogenetic factors influencing this dichotomous outcome provide an opportunity to learn more about the underlying immune mechanisms helpful for successful management of a leishmania infection. Given the transient and migratory pattern of VL risk, individuals who are infected with L. i. chagasi but develop a self-curing outcome are difficult to locate. 21 In the current study we chose to examine families in an endemic area of northeast Brazil in which there was a high likelihood of exposure to leishmania-infected sand flies, assessed by the high rate of skin test and disease positivity in immediate household members. Although this underlying assumption is undoubtedly imperfect, the exclusion of low DTH-rate households likely helped us pinpoint families with a spectrum of infection outcomes. We examined DTH+ versus DTH- as both a qualitative and a quantitative

149 134 trait, based on the assumption that exposure is an all-or-nothing phenomenon, whereas DTH size includes measures of the strength of systemic and the dermal immune responses. The data highlighted a role for TGF-β signaling influencing the outcome of infection. Studies also underscored the importance of the FcγgR in protection from disease. Understanding the roles of polymorphic alleles in these regions, and the physiologic roles of these pathways in the host response to L. i. chagasi may have future implications for the development of immune strategies to interrupt infection.

150 135 CHAPTER VI. DISCUSSION 6.1. A preface to the discussion The completed human and parasite genomes have improved our understanding of the breadth and role of a genetic diversity in the human diseases caused by African trypanosomiasis and leishmaniasis. The studies undertaken here have involved genome and multi-genome data from both parasites and hosts, ranging from a description of a large gene family within the genome of one organism, namely the Trypanosoma brucei VSG repertoire, to the development of a molecular assay for the detection, species determination and quantification of Leishmania species, and finally to assessing the degree to which polymorphisms within the human genomes of a population in Northeast Brazil are influencing the outcome of visceral leishmaniasis infections. Although seemingly diverse, the products of these genomes interact to create a human disease, and thus are critically important determinants in human disease. As a part of the studies of leishmania genome diversity, we have also included the development of a molecular assay for detecting leishmania DNA and determining the species in clinical or environmental samples. This approach not only exploits the interspecies differences between leishmania genomes, but also provides a tool that will be useful to take forward in refining the definitions of clinical disease phenotypes. The scope of these studies was only made possible because of the nascent age of genomics. I hope to underscore in this discussion the lessons that can be learned because of the array of genomic information that has been collected, and their potential applications to the study of human diseases caused by the Trypanosomatid protozoa. I hope to communicate the idea that the large and growing databases collected by scientists in this field should be considered a boon to investigators, refining our understanding of

151 136 disease. As annotations and methods for interpreting data are continuously improving these databases are becoming more effective qpcr assay applications The purpose of developing the qpcr assay described in Chapter III and applied in Chapter IV was to learn more about the interactions between the parasite, the host, and their surrounding environment. These tools provide a means to better understand the niche occupied by Leishmania spp. parasites. This test can be used to identify infected sand flies which can lead to a timely identification human or animal populations serving as reservoirs for the disease for which treatment or vector control can be efficiently targeted. As hosts and parasites are both constantly migrating and adapting to their changing environments, the ability to identify the infecting species of Leishmania allows researchers to monitor changes in parasite populations and allow clinicians to better choose the most appropriate treatment. We used both similarities and differences between different Leishmania spp. genetic sequences, to screen for primer sets informative both for sensitive detection of the parasite and species determination. We designed primers targeting both genomic sequences within the parasite and the many-copy kinetoplast minicircle sequences. Both genomic and kinetoplast targets are useful as we have found that kinetoplast copy number can vary somewhat from between isolate and species. A practical way to overcome this shortcoming is to use genomic primers when the parasite population being quantified is sufficiently dense, and to fall back on kinetoplast DNA primers when the amount of parasite template DNA is very small. The future directions of the serial qpcr assay include those applications by collaborating labs in endemic regions, many of which were mentioned in Chapter IV. We have tried to make this set of qpcr primers one that could be utilized by leishmania researchers working in many different parts of the world, and we hope it can help

152 137 contribute to standardizing ways of quantifying parasite loads, and create a more efficient means of species determination. Nevertheless, better primer sets, and simpler formulations of the serial assay, requiring fewer tests, or being more specific to strains or species of interest are bound to be created. Therefore it would be to the benefit of the community to revisit this question again in the near future, in order to compare the performance of primer sets in use against a large panel of species and strains and ideally share these results in a web-based format where investigators can design a serial assay specific to their labs preferences for qpcr chemistries and species specificity requirements. Furthermore, simpler applications could be developed that would be compatible with remote sites. In particular the LAMP (Loop-mediated isothermal amplification) 150 method seems promising for implementation in a low technology 200, 201 setting. With an estimated 350 million individuals at risk for Leishmania infection 12, the extent to which asymptomatically infected individuals may harbor infections, and therefore their risk of developing future infections or transmitting disease requires further exploration. The qpcr assays described here are being applied to better measuring and understanding this population. Quantification is especially important as parasite loads in some individuals may be much higher than others, and the extent to which parasite loads vary within asymptomatic individuals remains to be explored. Understanding infection among the healthy population in endemic regions will help better identify individuals at risk for developing more serious disease, and help protect otherwise uninfected individuals from exposure through the blood supply.

153 Application of unsupervised clustering to the analysis of conserved non-coding genomic elements in Trypanosomatids Parasite genetic diversity can be an important adaptation for survival in the presence of host immune elements. This becomes evident in the large coding capacity, an estimated 10-20% of the Trypanosoma brucei s genome 30, devoted to its Variant Surface Glycoprotein (VSG) genes. Growth of trypanosomes expressing one dominant VSG elicits an effective host response capable of eradicating this clone, but through programmed switching of the dominant expressed VSG, each population of parasites expressing one VSG is sequentially replaced by new parasites expressing an alternating dominant VSG antigen. In this manner the trypanosomes in a host are able to maintain a high parasitemia in the face of an otherwise functional host immune system. The mechanisms by which trypanosomes diversify their VSG antigenic repertoire has been the subject of fascinating research. Trypanosoma brucei spp. contain > 1000 basic copies of VSG genes in their genome, but only a single VSG is expressed at a time. The expressed VSG is located in a telomeric expression site. In conjuction with 8 expression site associated genes (ESAGs), the expressed VSG is transcribed by RNA pol I from a promoter to produce a polycistronic transcript. Basic copies are can be switched into an expression site by homologous recombination or by an inactive telomere changing to the active telomere. 10 During this study we questioned whether additional diversity of VSG genes occurs by generating mosaic genes from basic chromosomal copies. To address this, we performed a comparison between all the known VSG genes in existing databases for evidence of mosaicism. A few VSG sequences in the database were generated by individual labs over years of study, while most were the result of the Trypanosoma brucei whole genome sequencing project 30.

154 139 To access the similarity and potential recombination events between members of VSG genes, we developed a software pipeline to simplify the process of clustering and visualizing clusters of similar sequences. The methods used for clustering sequences in Trypanosoma brucei s VSG repertoire also provide a promising means of identifying and categorizing groups and sub-groups of conserved non-coding elements in any Trypanosomatid species. Trypanosomatids alter their gene expression depending on whether they are within their host or vector life-stage. Regulation of gene expression occurs largely post transcriptionally, and conserved sequences in the 3 UTRs of Leishmania spp. have been shown to regulate RNA message abundance through message stability and degradation 202. Better understanding which genes are co-regulated in a stage-specific manner may help researchers better understand the virulence factors that are required to survive the human immune system. Unsupervised clustering using the MCL algorithm, clustered the Variant Surface Glycoprotein (VSG) genes of Trypanosoma brucei, in a way that both agreed with previously published analyses of gene similarity, and added to our understanding by showing possible recombination events occurring between clusters of VSGs. Visualization of these clusters was an especially useful way to view similarities and potential mosaic genes between VSG types, however a limitation of visualization is that differences between large groups can be readily seen, but the individual relationships between genes are not easily seen. Nevertheless, in the process of generating the visualizations of gene clusters, we also obtain a graph structure of the similarity between all the biological sequences. We have shown how this structure lends itself well to computational analysis of the topology of the graph structure. In particular, the modularity of the clusters and the betweenness centrality of individual or sets of genes in the graph can be measured. These measurements were helpful in calibrating the graph to generate the most tightly clustered groups (modularity) and to identify genes or groups of genes which tended to share

155 140 sequence similarity with more other more diverse sub-groups of genes (betweenness centrality). The genes with high betweenness centrality were particularly interesting because this data suggests that atypical VSGs (non-pseudogene VSGs that are not in a typical VSG expression sites) could be contributing to the maintenance of the overall genetic diversity of the VSG repertoire. The ease of implementing the MCL algorithm in sorting VSG proteins into types led us to hypothesize this method could be extended to aid in the categorization and classification of non-coding protein sequences in Trypanosomatid genomes as described in Appendx B. The first step in identifying conserved non-coding sequences amongst Trypanosmatids makes the application of the pipeline described in Appendix B unique to Trypanosmatids. As Trypanosomatids have highly conserved synteny between proteincoding genes, and inter-coding regions between these genes are relatively short (4 kilobase median length), the search space for potentially conserved regulatory sequences elements within syntenic regions is fairly small. Using the pipeline described, the inclusion of a requirement for a sequence to be present in the more divergent L. (V.) braziliensis species in addition to other Leishmania Leishmania sub-genus species was critical in reducing the output to only the most conserved sequences. Preliminary results from this study indicate known regulatory SIDER elements were readily identified through the search pipeline. The characterization of the many other clusters of conserved sequences identified in this analysis remains to be addressed in the future. We hope the adoption or adaptation of some of the techniques described in this pipeline could help any group assembling a similar pipeline, and we hope that results we have generated could aid researchers trying to understand the regulation of genes of interest in Trypanosomatids by providing a tool to find other genes with a similar regulatory element and perhaps predict its function.

156 Annotating unknown protein-coding genes Genomes are now available for multiple Trypanosomatids, but knowing the amino acid sequence has been insufficient to predict the function of some genes, as there are often no well-curated domains within the proteins. This deficiency has left a large blind spot in all high-throughput studies employing analyses which look for enrichments in gene ontologies or pathways. We have sought to extend our ability to predict the structure and function of these genes through methods described in Appendix A. Here we have obtained a graph useful in aggregating annotations from other proteins by relaxing as many constraints as practically possible. This meant incorporating all domains identified by either curated or automatically generated profile-hidden Markov models, and from all proteins and taxonomic information from any species. The graph constructed containing this information in an indexed SQL database can fit on and be traversed on a typical personal computer (currently operating without issue on a computer using 512Mb of RAM and occupying less than 40GB of hard disk space). This graph could potentially be useful for finding or enriching the annotations for any gene sequence. More complete annotations leading to a better understanding of datasets acquired from high-throughput sequencing and proteomics analyses could aid in the identification of proteins critical for immune evasion Human population genetics and susceptibility to leishmaniasis In contrast to the parasite, in which genetic plasticity of the population provides the ability to adapt to host situations, genetic diversity of the human host provides a variable response to infection with the Trypanosomatids. Our investigations of genetic loci associated with visceral leishmaniasis or asymptomatic Leishmania infantum chagasi infections highlighted some genomic regions worthy of further pursuit. The results from this analysis indicated potential associations between symptomatic visceral leishmaniasis

157 142 and polymorphisms in TGFBR2 and LTBP4. There are also potential associations between a positive cutaneous delayed-type hypersensitivity (DTH+) response to leishmania antigen (asymptomatic infection) and the FCGR2A. Visceral leishmaniasis is a disease characterized by immunosuppression, and TGF-β is one factor contributing to this state. The TGF-β pathway has already been found to play a major role in immunosuppression observed in murine leishmaniasis. 186, 203, 204 TGF-β serves to suppress both development of a type 1 immune response, and 205, 206 suppress the proliferation of both type 1 and type 2 CD4+ cells. Not only are local levels elevated in tissues facilitating local parasite growth, the parasite itself secretes a cathepsin B-type protease that activates latent TGF-β in extracellular milieu. TGF-β levels are elevated in the sera of subjects with active visceral leishmaniasis, and diminish at the time of symptomatic cure. 207 One could envision a role for the latent TGF-β binding protein 4, encoded by LTBP4, in facilitating the extracellular activation by 184, 185 proteases. The TGF-β receptor II subunit is the subunit that binds active TGF-β, after which TGF-β Receptor I subunit is phosphorylated leading to intracellular signaling events. Thus both proteins affect the availability of, or cellular response to, active TGFβ. The role for genetics in an asymptomatic infection has only begun to be investigated. Thus the potential association between the low affinity Fcγ receptor (encoded by FCGR2A) raises the possibility that known effects of FcγR ligation may participate in the asymptomatic response. There are high levels of antibodies produced during visceral leishmaniasis, and Leishmania spp. amastigotes are coated with IgG after 195, 197 emerging from a host cell. Ligation of FcγRs during entry into the next macrophage elicits production of IL-10, a cytokine that suppresses IFN-γ mediated clearance of the parasites. Ironically, in a mouse model, IL-10 is important for maintaining a low level of persistent parasites in tissues, and maintaining protective immunity to the next parasite exposure. 196

158 143 Dissecting the determinants of the asymptomatic state could potentially provide insight into the factors allowing a host to clear Leishmania infantum chagasi infection with no adverse consequences. The presence of the DTH+ response may be due to immune memory from a past infection, or it may be due to immune surveillance maintained secondary to a current subclinical or latent infection. One potential use of parasite genomes in these studies could be our ability to detect asymptomatic infections by amplification of parasite DNA from asymptomatic exposed hosts. Indeed, after great care is taken to mitigate possible contamination during sample collection and preparation, asymptomatic or latent infections present in some individuals in the Brazilian population could be detected and measured using the qpcr approaches described in Chapter III. If the hypothesized infections were detected in the asymptomatic population, these could lead to new phenotypic information (presence of infection, evidence of clearance, or quantification parasite load) for use in past or future genomics studies. With the great cost and diminishing returns involved in adding more families from the region, future studies of these results should first include checking for data supporting these associations within genome-wide association studies in populations in India and Brazil that are currently underway. Further sequencing to identify possible functional variants in the TGF- β pathway genes may also be useful. The FCGR2A polymorphism has been associated with other immune disorders and tests of function involving IgG and this receptor may provide more information as to how host genetic variants in this receptor contribute to an asymptomatic response. Our finding of potential association between the latent TGF-β binding protein 4 and visceral leishmaniasis is the first evidence that studies of interactive microbe-human genomes may elucidate some aspects of disease pathogenesis. Cathepsin B released by leishmania can activate latent TGF-β, and presumably the relative ability of LTBP4 to retain TGF-β in an inactive bound form, versus releasing it into a biologically active form through the activity of the leishmania protease would influence the course of disease.

159 144 Although the analysis will be complex, such interactions could be investigated considering the polymorphisms in the human and parasite genomes Future combined annotation aggregation from both protein-coding genes and regulatory sequences Future studies of the Trypanosomatid genomes will be more informative if we can discern more about the functions of hypothetical genes and the functions of regulatory regions of the parasites. Annotations can be enhanced using a combination of data produced by the approaches in Appendix A and Appendix B. The presence of conserved regulatory elements could help enrich our annotations of genes of unknown function. Subsequently, those new proposed functions could be used to help annotate other orthologous genes of unknown function. Before this method would be effective we would also need to accurately estimate the degree to which erroneous annotations would corrupt the overall graph relating genes. Developing measures of confidence, or in the potential for error within a predicted annotation would be highly advisable moving forward with this sort of annotation aggregation. It is conceivable that the degree to which certain domains or annotations may be accurately assigned, or may contribute to erroneous results, may vary between protein families, domains, or even between species. I hypothesize that if error rates in these different circumstances could be measured using bootstrapping in subsets of data, then a machine learning approach may be effective in selecting best annotations or providing measures of confidence on annotations given the domains present and species being annotated. In pursuit of this goal, principle components analysis of profile-hmms included in the analysis may be able to reduce some redundancy in the current graph data-structure. The results currently produced by annotation aggregation based on the random walks of a probability graph require a great deal of scrutiny from the investigator,

160 145 and reducing the amount of time spent pruning erroneous results would make this a more useful tool. The ultimate goal will be to gain an understanding of the expression and function of leishmania genes in response to the environments it encounters in the insect and human hosts Conclusion The data presented in this thesis have shown how the completed and human and Trypanosomatid genome projects have added to our knowledge beyond just human or parasite genomic composition. They have also provided necessary tools for designing better molecular probes that can be used to probe the environment to define the determinants of human exposure, and to probe human specimens to define the extent of disease. We have just begun to explore methods by which an amalgamation of available genomics data and annotations from many species can contribute to our understanding of genes or non-coding regulatory sequences of Trypanosmatids. Large studies of human populations have added to our knowledge of how genetic variation in genes involved in host immunity can contribute to the outcome of a leishmania infection. The complexity of either the parasite or the human host alone is complex and the interaction even more daughting. Nonetheless our studies have led to some hypotheses regarding polymorphic features of both the parasite and those that could influence the outcome of infection. The studies described here have utilized parasite genetics to develop an assay to identify the species within the host. The future of such studies may lie in high throughput sequencing and the analysis of gene expression in both the parasite and host. Typically such a study would be difficult as the amount of host genetic material would limit the detection of mrna from the parasite. However, a method that capitalizes on a unique feature of the Trypanosmatid genomes to overcome this difficulty is splice leader trapping. 208 This method takes advantage of the conserved 5 splice leader sequence that

161 146 is trans-spliced to the beginning of all Trypanosomatid mrnas. A primer targeting this conserved 5 splice leader sequence allows the high-throughput reaction to specifically amplify Trypanosomatid messages. Thus, future analyses of parasite gene expression within the host may be feasible even when the parasite load is low. We hope these assays can be applied to improve our knowledge of host and parasite genetics and the interactions between the two, as costs of high-throughput sequencing continue to fall. Such a study will require large numbers of participants and produce a large amount of sequence data, but the interpretation of this data could better our understanding, both of parasite proteins necessary to evade the immune system, and host genetic variations useful in resisting infection.

162 147 APPENDIX A. CONSTRUCTING AN INFORMATICS INFRASTRUCTURE FOR ANNOTATING PROTEIN-CODING GENES OF UNKNOWN FUNCTION IN ANY SPECIES A.1. Introduction Whole genome sequencing has been completed for Leishmania, Tryapnosoma cruzi, and Trypanosoma brucei species. Multiple genomes have been made available for each species providing a wealth of information about the organization and composition of these species genomes. The incredible divergence of the Trypanosomatids from the more well studied model organisms, and their remarkable differentiation during life-stages makes them well suited for studying questions of evolution and basic biology, but their role as human pathogens makes understanding them even more imperative. Functional annotations of the completed sequences are only available for about one third of the sequenced genes. The interpro tool has been applied to enrich genomics annotation information available from the EupathDB/TriTrypDB group, and has added information to about 2/3 of the genes. However, this still leaves approximately 1/3 of the protein coding genes lacking any functional annotations. A means of predicting the function of these remaining genes could be a boon to high-throughput studies, improving the interpretations of data generated from these techniques. We hypothesized that orthologs do exist between unannotated genes and other species, and identifying these relationships however weak they may be could contribute to predicting function. To address is hypothesis, and attempt to provide annotations for unknown genes, we have conducted an exhaustive search for both known and predicted protein domains using profile-hidden Markov models (profile-hmms) among all available protein sequences from all species. The product of this search is a graph of relationships between all protein sequences that can be used to find similar

163 148 genes with annotated functions. This graph can be queried in a variety of ways, including random walks of a probability graph to link proteins with known annotations proteins to unannotated proteins, permitting a great deal of flexibility and customization in queries so investigators asking the question about a gene can focus on only the most relevant portions of their protein of interest, and when helpful, limit their results to those gathered from subset of species. A.2. Methods A.2.1. Profile-hidden Markov models Three sources of profile-hmms were used in this study. 1.) Pfam-A domains and 2.) Pfam-B domains provided by the Sanger Institute were included in the search. This includes approximately 10K Pfam-A domains and approximately 420K Pfam-B domains. Additionally, 60K PANTHER domains were also included in the analysis. These domains will certainly contain a great deal of redundancy, especially between Pfam-A and PANTHER domains, but our primary goal is to be as complete as possible in building a detailed graph between all proteins. Pfam-B domains as provided by the Sanger Institute are algorithmically generated by the Adda algorithm. Since sequences from Trypanosomatid genomes are included in this algorithm, there is excellent coverage of these genes. A note of caution though is that the graph generated between some unknown genes in Trypanosomatids may not be well connected to species beyond the Trypanosomatids, and in these situations, knowledge of domains may not translate well into the annotation of the genes. The consequences of this is that even though we achieved 99.9% coverage of the Leishmania infantum genes with this set of domains, the number of genes for which we confidently ascribe annotations to will undoubtedly be less than that number.

164 149 A.2.2. Genomic Sequences and Annotations The sequences used in this analysis were obtained from either the TriTrypDB for completed Trypanosomatid genomes, with the exception of L. (L.) donovani, which is was obtained from NCBI. All other protein sequences were obtained from the CAMERA database non-redundant protein database. A.2.3. High-throughput HMM scanning Profile-HMMs were used to scan all sequences using the HMMER 3.0 tool on the Univeristy of Iowa s high-perfomance computing grid. The approximately 27 million sequences searched were split into chunks thousands of sequences, and the tasks of scanning each sequence chunk were distributed to computers throughout the grid. Each node would hold a temporary local copy of all profile-hmms being used for the scan. Upon completion of the scan results were aggregated together, discarding the precise alignment of each matching sequence to conserve space. A.2.4. Efficient storage with an SQL database Results from HMMER scans with all the profile-hmms on all of the protein sequences were stored in a MySQL database using the MyISAM storage engine. MySQL, a free relational database provides an efficient means to interface with the data generated in this study, but any SQL implementation with a comparable storage engine should work equally well. SQL queries offer a powerful means to formulate almost any question you might ask of the data, if the data is properly organized. The output from the profile-hmm scan is the key to annotating the genes and an example implementation of how this hits table relates to other annotation resources is show in Figure A1 and A1. Other tables containing annotation information are also necessary, these include the profile-hmm and protein related tables, with relevant annotations from their original data sources.

165 150 Figure A1. An SQL implementation of the described data schema Notes. An SQL implementation of the described data schema with 27 million protein sequences requiring approximately 30GB of hard disk space for implementation, or 7GB of space for long term storage. SQL table names are listed at the top of each block. The first column contains a list of the name of the SQL column for which a brief description is listed in. The second column contains a recommended storage type within a MySQL database. Third, rightmost, column contains a P if that SQL column should be a unique primary key, and an X if it should be indexed for quicker access. The bottom of each block lists the number of entries in the example database and the storage requirements currently being used by the table in a MyISAM storage engine.

166 151 Table A1. Brief descriptions of SQL column names Name Tables Description protein_db_id hits, proteins Integer representation of the protein. profile_db_id hits, profiles, profile_annotations Integer representation of the profile- HMM. full_e_val_neg_log2 hits HMMER output of E-value for a matching the whole sequence. full_score hits HMMER matching score for the full sequence. full_bias hits HMMER bias for the full sequence. domain_c_eval_neg_log2 hits HMMER output for the domains E- value. domain_i_eval_neg_log2 hits HMMER output of a corrected domain E-value domain_score hits HMMER matching score for a domain domain_bias hits HMMER bias for the domain. hmm_coord_from hits Where the alignment begins on the profile-hmm. hmm_coord_to hits Where the alignment ends on the profile-hmm. ali_coord_from hits Where the alignment begins on the protein sequence. ali_coord_to hits Where the alignment ends on the protein sequence. env_coord_begins hits Number of amino acids prior to ali_coord_from HMMER predicts the match to really begin. env_coord_after hits Number of amino acids after alli_coord_to HMMER predicts the match to really end. reliability_of_alignment hits HMMER s measure of the the quality of the alignment. Name profiles The name of the profile-hmm accession profiles Another identifier of the profile- HMM profile_length profiles Number of amino acids of the profile- HMM annotation profiles Description of the profile.

167 152 Table A1. Continued. annotation_category profiles Categorical description of the profile annotation. protein_name proteins Unique protein name from TriTrypDB or CAMERA. protein_length proteins Number of amino acids in protein. Taxon proteins, taxonomy_basics, taxonomy_spines The taxonomy ID for a species from NCBI. protein_external_id proteins A compressed list of NCBI sequence ids. protein_description_db_id hypothetical proteins, protein_description_db_id proteins, protein_description_db_id An integer representation of a descriptive text. 1 if the protein has no description or is unknown/hypothetical. 0 if it has some other description. protein_category proteins Identifier indicating if this protein came from TriTrypDB or CAMERA. description protein_descriptions Text description of a protein. taxonomy_name taxonomy_basics The name of the taxon from NCBI. unique_name taxonomy_basics Alternative name from NCBI. name_class taxonomy_basics Type of name from NCBI. protein_count taxonomy_basics The number of proteins under this taxon. taxonomy_rank_db_id taxonomy_basics The rank of the taxon i.e. kingdom, genus, species. species_count taxonomy_basics The number of species. higher_parent taxonomy_spines A taxon id above the one called taxon i.e. genus Leishmania being over species L. (L.) infantum steps_up taxonomy_spines Number of levels over the column labeled taxon. Note: Brief descriptions of SQL column names from Figure A1.

168 153 For search outputs connecting proteins with domains to be useful, it is essential that there actually is overlap of the portion of the domain that matches. Many times domain hits will only be part of the domain, and there would be a great deal of noise in the results if search results also returned domain hits that did not overlap for at least part of the profile-hmm. This is the reason that the hmm_coord_from and hmm_coord_to fields are indexed in the example database (Figure A1). Indexing these greatly improves the speed of queries involving them can be processed, but also requires a substantial amount of memory. Indexing a buffered subset of results could also be a viable alternative. To summarize with a simplified example, these indexed columns permit us to efficiently ask the question Which proteins share this specific portion of the domain?, instead of Which proteins have any part of this domain?. A.2.5. Taxonomy data Taxonomy provides a means to group domains and annotations, giving the investigator control over the source species or clade of the annotation information they are pulling together. An implementation a relational database structure of taxonomic data is shown in Figure A1 and Table A1. The taxonomy_basics table shown has information about what the taxonomy name is, and its rank in regards to species, genus, order etc. The taxonomy_spines table is a shortcut for rebuilding the taxonomic tree structure, or purning that structure in whichever way is required by the investigator. This is accomplished by each taxonomy id in the taxon column having any higher parent taxonomy with an entry in the higher_parent column. Therefore, to select every taxonomy falling a particular taxonomy like the genus Leishmania for example, you would simply ask for any taxon where the genus Leishmania is present a higher parent. A.3. Proposed applications As the data generated here is comprised of an intractable number of proposed functions for all sorts of protein coding genes in every species, the results are probably

169 154 best described in terms of what question can be asked of it, followed by a discussion of how this differs from what is available and what the limitations of these method are. First this data can be used to search for orthologs based on amino acid sequence similarity. The domain hits can be likened to BLAST hits and requesting other proteins that have the same domains will can produce a search result similar to BLAST in a computationally efficient way, as the most difficult calculations measuring similarity have been pre-computed for everything. The results generated from this search will not however be ordered appropriately from most to least similar as with BLAST. Ranking results in terms of most similar domain score, and most similar domain coverage relieves this shortcoming. This type of search enables the investigator to pull in the annotations of any matching genes and discover which domains were contributing to that annotation, and where on the protein those domains were matching. The data structures proposed here give the investigator the option to limit results to only those generated from a part of a protein. This should be helpful when an unknown domain or portion of a protein is fails to be annotated properly because the results from a common domain falling on the same protein are drowning out any results from the unannotated part. Another data set useful for filtering results is the taxonomy data. If the investigator wants results exclusively from one clade it is possible to make that selection, and alternatively a clade could be excluded from the results. A probability graph structure can be useful in joining together annotations from known genes with unannotated genes, and seems thus far to be a useful way to connect these proteins. This is accomplished by setting the probability of traversing an edge on the graph, starting from an unkonwn gene of interest, to be proportional to their HMMER scores, and optionally proportional to the amount of overlap two domain hits share. This graph structure can be traversed by random walks determined by those weights, and has the nice feature of being able to gather annotations from multiple node distance on the graph while fairly weighting the in formativeness of these less related genes. A counting

170 155 approach can be taken to choose an annotation that is observed most often on random walks starting from a protein of interest. Future research should involve a measure of confidence which rewards information provided by multiple sources. Only counting the number of times an annotation is seen on a random walk may unfairly bias a single gene that is traversed multiple times. The data structures described also allow for filtering out unannotated results, but it may be helpful to keep these hypothetical results in this graph so that the distance from a gene to any annotated genes can be seen in proportion to the number walks that do not cross any annotations. This may be a helpful way to identify which annotations will ultimately have a low confidence score. A.4. Future directions The data structure described already serves the function well of pulling together annotations for any single gene from across known annotations from all proteins. Rather than discarding results of high-throughput assays that include these unannotated proteins, we intend to apply this process to improving our annotations and hypothesized functions of unnanotated genes. Further research into how machine learning approaches may be used to reduce the complexity of the graph or useful in determining a confidence score for predicted annotations would help expedite the interpretation of future analyses.

171 156 APPENDIX B. CONSTRUCTING AN INFORMATICS INFRASTRUCTURE FOR IDENTIFYING AND ANNOTATING CONSERVED NON-CODING SEQUENCES OF UNKNOWN FUNCTION IN TRYPANOSMATIDS B.1. Introduction One of the most remarkable features of the competed Tyrpanosoma brucei, Trypanosoma cruzi, and Leishmania major genomes is the highly conserved synteny between species. Additionally, Leishmania braziliensis and Leishmania infantum genomes have been completed and published. The lengths of inter-coding regions are relatively short (4 kb median length). Within these inter-coding regions are regulatory elements such as the reported SIDER elements. As protein coding-genes lack traditional pol II promoters, and are transcribed in poly-cistronic transcripts then broken into individual messages through trans-splicing events, it is hypothesized that gene regulation largely takes place post-transcriptionally, and often by means of regulating the rate of message degradation. We hypothesized that cis-regulatory elements like the SIDER sequences could be identified and functionally classified by a computational analysis pipeline which capitalizes on the conserved syntienic regions between species. B.2. Methods The ability of the MCL algorithm to cluster non-coding cis-regulatory sequences in the Leishmania genome was assessed with known SIDER1 and SIDER2 elements. The clusters color-coded by MCL (bottom panel Figure B1) using the software from Chapter II was compared to known SIDER types for Leishmania infantum SIDER elements. SIDER1 and SIDER2 elements did cluster as predicted (top panel Figure B1). This provided an assurance that a Blast-MCL pipeline was at least capable of and properly sorting known SIDER elements.

172 157 Figure B1. Known SIDER elements can be separated using Blast and MCL Note: SIDERs color-coded according to known type are depicted in the top panel. In the bottom panel, SIDERs have been clustered using the same visualization technique, but non-grouped sequences have been removed and the MCL algorithm has assigned and color-coded type labels to different clusters.

173 158 First, syntenic regions within the genomes must be identified. This was accomplished by identifying the best matching genes between genomes by Blast and then comparing Blast hits among syntenic protein-coding genes by stepping down the chromosome in order to find the length of the syntenic region. Next, syntenic intercoding regions among the genomes were grouped together. In the case of Leishmania, this includes the genomes of L. infantum, L. major, L. braziliensis, and more recently L. mexicani. The syntenic regions shared between these species, were searched against themselves for similar sequences, observed as Blast hits. If the blast hit occurred in multiple (or all) syntenic sequences, it was considered as to be containing a putative conserved regulatory sequence, and this sequence fragment whose bounds were set by the blast hit was stored in a list of putative regulatory sequences. This process was repeated thousands of times until each interceding region had been searched against its conserved counterparts in other species, and all putative regulatory sequences were pooled together prior to clustering by the MCL algorithm. The requirement for the number of matching species is a parameter that can be adjusted in the software pipeline. The unsupervised clustering MCL algorithm, as described in Chapter I, was used to group clusters of conserved regulatory sequences, and to visualize those clusters. Some shortcomings of this method are that it will be insensitive to very short conserved sequence which will not score as a Blast hit, and conserved sequences that are few in number will be harder to visualize. To improve our throughput so that we do not have to go through the tedious process of annotating each clustered sequence, we search the clusters for enrichments in ontologies or functional analyses of upstream or downstream genes. For the Trypanosmatids this can include some gene ontology data, and also some studies of life-stage specific expression of genes. The results of these enrichments could be used to tweak the MCL clustering parametes to find a cluster of regulatory elements that most likely suits the identified function.

174 159 Optionally, after a set of conserved nucleotide sequences has been identified, this set can be used to generate new profile-hmms and the genomes of the Trypanosomatids can be searched for these sequences. Note that the software from Chapter II automatically generates these profile-hmms for clustered sequences, making this an easy addition to the pipeline. The completed sets of conserved regulatory sequence can then be searched for motifs. We have used the MEME software to identify the motifs shown in these results (Figure B1 bottom panel). B.3. Preliminary results Our preliminary searches for cis-regulatory elements among the Leishmania genomes have successfully found the SIDER2 family (Group 1 Figure B2). And an initial survey of results has clustered and identified several other motifs. Most groups did not readily show enrichment for any pathways in a gene ontology analysis, however Group 3 was enriched in some metabolic pathways (results not shown). B.4. Future directions As more Leishmania and other Trypanosomatid genomes become available, our ability to identify conserved elements should also improve. Proceeding forward with this research, care should be taken to assign accurate confidence values to annotations assigned to clusters of sequences. Although we began this pipeline with very small sets of sequences due to the highly conserved synteny of inter-coding regions, we quickly amass a larger search space when we consider all of the different possible clustering schemes and potential annotations. Performing functional follow-ups using reporter constructs would be a good starting point in the investigation of functions for any of these motifs. If an element is involved in the differential regulation of protein expression in different life-stages or different metabolite conditions, then a reporter assay may be informative as to its function. Understanding these regulatory elements may lead to a better understanding of

175 160 how parasite gene expression is controlled during its different life stages, and elements important to survival in the mammalian host may help indicate previously undescribed virulence factors contributing to Trypanosomatid pathogensis.

176 161 Figure B2. Visualizing conserved inter-coding region sequence clustering and motifs Note: These are clusters of conserved inter-coding sequences in Leishmania infantum, Leishmania major, Leishmania mexicana, and Leishmania braziliensis, using software from Chapter I. Groups of sequences have been clustered according to Blast similarity and separated into groups and color-coded by the MCL algorithm in the top panel. Motifs found within the conserved elements are listed in the bottom panel. The group labeled number 1 corresponds to the SIDER2 elements. We have not investigated the other groups.