Genetic marker(s) for species determination and typing. How to choose what to choose New markers selection

Size: px
Start display at page:

Download "Genetic marker(s) for species determination and typing. How to choose what to choose New markers selection"

Transcription

1 Genetic marker(s) for species determination and typing How to choose what to choose New markers selection Marco Lalle EURLP-Istituto Superiore di Sanità Rome, Italy

2 The key to the development of a reliable, molecular PCR-based method is: define one or more suitable DNA target regions (genetic marker or locus) based on DNA sequencing. DIFFERENT GENES EVOLVE AT DIFFERENT RATES

3 The DNA region selected should be sufficiently variable in sequence to allow parasites identification at taxonomic level required To detect species, analysis of highly/moderately conserved coding regions to consistently allow the delineation among species To identify variants, investigate into the transmission of genotypes, and to trace sources of infection, discriminatory fingerprinting techniques, to identify individual isolates or clonal lineages are required Regardless the detection technique chosen!

4 Polymorphic (any difference in the nucleotide sequence between individuals) Selectivity neutral (gene variants detected do not have any direct effect on fitness. Have great potential for investigating processes such as gene flow, migration or dispersal) Evenly and frequently distributed throughout the genome Reproducible Desirable genetic marker Easy, fast and cheap to detect High resolution with large number of samples (Informative!)

5 A range of target regions in the nuclear and mitochondrial genomes have been employed to achieve the identification of parasites to species or sub-specific genotypes Ribosomal rdna/rrna high sensitive no good for closely related species Mitochondrial DNA targets Specific nuclear genomic DNA sequences less sensitive highly specific for single species Single vs multicopy genes/elements to increase sensitivity!

6 Markers for species determination

7 Ribosomal RNA (rrna) In nuclear ribosomal genes and spacers, there is often less sequence variation among individuals within a population and between populations, which makes them suitable as speciesspecific markers. Nuclear ribosomal DNA (rdna) the first (ITS-1) and second (ITS-2) internal transcribed spacers (ITSs) of nuclear rdna provide reliable genetic markers for the specific identification of a range of nematodes

8 Ribosomal RNA (rrna) Ideal gene for phylogenetic studies because it : is an essential gene that is present in all organisms. is a common target for sequencing studies; large database for comparisons. contains sites that are relatively conserved (stems) and sites that are more free to vary (loops).

9 Internal transcribed spacer (ITS) Internal transcribed spacer (ITS) refers to the spacer DNA situated between the small-subunit ribosomal RNA (rrna) and largesubunit rrna genes in the chromosome. ITS1 is located between 18S and 5.8S rrna genes ITS2 is between 5.8S and 28S rrna genes.

10 Internal transcribed spacer (ITS) small size associated to the availability of highly conserved flanking sequences easy to detect due to the high copy number of the rrna clusters undergoes rapid concerted evolution via unequal crossingover and gene conversion high degree of variation even between closely related species. This can be explained by the relatively low evolutionary pressure acting on such non-coding spacer sequences.

11 Mitochondrial DNA All eukaryote cells contain mitochondria, and animal (i.e. helminths) mitochondrial DNA (mtdna) has a relatively fast mutation rate, resulting in the generation of diversity within and between populations over relatively short evolutionary timescales (thousands of generations).

12 Mitochondrial DNA Higher mutation rates and more rapid variation sorting results in divergence of mtdna sequences among species with a comparatively small variance within species. mtdna genome is inherited only from the mother!!!!!

13 Example: Identificaton of nematodes other than Trichinella in meat Evidence: The artificial digestion is the gold standard diagnostic method to detect Trichinella sp. larvae in muscle tissues of animals. It is not unusual that larvae of nematodes living or migrating into the body of vertebrates, but in different niches (gut lumen, liver, lungs, lymphatic or blood vessels), could be, by mistake, identified as larvae of the genus Trichinella. Size, shape and experience can exclude the risk of a glaring blunder Problem: damages of the inner structures or the external cuticle by the digestion process makes identification hard!

14 Method: The use of short DNA sequences as taxon barcodes to differentiate or discover new species. DNA purification from single larva (DNA IQ System, Promega) PCR (PCR Kit, Qiagen) Sequencing Sequence analysis by BLAST on GenBank

15 18S rrna is part of the ribosomal DNA gene cluster that is present in multiple copies inside the genome. Sequences are available on GenBank for a lot of organisms. The primers target a region of about 1kb, while a smaller region (about 500bp) including the 5 third of the gene is sequenced. The 5 third of 18S contains the most part of nucleotide variability as it include both conserved stem and highly divergent loop regions. ITS1 (internal transcribed spacer I) Located next to 18S gene on the same repeated gene cluster then represent an abundant target. A non-coding region then shows an high variability in nucleotide composition. Forward and reverse primers annealing to 18S and to 5.8S conserved regions are used to amplify this molecular target.

16 COI (cytochrome C oxidase I gene) located inside the mitochondrial DNA, it is frequently used for phylogeny, and genetic population studies and recently for taxa identification within the DNA barcoding project. Universal primers targeting conserved regions flanking a variable sequence of about 450 base pairs are used to amplify the gene. 12S rrna gene located inside the mitochondrial DNA. It is widely used for phylogenetic analysis because it evolves more rapidly than the nuclear rrna genes. Primers used amplify a region of about 500bp inside the gene

17 Results: The recovery of nematodes different from those belonging to the genus Trichinella during artificial digestion of muscle tissue samples is not unusual. Molecular based identification system allowing a reliable and quick response to univocally identify these nematodes.

18 Markers for genotyping There are a series of applications that require tools with a higher discrimination power (i.e., more polymorphic markers). studies of population genetics, outbreak investigations all the situations when the source of infection should be determined precisely.

19 What makes a gene pholmorphic??? (Gene polymorphism)

20 Single nucleotide polymorphisms (SNPs) SNPs are a single nucleotide changes that happen in the genome in a particular location. SNPs is known to be the most common form of genetic variation. A major and the cause of this SNPs is the replacement of the nucleotide Cytosine (C) with Thymine (T) in a part of the DNA.

21 Small-scale insertions/deletions Small insertions and deletion are called INDELs and this type of gene polymorphism is dependent on insertion or deletion of DNA bases in an organism.

22 Polymorphic repetitive elements Repetitive DNA can be valuable. The hyperevolution of repeats means that many are specific to an individual species or a clade of related species. Moreover, some such repeats exist at copy numbers several thousand times that of individual gene markers. These factors have inspired the development of many diagnostic probes based on mini or microsatellite DNA.

23 Minisatellites consist of repetitive, generally GCrich,[citation needed] motifs that range in length from 10 to over 100 base pairs. Microsatellites are characterized for the repetition for 1-6 base pairs of DNA sequence. Are commonly used as a molecular markers especially for identifying the relationship between alleles.

24 In principle, all kind of non-coding sequences, such as: Intergenic sequences Introns Mini- or Micro-satellites are all potentially interesting sources of polymorphism. Remember that the genomes of the most important human and animal parasites have been, or are currently being, sequenced. So, there is a huge amount of information that is freely available on the web (you can mine databases).

25

26 Example: An Epidemiological Study of Trichinella Evidence: two foci of Trichinella britovi appeared in free ranging pigs of two restricted areas of Corsica (France) and Sardinia (Italy), considered until then to be Trichinella free Hypothesis: are the two foci epidemiologically linked due to geographic proximity and illegal animal trade between the two islands? Method: use microsatellites to investigate the origin of these foci by analyzing the intraspecific genetic diversity of T. britovi isolates

27 24 single larvae from each of the 27 T. britovi (5 Sardinia; 8 Italy, 8 Corse; 7 France) isolates were genotyped at each locus DNA purification from single larva (DNA IQ System, Promega) Multiplex PCR to confirm species (Multiplex PCR Kit, Qiagen) PCR with primers designed for microsatellites (Type-it microsatellite PCR kit, Qiagen) Run on capillary electrophoresis system (Qiaxcel with high resolution cartrige, Qiagen) Sequencing for band size confirmation Microsatelite size was deduced by comparison with fragments of known length

28 Molecular diagnostic of Cryptosporidium Species-specific Markers 18S rdna COWP Actina Beta-tubulina HSP70 Subtyping Markers glycoprotein GP60 Mini-satellites Micro-satellites

29 Molecular diagnostic of Giardia Genotypes of Giardia are traditionally named Assemblages (A to H) which are identified by analysis of single or multiple loci For the primary diagnosis of giardiasis Small subunit ribosomal ribonucleic acid (SSU rrna) beta-giardin triosephosphate isomerase (TPI) intergenic spacer (IGS) regions for further genotyping SSU rrna glutamate dehydrogenase (GDH) TPI beta-giardin IGS region elongation factor 1-alpha (EF1-alpha)

30

31 Giardia and domestic animals Assemblage E is prevalent Assemblage A mainly AI Assemblage B almost absent Assemblage F is prevalent Assemblage C rare Assemblage A mainly AI Assemblage B rare Assemblage C / D prevalent Assemblage B rare Assemblage A more common Mixed infections common G. duodenalis assemblage-specific strains are common in their respective hosts Some of the assemblage A and B subtypes that have been identified in animals are genetically identical to those found in humans, so at this level of resolution, zoonotic transmission seems to be supported

32 Giardia: from single to multiple loci The analysis of single loci suggests potential zoonotic transmission, but the level of resolution is low. What happens when multiple loci are used?

33 Multilocus Sequence Typing For MLST, depending on the degree of discrimination required, a number of housekeeping genes of an isolate are amplified and sequenced on both strands. Each sequence variant within each gene is assigned to a distinct allele, and the combination of alleles within an isolate defines its allelic profile or sequence type (SQT). housekeeping genes are typically constitutive genes that are required for the maintenance of basic cellular function

34 zoonotic transmission doesn t seem to occur commonly No evident clusters (in terms of hosts, or sample origin, etc) are observed for assemblage B

35 Molecular diagnostic of Giardia While molecular markers for assemblage A appear to produce robust and easy-to-read sequences The allelic heterozygosity (ASH) shown to exist at the singlecell level in assemblage B isolates and sometimes further complicated by other coinfecting assemblage B subgenotypes makes precise identification impossible.

36 Molecular diagnostic of Giardia

37 You have to take into account the genome ploidy of your parasite!!!! Ploidy is the number of sets of chromosomes in a cell, and hence the number of possible alleles of each gene!.

38 Molecular diagnostic of Toxoplasma Considered a single species in the genus Toxoplasma with only 3 genetic types I, II, and III with limited variation. Type I isolates are uniformly lethal to out-bred mice, while type II and III isolates are significantly less virulent Now! genotypes #1 (type II), #2 (type III) and #3 (type II variant) worldwide and in Europe, genotypes #1, #2, #3, #4 and #5 in North America, genotypes #2 and #3 (type III and type II variant) in Africa genotypes #9 (Chinese 1) and #10 (type I) in East Asia Highly variable atypical genotypes in South America Atypical isolates often cause severe acute or disseminated toxoplasmosis in immunocompetent individuals

39 Molecular diagnostic of Toxoplasma Several other single-copy genes, such as SAG1, SAG2, and GRA1 are used as PCR targets- To achieve high sensitivity, several multicopy targeting genes are used for the detection in biological samples

40 Molecular diagnostic of Toxoplasma For genotyping in a single multiplex PCR assay using 15 microsatellite markers 8 MS markers (TUB2, W35, TgM-A, B18, B17, M33, IV.1, and XI.1) differentiate types I, II, and III from all the atypical genotypes 7 MS markers (M48, M102, N60, N82, AA, N61, and N83) enhance genetic resolution in differentiating closely related isolates within clonal lineage Multilocus sequence typing (MLST) based on DNA sequence polymorphisms of some alleles unique to the Brazil isolates (atypical isolates) 5 -SAG2, 3 -SAG2, BTUB, GRA6, and SAG3

41 How to choose the target 1-Look for specific sequences of my parasite 2-Download these to my computer 3-Analyse these on my computer or using free resources on the web 4-Develop and test primers

42 How to choose the target The huge amount of sequence data available from many online databases offers a powerful resource for the design of primers and probes to be implemented in the detection of pathogens. Alignments of target sequences from related and unrelated organisms are used to aid the design of organism-specific detection assays, and a choice is made of a target sequence that is specific for the organism of interest and does not show sequence variation within the organism. Nevertheless, isolates from field samples or clinical specimens can significantly deviate from the reference strain sequence reported in the database, as the reference strains represent single isolates. For this reason the largest number of sequences should be considered in the design of the molecular probes, and if needed in house sequencing should be performed.

43 How to look for specific sequences in public databases The huge amount of sequence data available from many online databases offers a powerful resource for the design of primers and probes to be implemented in the detection of pathogens. 3 major publicly available databases one is maintained in the USA (NCBI) one in Europe (EBI) and one in Japan (DDBJ) The content of one databases is always mirrored by the others

44 Finding sequences in public databases ENTREZ at Entrez/ SRS at dbest at nlm.nih.gov/ dbest/index.html SWISSPROT is available through expasy. huge.ch/ sprot/ Database Searching BLAST searches at BLAST/ BLAST2 is at nlm.nih.gov/ cgi-bin/blast/ nphnewblast?jform=1 PSI-BLAST is at nlm.nih.gov/ cgi-bin/ BLAST/ nph-psi_blast The TIGR database search engines are accessible through igr.org/ EBI Blast searches blast2/ [FASTA searches at ebi.ac.uk/ fasta3/] Sequence Search Site at The Sanger Centre genome sequencing teams have specific search engines for each project, see sanger.ac.uk/ Caenorhabditis elegans specific searches at sanger.ac.uk/ Projects/C_ elegans/ Genome Computing/ Bioinformatic s centres EUROPEAN BIOINFORMATICS INSTITUTE cgi-bin/rbanner/ index.cgi SANGER CENTRE ac.uk/ WASHINGTON UNIVERSITY GENOME CENTRE THE INSTITUTE FOR GENOME RESEARCH tigr.org/ Multiple alignment tools on the WWW CLUSTALW at the EBI at clustalw Multiple alignment bcm.tmc.edu:9331/ multi-align/ multialign.html see BCM Search Launcher at tmc.edu:8088/searchlauncher/launcher.html for additional tools Restriction enzymes New England Biolabs REBASE at rebase/rebase.htm Cut your DNA online at ccsi.com/ firstmarket/cutter

45 Search for Here you type the keywords of your search (could be one word, or combinations of) Here you have a scroll-down menu where you can select the database to search

46 If I type Giardia, the engine will find sequences. If I focus my search by typing Babesia and ribosomal, then only 4270 entries contain the two keywords If I type Trichinella, I will find entires! But if I type Trichinella and complete cds the number becomes Why? Because most of the entries are partial sequences (like cdnas)

47 Why is FASTA format important? Because you can download the sequences to your computer in a very compact format (text), and work with them with many softwares. For instance, to design primers it is important to use all the information (all the sequences) of one particular locus. You need to obtain a MULTIPLE alignment of your sequences. How you do that?

48 Multiple alignment tool: CLUSTAL-W You can run CLUSTAL from the web; in that case you are asked to load your sequences from a file (that you have already prepared). The sequences must be in one of the recognized formats, and one (the simplest) format is FASTA.

49 Example Let us imagine that you are interested in studying one locus of, say, Giardia duodenalis. You have searched the GenBank database using the keywords: Giardia AND triose phosphate isomerase After checking the entries, you have downloaded and renamed the sequences you want to align in FASTA format. You have created a word file, that you have saved as text, that contains the sequences

50 This is how your world file will look like >AF tcaagtgtaanggctctcttgactttatcaagagccacgtggcggcaattgctgcccataagatccctgattccgtggacgtcgtcattgccccttccgccgtacacctgtcaacagccattgcggcaaacacgtcaaaacagttgaggat agcagcgcagaatgtgtacctagaggggaacggggcgtggactggcgagacaagtgttgagatgcttcaggacatgggtttgaagcatgtgatagtagggcactctgaaagacgcagaatcatgggggagaccgacgagcaaagc gccaagaaggctaagcgtgccctggaaaaggggatgacggtcatcttctgcgtcggagagaccttggacgagcgcaaggccaaccgcaccatggaggtgaacatcgcccagcttgaggcgcttggcaaggagctcggagagtcca agatgctctggaaggaggttgtcat >AF tcaagtgtaacggctctcttgactttatcaagagccacgtggcggcaattgctgcccataagatccctgattccgtggacgtcgtcattgccccctccgccgtacacctgtcaacagccattgcggcaaacacgtcaaaacagttgaggat agcagcgcagaatgtgtacctagaggggaacggggcgtggactggcgagacaagtgttgagatgcttcaggacatgggtttgaagcatgtgatagtagggcactctgaaagacgcagaatcatgggggagaccgacgagcaaagc gccaagaaggctaagcgtgccctggaaaaggggatgacggtcatcttctgcgtcggagagaccttggatgagcgcaaggccaaccgcaccatggaggtgaacatcgcccagcttgaggcgcttggcaaggagctcggagagtcca agatgctctggaaggaggttgtcattgcttacgagccc >AF ttaagtgcaacggctcgctcgactttatcaagagtcacgtgggggccattgctgcccacaagatccctgattccgtggacgttgttgtcgccccttctgccgtgcacctgtcaacagccattgcggcaaacacgtcaaagcagttgaagat agcggcgcagaatgtgtacctagaggggaacggggcgtggaccggtgagacgagcgttgagatgctccaggacatgggcctagagcatgtgataatagggcactctgaaaggcgcagaatcatgggggagaccgacgagcaga gcgccaggaaggcgaagcgcgctctagaaaaggggatgacggtcatcttctgcgtaggagagaccctggacgagcgcaaggccaaccgcaccatggaggtgaacatcgcccagcttgaggcgctcagcaaggagcttggagagt cgaagatgctctggaagggagttgttattgcctacgagccc >AF ttaagtgtaacggctcncttgattttatcaagagccacgtggcggccattgctgcccacaagatccccgattccatagacgttgttgttgccccttctgccgtacatttatcaacagctattgcagcaaacacgtcaaaacagttgaagatag cggcgcagaatgtgtacctagaggggaatggagcgtggactggtgagacgagtgttgagatgcttcaggacatgggcttggagtacgtgataatagggcattctgaaaggcgtagaatcatgggggagaccgacgagcagagtgcc aagaaggctaagcgtgctctagaaaaggggatgacggttatcttttgtgttggagagacccttgatgagcgcaaggccaaccgcaccatggaggtaaacattgctcagcttgaggcgctcagcaaagagctcggggagtctaagctg ctatggaagaaagtcgttattgcttacgagccc

51 Then you have to run ClustalX to get the multiple alignment ClustalX will ask you to name the 2 output files it will produce: you choose the filenames (example GiardiaTPI) and Clustla will add the extensions.aln and.dnd The aln is the alignment file, while the dnd is a dendrogram. You can open the GiardiaTPI.aln file using word, and edit the file

52 CLUSTAL multiple sequence alignment AF AF AF AF AF AF AF AF AF AF AF AF TCAAGTGTAANGGCTCTCTTGACTTTATCAAGAGCCACGTGGCGGCAATTGCTGCCCATA TCAAGTGTAACGGCTCTCTTGACTTTATCAAGAGCCACGTGGCGGCAATTGCTGCCCATA TTAAGTGCAACGGCTCGCTCGACTTTATCAAGAGTCACGTGGGGGCCATTGCTGCCCACA TTAAGTGTAACGGCTCNCTTGATTTTATCAAGAGCCACGTGGCGGCCATTGCTGCCCACA * ***** ** ***** ** ** *********** ******* *** *********** * AGATCCCTGATTCCGTGGACGTCGTCATTGCCCCTTCCGCCGTACACCTGTCAACAGCCA AGATCCCTGATTCCGTGGACGTCGTCATTGCCCCCTCCGCCGTACACCTGTCAACAGCCA AGATCCCTGATTCCGTGGACGTTGTTGTCGCCCCTTCTGCCGTGCACCTGTCAACAGCCA AGATCCCCGATTCCATAGACGTTGTTGTTGCCCCTTCTGCCGTACATTTATCAACAGCTA ******* ****** * ***** ** * ***** ** ***** ** * ******** * TTGCGGCAAACACGTCAAAACAGTTGAGGATAGCAGCGCAGAATGTGTACCTAGAGGGGA TTGCGGCAAACACGTCAAAACAGTTGAGGATAGCAGCGCAGAATGTGTACCTAGAGGGGA TTGCGGCAAACACGTCAAAGCAGTTGAAGATAGCGGCGCAGAATGTGTACCTAGAGGGGA TTGCAGCAAACACGTCAAAACAGTTGAAGATAGCGGCGCAGAATGTGTACCTAGAGGGGA **** ************** ******* ****** ************************* Asterisks indicate identity at any position among all the sequences present in the alignment

53 Design primers You can then use the information in multiple alignments to: Look for conserved regions (if you want to design primers that can bind and amplify all species) Look for variability (if you want to push the system towards some degree of specificity) TEST them and good luck!!!

54 But if we need to do more???? Esp. 1. Only few genes of a specific parasite are available (i.e. ribosomal and a house-keeing genes) but new genetic markers useful for epidemiologic studies are need. What to do if I need to identify them? Esp. 2. You noted that some strains of the parasite species X apparently circulate only in humans. However the genome of just one isolate is available- Can be possible to identify potential determinants of host specificity? One approach is to compare informative parasite isolates at the level of the whole genome, with the objective of identifying genetic variations that may be responsible for observed phenotypic differences

55 Next-generation sequencing (NGS) Next-generation sequencing (NGS) methods provide cheap and reliable large-scale DNA sequencing. They are used extensively for de novo sequencing, for disease mapping, for quantifying expression levels through RNA sequencing and in population genetic studies. In NGS methods, a whole genome, or targeted regions of the genome, is randomly digested into small fragments (or short reads) that get sequenced and are then either aligned to a reference genome or assembled.

56 Overview of the workflow of a NGS experiment Step 1: Planning the experiment Step 2: library preparation Step 3: sequencing Wet lab Step 4: data analysis Dry lab

57 One can look at many things in the genome, e.g.: Presence/absence of specific genes Distribution of SNPs (in genes, in intergenic regions) Distribution of synonymous versus nonsynonymous substitutions (genes under selection) Repetitive sequences