BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

Size: px
Start display at page:

Download "BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES"

Transcription

1 BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

2 We sequenced and assembled a genome, but this is only a long stretch of ATCG What should we do now? 1. find genes

3 Gene calling The simpliest thing is to look for ORFs= open reading frames An ORF is a stretch of DNA that starts with a start codon and ends with a stop codon Our goal is to call (which means to find) the ORFs in a genome sequence

4 ORF calling So we need a software that will recognize start and stop These usually are ATG = methionine TGA TAA = stop TAG

5 ORF calling So we need a software that will recognize start and stop in all six possible frames

6 Gene calling Sounds pretty easy... however there are some issues

7 Gene calling issues 1. The genetic code is NOT really universal So we need to known which variation of the code our organism follow 2. Eukaryotes have introns Rules for intron/exon boundaries vary among species, so we will need a software that is suited for our organism

8 Gene calling Easy and straightforward Fundamental to use the right software Which is in general a good rule for bioinformatics

9 Gene Annotation The process to assign a name and a function to each of our genes This is done by comparing each gene in our genome to a database, to detect a gene that is similar enough for us to say that our novel sequence has the same function

10 Gene Annotation... comparing each gene in our genome... When I want to compare two sequences, or two set of sequences, I use the NCBI BLAST algorithm

11 Gene Annotation - BLAST BLAST means Basic Local Alignment Search Tool It can be used online or offline Offline is better for entire genomes It is fast and accurate It is highly customizable It outputs hits with a score, indicating the strength of the similarity

12 Gene Annotation - BLAST It is highly customizable Four main algorithm, with varying inputs Combination of nucleotides and proteins the input and in the database sequences

13 Gene Annotation - BLAST It is highly customizable

14 Gene Annotation - BLAST Even more customizable offline

15 Gene Annotation - BLAST Even more customizable offline We can set a number of parameters such as: Cost of a gap: how much negative score does a gap in the alignment cause % identity between the query and database Output format: for example a table The most important parameter is possibly the E value

16 Gene Annotation - BLAST E value The Expect value (E) is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. It decreases exponentially as the Score (S) of the match increases. Essentially, the E value describes the random background noise. So very low e-values indicate a very low possibility that the hit has been found for a random similarity between the two sequences High e-values indicate a high possibility of a random hit

17 Gene Annotation - BLAST E value So an e-value of 1 is VERY BAD! It is strictly correlated with the database size A bigger database contains more sequences, and thus more sequences that will be randomly similar to the input 10-5 is widely considered a stringent e-value HOWEVER the parameter must be set based on the task

18 Gene Annotation Databases...comparing each gene in our genome to a database... Which database? There are multiple possibilities some are very general, some are species-specific

19 Gene Annotation Databases NR = non redundant NCBI database (proteins) NT = non redundant NCBI database (nucleotides) UCSC Genome Browser for human genes COG = cluster of orthologous genes Flybase for Drosophila RAST for bacteria

20 Gene Annotation Databases Multiple possibilities The choice should be careful, based on the organism, and comparison of multiple databases should be done when possible Specific database can be generated for the task For example based on ncbi searches

21 Gene Annotation...comparing each gene in our genome to a database, to detect a gene that is similar enough for us to say that our novel sequence has the same function How much is enough?

22 The case of the creeping Fox terrier clone Stephen Jay Gould Essay contained in the book 'Bully for Brontosaurus' (1992)

23 The case of the creeping Fox terrier Clone We may imagine the earliest herds of horses in the lower Eocene as resembling a lot of Fox-Terriers in size... HF Osborn was the first to use this comparison in 1905, and since then most of the books started using it Do authors really know the size of a Fox-terrier? Or are they just copying the old comparison?

24 The case of the creeping Fox terrier clone When we use comparisons to annotate our genes, we need to be careful How many times has this comparison been used starting from a gene with a function that was experimentally determined?

25 Avoid missannotation when possible Use multiple databases Use stringent BLAST parameters Double-check the important genes (cannot do them all, we are working highthroughput)

26 Genes can be useful for many tasks A couple of examples Evaluating the metabolic potential of the newly sequenced genome Determining the phylogenetic position of the organism

27 From genes to metabolisms The presence of a genes often indicates an active enzime If all enzymes of a pathway are present, our organism can very probably complete the pathway Specific softwares can reconstruct metabolic pathways, or cellular structures

28 From genes to metabolisms KEGG Kyoto encyclopedia of genes and genomes

29 Phylogenetics Phylogenetics is the study of the evolutionary history of organisms

30 Phylogeny All organisms have a common ancestor Similarly to genealogy, phylogeny aims at reconstructing a 'tree' Reconstruction of EVOLUTION, using differences and common traits

31 Phylogeny

32 Phylogeny Reconstruction of EVOLUTION, using differences and common traits Originally it was based on morphology

33 Phylogenetics To study phylogenies we need hereditary characters that group and separate the units present in our dataset We can use morphology, but...

34 Phylogenetics We can use morphology, but Interpretation and Analogy Can provide false evidence or make or results noisy Advanced technics of Geometric morphometrics can work

35 Phylogenetics It is important to use traits that are Homologous: that derive from a common ancestory And not Analogous: that evolved independentely, in a process of convergent evolution

36 Fins in this dataset are an analogous character (fish and whales) Endothermy is an homologus character (mammals)

37 Phylogeny ATCTTC TG ATCTTCATG ACCTTCATG ATCTGCATG ATCAGCATG ATCTGCATG ATCCGCATG Reconstruction of EVOLUTION, using DNA sequences DNA is perfect, because it is a digital character that is not influenced by interpretation

38 Phylogeny We do not have information on the ancestors!! ATCTTC TG ACCTTCATG ATCAGCATG ATCCGCATG So we need to infer the evolution of the character

39 Phylogenetics DNA is perfect, because it is a digital character that is not influenced by interpretation With single genes phylogenetic analyses With entire genomes phylogenomic analyses More characters = more power

40 Phylogenetics More characters = more power I can discriminate ancient events where the noise is very strong

41 Phylogenetics More characters = more power I can discriminate extremely recent events, where the variation between the different taxa is extremely low

42 Phylogenetics 1. sequence and assemble genome 2. extract genes 3. obtain genes of other organisms from a database 4. align them 5. run a phylogenetic software

43 Phylogenomics obtain genes Finding homologous genes is not enough: The genes that we want are called orthologous genes ortologous: genes that derive from a speciation event paralogous: genes that derive from a duplication event

44 Phylogenomics obtain genes The software OrthoMCL is an example of a tool to obtain orthologous genes This software 1. Compares all the genes of all the organisms in a dataset (bidirectional Blast hit) 2. uses a Markov Cluster algorithm to create networks to determine orthologous genes

45 Phylogenomics obtain genes to obtain orthologous genes bidirectional Blast hit The software only accepts the gene pairs for which each of the two genes is the best hit in the Blast search of the other genome

46 Align the genes Orthologous genes are generally very similar, so they can be aligned by softwares such as Muscle

47 Phylogeny Starting from an alignment we can use specific algorithms with the goal of understanding the evolutionary relations between our organisms A number of such phylogenetic algorithms exist, Maximum Likelihood methods: try to find the evolutionary tree that is more likely to explain the variation present in the dataset Maximum parsimony methods: try to find the tree that explain the variation with the lowest amount of evolutionary changes

48 Phylogenetics 1. sequence and assemble genome 2. extract genes 3. obtain genes of other organisms from a database 4. align them 5. run a phylogenetic software When we work with very similar genomes orthologous genes can contain too little information, as they are too similar we need higher resolution

49 Phylogenetics 1. sequence and assemble genome 2. extract genes 3. obtain genes of other organisms from a database 4. align them 5. run a phylogenetic software

50 SNPs analysis When we work with very similar genomes orthologous genes can contain too little information, as they are too similar we need higher resolution We need to work at a 'lower' level, not on genes, but on single positions Single Nucleotide Polymophisms This approach allows to detect single variations - in highly variable genes excluded from the orthology analysis - in intergenic regions

51 SNPs ANALYSIS how to We sequence and assemble our genomes (contigs) We align them to a REFERENCE GENOME REFERENCE GENOME: a closely related genome that we can use as blueprint, as reference, to compare our novel genomes to Variations between the reference and our novel genomes can be recorded and used for comparison purposes, such as SNPs based phylogeny

52 SNPs ANALYSIS how to Alignment of the genomes to a REFERENCE GENOME This can be done using specific softwares MAUVE does the alignement and gives us the SNPs

53 SNPs ANALYSIS If the phylogenomic approach can exclude important information The SNPs approach may include areas with questionable alignment To avoid this problem not all variations are used, but just the CORE SNPs: SNPs that are flanked on both sides by identical nucleotides in all the genomes of our alignment This will allow to obtain a dataset that is precise and informative

54 SNPs ANALYSIS SNPs analysis allows detect minimum differences between very similar genomes It is the analysis of choice when working in human genomics, and in general in the genomics of model systems With the increase of available genomes, it has also become the method of choice for bacterial genomics of single species, a field that is called genomic epidemiology

55 Phylogenetics with SNPs 1. sequence and assemble genome 2. align the contigs to a reference genome 3. extract core SNPs 4. run a phylogenetic software However there is a FASTER alternative

56 Phylogenetics with SNPs 1. sequence and assemble genome 2. align the contigs to a reference genome 3. extract core SNPs 4. run a phylogenetic software However there is a FASTER alternative: mapping the reads directly to the REFERENCE GENOME

57 MAPPING OF THE READS The assembly of the reads into a genome is not the only way Assembly from reads = DENOVO As in 'try to generate a novel genome DENOVO, without previous information' An alternative that is very useful in specific situations is mapping the reads to a genome we already know, indicated, again, as REFERENCE GENOME

58 MAPPING OF THE READS Mapping means using a bioinformatic algorithm to determine in what position of a previously sequenced REFERENCE GENOME we can locate our reads, without assemblying them

59 MAPPING OF THE READS Mapping means using a bioinformatic algorithm to determine in what position of a previously sequenced REFERENCE GENOME we can locate our reads, without assemblying them

60 Phylogenetics with MAPPING 1. sequence genome 2. map the reads to the reference 3. extract core SNPs 4. run a phylogenetic software Perfect for big genomes (less computational power needed) and also useful for finding variations for genomics of alleles... and for transcriptomics

61 Genomic epidemiology Genomic Epidemiology Tracing the origin epidemic outbreaks: whole-genome sequencing and the microevolution of pathogenic agents

62 Genomic epidemiology Molecular typing of pathogens Important in microbiology to classify bacteria at the subspecies level: find virulent clones... Analysis of a single gene (e.g. 16 rdna) ~ 1000 bp, ~ 50 Euro MLST ~ 4000 bp, ~ 300 Euro Whole genome sequencing 1-5 millions bp, Euro (Plasmids included) NOW

63 Genomic epidemiology WHOLE GENOME typing of pathogens Approaches and advantages Thousands of characters to discriminate between different strains Comparative genomics can be used to study the origin of phenotypic traits and host/environment adaptaptation mechanisms Not only classification/clustering of microorganisms but also reconstruction of their evolutive history thanks to phylogeny

64 Genomic epidemiology WHOLE GENOME typing of pathogens Approaches and advantages Small genomic changes can be used to track the spread of a pathogen in different time and space scales This makes WGS the perfect tool for investigation

65 Genomic epidemiology Example in medical epidemiology Klebsiella pneumoniae The model: Klebsiella pneumoniae is a nosocomial pathogen, known for its multiple resistances to antibiotics, usually carried by PLASMIDS The plasmid gene KPC gives resistance to carbapenemic antibiotics The problem: resistance to carbapenemic antibiotics and has rapidly spread in Italy in the last few years. How has this happened?

66 Genomic epidemiology THE APPROACH K. pneumoniae isolates of various antibiotic resistance profiles were collected from 5 Italian hospitals 2. Whole genome sequencing using the MiSEQ machine from Illumina 3. Genome assembly using the software MIRA 4. Comparative genomics and phylogeny

67 Genomic epidemiology project GLOBAL phylogeny All available K.pneumoniae genomes from all over the world (n=230) were added to the database, for a total of 319 genomes Multiple genomic alignment, based on several pairwise alignments (using Mauve) Extraction of Single nucleotide polymorphisms (SNPs) with an in-house suite of scripts (Python, Perl, R, shell)

68 Genomic epidemiology project GLOBAL phylogeny 94,812 core SNPS detected Core SNPS are one-base mutations in genomic regions present in all genomes of the alignment SNP phylogeny Maximum Likelihood, 100 bootstrap replicates (RaxML software)

69

70 Genomic epidemiology project GLOBAL phylogeny

71 Branch length in phylogenies Phylogenetic trees contain the information of the phylogenetic relationship between the analyzed organisms However they can also contain the information of how 'distant' the different organisms are This information can be shown as branch length

72 Genomic epidemiology project GLOBAL phylogeny

73 Genomic epidemiology project GLOBAL phylogeny 203 genomes cluster here! THEY FORM THE CLONAL COMPLEX 258 (CC258) (i.e. all of them have Multilocus Sequence Type 258 or single mutations of it) 97% of them have gene blakpc Only 4% of the other have the gene

74 Genomic epidemiology project GLOBAL phylogeny Why almost all blakpc positive genomes are in CC258? Maybe, the plasmid cannot be transferred... NO! Plenty of evidence in literature of plasmid transfer Maybe there is a genomic reason A) a genomic element of CC258 acts as a plasmid magnet B) genomic traits make these strains highly virulent and/or highly fit (so that they are massively isolated worldwide)

75 Genomic epidemiology project recombination events Two genomic areas with high SNP density were detected, are they recombinations? PHYLOGENY OF PUTATIVE RECOMBINATIONS Yes, they are!

76 Genomic epidemiology project recombination events ~5.6 Mb ~1.3 Mb ~1.1 Mb ~300 Kb

77 Genomic epidemiology project the Klebsiella hopeful monster Recombinations as evolutionary leaps, CC258 derived from giant genomic 'fusions', with a high fitness, as indicated by its global spread in all hospitals around the world, in less then 30 years...sounds like punctuated equilibrium! Commentary in Mbio

78 Genomic epidemiology project GLOBAL phylogeny Four Italian clades in the CC258 Four different diffusion events in Italy

79 Genomic epidemiology project molecular clock Date the nodes, date the 4 events of entrance in Italy Method used: bayesian inference (Beast)

80 Genomic epidemiology project molecular clock Recombination events were also dated

81 Outbreak reconstruction almost forensic genomics Outbreak of CC258 K. pneumoniae in an hospital in northern Italy 7 genomes (that fit in one of the four Italian clades) Using DATES of isolation and SNPs it is possible to reconstruct the spreading route of the pathogen

82 Outbreak reconstruction almost forensic genomics Whose fault is it? Star-like diffusion The diffusion does not correlate with the bed disposition The pathogen is likely to be carried around by the hospital staff: a better safety protocol is needed In addition, comparative genomics shows that the isolates from the seven patients do not present any specific virulence or resistance factor that make them different from other strains from the same hospital.