BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

Size: px
Start display at page:

Download "BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES"

Transcription

1 BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

2 We sequenced and assembled a genome, but this is only a long stretch of ATCG What should we do now? 1. find genes What are the starting and end points for a gene?

3 Gene calling The simpliest thing is to look for ORFs= open reading frames An ORF is a stretch of DNA that starts with a start codon and ends with a stop codon Our goal is to call (which means to find) the ORFs in a genome sequence

4 ORF calling So we need a software that will recognize start and stop These usually are ATG = methionine TGA TAA = stop TAG

5 ORF calling So we need a software that will detect start and stop in all six possible frames

6 Gene calling Sounds pretty easy... however there are some issues

7 Gene calling issues 1. The genetic code is NOT really universal So we need to known which variation of the code our organism follow 2. Eukaryotes have introns Rules for intron/exon boundaries vary among species, so we will need a software that is suited for our organism

8 Gene calling Easy and straightforward Fundamental to use the right software Which is in general a good rule for bioinformatics

9 Gene Annotation The process to assign a name and a function to each of our genes This is done by comparing each gene in our genome to a database, to detect a gene that is similar enough for us to say that our novel sequence has the same function Do you know any software that compares sequences?

10 Gene Annotation... comparing each gene in our genome... When I want to compare two sequences, or two set of sequences, I use the NCBI BLAST algorithm

11 Gene Annotation - BLAST BLAST means Basic Local Alignment Search Tool It can be used online or offline Offline is better for entire genomes It is fast and accurate It is highly customizable It outputs hits with a score, indicating the strength of the similarity

12 Gene Annotation - BLAST It is highly customizable Four main algorithm, with varying inputs Combination of nucleotides and proteins the input and in the database sequences

13 Gene Annotation - BLAST It is highly customizable

14 Gene Annotation - BLAST The local version is even more customizable

15 Gene Annotation - BLAST We can set a number of parameters such as: Cost of a gap: how much negative score does a gap in the alignment cause % identity between the query and database Output format: for example a table The most important parameter is possibly the E value

16 Gene Annotation - BLAST E value The Expect value (E) is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. It decreases exponentially as the Score (S) of the match increases. Essentially, the E value describes the random background noise. So very low e-values indicate a very low possibility that the hit has been found for a random similarity between the two sequences High e-values indicate a high possibility of a random hit

17 Gene Annotation - BLAST E value So an e-value of 1 is VERY BAD! It is strictly correlated with the database size A bigger database contains more sequences, and thus more sequences that will be randomly similar to the input 10-5 is widely considered a stringent e-value HOWEVER the parameter must be set based on the task

18 Gene Annotation Databases...comparing each gene in our genome to a database... Which database? There are multiple possibilities some are very general, some are species-specific

19 Gene Annotation Databases NR = non redundant NCBI database (proteins) NT = non redundant NCBI database (nucleotides) UCSC Genome Browser for human genes COG = cluster of orthologous genes Flybase for Drosophila RAST for bacteria

20 Gene Annotation Databases Multiple possibilities The choice should be careful, based on the organism, and comparison of multiple databases should be done when possible Specific databases can be generated LOCALLY for the task For example based on ncbi searches

21 Gene Annotation...comparing each gene in our genome to a database, to detect a gene that is similar enough for us to say that our novel sequence has the same function How much is enough?

22 The case of the creeping Fox terrier clone Stephen Jay Gould Essay contained in the book 'Bully for Brontosaurus' (1992)

23 The case of the creeping Fox terrier Clone We may imagine the earliest herds of horses in the lower Eocene as resembling a lot of Fox-Terriers in size... HF Osborn was the first to use this comparison in 1905, and since then most of the books started using it Do authors really know the size of a Fox-terrier? Or are they just copying the old comparison?

24 The case of the creeping Fox terrier clone When we use comparisons to annotate our genes, we need to be careful How many passages since this comparison was used starting from a gene with a function that was experimentally determined?

25 Avoid missannotation when possible Use multiple databases Use stringent BLAST parameters Double-check the important genes (cannot do them all, we are working highthroughput)

26 Genes can be useful for many tasks A couple of examples Evaluating the metabolic potential of the newly sequenced genome Determining the phylogenetic position of the sequenced organism

27 From genes to metabolisms The presence of a genes often indicates an active enzime If all enzymes of a pathway are present, our organism can very probably complete the pathway Specific softwares can reconstruct metabolic pathways, or cellular structures

28 From genes to metabolisms KEGG Kyoto encyclopedia of genes and genomes

29 Phylogenetics Phylogenetics is the study of the evolutionary history of organisms

30 Phylogeny All organisms have a common ancestor Similarly to genealogy, phylogeny aims at reconstructing a 'tree' Reconstruction of EVOLUTION, using differences and common traits

31 Phylogeny

32 Phylogeny Reconstruction of EVOLUTION, using differences and common traits Originally it was based on morphology

33 Phylogenetics To study phylogenies we need hereditary characters that group and separate the units present in our dataset We can use morphology, but...

34 Phylogenetics We can use morphology, but Interpretation and Analogy Can provide false evidence or make or results noisy Advanced technics of Geometric morphometrics can work

35 Phylogenetics It is important to use traits that are Homologous: that derive from a common ancestory And not Analogous: that evolved independently, in a process of convergent evolution

36 Fins in this dataset are an analogous character (fish and whales) Endothermy is an homologus character (mammals)

37 Phylogeny ATCTTC-TG ATCTTCATG ACCTTCATG ATCTGCATG ATCAGCATG ATCTGCATG ATCCGCATG Reconstruction of EVOLUTION, using DNA sequences DNA is perfect for the task, because it is a digital character that is not influenced by interpretation

38 Phylogeny We do not have information on the ancestors!! ATCTTC-TG ACCTTCATG ATCAGCATG ATCCGCATG So we need to infer the evolution of the character

39 Phylogenetics DNA is perfect, because it is a digital character that is not influenced by interpretation With single genes phylogenetic analyses With entire genomes phylogenomic analyses More characters = more power

40 Phylogenomics More characters = more power I can discriminate ancient events where the noise is very strong

41 Phylogenomics More characters = more power I can discriminate extremely recent events, where the variation between the different taxa is extremely low

42 Phylogenomics 1. sequencing, assembly, annotation 2. obtain orthologous genes of other organisms from a database 3. align orthologs 4. run a phylogenetic software

43 Phylogenomics obtain genes Finding homologous genes is not enough: The genes that we want are called orthologous genes orthologous: genes that derive from a speciation event paralogous: genes that derive from a duplication event

44 Phylogenomics obtain genes The software OrthoMCL is an example of a tool to obtain orthologous genes This software 1. Compares all the genes of all the organisms in a dataset (bidirectional Blast hit) 2. uses a Markov Cluster algorithm to create networks to determine orthologous genes

45 Phylogenomics obtain genes to obtain orthologous genes bidirectional Blast hit The software only accepts the gene pairs for which each of the two genes is the best hit in the Blast search of the other genome

46 Phylogenomics obtain genes to obtain orthologous genes bidirectional Blast hit Single best hit can give false positive results, for example in case of duplicated genes Organism1 - gene_a1 Organism1 gene_a2 Organism2 - gene_b1 Organism2 - gene_b2

47 Align the genes Orthologous genes are generally very similar, so they can be easily aligned by softwares such as Clustal or Muscle

48 Phylogeny Starting from an alignment we can use specific algorithms with the goal of understanding the evolutionary relations between our organisms A number of such phylogenetic algorithms exist Two EXAMPLES Maximum Likelihood methods: try to find the evolutionary tree that is more likely to explain the variation present in the dataset Maximum parsimony methods: try to find the tree that explain the variation with the lowest amount of evolutionary changes

49 Phylogenomics 1. sequencing, assembly, annotation 2. obtain orthologous genes of other organisms from a database 3. align orthologs 4. run a phylogenetic software When we work with very similar genomes orthologous genes can contain too little information, as they are too similar Example: population genomics, comparing multiple individuals from a single species we need higher resolution

50 SNPs analysis When we work with very similar genomes orthologous genes will contain too little information, as they are too similar we need higher resolution We need to work at a 'lower' level, not on genes, but on single positions Single Nucleotide Polymophisms This approach allows to detect single variations - in highly variable genes excluded from the orthology analysis - in intergenic regions

51 SNPs ANALYSIS how to We sequence and assemble our genomes (contigs) We align them to a REFERENCE GENOME REFERENCE GENOME: a closely related genome that we can use as blueprint, as reference, to compare our novel genomes to Variations between the reference and our novel genomes can be recorded and used for comparison purposes, such as SNPs based phylogeny

52 SNPs ANALYSIS how to Alignment of the genomes to a REFERENCE GENOME This can be done using specific softwares MAUVE does the alignement and gives us the SNPs

53 SNPs ANALYSIS If the phylogenomic approach can exclude important information The SNPs approach may include areas with questionable alignment To avoid this problem not all variations are used, but just the CORE SNPs: SNPs that are flanked on both sides by identical nucleotides in all the genomes of our alignment This will allow to obtain a dataset that is precise and informative

54 SNPs ANALYSIS SNPs analysis allows detect minimum differences between very similar genomes It is the analysis of choice when working in human genomics, and in general in the genomics of model systems With the increase of available genomes, it has also become the method of choice for bacterial genomics of single species, a field that is called genomic epidemiology

55 Phylogenetics with SNPs 1. sequence and assemble genome 2. align the contigs to a reference genome 3. extract core SNPs 4. run a phylogenetic software However there is a FASTER alternative

56 Phylogenetics with SNPs 1. sequence and assemble genome 2. align the contigs to a reference genome 3. extract core SNPs 4. run a phylogenetic software However there is a FASTER alternative: mapping the reads directly to the REFERENCE GENOME

57 MAPPING OF THE READS The assembly of the reads into a genome is not the only way Assembly from reads = DENOVO As in 'try to generate a novel genome DENOVO, without previous information' An alternative that is very useful in specific situations is MAPPING the reads to a genome we already know, indicated, again, as REFERENCE GENOME

58 MAPPING OF THE READS Mapping means using a bioinformatic algorithm to determine in what position of a previously sequenced REFERENCE GENOME we can locate our reads, without assemblying them

59 MAPPING OF THE READS Mapping means using a bioinformatic algorithm to determine in what position of a previously sequenced REFERENCE GENOME we can locate our reads, without assemblying them

60 Phylogenetics with MAPPING 1. sequence genome 2. map the reads to the reference 3. extract core SNPs 4. run a phylogenetic software Perfect for big genomes (less computational power needed) and also useful for finding variations for genomics of alleles... and for transcriptomics