Bioinformatics? Assembly, annotation, comparative genomics and a bit of phylogeny.

Size: px

Start display at page:

Download "Bioinformatics? Assembly, annotation, comparative genomics and a bit of phylogeny."

Nelson Sherman
5 years ago
Views:

1 Bioinformatics? Assembly, annotation, comparative genomics and a bit of phylogeny stefano.gaiarsa@unimi.it

Case study! it s a FAKE ONE, do not run away in panic! There s an outbreak of Mycoplasma bovis (serious problem in cattle!) that is killing a lot of cattle and some humans!

2 Case study! it s a FAKE ONE, do not run away in panic! There s an outbreak of Mycoplasma bovis (serious problem in cattle!) that is killing a lot of cattle and some humans!! What makes this bug so tough?! let s take a look at the GENOME! We have the genome of the REFERENCE STRAIN (a.k.a the wild type clone): Mycoplasma bovis strain PG45. We are going to look at the differences between the reference and ours!

3 Case study! it s a FAKE ONE, do not run away in panic! We sequence our genome with an Illumina Miseq machine we get the reads, hundreds of reads, thousands millions of reads!! of reads,

4 We have our reads... so What now?

5 We have our reads... G ST ENO RU M CT IC UR E so What now? READS GENOME Assembly E GEN ONS TI C N FU GENES Gene calling Annotation PHYLOGENY Ortholog search Alignment Phylogenomic analysis

6 G ST ENO RU M CT IC UR E What now? READS GENOME Assembly E GEN ONS TI C N FU GENES Gene calling Annotation PHYLOGENY Ortholog search Alignment Phylogenomic analysis

7 Reads How many are they? How good are they? We need reliable data for our project should we throw away something? we need to use: fastqc

8 Reads quality

9 Reads quality In the.fastq format you have a line with a symbol per each base in the sequence: they represent the quality (the reliability) of each sequenced base the quality tends to drop towards the end of the read position in the read

10 Reads quality Bad quality? we can DISCARD some reads and/or just TRIM them at the 3 fastq_quality_trimmer from package fastx-toolkit syntax: you@yourpc:yourdirectory$ fastq_quality_trimmer -t [min_qual] -l [min_length_trimmed_read] -i [input_file] -o [output_file]

11 Genome assembly......a huge puzzle with short reads We have millions of short sequences (30 to 500 nt) up to with the latest, expensive technologies

12 solving the puzzle This is how we sampled our reads: We don t have this picture above, i.e. we don t know the position of each read on the genome. We only have the short sequences: >read1 ACGTAAACGTAA >read2 CGTAAATCGGTC >read3 ATCGATAATGCC >read4 CGGTCATCGATA What do we do?

13 solving the puzzle the greedy way >read1 ACGTAAACGTAA >read2 CGTAAATCGGTC >read3 ATCGATAATGCC >read4 CGGTCATCGATA >read1 ACGTAAACGTAA >read2 CGTAAATCGGTC >read3 ATCGATAATGCC >read4 CGGTCATCGATA >read1 ACGTAAACGTAA >read CGTAAATCGGTC >read ATCGATAATGCC >read CGGTCATCGATA----- >CONSENSUS ACGTAAACGTAAATCGGTCATCGATAATGCC >read1 ACGTAAACGTAA >read2 CGTAAATCGGTC >read3 ATCGATAATGCC >read4 CGGTCATCGATA We look for overlaps and we connect the reads!

14 solving the puzzle what kind of puzzle is this? Assembling genomes is not like solving a jigsaw puzzle. But it is more similar to: DOT TO DOT!

15 yeah...but more like this one!

16 solving the puzzle, i.e. the GRAPH 2 different approaches The greedy way: Each read is a dot in our graph (our puzzle ): a VERTEX or NODE Each time two reads overlap we draw a line: an ARC (or EDGE) CGGTCATCGATA ACGTAAACGTAA - older method time consuming CGTAAATCGGTC memory consuming not apt for big genomes a bit more precise example of software: MIRA (still used!) ATCGATAATGCC

17 solving the puzzle, i.e. the GRAPH 2 different approaches The new approach (De Brujin graphs): Divide the reads in smaller pieces of a given length: Each piece (a K-mer) is used as a VERTEX Each time two k-mer overlap with a 1-base overhang, they are joined! CGTAA GTAAC TAACG AACGG - faster - less memory consuming (the same K-mer is present more times in the reads but we add just one copy to our graph!) - apt for bigger genomes - example of software: Velvet, SPAdes (we will use it now)

18 solving the puzzle, i.e. the GRAPH Ok! we will assemble genomes in one piece!!! NOPE! Sometimes a vertex has more than one connection! Your genome will still be in pieces, either method you choose!

19 solving the puzzle De Brujin graphs SPAdes St. Petersburg genome Assembler

20 solving the puzzle De Brujin graphs SPAdes Let s try to use it with just a few reads We need: 1. to use the head command 2. to remember how the.fastq files are formatted please extract the first 100 reads on the forward (1) read file

21 solving the puzzle De Brujin graphs SPAdes python spades.py -s [reads] --careful -t 1 -k [length_of_the_kmer] -o [output_name] Let s check the quality! you@yourpc:yourdirectory$ assembly-stats [genome_file]

22 Assembly quality assembly-stats how many contigs? how long is the genome? do you remember N50?

23 solving the puzzle De Brujin graphs SPAdes Can we try with more reads? try with 10,000 reads!

24 One more piece to the puzzle Adapter Genomic fragment SINGLE Primer We can add the paired end information PAIRED Wait! I'm reading the SAME THING twice! OK, let's use a bigger fragment Adapter

25 One more piece to the puzzle THEY OVERLAP I can make a UNIQUE big SEQUENCE OR... A BIT MORE FAR AWAY And the bases in between??

26 One more piece to the puzzle paired-end assembly, the old way ACGTAAACGTAA CCTATTGCTATA CGTAAATCGGTC CTATAGTCCGAT ATCGATAATGCC GATTCGGAATGC CGGTCATCGATA CGATCGATTCGG NO PROBLEM! >CONSENSUS ACGTAAACGTAAATCGGTCATCGATAATGCCTATTGCTATAGTCCGATCGATTCGGAATGC

27 One more piece to the puzzle De Brujin s way SPAdes paired-end information is added after the De Brujin graph approach you@yourpc:yourdirectory$ python spades.py -1 [forward_reads] -2 [reverse_reads] --careful -t 1 -k [length_of_the_kmer] -o [output_name] use 10,000 reads forward and 10,000 reads reverse...and...check the quality now!

28 Visualize the assembly Besides having statistics on our assemblies, we can visualize it using Bandage This software is useful to see the contigs and all possible connections. In this way we can use other approaches to join the last pieces, both using bioinformatics (e.g. scaffolding) and using molecular biology (PCR) (this only makes sense when you have very high quality assemblies, otherwise you will only see a mess of squares and lines)

29 SEE YOU NEXT WEEK!

30 Case study! Remember the Mycoplasma outbreak?! We have the genome! now let s compare it with the WT We need to do some COMPARATIVE GENOMICS...a.k.a. spot the differences!

31 G ST ENO RU M CT IC UR E What now? READS GENOME Assembly E GEN ONS TI C N FU GENES Gene calling Annotation PHYLOGENY Ortholog search Alignment Phylogenomic analysis

32 Genomic structure...based on sequence similarity Mauve Go to the right folder and type: java -jar Mauve.jar

33 Case study! Remember the Mycoplasma outbreak?! DON T FORGET OUR RESULTS, GUYS!!! insertion of big piece of DNA

34 G ST ENO RU M CT IC UR E What now? READS GENOME Assembly E GEN ONS TI C N FU GENES Gene calling Annotation PHYLOGENY Ortholog search Alignment Phylogenomic analysis

35 Gene calling...better call it ORF calling ORF Open Reading Frame Any genomic subsequence that has the potential to code for a protein or peptide i.e. it starts with the START codon and ends with a STOP codon >SEQ_1 CTTATGATATGGATCGATCGATTCGATCGATCGATTTAGCTAGGCTAGCTAGCTGACTAG >SEQ_1 CTT ATG ATA TGG ATC GAT CGA TTC GAT CGA TCG ATT TAG CTA GGC TAG CTA GCT GAC TAG

36 Gene calling...better call it ORF calling There are 6 ways to read your genome in triplets:...3 on each DNA strand >SEQ_1 CTTATGATATGGATCGATCGATTCGATCGATCGATTTAGCTAGGCTAGCTAGCTGACTAG >SEQ_1 CTT ATG ATA TGG ATC GAT CGA TTC GAT CGA TCG ATT TAG CTA GGC TAG CTA GCT GAC TAG >SEQ_1 CT TAT GAT ATG GAT CGA TCG ATT CGA TCG ATC GAT TTA GCT AGG CTA GCT AGC TGA CTA G

37 Gene calling...better call it ORF calling Prodigal Searches all possible ORFs on your genome Throws the short ones away prodigal -i *genome_file* -d *found_genes* -a *translated_found_genes* BE FAIR WITH EXTENSIONS!! Usually:.fna files are genomes.ffn files are genes.faa are proteins

38 G ST ENO RU M CT IC UR E What now? READS GENOME Assembly E GEN ONS TI C N FU GENES Gene calling Annotation PHYLOGENY Ortholog search Alignment Phylogenomic analysis

39 Annotation...what does my gene do? Similar sequence, similar function...most of the times!! In these cases we use BLAST It is a LOCAL alignment tool, but it is used for: Aligning Comparing Searching...in one word...blasting

40 Annotation...what does my gene do? BLAST We align a QUERY sequence on a DATABASE of sequence(s) Small sequence on big sequence Small sequence on small sequence

41 Annotation...what does my gene do? Similar sequence, similar function Which is the most similar? We can BLAST on a BIG database Prepare the database you@yourpc[yourdirectory] makeblastdb -in [fasta_file] -dbtype [nucl/prot] Do the research (or blastn, blastx, tblastn, tblastx) you@yourpc[yourdirectory] blastp -query [query_fasta_file] -db [database] -out [output_file] - a lot of other stuff...

42 Annotation...what does my gene do? COGs (clusters of orthologous groups) are groups of proteins of known function Prepare the database makeblastdb -in COG_formatted.fa -dbtype prot Do the research blastp -query [query_fasta_file] -db COG_formatted.fa -out [output_file] -max_target_seqs 1 -outfmt 6 qseqid sseqid you@yourpc[yourdirectory] python get_function_hist.py [annotation_input_file]

43 Case study! Remember the Mycoplasma outbreak?! DON T FORGET OUR RESULTS, GUYS!!! New functions added!!

44 G ST ENO RU M CT IC UR E What now? READS GENOME Assembly E GEN ONS TI C N FU GENES Gene calling Annotation PHYLOGENY Ortholog search Alignment Phylogenomic analysis

46 Phylogeny Reconstruction of EVOLUTION, using differences and common morphological tracts

47 Phylogeny ATCTTC-TG ATCTTCATG ACCTTCATG ATCTGCATG ATCAGCATG ATCTGCATG ATCCGCATG Reconstruction of EVOLUTION, using DNA sequences MOLECULAR PHYLOGENY

48 Phylogeny We do not have information on the ancestors!! ATCTTC-TG ACCTTCATG ATCAGCATG ATCCGCATG Reconstruction of EVOLUTION, using DNA sequences MOLECULAR PHYLOGENY

49 Phylogeny We will reconstruct the story of a single gene. 1) We will look for the same gene/protein from the inserted region in all the organisms we want to analyze. YOUR TURN! What tool would you use?

50 Phylogeny We will reconstruct the story of a single gene. 1) We will look for the same protein from the inserted region in all the organisms we want to analyze. YOUR TURN! What tool would you use? yes, the Swiss army knife of bioinformatics:... BLAST! you@yourpc[yourdirectory] blastp -query [query_fasta_file] -db [database] -out [output_file] -max_target_seqs 1

51 Phylogeny INSTRUCTIONS FOR THE EXERCISE: You will find the gene protein_search folder. MBG_protein.faa in the Pick an organism in which you wish to find which protein is homologous to ours Run the makeblastdb and blastp commands (don t use -outfmt )

52 Phylogeny 2) We need to ALIGN all the proteins, in order to spot the differences 3) then eliminate some noisy amino acids 4) Do the phylogeny Phylogeny is really complex. There are a lot of possible ways to perform analyses. Some people have dedicated their entire scientific career to molecular phylogeny, so it s not a really 5 minute thing..but we are going to take it the easiest way and use some simple tools in Seaview

53 Phylogeny The tools that we have used inside SeaView: - muscle: to align all the sequences - Gblocks: to cut all the bases that could not be used for phylogeny (i.e. gaps and bases/amino acids with low quality alignment) - phyml: to run the phylogeny

54 Case study! Remember the Mycoplasma outbreak?! Most probable conclusion: A DNA insertion from Yersinia pestis, makes our Mycoplasma strain really BAD ASS!