Carl Woese. Used 16S rrna to develop a method to Identify any bacterium, and discovered a novel domain of life

Size: px
Start display at page:

Download "Carl Woese. Used 16S rrna to develop a method to Identify any bacterium, and discovered a novel domain of life"

Transcription

1 METAGENOMICS

2 Carl Woese Used 16S rrna to develop a method to Identify any bacterium, and discovered a novel domain of life His amazing discovery, coupled with his solitary behaviour, made many contemporary biologists think he was crazy

3 16S rrna One of the three structural RNAs that compose the prokaryotic ribosome (together with 5S and 23S, and with 52 proteins) Eukaryotic ribosomes are composed by 79 proteins and 4 rrnas (5S, 5.8S, 28S, 18S) S stands for...

4 16S rrna S stands for SVEDBERG After Theodor Svedberg, winner of the 1926 Nobel Prize in chemistry inventor of the ultracentrifuge It measures how fast a molecule precipitates in an ultracentrifuge, 1S=10-13 seconds

5 16S rrna Translation is one of the most conserved biological processes Ribosomal RNAs and proteins tend to be very similar even in very different organisms

6 16S rrna Woese used this characteristics to compare the ribosomal sequences across eukaryotes and prokaryotes and to Redraw the tree of life

7 16S rrna contains highly conserved regions, but also Variable regions (V1-V9), in those parts of the molecule that are not fully constrained by the function in the ribosome

8 16S rrna we can design PCR primers in the conserved regions And use them to amplify and sequence the variable regions This method called universal bacterial PCR, or panbacterial PCR allows to - amplify (almost) all bacteria - identify them based on V regions

9 16S panbacterial sequencing is at the basis of microbial ecology studies And it gave us the possibility to understand that bacteria are everywhere

10 Microbiome At the interface between bioinformatics, microbiology & ecology

11 The field of study named MICROBIOMICS Includes all the strategies to study the MICROBIOTA Microbiota = the entire microbial community of an habitat Microbiome = the genomes of a microbiota

12 How many human cells are in my body?

13 How many human cells are in my body? 30 trillions=

14 How many human cells are in my body? 30 trillions= How many cells are in my body?

15 How many human cells are in my body? 30 trillions= How many cells are in my body? 70 trillions= This means that over 50% of the cells that compose my body are NOT HUMAN! (different studies suggest variable values, from 50 to 90%

16 JEFFREY GORDON Identified correlations between obesity and the microbiota He colonized germ-free twin mice, one with 'lean microbiota' and one with 'obese microbiota' Nature 2006

17 MICROBIOME STUDIES human microbiome studies (gut, mouth...) Strong influence of the microbiota in: Immunity Infections Metabolism Happyness (serotonin)

18 MICROBIOME STUDIES Ecology Agriculture Wastewater Biofuels Animals Plants

19 Fecal microbiome transplantation (FMT) The medical procedure of taking stool from an healthy donor and transplant it in a patient who has a difficult to eliminate infection, mainly from Clostridium difficile FMT is currently considered an experimental treatment, but it has shown extremely promising results

20 MICROBIOME STUDIES Filarial nematodes cause terrible diseases Symbiotic bacteria found in filarial nematodes They are necessary for the host survival Antibiotic treatment kills the symbionts --> the worm dies patients are cured

21 MICROBIOME STUDIES Microbiome of a hydrotermal vent 2,000 m deep between Norway and Greenland Found a novel phylum of Archea that could be the missing link between prokaryotes and eukaryotes

22 Microbiome 1.0 Amplicon sequencing

23 How to study microbiomes? Classical approach 1. PCR with panbacterial primers to amplify a variable region of the 16s rrna 2. Clone the PCR products 3. Sanger sequencing of a number of clones 4. Analysis of the obtained sequences

24 Next-gen sequencing can be used for microbiome studies 1. PCR with panbacterial primers (variable region of the 16s rrna) 2. Use of the PCR product as a template for the next-generation sequencing (many more reads) 3. bioinformatic analysis

25 What technology? When we choose the right Next-gen technology for an experiment, the two main questions are How long must the reads be? How many reads do we need?

26 How long must the reads be? In metagenomics, longer reads are better Longer reads = sequence more variable regions = more discriminatory power

27 How many reads do we need? Increasing the number of sequences obtained, it will be possible to identify a greater number of taxa After a certain number of sequences, depending upon the complexity of the community of the sample, a plateau will be reached

28 How many reads do we need? Enough reads to detect all the taxa present in the chosen sample Strongly depends on the complexity of the microbial community Human gut very complex Soil sample extremely complex Arthropod sample simple Anyway, less reads than genome sequencing projects

29 What technology? We need long reads, and we need few reads

30 What technology? We need long reads, and we need few reads It used to be Sanger, then it became 454 (long reads) Now even metagenomic studies are performed with Illumina

31 What technology? Now even metagenomic studies are performed with Illumina The quality, quantity and low price of Illumina sequences wins even if the method is not optimal for microbiome studies The current microbiomics approaches are thus tailored for Illumina reads a high number of short reads In the future 3rd gen methods will allow to sequence the entire 16S rrna... and the bioinformatics methods will change accordingly

32 Multiplexing The output of Next-Gen technologies, especially Illumina, is excessive for metagenomics Physical multiplexing was invented for 454 The plate is divided into up to 16 spaces, to load 16 different samples

33 Multiplexing Physical multiplexing helps but... No more than 16 samples Physical space of the sequencing plate is lost not productive

34 Biochemical multiplexing

35 Biochemical multiplexing During library preparation Ligation of the Adapters + a barcode sequence BARCODE: a short sequence (usually 8nt) A different BARCODE is associated to each sample

36 Biochemical multiplexing A different BARCODE is associated to each sample Up to 384 samples (currently, but it depends on the technology) can be pooled in one sequencing run The sequences of each sample are then separated downstream with bioinformatic tools that recognize the sequence of each barcode The bioinformatics of demultiplexing is very simple: the Illumina software discriminates barcodes

37 BIOINFORMATICS of 16S metagenomics 0. sequencing 1. Sample reduction 1.1 Quality control 1.2 Identical sequences merge 2 Selection of homologous 16S sub-sequences 3. OTUs classification 3.1 OTUs identification 3.2 OTUs annotation

38 BIOINFORMATICS of 16S metagenomics 0. SEQUENCING: Barcoded paired-ends reads are generated Barcoded so multiple samples can be sequenced at once Paired-ends so we have more information (more nts per read) Paired-ends for metagenomics are constructed so that they have an overlap This allows merging of the two paired end reads and generation of a longer sequence Read 1-250nt Overlap - 100nt Merged read 400nt Read 2-250nt

39 BIOINFORMATICS of 16S metagenomics 1. Sample reduction Since we are working with Illumina we are generating: SHORT reads ACCURATE reads very likely an EXCESS of reads This is taken into account in the bioinformatic pipeline KNOW YOUR DATA!!!

40 BIOINFORMATICS of 16S metagenomics 1. Sample reduction 1.1 Quality control: Paired-end assembly and quality check Each pair of reads is assembled in one consensus Read pairs with mismatches number greater than a threshold are removed The threshold is usually ZERO Assembled reads with length lower than a threshold are removed

41 BIOINFORMATICS of 16S metagenomics 1. Sample reduction 1.2 Identical sequences are merged Selected assembled reads are compared all versus all and identical reads are merged, just one is mantained in the reads dataset and infomation about merging is stored in a text file This step decreases the amount of reads subjected to the next analyses, reducing the CPU power and time required to complete the analysis (metagenomics can be time consuming!) Simple optimized informatic ALGORITHMs are used (we only look for sequences that are exactly identical)

42 BIOINFORMATICS of 16S metagenomics 2. selection of homologous 16S sub-sequences Select the subset of reads to use for the subsequent identification In order to compare the reads we must calculate the reads pairwise nucleotide distances but an allvsall approach is not feasible (Np-complete problem) To do it we need to compare homologous regions but the Illumina reads are too short to cover the entire V4 region of the 16S gene we have to select a gene sub-region and perform the analysis using only the reads that align on that V4 sub-region

43 BIOINFORMATICS of 16S metagenomics 3.1 OTUs identification The selected reads (homologous) are aligned all versus all and clustered on the basis of nucleotide similarity reads with nucleotide distance lower than a threshold (usually 3% for the V4 region of the 16S rdna) are grouped in the same cluster Each cluster is defined as a Operative Taxonomic Unit (OTU) Why OTUs and not species? What is a species?

44 BIOINFORMATICS of 16S metagenomics 3.1 OTUs identification Why OTUs and not species? The concept of species is difficult to define Derives mostly from historical and phenotypical reasons The concept of OTUs is - standardized (good) - free of phenotypical meaning (good and bad) A single species can comprise multiple OTUs A single OTU can group samples from multiple species

45 BIOINFORMATICS of 16S metagenomics 3.2 OTUs Annotation All the reads belonging to an OTU are used to generate a consensus sequence: a sequence that contains, at each position, the base with the highest frequency in the alignment The consensus is compared to a 16S database and annotated

46 BIOINFORMATICS of 16S metagenomics 3.2 OTUs Annotation Reads taxonomic annotation The OTUS are aligned against a manually curated alignment of 16S sequences, representative of the known bacterial diversity (several 16S alignments are available on database, the most used are SILVA and RDP)

47 SOFTWARES FOR 16S METAGENOMICS PERFORM ALL THE STEPS 1. Sample reduction 1.1 Quality control 1.2 Identical sequences merge 2. Selection of homologous 16S sub-sequences 3. OTUs classification 3.1 OTUs identification 3.2 OTUs annotation MOTHUR Knowledge of the process is fundamental to interact with the software, set parameters and analyze results The pipelines are highly customizable

48 BIOINFORMATICS of 16S metagenomics Finally the output! Note: OTU Sample1 Sample2 Sample3 OTU OTU The number of OTU does not correspond to the number of species! More OTUs can be assigned to the same species Note2: OTU OUT Discrimination at the species level is difficult Better to stop at the genus

49 Comparative analysis of microbiomes 16S Microbiomics is comparative by nature

50 Comparative analysis of microbiomes The obtained data can used to compare the composition of different microbial communities Alpha-diversity is the measure of the diversity within a population (many different types are in the sample) Beta-diversity is the measure of the inter-population diversity

51 Comparative analysis of microbiomes Several indexes and statistical tests are available to study alpha and beta diversities These methods are imported directly from ecology

52 Comparative analysis of microbiomes Alpha-diversity Many of the indexes used to study microbial alpha diversity are also used in ecology. E.g. Shannon Index Simpson index used to quantify the biodiversity of a habitat Take into account the number of species present and the abundance of each species

53 Comparative analysis of microbiomes Beta-diversity In order to compare the composition of different microbial populations we can use the Bray-Curtis dissimilarity index to calculate the similarity matrix, and then apply clustering methods to group the populations

54 MICROBES = bacteria? Additional approaches are considering The fungal community = MYCOBIOME (18S or ITS amplicon sequencing) The viral community = VIROME (???) Could we design an unbiased approach to obtain the entire MICROBIOME?

55 Microbiome 2.0 Shotgun metagenomics

56 Next-Gen sequencing technologies are so productive that a complex sample such as a microbial community can be fully sequenced without 'filter' steps This is the concept of metagenomic shotgun Ultra-deep sequencing is needed

57 SHOTGUN METAGENOMICS This approach hypothetically allows to identify not just every taxon, but every gene present in the sample Consequently, it is possible to analyze the complexity of metabolic reactions present in the sample

58 Downstream bioinformatics analysis to discriminate what is present in a shotgun microbiome sample can be very challenging Ad-hoc softwares are now being developed specifically for these approaches

59 Downstream analyses Once you get your metagenomic reads, you can perform: 1. IDENTIFICATION AND QUANTIFICATION Know what species are in the sample and how many cells there are for each of them 2. FUNCTIONAL ANALYSIS Know what functions/genes are in the sequenced community (with or without knowing which species has which gene) 3. more specific things such as GROWTH ANALYSIS know which population is growing A closed reference genome and organisms with one single origin of replication are needed (very new and rare)

60 1. IDENTIFICATION AND QUANTIFICATION It is possible to characterize and quantify the community in a sequenced sample APPROACH: Look for reads corresponding to marker genes: genes apt to be used to recognize an organism, signature genes Can be performed through a mapping on more than one marker gene at the same time (more precise than the amplicon based one) There are two ways of calculating the taxonomy: based on sequence alignments and similarity or using phylogeny One of the output images of MetaPhlAn2

61 2. FUNCTIONAL ANALYSIS Reads can be assembled to look for functions/genes Gene function can be annotated by sequence similarity or predicted Sometimes the assembly is so fragmented that you get only pieces of genes (function can still be predicted) Whose gene is this? Functions can be assigned to a taxonomy by BINNING, which can be performed before or after the assembly Taxonomy assignment is based on: sequence similarity based on alignment to known genomes sequence content: percentage of GC, presence of specific k-mers Similar read coverage E. coli B. subtilis P. putida

62 3. GROWTH ANALYSIS know which population is growing A closed reference genome and organisms with one single origin of replication are needed (very new and rare) Peak to trough ratio see the differences in coverage between origin of replication and other genome locations

63 Microbiome 3.0 Ad-hoc approaches

64 Mini metagenomics A complex sample is passed through a cell-sorter

65 Mini metagenomics Cells of interest are selected These can be all bacteria, all eukaryotic cells, all bacteria of one species, all leukocytes... The sample is sequenced Shotgun metagenomics methods are used... on a much simpler community

66 Single-cell genomics If I can sort all cell of a single type, I can also select SINGLE CELLS A single cell must be processed in nanovolumes The genome must be amplified using unbiased techniques of whole-genome amplification The DNA can then be sequenced and analyzed Example of an application: Full genome prenatal diagnosis, cancer cell genomics Single-cell can be used to analyze expression patterns as well: single-cell transcriptomics