Phylogenetic methods for taxonomic profiling

Similar documents
Using Ensembles of Hidden Markov Models in metagenomic analysis

TIPP: Taxon Iden,fica,on using Phylogeny-Aware Profiles

Chapter 7. Motif finding (week 11) Chapter 8. Sequence binning (week 11)

Microbiomes and metabolomes

SUPPLEMENTARY INFORMATION

Introduction to OTU Clustering. Susan Huse August 4, 2016

Carl Woese. Used 16S rrna to developed a method to Identify any bacterium, and discovered a novel domain of life

Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME. Peter Sterk EBI Metagenomics Course 2014

A base composition analysis of natural patterns for the preprocessing of metagenome sequences

Evaluation of the liver abscess microbiome and liver abscess prevalence in cattle reared for production of natural branded beef

Grundlagen der Bioinformatik Summer Lecturer: Prof. Daniel Huson

Metagenomics Computational Genomics

Main Topics. Microbial habitats. Microbial habitats. Lecture 21: Bacterial diversity and Microbial Ecology. Ecological characteristics of bacteria

Carl Woese. Used 16S rrna to develop a method to Identify any bacterium, and discovered a novel domain of life

Genome 373: Genomic Informatics. Elhanan Borenstein

Microbiome: Metagenomics 4/4/2018

Antibiotic Resistance Genes: From The Farm To The Human Gut

Developing Tools for Rapid and Accurate Post-Sequencing Analysis of Foodborne Pathogens. Mitchell Holland, Noblis

Bioinformatics for Microbial Biology

Bioinformatic tools for metagenomic data analysis

Microgenetix Meaningful Microbial Diversity Gut Health for Good Health

ReprDB and pandb: minimalist databases with maximal microbial representation

Functional profiling of metagenomic short reads: How complex are complex microbial communities?

Assigning Sequences to Taxa CMSC828G

Supplementary Figure 1 Schematic view of phasing approach. A sequence-based schematic view of the serial compartmentalization approach.

Evaluation of the liver abscess microbiome and liver abscess prevalence in cattle reared for production of natural branded beef

CBC Data Therapy. Metagenomics Discussion

Introduc)on to QIIME on the IPython Notebook

Metagenomic species profiling using universal phylogenetic marker genes

Methods for the phylogenetic inference from whole genome sequences and their use in Prokaryote taxonomy. M. Göker, A.F. Auch, H. P.

HMP Data Set Documentation

SUPPLEMENTARY INFORMATION

Creation of a PAM matrix

Scoring Alignments. Genome 373 Genomic Informatics Elhanan Borenstein

Taxator-tk: precise taxonomic assignment of metagenomes by fast approximation of evolutionary neighborhoods.

INTRODUCTION A clear cultivation bias exists in microbial phylogenetics. As of 2010, half of all

HmmUFOtu: An HMM and phylogenetic placement based ultra-fast taxonomic assignment and OTU picking tool for microbiome amplicon sequencing studies

Methods for comparing multiple microbial communities. james robert white, October 1 st, 2007

MB 668 Microbial Bioinformatics and Genome Evolution. 4 credits Spring, 2017

GPU-Meta-Storms: Computing the similarities among massive microbial communities using GPU

Integrating Evolutionary, Ecological and Statistical Approaches to Metagenomics. A proposal to the Gordon and Betty Moore Foundation

Practical Bioinformatics for Life Scientists. Week 14, Lecture 27. István Albert Bioinformatics Consulting Center Penn State

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Experimental Design Microbial Sequencing

Machine Learning. HMM applications in computational biology

Lecture 8: Predicting metagenomic composition from 16S survey data

Genetics and Bioinformatics

Report on database pre-processing

BE CHMARKI G BLAST ACCURACY OF GE US/PHYLA CLASSIFICATIO OF METAGE OMIC READS

Novel bacterial taxa in the human microbiome

Functional annotation of metagenomes

Meta-IDBA: A de Novo Assembler for Metagenomic Data

Supplementary Figures

Suppl Methods for Biogeochemistry paper

What is metagenomics?

Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM)

Alignment-free d2 oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences

Analysis of milk microbial profiles using 16s rrna gene sequencing in milk somatic cells and fat

03-511/711 Computational Genomics and Molecular Biology, Fall

Evidence-Based Clustering of Reads and Taxonomic Analysis of Metagenomic Data

Microbiome Analysis. Research Day 2012 Ranjit Kumar

Supplementary Online Content

An introduction into 16S rrna gene sequencing analysis. Stefan Boers

Lecture 8: Predicting and analyzing metagenomic composition from 16S survey data

SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF NGS DATA CBIO-PIPELINE SAMSON, KM

Nature Biotechnology: doi: /nbt Supplementary Figure 1. MBQC base beta diversity, major protocol variables, and taxonomic profiles.

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

Computing for Metagenome Analysis

UNIVERSITY OF EAST ANGLIA School of Computing Sciences Main Series UG Examination ALGORITHMS FOR BIOINFORMATICS CMP-6034B

Microbiome Diversity on Materials

Introduction to CGE tools

RNA Structure Prediction. Algorithms in Bioinformatics. SIGCSE 2009 RNA Secondary Structure Prediction. Transfer RNA. RNA Structure Prediction

ISQuest: finding insertion sequences in prokaryotic sequence fragment data

TECHNIQUES FOR STUDYING METAGENOME DATASETS METAGENOMES TO SYSTEMS.

Measuring transcriptomes with RNA-Seq. BMI/CS 776 Spring 2016 Anthony Gitter

Functional analysis using EBI Metagenomics

DiScRIBinATE: a rapid method for accurate taxonomic classification of metagenomic sequences

COMPARING MICROBIAL COMMUNITY RESULTS FROM DIFFERENT SEQUENCING TECHNOLOGIES

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

Applications of Next Generation Sequencing in Metagenomics Studies

UTAX accurately predicts taxonomy of marker gene sequences

Overview of CIDT Challenges and Opportunities

Single-cell genome sequencing at ultra-high-throughput with microfluidic droplet barcoding

CBC Data Therapy. Metatranscriptomics Discussion

RHIZOSPHERE METAGENOMICS OF THREE BIOFUEL CROPS. Jiarong Guo

Log abundance distribution: assess detection limit of as low as DNA of three microbes.

UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics

Log abundance distribution: assess detection limit of as low as DNA of three microbes.

Comparative genomics of clinical isolates of Pseudomonas fluorescens, including the discovery of a novel disease-associated subclade.

From AP investigative Laboratory Manual 1

Biology 644: Bioinformatics

Jianguo (Jeff) Xia, Assistant Professor McGill University, Quebec Canada June 26, 2017

Introduction to Microbial Sequencing

Genome sequence of Brucella abortus vaccine strain S19 compared to virulent strains yields candidate virulence genes

GenScale Scalable, Optimized and Parallel Algorithms for Genomics. Dominique LAVENIER

DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment

Fungal ITS Bioinformatics Efforts in Alaska

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment

Enrichment of the lung microbiome with gut bacteria in sepsis and the acute respiratory distress syndrome

Transcription:

Phylogenetic methods for taxonomic profiling Siavash Mirarab University of California at San Diego (UCSD) Joint work with Tandy Warnow, Nam-Phuong Nguyen, Mike Nute, Mihai Pop, and Bo Liu

Phylogeny reconstruction pipeline samples Sequencing ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG Bioinformatic processing 000 AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT 2

Phylogeny reconstruction pipeline Step 1: Multiple sequence alignment samples ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG Sequencing CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G Bioinformatic processing 000 AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT 000 2

Phylogeny reconstruction pipeline Step 2: Species tree reconstruction Step 1: Multiple sequence alignment supermatrix ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G 000 Approach 1: Concatenation CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference Approach 2: Summary methods Orangutan anzee Gene tree estimation samples Sequencing Bioinformatic processing ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG 000 AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G 000 000 Summary method Orangutan anzee 2

Phylogeny reconstruction pipeline Step 2: Species tree reconstruction Step 1: Multiple sequence alignment supermatrix ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G 000 Approach 1: Concatenation CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference Approach 2: Summary methods Orangutan anzee Gene tree estimation samples Sequencing Bioinformatic processing ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG 000 AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G 000 000 Summary method Orangutan anzee TGGCACGCAACG ATGGCACGCTA ATGGCACGCA AGCTAACACGGAT ATGGCACGA 2

Phylogeny reconstruction pipeline Step 2: Species tree reconstruction Step 1: Multiple sequence alignment supermatrix ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G 000 Approach 1: Concatenation CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference Approach 2: Summary methods Orangutan anzee Gene tree estimation samples Sequencing Bioinformatic processing ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG 000 AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G 000 000 Summary method Orangutan anzee TGGCACGCAACG ATGGCACGCTA ATGGCACGCA AGCTAACACGGAT ATGGCACGA Step 3: Phylogenetic placement 00 ----ATGGCGA--- gene 30 -ACATGGCT----- 000 -ACATGGCT----- -----CATTGCT-- 2

Phylogeny reconstruction pipeline PASTA UPP Step 1: Multiple sequence alignment supermatrix ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G Step 2: Species tree reconstruction 000 Approach 1: Concatenation CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference Approach 2: Summary methods Orangutan anzee Gene tree estimation samples Sequencing Bioinformatic processing ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG 000 AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G 000 000 Summary method Orangutan anzee TGGCACGCAACG ATGGCACGCTA ATGGCACGCA AGCTAACACGGAT ATGGCACGA Step 3: Phylogenetic placement 00 ----ATGGCGA--- gene 30 -ACATGGCT----- 000 -ACATGGCT----- -----CATTGCT-- 2

Phylogeny reconstruction pipeline PASTA UPP Sta$s$cal binning Step 1: Multiple sequence alignment supermatrix ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G Step 2: Species tree reconstruction 000 Approach 1: Concatenation CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference Approach 2: Summary methods Orangutan anzee Gene tree estimation samples Sequencing Bioinformatic processing ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG 000 AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G 000 000 Summary method ASTRAL Orangutan anzee TGGCACGCAACG ATGGCACGCTA ATGGCACGCA AGCTAACACGGAT ATGGCACGA Step 3: Phylogenetic placement 00 ----ATGGCGA--- gene 30 -ACATGGCT----- 000 -ACATGGCT----- -----CATTGCT-- 2

Phylogeny reconstruction pipeline PASTA UPP Sta$s$cal binning Step 1: Multiple sequence alignment supermatrix ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G Step 2: Species tree reconstruction 000 Approach 1: Concatenation CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference Approach 2: Summary methods Orangutan anzee Gene tree estimation samples Sequencing Bioinformatic processing ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG 000 AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G 000 000 Summary method ASTRAL Orangutan anzee TGGCACGCAACG ATGGCACGCTA ATGGCACGCA AGCTAACACGGAT ATGGCACGA Step 3: Phylogenetic placement SEPP TIPP 00 ----ATGGCGA--- 2 gene 30 -ACATGGCT----- 000 -ACATGGCT----- -----CATTGCT--

Phylogeny reconstruction pipeline PASTA UPP Step 1: Multiple sequence alignment supermatrix ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G Step 2: Species tree reconstruction 000 Approach 1: Concatenation CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference Approach 2: Summary methods Orangutan anzee Gene tree estimation samples Sequencing Bioinformatic processing ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG 000 AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G 000 000 Summary method Orangutan anzee TGGCACGCAACG ATGGCACGCTA ATGGCACGCA AGCTAACACGGAT ATGGCACGA Step 3: Phylogenetic placement SEPP TIPP 00 ----ATGGCGA--- 3 gene 30 -ACATGGCT----- 000 -ACATGGCT----- -----CATTGCT--

Microbiome analyses using evolutionary trees ACCG CGAG CGGT GGCT TAGA GGAG GcTT ACCT Escherichia coli Salmonella enterica Salmonella bongori Yersinia pestis Vibrio cholerae Fragmentary metagenomic reads A reference dataset of full length sequences with an alignment and a tree Place each fragmentary read independently on a reference tree of known sequences 4

Phylogenetic placement Input: A backbone multiple sequence alignment for a marker gene, including sequences from known species A backbone ML phylogenetic tree, corresponding to the backbone alignment A collection of (fragmentary, error-prone) query sequences

Phylogenetic placement Input: A backbone multiple sequence alignment for a marker gene, including sequences from known species A backbone ML phylogenetic tree, corresponding to the backbone alignment A collection of (fragmentary, error-prone) query sequences Output: Probabilistic placements of each query sequence on the phylogenetic tree after (locally) aligning the query to the reference

Phylogenetic placement Input: A backbone multiple sequence alignment for a marker gene, including sequences from known species A backbone ML phylogenetic tree, corresponding to the backbone alignment A collection of (fragmentary, error-prone) query sequences Output: Probabilistic placements of each query sequence on the phylogenetic tree after (locally) aligning the query to the reference Tools: Alignment: HMMER Placement: pplacer (Matsen) and EPA (RAxML)

Phylogenetic placement Input: A backbone multiple sequence alignment for a marker gene, including sequences from known species SATe-Enabled Phylogenetic Placement A backbone ML phylogenetic tree, corresponding to the backbone alignment (SEPP) A collection of (fragmentary, error-prone) query sequences Output: Probabilistic placements of each query sequence on the phylogenetic tree after (locally) aligning the query to the reference Tools: Alignment: HMMER Placement: pplacer (Matsen) and EPA (RAxML)

Phylogenetic placement simulations S. Mirarab et al., PSB. (2012). 0.0 Increasing rate of evolution

Reference tree

HMM for the alignment step

Ensemble of HMMs

Ensemble of HMMs

Ensemble of HMMs

SATe-Enabled Phylogenetic Placement (SEPP) Step 1: Align each query sequence to the backbone alignment Use an ensemble of disjoint HMMs instead of using a single HMM to improve accuracy. The ensemble is created based on the reference tree such that each model better captures details of a part of a tree 12 S. Mirarab et al., PSB. (2012).

SATe-Enabled Phylogenetic Placement (SEPP) Step 1: Align each query sequence to the backbone alignment Use an ensemble of disjoint HMMs instead of using a single HMM to improve accuracy. The ensemble is created based on the reference tree such that each model better captures details of a part of a tree Step 2: Place each query sequence into the backbone tree, using extended alignment Use divide-and-conquer on the backbone tree to improve scalability to reference trees with tens of thousands of leaves 12 S. Mirarab et al., PSB. (2012).

SEPP on simulated data S. Mirarab et al., PSB. (2012). 0.0 0.0 Increasing rate of evolution

SEPP on large 16S references Simulations: 16S bacteria, 13k curated backbone tree, 13k fragments Running time Memory

SEPP on large 16S references Simulations: 16S bacteria, 13k curated backbone tree, 13k fragments Running time Memory Real data (with Rob Knight s lab; Daniel McDonald): EMP: placing ~300,000 fragments on the greengenes reference tree with 203,452 sequences 8 hours (16 cores) AG: placing ~40,000 fragments on the greengenes reference tree with 203,452 sequences 10 minutes (16 cores)

Taxonomic Profiling Input: Reference multiple sequence alignments for a collection of marker genes, each including sequenced species Reference trees for marker genes. We force trees to be compatible with the taxonomy (not necessary). A metagenomic sample: a collection of fragmentary reads from many species with different abundances

Taxonomic Profiling Input: Reference multiple sequence alignments for a collection of marker genes, each including sequenced species Reference trees for marker genes. We force trees to be compatible with the taxonomy (not necessary). A metagenomic sample: a collection of fragmentary reads from many species with different abundances Output: The taxonomic profile of the sample Genus % Pseudomonas 16.6 Campylobacter 8.9 Streptomyces 7.6 Pasteurella 6.4 Clostridium 5.1 Alcanivorax 4.5 unclassified 1.2 Phylum % Proteobacteria 63.1 Actinobacteria 9.6 Firmicutes 9.6 Euryarchaeota 7.6 Cyanobacteria 4.5 Crenarchaeota 3.8 unclassified 0.0

Taxonomic Profiling Input: Reference multiple sequence alignments for a collection of marker genes, each including sequenced species Reference trees for marker genes. We force trees to be compatible with the taxonomy (not necessary). A metagenomic sample: a collection of fragmentary reads from many species with different abundances Output: The taxonomic profile of the sample Genus % Pseudomonas 16.6 Campylobacter 8.9 Streptomyces 7.6 Taxon Identification and Pasteurella 6.4 Clostridium 5.1 Alcanivorax 4.5 Phylogenetic Profiling unclassified 1.2 (TIPP) Phylum % Proteobacteria 63.1 Actinobacteria 9.6 Firmicutes 9.6 Euryarchaeota 7.6 Cyanobacteria 4.5 Crenarchaeota 3.8 unclassified 0.0

Algorithmic steps Step 1: map fragments to ~30 marker genes using BLAST 16 Nguyen et al., Bioinformatics (2014)

Algorithmic steps Step 1: map fragments to ~30 marker genes using BLAST Step 2: Use SEPP to place reads on the marker trees Take into account uncertainty: use several alignments and placements on the tree (to reach a predefined level of statistical support) 16 Nguyen et al., Bioinformatics (2014)

Algorithmic steps Step 1: map fragments to ~30 marker genes using BLAST Step 2: Use SEPP to place reads on the marker trees Take into account uncertainty: use several alignments and placements on the tree (to reach a predefined level of statistical support) Step 3: Summarize results across genes to get a taxonomic profile Each read contributes to each branch and all branches above it proportionally to the probability that it belongs to that branch Results from all genes are simply aggregated as counts 16 Nguyen et al., Bioinformatics (2014)

Phylogeny reconstruction pipeline PASTA UPP Step 1: Multiple sequence alignment supermatrix ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G Step 2: Species tree reconstruction 000 Approach 1: Concatenation CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference Approach 2: Summary methods Orangutan anzee Gene tree estimation samples Sequencing Bioinformatic processing ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG CTGAGCATCG CTGAGCTCG ATGAGCTC CTGACACG 000 AGCCACGCCATA ATGGCACGCCTA AGCTACCACGGAT ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G 000 000 Summary method Orangutan anzee TGGCACGCAACG ATGGCACGCTA ATGGCACGCA AGCTAACACGGAT ATGGCACGA Step 3: Phylogenetic placement SEPP TIPP 00 ----ATGGCGA--- 17 gene 30 -ACATGGCT----- 000 -ACATGGCT----- -----CATTGCT--

Multiple sequence alignment Input: a (potentially ultra-large) set of input sequences from a single gene sequence may be full or fragmentary Output: a multiple sequence alignment Optional: co-estimate the alignment and tree Relevance: useful to get very large reference alignments and trees with up to hundreds of thousands of leaves

Multiple sequence alignment Input: a (potentially ultra-large) set of input sequences from a single gene PASTA sequence may be full or fragmentary Great for trees Output: a multiple sequence alignment Not good for fragmentary data Optional: co-estimate the alignment and tree Relevance: useful to get very large reference alignments and trees with up to hundreds of thousands of leaves

Multiple sequence alignment Input: a (potentially ultra-large) set of input sequences from a single gene PASTA sequence may be full or fragmentary Great for trees Output: a multiple sequence alignment Not good for fragmentary data Optional: co-estimate the alignment and tree UPP Good for fragmentary data Relevance: useful to get very large reference alignments and trees with up to hundreds of thousands of leaves

UPP Steps Step 1: randomly select a small subset of full length sequences (e.g., 1000) as backbone. Step 2: align the backbone using other tools (e.g., using PASTA) Step 3: Use a SEPP-like approach to align the remaining sequences into the reference Note: leaves some insertion sites unaligned

PASTA: Iterative divide-and-conquer alignment and tree estimation A B C D E Decompose dataset A B C D E Estimate tree Align subproblems ABCDE Merge sub-alignments A B C D E S. Mirarab et al., Res. Comput. Mol. Biol. (2014). S. Mirarab et al., J. Comput. Biol. 22 (2015). 20

PASTA on Greengenes Testing the performance of PASTA for building green genes 16S reference tree Q1: Ability to distinguish samples using unifrac? unweighted weighted GG PASTA GG PASTA 88 soils 0.78 0.78 0.75 0.74 infant-time-series 0.55 0.55 0.37 0.42 moving pictures 728 724 2188 2439 global gut 52.9 51.1 79 72 Q2: Speed: 97% tree ( 99,322 leaves): 28 hours (16 cores) 99% tree (203,452 leaves): 49 hours With Daniel McDonald (Knight s lab) and Uyen Mai

Software availability PASTA: github.com/smirarab/pasta (internally uses FastTree, Mafft, HMMER, and OPAL) SEPP: github.com/smirarab/sepp (internally uses pplacer and HMMER) UPP: https://github.com/smirarab/sepp/blob/master/readme.upp.md (internally uses HMMER) TIPP: https://github.com/smirarab/sepp/blob/master/readme.tipp.md (internally uses pplacer and HMMER) Species tree estimation: Statistical binning: https://github.com/smirarab/binning ASTRAL: github.com/smirarab/astral

Acknowledgments Nam-Phuong Nguyen Tandy Warnow s lab: Mike Nute Mihai Pop s lab: Bo Liu Rob Knight s lab Daniel McDonlad Mirarab lab Uyen Mai