Peter T Kim. Department of Mathematics and Statistics University of Guelph, Canada

Size: px
Start display at page:

Download "Peter T Kim. Department of Mathematics and Statistics University of Guelph, Canada"

Transcription

1 Department of Mathematics and Statistics University of Guelph, Canada Department of Pathology and Molecular Medicine McMaster University, Canada Program Leader SAMSI LDHD Working Group Leader: Nonlinear Low-dimensional Structures in High-dimensions for Biological Data SAMSI LDHD Instructor: Geometric and Topological Summaries of Data and Inference II 7 February 2014

2 Metagenomics Genomics: refers to the study of the genome of individuals Metagenomics: refers to the study of the cumulative genome in groups of individuals

3 Metagenomics Genomics: refers to the study of the genome of individuals Metagenomics: refers to the study of the cumulative genome in groups of individuals

4 Data Acquisition We will describe DNA Sequencing Alignment space (and the first ) Trees (and the second ) But first some preliminaries

5 DNA 5 AGCCTGA 3 DNA sequences: the free product over an alphabet of four letters, A = {A, G, T, C}: S eq = k=1 These letters are the nucleotides (or DNA bases) A, G, T, C: A = Adenine G = Guanine T = Tyrosine C = Cytosine A k

6 Orientation 5 AGCCTGA 3 The 5 and 3 notation gives an orientation. They take their name from the order of the carbon atoms on the sugar backbone of the nucleodtides. DNA is replicated in the 5 3 direction.

7 Orientation 5 AGCCTGAA 3 The 5 and 3 notation gives an orientation. They take their name from the order of the carbon atoms on the sugar backbone of the nucleodtides. DNA is replicated in the 5 3 direction.

8 Orientation 5 AGCCTGAAT 3 The 5 and 3 notation gives an orientation. They take their name from the order of the carbon atoms on the sugar backbone of the nucleodtides. DNA is replicated in the 5 3 direction.

9 Orientation 5 AGCCTGAATC 3 The 5 and 3 notation gives an orientation. They take their name from the order of the carbon atoms on the sugar backbone of the nucleodtides. DNA is replicated in the 5 3 direction.

10 Orientation 5 AGCCTGAATCT 3 The 5 and 3 notation gives an orientation. They take their name from the order of the carbon atoms on the sugar backbone of the nucleodtides. DNA is replicated in the 5 3 direction.

11 Complementary DNA 5 AGCCTGA 3 3 TCGGACT 5 The orientation is also important when considering complementary DNA sequences, which are generated by interchanging A and T, and G and C, and then switching directions. When DNA is replicated, we refer to the elongating strand as the complement and the other as the template.

12 DNA Mutations 5 AGCCTGA 3 3 TCGGACT 5 point mutations Insertions deletions inversions

13 DNA Mutations 5 AGCATGA 3 3 TCGTACT 5 point mutations Insertions deletions inversions

14 DNA Mutations 5 AGCGTAATGA 3 3 TCGCATTACT 5 point mutations Insertions deletions inversions

15 DNA Mutations 5 AGCTGA 3 3 TCGACT 5 point mutations Insertions deletions inversions

16 DNA Mutations 5 ACAGCA 3 3 TGTCGT 5 point mutations Insertions deletions inversions

17 Which DNA? To identify the bacteria in a community, we require well-defined DNA tags (sequences). Such a tag should satisfy: the tag is present in all bacteria the tag is highly conserved the tag is varied enough to allow for discrimination bacteria do not exchange or otherwise acquire new tags each bacterium has exactly one such tag ?

18 Which DNA? To identify the bacteria in a community, we require well-defined DNA tags (sequences). Such a tag should satisfy: the tag is present in all bacteria the tag is highly conserved the tag is varied enough to allow for discrimination bacteria do not exchange or otherwise acquire new tags each bacterium has exactly one such tag ?

19 Which DNA? To identify the bacteria in a community, we require well-defined DNA tags (sequences). Such a tag should satisfy: the tag is present in all bacteria the tag is highly conserved the tag is varied enough to allow for discrimination bacteria do not exchange or otherwise acquire new tags each bacterium has exactly one such tag ?

20 Which DNA? To identify the bacteria in a community, we require well-defined DNA tags (sequences). Such a tag should satisfy: the tag is present in all bacteria the tag is highly conserved the tag is varied enough to allow for discrimination bacteria do not exchange or otherwise acquire new tags each bacterium has exactly one such tag ?

21 Which DNA? To identify the bacteria in a community, we require well-defined DNA tags (sequences). Such a tag should satisfy: the tag is present in all bacteria the tag is highly conserved the tag is varied enough to allow for discrimination bacteria do not exchange or otherwise acquire new tags each bacterium has exactly one such tag ?

22 Which DNA? To identify the bacteria in a community, we require well-defined DNA tags (sequences). Such a tag should satisfy: the tag is present in all bacteria the tag is highly conserved the tag is varied enough to allow for discrimination bacteria do not exchange or otherwise acquire new tags each bacterium has exactly one such tag ?

23 Which DNA: 16S rrna The 16S gene encodes for the 16S rrna subunit of the bacterial ribosome crucial protein production 9 hypervariable regions bacterial species are fully identified by their 16S sequence Well-defined sequences to choose from, including functional genes. Alternatively can search for functional pathways by shotgun method V 1 V 2 V 8 V 9

24 : Specimen collection Collect and Process DNA Sequencing and Quality Control A G T C A G T C Sequence Alignment Analysis at various resolutions Classification s 1 s 2 s 3 s 4 s 5 s 1 s 2 s 3 s 4 s 5

25 : Specimen collection Specimen Specimens are collected. Bacteria are present in Specimen. DNA is extracted from bacteria. Possible environmental contamination

26 : Specimen collection Bacterium Specimens are collected. Bacteria are present in Specimen. DNA is extracted from bacteria. Possible environmental contamination

27 : Specimen collection DNA Specimens are collected. Bacteria are present in Specimen. DNA is extracted from bacteria. Possible environmental contamination

28 : Specimen collection DNA Specimens are collected. Bacteria are present in Specimen. DNA is extracted from bacteria. Possible environmental contamination

29 : PCR (ie signal amplification) Polymerase Chain Reaction (PCR) dsdna suspended in solution. Sequence specific primers added as are dntp s and DNA polymerase. Solution heated to denature dsdna. 5 3 dsdna 3 5 Solution cooled to allow primers to anneal. DNA polymerase elongates 5 to 3.

30 : PCR (ie signal amplification) Polymerase Chain Reaction (PCR) dsdna suspended in solution. 5 3 primers 3 5 Sequence specific primers added as are dntp s and DNA polymerase. Solution heated to denature dsdna. Solution cooled to allow primers to anneal. DNA polymerase elongates 5 to 3.

31 : PCR (ie signal amplification) 5 3 DNA polymerase 3 5 Polymerase Chain Reaction (PCR) dsdna suspended in solution. Sequence specific primers added as are dntp s and DNA polymerase. Solution heated to denature dsdna. Solution cooled to allow primers to anneal. DNA polymerase elongates 5 to 3.

32 : PCR (ie signal amplification) Polymerase Chain Reaction (PCR) dsdna suspended in solution. Sequence specific primers added as are dntp s and DNA polymerase. Solution heated to denature dsdna. Solution cooled to allow primers to anneal. DNA polymerase elongates 5 to 3.

33 : PCR (ie signal amplification) Polymerase Chain Reaction (PCR) dsdna suspended in solution. Sequence specific primers added as are dntp s and DNA polymerase Solution heated to denature dsdna. Solution cooled to allow primers to anneal. DNA polymerase elongates 5 to 3.

34 : PCR (ie signal amplification) Polymerase Chain Reaction (PCR) dsdna suspended in solution. Sequence specific primers added as are dntp s and DNA polymerase. Solution heated to denature dsdna. Solution cooled to allow primers to anneal. DNA polymerase elongates 5 to 3.

35 : PCR (ie signal amplification) Polymerase Chain Reaction (PCR) dsdna suspended in solution. Sequence specific primers added as are dntp s and DNA polymerase. Solution heated to denature dsdna. Solution cooled to allow primers to anneal. DNA polymerase elongates 5 to 3.

36 : PCR (ie signal amplification) Polymerase Chain Reaction (PCR) dsdna suspended in solution. Sequence specific primers added as are dntp s and DNA polymerase. Solution heated to denature dsdna. 5 Solution cooled to allow primers to anneal. 3 DNA polymerase elongates 5 to 3.

37 : PCR (ie signal amplification) Polymerase Chain Reaction (PCR) dsdna suspended in solution. Sequence specific primers added as are dntp s and DNA polymerase. Solution heated to denature dsdna Solution cooled to allow primers to anneal. DNA polymerase elongates 5 to 3.

38 : PCR (ie signal amplification) Polymerase Chain Reaction (PCR) dsdna suspended in solution. Sequence specific primers added as are dntp s and DNA polymerase. Solution heated to denature dsdna Solution cooled to allow primers to anneal. DNA polymerase elongates 5 to 3.

39 : PCR (ie signal amplification) Polymerase Chain Reaction (PCR) dsdna suspended in solution. Sequence specific primers added as are dntp s and DNA polymerase. Solution heated to denature dsdna Solution cooled to allow primers to anneal. DNA polymerase elongates 5 to 3.

40 : PCR (ie signal amplification) Polymerase Chain Reaction (PCR) dsdna suspended in solution. Sequence specific primers added as are dntp s and DNA polymerase. Solution heated to denature dsdna Solution cooled to allow primers to anneal. DNA polymerase elongates 5 to 3.

41 : PCR (ie signal amplification) Polymerase Chain Reaction (PCR) dsdna suspended in solution. Sequence specific primers added as are dntp s and DNA polymerase. Solution heated to denature dsdna. Solution cooled to allow primers to anneal. DNA polymerase elongates 5 to 3.

42 : PCR Problems There are some problems associated with PCR. Since we do not have the biological error correction tools in the PCR mix, insertions, deletions, or point mutations are not corrected chimera formation Chimeras are DNA strands that occur when a sequence is not completed during the elongation cycle, is denatured from its template during heating, and then bonds itself to a different template. 3 AGCCTGAAGCCTGA 5 5 TCGGACT 3

43 : PCR Problems There are some problems associated with PCR. Since we do not have the biological error correction tools in the PCR mix, insertions, deletions, or point mutations are not corrected chimera formation Chimeras are DNA strands that occur when a sequence is not completed during the elongation cycle, is denatured from its template during heating, and then bonds itself to a different template. 3 AGCTTGAACGTTGA 5 5 TCGGACT 3

44 : Sequencing There is a murky area between when we place the DNA in the sequencer, and when we get their reads. During this process, 100,000 s DNA strands are sequenced, and various quality filters are applied. We obtain information about the quality of each base in a sequence; we trim the sequences to where a floating window shows a cumulative quality score below a certain threshold. We also trim sequences to remove ambiguous bases entirely.

45 : Sequencing There is a murky area between when we place the DNA in the sequencer, and when we get their reads. During this process, 100,000 s DNA strands are sequenced, and various quality filters are applied. We obtain information about the quality of each base in a sequence; we trim the sequences to where a floating window shows a cumulative quality score below a certain threshold. We also trim sequences to remove ambiguous bases entirely.

46 : Sequencing There is a murky area between when we place the DNA in the sequencer, and when we get their reads. During this process, 100,000 s DNA strands are sequenced, and various quality filters are applied. We obtain information about the quality of each base in a sequence; we trim the sequences to where a floating window shows a cumulative quality score below a certain threshold. We also trim sequences to remove ambiguous bases entirely. 5 AGTCAGTCAGTC 3

47 : Sequencing There is a murky area between when we place the DNA in the sequencer, and when we get their reads. During this process, 100,000 s DNA strands are sequenced, and various quality filters are applied. We obtain information about the quality of each base in a sequence; we trim the sequences to where a floating window shows a cumulative quality score below a certain threshold. We also trim sequences to remove ambiguous bases entirely. 5 AGTCAGTCAGTC 3

48 : Sequencing Output of sequencing: 5 -AGATCCCTAGT-3 5 -AGATCCAG-3 5 -AGCTCCCGAGTCCGGTCG-3 5 -AGCTCCGGAGGGTC-3 5 -AGCTCCCGAGTTCG-3 5 -AGCTCCCTCCGGTCG-3 5 -AGATGAGTCCCGTCG-3.

49 Sequence alignment What accounts for genetic diversity also makes direct sequence comparison difficult. 5 -AGATCCCTAGT-3 5 -AGATCCAG-3 5 -AGCTCCCGAGTCCGGTCG-3 5 -AGCTCCGGAGGGTC-3 5 -AGCTCCCGAGTTCG-3 5 -AGCTCCCTCCGGTCG-3 5 -AGATGAGTCCCGTCG-3.

50 Sequence alignment What accounts for genetic diversity also makes direct sequence comparison difficult. 5 -AGATCCCTAGT-3 5 -AGATCCAG-3 5 -AGCTCCCGAGTCCGGTCG-3 5 -AGCTCCGGAGGGTC-3 5 -AGCTCCCGAGTTCG-3 5 -AGCTCCCTCCGGTCG-3 5 -AGATGAGTCCCGTCG-3. Insertions and deletions shift read frames between sequences; we must adjust. We must somehow align the homologous subsequences.

51 Sequence alignment What accounts for genetic diversity also makes direct sequence comparison difficult. 5 -AGATCCCTAGT-3 5 -AGATCCAG-3 5 -AGCTCCCGAGTCCGGTCG-3 5 -AGCTCCGGAGGGTC-3 5 -AGCTCCCGAGTTCG-3 5 -AGCTCCCTCCGGTCG-3 5 -AGATGAGTCCCGTCG-3. Insertions and deletions shift read frames between sequences; we must adjust. We must somehow align the homologous subsequences.

52 Sequence alignment What accounts for genetic diversity also makes direct sequence comparison difficult. 5 -AGATCCCTAGT-3 5 -AGATCCAG-3 5 -AGCTCCCGAGTCCGGTCG-3 5 -AGCTCCGGAGGGTC-3 5 -AGCTCCCGAGTTCG-3 5 -AGCTCCCTCCGGTCG-3 5 -AGATGAGTCCCGTCG-3. Insertions and deletions shift read frames between sequences; we must adjust. We must somehow align the homologous subsequences.

53 Sequence alignment Defining positional homology is tricky. Our attempt to formalise it follows: Definition Given two (sub)sequences, λ 1 and λ 2, λ 1 is said to be a functional ancestor of λ 2 if λ 2 serves a purpose functionally similar to λ 1 and if λ 2 could have arisen through mutation of a direct series of functional ancestors from λ AGTCCTGAACG-3 5 -AGTTOOOAACG-3

54 Sequence alignment Definition Given two sequences λ 1 and λ 2, the subsequences λ 1 and λ 2 are said to be homologous if they originate by way of mutation from the same functional ancestor. We seek to align sequences according to their homologous subsequences; we refer to this as positional homology. This permits comparison and clustering. 5 -AGTCCTGAACG-3 5 -AGTTOOOAACG-3

55 Sequence alignment Specimen 5 -AGATCCCTAGT-3 5 -AGATCCAG-3 5 -AGCTCCCGAGTCCGGTCG-3 5 -AGCTCCGGAGGGTC-3 5 -AGCTCCCGAGTTCG-3 5 -AGCTCCCTCCGGTCG-3 5 -AGATGAGTCCCGTCG-3.

56 Sequence alignment Specimen 5 -AGATCCCTAGT-3 5 -AGATCCAG-3 5 -AGCTCCCGAGTCCGGTCG-3 5 -AGCTCCGGAGGGTC-3 5 -AGCTCCCGAGTTCG-3 5 -AGCTCCCTCCGGTCG-3 5 -AGATGAGTCCCGTCG-3. Alignment Library 5 -A---T-G--A-C-G-C--T-G-G-C--G-G--C-A-T-G AC--G----A-C-G-C--T-G-G-C--G-G--C-A-T-G AC--G----A-C-G-T--T---G-C--G-A--T-G-C-G AC--G-A----C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A----C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A----CAG-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-T-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-T-G...-3.

57 Sequence alignment Score 58 Alignment Library 5 -A---T-G--A-C-G-C--T-G-G-C--G-G--C-A-T-G...-3 Specimen 5 -AGATCCCTAGT-3 5 -AGATCCAG-3 5 -AGCTCCCGAGTCCGGTCG-3 5 -AGCTCCGGAGGGTC-3 5 -AGCTCCCGAGTTCG-3 5 -AGCTCCCTCCGGTCG-3 5 -AGATGAGTCCCGTCG AC--G----A-C-G-C--T-G-G-C--G-G--C-A-T-G AC--G----A-C-G-T--T---G-C--G-A--T-G-C-G AC--G-A----C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A----C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A----CAG-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-T-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-T-G...-3.

58 Sequence alignment Score 58 Alignment Library 5 -A---T-G--A-C-G-C--T-G-G-C--G-G--C-A-T-G...-3 Specimen 5 -AGATCCCTAGT-3 5 -AGATCCAG-3 5 -AGCTCCCGAGTCCGGTCG-3 5 -AGCTCCGGAGGGTC-3 5 -AGCTCCCGAGTTCG-3 5 -AGCTCCCTCCGGTCG-3 5 -AGATGAGTCCCGTCG AC--G----A-C-G-C--T-G-G-C--G-G--C-A-T-G AC--G----A-C-G-T--T---G-C--G-A--T-G-C-G AC--G-A----C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A----C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A----CAG-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-G-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-T-G AC--G-A--A-C-G-C--T-G-G-C--G-G--C-A-T-G...-3.

59 Example 100% 90% 80% % Rela've Abundance 70% 60% 50% 40% 30% Root;p Verrucomicrobia Root;p Tenericutes Root;p Spirochaetes Root;p Proteobacteria Root;p Fusobacteria Root;p Firmicutes Root;p Cyanobacteria Root;p Bacteroidetes Root;p AcHnobacteria 20% 10% 0%

60 Example 100% ME! 90% 80% % Rela've Abundance 70% 60% 50% 40% 30% Root;p Verrucomicrobia Root;p Tenericutes Root;p Spirochaetes Root;p Proteobacteria Root;p Fusobacteria Root;p Firmicutes Root;p Cyanobacteria Root;p Bacteroidetes Root;p AcHnobacteria 20% 10% 0%

61 Phylogenetic relationships Up to now, the identities of the sequences are unknown. We can now make use of reference library whose sequence origins are known, or otherwise build a hierarchy using the notion of genetic distance. Some divergences used for genetic distance include eachgap: each point mutation is penalized (e.g. m=1), each base/gap pair is penalized (e.g. g=1) onegap: each point mutation is penalized, a contiguous series of base/gap pairs is penalized as one nogap: each point mutation is penalized, gaps are ignored

62 Trees and Persistence Consider a phylogenetic tree s 1 s 2 s 3 s 4 s 5

63 Trees and Persistence Consider a phylogenetic tree Using the leaves s i as points in a metric space, with distance given by branch lengths, obtain a persistence diagram at the β 0 level. s 1 s 2 s 3 s 4 s 5 β 0 Persistence h

64 Trees and Persistence Consider a phylogenetic tree Using the leaves s i as points in a metric space, with distance given by branch lengths, obtain a persistence diagram at the β 0 level. s 1 s 2 s 3 s 4 s 5 β 0 tracks the connectedness of the sublevel sets. β 0 Persistence h

65 Trees and Persistence Consider a phylogenetic tree Using the leaves s i as points in a metric space, with distance given by branch lengths, obtain a persistence diagram at the β 0 level. s 1 s 2 s 3 s 4 s 5 β 0 tracks the connectedness of the sublevel sets. β 0 Persistence h Persistence is trivial at higher dimensions by contractability.

66 Trees and Persistence Consider an arbitrary metric space M.

67 Trees and Persistence Consider an arbitrary metric space M. Take a finite collection of points X from M.

68 Trees and Persistence Consider an arbitrary metric space M. Take a finite collection of points X from M. Persistence on tree as before, or...

69 Trees and Persistence Consider an arbitrary metric space M. Take a finite collection of points X from M. Persistence on tree as before, or... apply persistence algorithm directly to X.

70 Trees and Persistence Consider an arbitrary metric space M. Take a finite collection of points X from M. Persistence on tree as before, or... apply persistence algorithm directly to X. What is different?

71 Trees and Persistence Consider an arbitrary metric space M. Take a finite collection of points X from M. Persistence on tree as before, or... apply persistence algorithm directly to X. What is different? = Higher dimensional persistence no longer trivial

72 Trees and Persistence Consider an arbitrary metric space M. Take a finite collection of points X from M. Persistence on tree as before, or... apply persistence algorithm directly to X. What is different? = Higher dimensional persistence no longer trivial = β 0 no longer perfectly tracks sublevel sets

73 Example Consider the following collection of sequences taken from sequence space under the eachgap metric, and the true tree. s 1 = GGTTGCAA s 2 = GATTCCAA s 3 = GATTCCAT s 4 = GCTTCCAT s 5 = GGTTGCAT s 1 s 2 s 3 s 4 s 5 Time

74 Example Consider the following collection of sequences taken from sequence space under the eachgap metric, and the true tree. s 1 = GGTTGCAA s 2 = GATTCCAA s 3 = GATTCCAT s 4 = GCTTCCAT s 5 = GGTTGCAT s 1 s 2 s 3 s 4 s 5 Time Recall for the eachgap : Given two aligned sequences, we apply a penalty (eg 1) to each pairwise mismatch and a penalty (eg 1) to each nucleotide/gap pair.

75 Example Consider the following collection of sequences taken from sequence space under the eachgap metric, and the true tree. s 1 = GGTTGCAA s 2 = GATTCCAA s 3 = GATTCCAT s 4 = GCTTCCAT s 5 = GGTTGCAT s 1 s 2 s 3 s 4 s 5 Time ɛ = 0 s 1 s 5 VR-complex s 2 s 3 s 4

76 Example Consider the following collection of sequences taken from sequence space under the eachgap metric, and the true tree. s 1 = GGTTGCAA s 2 = GATTCCAA s 3 = GATTCCAT s 4 = GCTTCCAT s 5 = GGTTGCAT s 1 s 2 s 3 s 4 s 5 Time ɛ = 1 16 s 1 s 5 VR-complex s 2 s 3 s 4

77 Example Consider the following collection of sequences taken from sequence space under the eachgap metric, and the true tree. s 1 = GGTTGCAA s 2 = GATTCCAA s 3 = GATTCCAT s 4 = GCTTCCAT s 5 = GGTTGCAT s 1 s 2 s 3 s 4 s 5 Time ɛ = 2 16 s 1 s 5 VR-complex s 2 s 3 s 4

78 Example Consider the following collection of sequences taken from sequence space under the eachgap metric, and the true tree. s 1 = GGTTGCAA s 2 = GATTCCAA s 3 = GATTCCAT s 4 = GCTTCCAT s 5 = GGTTGCAT s 1 s 2 s 3 s 4 s 5 Time ɛ = 3 16 s 1 s 5 VR-complex s 2 s 3 s 4

79 Example Consider the following collection of sequences taken from sequence space under the eachgap metric, and the true tree. s 1 = GGTTGCAA s 2 = GATTCCAA s 3 = GATTCCAT s 4 = GCTTCCAT s 5 = GGTTGCAT s 1 s 2 s 3 s 4 s 5 Time β 0 Persistence ɛ β 1 Persistence ɛ

80 Example: Contrast Tree Based β 0 Persistence h s 1 s 2 s 3 s 4 s 5 VS Metric Based β 0 Persistence ɛ β 1 Persistence ɛ

81 Questions Three questions arise How treelike is our data? Does our metric violate treeness? Does our tree selection violate the data? What do these terms mean?

82 Questions Three questions arise How treelike is our data? Does our metric violate treeness? Does our tree selection violate the data? What do these terms mean?

83 Questions Three questions arise How treelike is our data? Does our metric violate treeness? Does our tree selection violate the data? What do these terms mean?

84 Questions Three questions arise How treelike is our data? Does our metric violate treeness? Does our tree selection violate the data? What do these terms mean?

85 References Kim, P.T, Pinder, S., Rush, S. (2014). Fréchet analysis and the microbiome. Journal of Statistical Planning and Inference, 145, Shahinas, D., Silverstein, M., Sitter, T., Chiu, C., Kim, P., Allen-Vercoe, E., Weese, S., Wong, A., Low, D., Pillai, D. (2012). Toward an understanding of changes in diversity associated with fecal microbiome transplantation based on 16S rrna gene deep sequencing. mbio, 3, e (11 pages). Rush, S., Pinder, S., Costa, M., Kim, P.T. (2012). A Microbiology Primer for. Quantitative Bio-Science, 31, With discussion by Huckemann, S., Bubenik, P., Massam, H., Mudalige, N., Heo, G. Costa, M.C., Arroyo, L.G., Allen-Vercoe, E., Stämpfli, H.R., Sturgeon, A., Kim, P.T., Weese, J.S. (2012). Comparison of the Fecal Microbiota of Healthy Horses and Horses with Colitis by High Throughput Sequencing of the V3-V5 Region of the 16S rrna Gene. PLoS ONE, 7(7), e41484 (11 pages).

86 Acknowledgments Stephen Rush Jane Belanger, Tatiana Petukhova, Shaun Pinder, Connor Richardson, Amy Sturgeon Christine Lee MD, Marek Smieja MD, Scott Weese DVM SAMSI WG: Nonlinear Low-dimensional Structures in High-dimensions for Biological Data

87 Acknowledgments Stephen Rush Jane Belanger, Tatiana Petukhova, Shaun Pinder, Connor Richardson, Amy Sturgeon Christine Lee MD, Marek Smieja MD, Scott Weese DVM SAMSI WG: Nonlinear Low-dimensional Structures in High-dimensions for Biological Data

88 Acknowledgments Stephen Rush Jane Belanger, Tatiana Petukhova, Shaun Pinder, Connor Richardson, Amy Sturgeon Christine Lee MD, Marek Smieja MD, Scott Weese DVM SAMSI WG: Nonlinear Low-dimensional Structures in High-dimensions for Biological Data

89 Acknowledgments Stephen Rush Jane Belanger, Tatiana Petukhova, Shaun Pinder, Connor Richardson, Amy Sturgeon Christine Lee MD, Marek Smieja MD, Scott Weese DVM SAMSI WG: Nonlinear Low-dimensional Structures in High-dimensions for Biological Data

90 Acknowledgments Stephen Rush Jane Belanger, Tatiana Petukhova, Shaun Pinder, Connor Richardson, Amy Sturgeon Christine Lee MD, Marek Smieja MD, Scott Weese DVM SAMSI WG: Nonlinear Low-dimensional Structures in High-dimensions for Biological Data