Review of whole genome methods Suffix-tree based MUMmer, Mauve, multi-mauve Gene based Mercator, multiple orthology approaches Dot plot/clustering based MUMmer 2.0, Pipmaker, LASTZ 10/3/17 0
Rationale: MUMmer 2.0 Original implementation required large amounts of memory Advantages: Chromosome scale inversions in bacteria Large scale duplications in Arabidopsis Ancient human duplications when amino acid space explored >70% of human chr 14 derives from chr 2 10/3/17 1
Improvements Uses suffix trees for linear time and space solution but room for improvement Memory reduced from 293MB to 100MB using suffix tree improvements of Kurtz (20 bytes/ bp) Time down from 74s to 27s using streaming 10/3/17 2
Idea of algorithm We take a streaming string and run McCreight s algorithm to find where it would go. If it branches in a leaf edge, it is unique in the string in the suffix tree (reference) We then check the character immediately to the left in both strings for left maximality 10/3/17 3
A mini quiz You are given two genomes that your biologist colleagues think have perfectly matching repeats (>2 copies in each). How would you find the length of the longest matching repeat within one genome? (and in how much time) How would you find the longest repeat shared between two genomes? 10/3/17 4
Pros and cons Question 1: If you stream one or more strings against a suffix tree, are matches guaranteed to be unique in the queries? Question 2: What are the advantages and disadvantages (if any) of using protein sequences instead of nucleotide ones? 10/3/17 5
Yeast paper Beer may have cemented human societies through social act, rituals, medicine and uncontaminated water Yeast, along with crops, may have also been domesticated 10/3/17 6
Background Brewing evolved in middle ages Europe to produce ale-type beer via Saccharomyces cerevisiae, the same yeast used in wine and leavened bread. Lager-brewing arose in 15 th century Bavaria, and is the most popular technique Lager, however, requires slow, low temperature fermentation by cryotolerant yeast(s). 10/3/17 7
Results Saccharomyes are associated with oak trees in Northern hemisphere. This study focused on Patagonia in South America with 123 cryotolerant species and two isolates of S. cerevisiae. The fact so many were cryotolerant is unique relative to the northern hemisphere. These group with biological assays with the two known contaminants of lager/cider/wine fermentation 10/3/17 8
Genome sequencing Relationships are contentious as the lager yeast and related yeasts previously were only found in human fermentation efforts. To address this issue, the authors sequenced representatives from Patagonia and breweries using short read/ next gen technology. Comparisons were done to inform the biology here. 10/3/17 9
Domestication and analysis Lager yeast is a mix of at least three yeast species Interestingly, all cryotolerant species have the same chunk of S. cervisiae useful for processing maltose Maltose is one of the most abundant sugars in wort used in brewing Fusion seems to have happened at least twice (see optional paper on course site) 10/3/17 10
Sequence Assembly Required! 11 ISMB 2007
Sequence Assembly Genome Sequenced Fragments (reads) Assembled Contigs Finished Genome
Greedy solution is bounded
Typical assembly strategy & n# $! % 2" pairs θ(n 2 l 2 ) run-time Directly detect promising pairs Exact Matching Filter O(n) pairs O(nl 2 ) run-time
Traditional Assemblers TIGR Assembler CAP3/PCAP PHRAP Celera Assembler ARACHNE JAZZ PHUSION ATLAS Advantages Effective heuristics to solve this NPC problem Brute-force parallelization is easy to implement Limitations θ(n 2 ) space required in the worst case Limited scaling as a result of using disk
A Look at the maize genome Repeats Gene islands
Problems due to repeats
Types of sequencing gaps Slide from Mihai Pop and Michael Schatz
Modern assemby using de Bruijn graphs G = (V, E) where V is the set of all length k subfragments and E are directed edges if nodes overlap by k-1 characters. Relevant papers: De Bruijn, 1946; Idury and Waterman, 1995; Pevzner, Tang, Waterman, 2001 Good news: the correct assembly exists as a path through G Bad news: there are many such paths!
Try it out! Consider the text: It was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness Nodes in the graph are overlapping phrases of length 4, aka It was the best and was the best of Draw an edge between nodes if the last three words of one node match the first three of another.
Iowa State University
Consider the text: Try it out! (part 2) It was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness How could you construct an assembly based on this graph? Are there multiple answers? How many possible answers are correct