Unlocking Genomic Diversity! without Assembly or Alignment!

Size: px
Start display at page:

Download "Unlocking Genomic Diversity! without Assembly or Alignment!"

Transcription

1 Unlocking Genomic Diversity! without Assembly or Alignment!!! Borevitz Laboratory! Division of Plant Sciences and! Division of Ecology, Evolution and Genetics! Research School of Biology (RSB)! The Australian National University! Canberra, ACT 2601, Australia! Photo: H Dempewolf

2 Intro the personalised genome

3 ($5 Whole Genome Sequencing library prep) kwip, assembly- and alignment-free esbmator of genebc similarity Current best prac2ce in whole-genome popula2on genomics Mo2va2on for why you might consider alignment-free methods Alignment-free methods (D2, kwip, Mash/AAF) Examples Conclusion Peterson BK, PLoS ONE (2012) May 31;7(5):e

4 The global Brachypodium collection Jared Streich, ANU 350 USDA Public Global Accessions: Dave Garvin and John Vogel 120 Private Collection: Spanish lines, Luis Mur 130 Spanish Private: Accessions, Pilar Catalan 120 Private Collection: US accessions, Shuangshuang Liu via Kent Bradford 950 Borevitz Lab: 240 EU accessions, 660 Australian accessions, 48 North America acc. 400 Armenia, Israel, Lebanon, Greece, Private Collections, Ezrati Lab 120 Private Collection: Turkey, Budak Lab 50 Private Collection: Italy, Greece, Georgia, Armenia, Spain, Hazen Lab

5 A lucky find: Large amounts of leftover reads after mapping Ready!! 8

6 in a nutshell Wouldn t it be nice... Raw reads (fastq) Then a miracle occurs! Pairwise distance matrix Kevin Murray, PhD student

7 Deconstruct a sequence into words (k-mers) Kevin Murray, PhD student 11

8 How do I efficiently compare bags of words? Easy! Pairwise vector product of word counts! Cheng Soon Ong, Christfried Webers

9 Pairwise! With Vector (inner)product of word counts as similarity measure The vector inner product of these vectors results in a single number and this should be a suitable similarity measure Lets develop software that counts words (kmers) into vectors and then does pair-wise comparisons by vector products everyone with everyone (kwip)

10 Similar approaches Jellyfish Guillaume Marcais and Carl Kingsford, A fast, lock-free approach for efficient parallel coun2ng of occurrences of k-mers. Bioinforma)cs (2011) 27(6): Feature Frequency Profiles (FFPs) Sims, G. E. et al. Alignment-free genome comparison with feature frequency profiles (FFP) and op2mal resolu2ons. Proceedings of the Na)onal Academy of Sciences of the United States of America 106, (2009). D2 stabsbcs (D 2, D 2 *, D 2 S) Most recently reviewed in: Song, K., Ren, J., Zhai, Z., Liu, X., Deng, M., & Sun, F. (2013). Alignment-Free Sequence Comparison Based on Next-Genera2on Sequencing Reads. Journal of Computa)onal Biology, 20(2), Too slow, too complicated, no software, not applicable to raw sequence data

11 kwip kmer hashing with khmer Sequencing read (e.g., 150 bp) kmers, k=20 Zhang Q, Pell J, Canino-Koning R, Howe AC, Brown CT. These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PLoS ONE Dec 31;9(7):e Crusoe MR, Alameldin HF, Awad S, Boucher E, Caldwell A, Cartwright R, et al. The Khmer Software Package: Enabling Efficient Nucleotide Sequence Analysis. F1000Research pmid: Based on its sequence, each kmer is converted into a number e.g., This number is (or is converted) into an address in a vector (bin) e9 And the corresponding bin count is increased by e9

12 Clustering technical replicates, 6 each (3000 rice genome project) Can we detect technical replicates? How does the D2 statistic perform? The 3,000 rice genomes project. The 3,000 rice genomes project. GigaScience 3, 7 (2014).

13 Weighting by information content to raise signal above noise Entropy vector weighting We weight each k-mer differently based on its frequency of observations in the dataset. This is the proportion of samples with non-zero counts of a given k-mer (bin). The Shannon Entropy of this frequency is used as the weight of each k-mer bin.

14 Clustering technical replicates, 6 each (3000 rice genome project) Can we detect technical replicates? Now with weighting! 96 of 28,000 runs The 3,000 rice genomes project. The 3,000 rice genomes project. GigaScience 3, 7 (2014).

15 Whole-genome genomics Population structure 20 Strains of Chlamydomonas reinhardtii alignment-free alignment-based (published) Flowers, J. M., Hazzouri, K. M., Pham, G. M., Rosas, U., Bahmani, T., Khraiwesh, B., et al. (2015). Whole-Genome Resequencing Reveals Extensive Natural Variation in the Model Green Alga Chlamydomonas reinhardtii. The Plant Cell, 27(9),

16 September 5: Early online in Plos Comp Bio 20

17 The missing heritability might be in your metadata! Motivation We might have no (or the wrong/inappropriate) reference genome! Samples might have the wrong label, and there is no way of checking! Excessive or cryptic variation, mix-ups of samples or file labels Merging data or experiments (e.g., download from SRA) Solution kwip, a k-mer based de novo genetic relatedness estimator Inner Product between kmer counts to determine relatedness Produces a pairwise distance matrix from raw NGS reads Implemented in a software tool: kwip ( ) Novel Probabilistic data structure (khmer, Titus Brown, UC Davis) Information theoretic weighting scheme Murray KD, Webers C, Ong CS, Borevitz J, Warthmann N (2017) kwip: The k-mer weighted inner product, a de novo estimator of genetic similarity. PLoS Comput Biol 13(9): e

18 Mash, Ondov, et al. (2016) Ondov et al. Genome Biology (2016) 17:132 DOI /s x SOFTWARE Open Access Mash: fast genome and metagenome distance estimation using MinHash Brian D. Ondov 1, Todd J. Treangen 1, Páll Melsted 2, Adam B. Mallonee 1, Nicholas H. Bergman 1, Sergey Koren 3 and Adam M. Phillippy 3* Abstract Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU h; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition. Mash is freely released under a BSD license ( Keywords: Comparative genomics, Genomic distance, Alignment, Sequencing, Nanopore, Metagenomics Background any problem where an approximate, global distance is When BLAST was first published in 1990 [1], there were acceptable, e.g. to triage and cluster sequence data, less than 50 million bases of nucleotide sequence in the assign species labels, build large guide trees, identify public archives [2]; now a single sequencing instrument mis-tracked samples, and search genomic databases. can produce over 1 trillion bases per run [3]. New The MinHash technique is a form of locality-sensitive methods are needed that can manage and help organize hashing [5] that has been widely used for the detection this scale of data. To address this, we consider the of near-duplicate Web pages and images [6, 7], but has general problem of computing an approximate distance seen limited use in genomics despite initial applications between two sequences and describe Mash, a generalpurpose toolkit that utilizes the MinHash technique [4] applied to the relevant problems of genome assembly over ten years ago [8]. More recently, MinHash has been to reduce large sequences (or sequence sets) to compressed sketch representations. Using only the sketches, sequence clustering [12]. Because of the extremely low [9], 16S rdna gene clustering [10, 11], and metagenomic which can be thousands of times smaller, the similarity memory and CPU requirements of this probabilistic of the original sequences can be rapidly estimated with approach, MinHash is well suited for data-intensive problems in genomics. To facilitate this, we have developed bounded error. Importantly, the error of this computation depends only on the size of the sketch and is independent of the genome size. Thus, sketches comprising comparison of MinHash sketches from genomic data. We Mash for the flexible construction, manipulation, and just a few hundred values can be used to approximate build upon past applications of MinHash by deriving a the similarity of arbitrarily large datasets. This has new significance test to differentiate chance matches when important applications for large-scale genomic data searching a database, and derive a new distance metric, management and emerging long-read, single-molecule the Mash distance, which estimates the mutation rate sequencing technologies. Potential applications include between two sequences directly from their MinHash sketches. Similar alignment-free methods have a long history in bioinformatics [13, 14]. However, prior methods based on word counts have relied on short words of only a few nucleotides, which lack the power to differentiate between closely related sequences and produce distance * Correspondence: adam.phillippy@nih.gov 3 Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA Full list of author information is available at the end of the article 2016 The Author(s). Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated. 22

19 Implications for population genetics/genomics/museomics Provide a quality control instrument for metadata, detect mislabels and mix-ups Identity check might become mandatory, cell line authentication, seed provenance Informs population and species groupings (without a reference genome) Overcoming taxonomy anarchy* (and vandalism) Overcoming reference genome bias and help with a population genetics paradigm shift Reference- and alignment-free population structure first, reference genome(s) second Enables integration across whole-genome datasets deconstruction in to k-mers is sequencing platform/protocol agnostic Establish relationships across different levels of relatedness through the weighting, the maker sets are dynamic and allow for zooming Effective for hybrids and species complexes proper accounting for admixture and heterozygosity Vector-count-file (.ct.gz) as thumbnail representation of the full data genome data for collections/website * Garnett ST, Christidis L. Taxonomy anarchy hampers conservation. Nature May 31;546(7656):25 7.

20 Acknowledgements Kevin Murray, ANU Justin Borevitz, ANU, ANU Chen Soon Ong, Data61 (CSIRO) Chrisfried Webers, Data61 (CSIRO)

21 Whole genome genomics need for QC Whole-Genome Genomics Datasets 28