Korilog. high-performance sequence similarity search tool & integration with KNIME platform. Patrick Durand, PhD, CEO. BIOINFORMATICS Solutions

KLAST high-performance sequence similarity search tool & integration with KNIME platform Patrick Durand, PhD, CEO

Sequence analysis big challenge DNA sequence... Context 1. Modern sequencers produce huge amount of data 2. Reference databanks contain large amount of data Problem How to compare all that data quickly and efficiently? Our answer KLAST deluge of data

The KLAST project Context Public/private collaborative R&D project to create new high-performance tools for NGS Partners www.korilog.com team.inria.fr/genscale Duration Stage 1: 2011-2013: creation of KLAST Stage 2: 2013 2015: enhancements of KLAST + metagenomics oriented tools Funding and Support Région Bretagne / Oséo Innovation / CRITT Santé Bretagne

KLAST algorithm KLAST = PLAST + ORIS + data filtering engine Algorithm PLAST Protein / Protein algorithm (KLASTp, KLASTx, tklastn, tklastx) Van-Hoa Nguyen, Dominique Lavenier, «PLAST: Parallel Local Alignment Search Tool for database comparison», BMC Bioinformatics 2009 Algorithm ORIS Nucleotide / Nucleotide algorithm (KLASTn) Dominique Lavenier, «Ordered Index Seed Algorithm for Intensive DNA Sequence Comparison», HiCOMB 2008 Efficient usage of hardware capabilities Multi-cores architectures SSE3 architecture / AVX2 ready

KLAST algorithm Differences between KLAST and BLAST amino acid algorithm * hits localization (subset seeds) + hits filtering (specific hardware usage) nucleotides algorithm * hits filtering (specific heuristic & specific hardware usage) data filtering engine (scores, e-value, identity, coverage, etc.) Common parts between KLAST and BLAST last algorithm step (dynamic programming) + statistical model Questions what are the differences between tools results? (quality) how fast KLAST is compared to BLAST? (speed)

Benchmark Study of Tara Oceans data sets KLAST and BLAST+ benchmark: comparison of 8,245 sequences (translated 454 reads) from Tara Oceans metagenomic data against 15 million proteins from Uniprot. Both algorithms ran on 8 Intel Xeon cores. 1. Speedup is 18x 2. KLAST covers 96% of BLAST results 8,238 min vs. 469 min Benchmark data courtesy of Jean-Marc Aury, Eric Pelletier and Thomas Vannier research team (National Sequencing Centre CEA, France).

KLAST integration Challenge Combining KLAST sequence comparison engine and data integration & analysis tools

Integration with KNIME What is KLAST Extension for KNIME? Provides nodes to + run Klast on NGS data sets + annotate results: Enzyme, GO, InterPro, NCBI Taxonomy (full and Lowest Common Ancestor) + filter data + import and export sequences and results + quickly prototype sequence analysis workflows + manage databanks: EMBL, Genbank, Uniprot, RefSeq, Silva, DNA Barcoding, standard FASTA, user-defined, etc.

KLAST: graphical mode Study of Tara Oceans data sets Dataset: klastp comparison of 8,245 proteins vs. Uniprot (15 million sequences) CPU time : 469 minutes for klastp workflow vs. 8238 minutes for blastp on a Genoscope cluster node (8 cores). Speedup: 18x Datasets and tests provided by Jean-Marc Aury, Eric Pelletier and Thomas Vannier (French National Sequencing Center / Genoscope / CEA)

KLAST: graphical mode Study of functional & taxonomy diversity Display the taxonomy diversity of a result dataset as a piechart Dataset: klastx comparison of 97,000 sequences (454 reads) vs. SwissProt_bacteria (350,000 sequences) Computation: 2h on an Apple MacBook Air (4 cores, 4 Go RAM). Metagenomics dataset provided by Philippe Vandenkoornhuyse and Alexis Dufresne, CAREN CNRS UMR 6553 EcoBio, Rennes

B I O I N F O R M AT I C S Solutions More information: contact Patrick Durand Email: pdurand@korilog.com Phone: +33 (0) 960 368 038 www.klast-search.com