Korilog. high-performance sequence similarity search tool & integration with KNIME platform. Patrick Durand, PhD, CEO. BIOINFORMATICS Solutions

Similar documents
ELE4120 Bioinformatics. Tutorial 5

Novel HPC technologies for Rapid Analysis in Bioinformatics Presenter: Paul Walsh, nsilico Life Science Ltd, Ireland

GenScale Scalable, Optimized and Parallel Algorithms for Genomics. Dominique LAVENIER

Bioinformatic tools for metagenomic data analysis

Why learn sequence database searching? Searching Molecular Databases with BLAST

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Large Scale Enzyme Func1on Discovery: Sequence Similarity Networks for the Protein Universe

UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics

Sequence Based Function Annotation

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

RNA sequencing with the MinION at Genoscope

Data Retrieval from GenBank

Product Applications for the Sequence Analysis Collection

Introduction to EMBL-EBI.

Gene-centered resources at NCBI

Types of Databases - By Scope

USING HPC CLASS INFRASTRUCTURE FOR HIGH THROUGHPUT COMPUTING IN GENOMICS

Next Generation Sequencing for Metagenomics

CloudLCA: finding the lowest common ancestor in metagenome analysis using cloud computing

CBC Data Therapy. Metagenomics Discussion

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

What s New in LigandScout 4.4

Supplementary Figures and Tables

Array-Ready Oligo Set for the Rat Genome Version 3.0

Protein Grouping, FDR Analysis and Databases.

Protein Bioinformatics Part I: Access to information

Elixir: European Bioinformatics Research Infrastructure. Rolf Apweiler

Two Mark question and Answers

Genomics. Data Analysis & Visualization. Camilo Valdes

Spectral Counting Approaches and PEAKS

NCBI web resources I: databases and Entrez

G4120: Introduction to Computational Biology

Bioinformatics for Microbial Biology

Introduction to 'Omics and Bioinformatics

I AM NOT A METAGENOMIC EXPERT. I am merely the MESSENGER. Blaise T.F. Alako, PhD EBI Ambassador

Introduction to BIOINFORMATICS

de novo paired-end short reads assembly

Basic Bioinformatics: Homology, Sequence Alignment,

Chimp Sequence Annotation: Region 2_3

BME 110 Midterm Examination

A Prac'cal Guide to NCBI BLAST

Designing Filters for Fast Protein and RNA Annotation. Yanni Sun Dept. of Computer Science and Engineering Advisor: Jeremy Buhler

Introduction to BLAST

An Interactive Workflow Generator to Support Bioinformatics Analysis through GPU Acceleration

Worksheet for Bioinformatics

Shannon pipeline plug-in: For human mrna splicing mutations CLC bio Genomics Workbench plug-in CLC bio Genomics Server plug-in Features and Benefits

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Match the Hash Scores

Grundlagen der Bioinformatik Summer Lecturer: Prof. Daniel Huson

Practical Bioinformatics for Life Scientists. Week 14, Lecture 27. István Albert Bioinformatics Consulting Center Penn State

Read Mapping and Variant Calling. Johannes Starlinger

Deakin Research Online

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Exercise I, Sequence Analysis

Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized

A New Algorithm for Protein-Protein Interaction Prediction

HiSeqTM 2000 Sequencing System

Functional profiling of metagenomic short reads: How complex are complex microbial communities?

Hot Topics. What s New with BLAST?

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

Stay Tuned Computational Science NeSI. Jordi Blasco

Sequence Databases and database scanning

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE

BGGN 213: Foundations of Bioinformatics (Fall 2017)

Gandiva: Introspective Cluster Scheduling for Deep Learning

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

What s New for School Year in Phage Genome Annotation

BIOINFORMATICS FOR DUMMIES MB&C2017 WORKSHOP

Surviving the Life Sciences Data Deluge using Cray Supercomputers

The University of California, Santa Cruz (UCSC) Genome Browser

Computing for Metagenome Analysis

GPU-Meta-Storms: Computing the similarities among massive microbial communities using GPU

ab initio and Evidence-Based Gene Finding

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

Addressing the I/O bottleneck of HPC workloads. Professor Mark Parsons NEXTGenIO Project Chairman Director, EPCC

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

This practical aims to walk you through the process of text searching DNA and protein databases for sequence entries.

Bioinformatics for Proteomics. Ann Loraine

Bioinformatics to chemistry to therapy: Some case studies deriving information from the literature

Plant genome annotation using bioinformatics

Introduction to DNA-Sequencing

COMPUTER RESOURCES II:

LARGE DATA AND BIOMEDICAL COMPUTATIONAL PIPELINES FOR COMPLEX DISEASES

Function Prediction of Proteins from their Sequences with BAR 3.0

Chapter 2: Access to Information

ELIXIR: data for molecular biology and points of entry for marine scientists

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

BIMM 143: Introduction to Bioinformatics (Winter 2018)

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:

Accelerating Motif Finding in DNA Sequences with Multicore CPUs

Goya Inference Platform & Performance Benchmarks. Rev January 2019

Databases in genomics

Annotation and the analysis of annotation terms. Brian J. Knaus USDA Forest Service Pacific Northwest Research Station

MetaGO: Predicting Gene Ontology of non-homologous proteins through low-resolution protein structure prediction and protein-protein network mapping

Challenging algorithms in bioinformatics

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

Transcription:

KLAST high-performance sequence similarity search tool & integration with KNIME platform Patrick Durand, PhD, CEO

Sequence analysis big challenge DNA sequence... Context 1. Modern sequencers produce huge amount of data 2. Reference databanks contain large amount of data Problem How to compare all that data quickly and efficiently? Our answer KLAST deluge of data

The KLAST project Context Public/private collaborative R&D project to create new high-performance tools for NGS Partners www.korilog.com team.inria.fr/genscale Duration Stage 1: 2011-2013: creation of KLAST Stage 2: 2013 2015: enhancements of KLAST + metagenomics oriented tools Funding and Support Région Bretagne / Oséo Innovation / CRITT Santé Bretagne

KLAST algorithm KLAST = PLAST + ORIS + data filtering engine Algorithm PLAST Protein / Protein algorithm (KLASTp, KLASTx, tklastn, tklastx) Van-Hoa Nguyen, Dominique Lavenier, «PLAST: Parallel Local Alignment Search Tool for database comparison», BMC Bioinformatics 2009 Algorithm ORIS Nucleotide / Nucleotide algorithm (KLASTn) Dominique Lavenier, «Ordered Index Seed Algorithm for Intensive DNA Sequence Comparison», HiCOMB 2008 Efficient usage of hardware capabilities Multi-cores architectures SSE3 architecture / AVX2 ready

KLAST algorithm Differences between KLAST and BLAST amino acid algorithm * hits localization (subset seeds) + hits filtering (specific hardware usage) nucleotides algorithm * hits filtering (specific heuristic & specific hardware usage) data filtering engine (scores, e-value, identity, coverage, etc.) Common parts between KLAST and BLAST last algorithm step (dynamic programming) + statistical model Questions what are the differences between tools results? (quality) how fast KLAST is compared to BLAST? (speed)

Benchmark Study of Tara Oceans data sets KLAST and BLAST+ benchmark: comparison of 8,245 sequences (translated 454 reads) from Tara Oceans metagenomic data against 15 million proteins from Uniprot. Both algorithms ran on 8 Intel Xeon cores. 1. Speedup is 18x 2. KLAST covers 96% of BLAST results 8,238 min vs. 469 min Benchmark data courtesy of Jean-Marc Aury, Eric Pelletier and Thomas Vannier research team (National Sequencing Centre CEA, France).

KLAST integration Challenge Combining KLAST sequence comparison engine and data integration & analysis tools

Integration with KNIME What is KLAST Extension for KNIME? Provides nodes to + run Klast on NGS data sets + annotate results: Enzyme, GO, InterPro, NCBI Taxonomy (full and Lowest Common Ancestor) + filter data + import and export sequences and results + quickly prototype sequence analysis workflows + manage databanks: EMBL, Genbank, Uniprot, RefSeq, Silva, DNA Barcoding, standard FASTA, user-defined, etc.

KLAST: graphical mode Study of Tara Oceans data sets Dataset: klastp comparison of 8,245 proteins vs. Uniprot (15 million sequences) CPU time : 469 minutes for klastp workflow vs. 8238 minutes for blastp on a Genoscope cluster node (8 cores). Speedup: 18x Datasets and tests provided by Jean-Marc Aury, Eric Pelletier and Thomas Vannier (French National Sequencing Center / Genoscope / CEA)

KLAST: graphical mode Study of functional & taxonomy diversity Display the taxonomy diversity of a result dataset as a piechart Dataset: klastx comparison of 97,000 sequences (454 reads) vs. SwissProt_bacteria (350,000 sequences) Computation: 2h on an Apple MacBook Air (4 cores, 4 Go RAM). Metagenomics dataset provided by Philippe Vandenkoornhuyse and Alexis Dufresne, CAREN CNRS UMR 6553 EcoBio, Rennes

B I O I N F O R M AT I C S Solutions More information: contact Patrick Durand Email: pdurand@korilog.com Phone: +33 (0) 960 368 038 www.klast-search.com