Bioinformatics for High Throughput Sequencing

Similar documents
High Throughput Sequencing & bioinformatics analysis

Introduction to Short Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

RNA-Sequencing analysis

Next-Generation Sequencing. Technologies

CRAC: An integrated approach to analyse RNA-seq reads Additional File 4 Results on real RNA-seq data.

About Strand NGS. Strand Genomics, Inc All rights reserved.

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Reads to Discovery. Visualize Annotate Discover. Small DNA-Seq ChIP-Seq Methyl-Seq. MeDIP-Seq. RNA-Seq. RNA-Seq.

Read Mapping and Variant Calling. Johannes Starlinger

Introduction to RNA-Seq in GeneSpring NGS Software

Welcome to the NGS webinar series

ChIP-seq and RNA-seq

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Introduction to Bioinformatics

Challenging algorithms in bioinformatics

ChIP-seq and RNA-seq. Farhat Habib

Analytics Behind Genomic Testing

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment

Machine Learning. HMM applications in computational biology

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Alignment methods. Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics

NEXT GENERATION SEQUENCING. Farhat Habib

Analysis of RNA-seq Data

SEQUENCING. M Ataei, PhD. Feb 2016

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics

QIAGEN s NGS Solutions for Biomarkers NGS & Bioinformatics team QIAGEN (Suzhou) Translational Medicine Co.,Ltd

Genomic resources. for non-model systems

De novo assembly in RNA-seq analysis.

DNA polymorphisms and RNA-Seq alternative splicing blow bubbles in de Bruijn Graphs

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Transcriptome Assembly and Evaluation, using Sequencing Quality Control (SEQC) Data

Next Generation Sequencing. Tobias Österlund

Introduction to human genomics and genome informatics

Assay Validation Services

BIOINFORMATICS ORIGINAL PAPER

Sample to Insight. Dr. Bhagyashree S. Birla NGS Field Application Scientist

SNP calling and VCF format

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

Course Presentation. Ignacio Medina Presentation

Supplementary Information Supplementary Figures

G E N OM I C S S E RV I C ES

RNA-sequencing. Next Generation sequencing analysis Anne-Mette Bjerregaard. Center for biological sequence analysis (CBS)

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

TECH NOTE Stranded NGS libraries from FFPE samples

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

Bioinformatics Advice on Experimental Design

Introduction to RNAseq Analysis. Milena Kraus Apr 18, 2016


RNA-SEQUENCING ANALYSIS

NUCLEOTIDE RESOLUTION STRUCTURAL VARIATION DETECTION USING NEXT- GENERATION WHOLE GENOME RESEQUENCING

Systematic evaluation of spliced alignment programs for RNA- seq data

Eucalyptus gene assembly

RNA-Seq with the Tuxedo Suite

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

COMPARISON OF GENE FUSION DETECTION TOOLS TO DETECT NOVEL GENE FUSIONS USING A CUSTOM ANNOTATION

The Genome Analysis Centre. Building Excellence in Genomics and Computa5onal Bioscience

Higher Human Biology Unit 1: Human Cells Pupils Learning Outcomes

Transcriptome analysis

de novo paired-end short reads assembly

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Mapping strategies for sequence reads

Deep Sequencing technologies

Bioinformatics Monthly Workshop Series. Speaker: Fan Gao, Ph.D Bioinformatics Resource Office The Picower Institute for Learning and Memory

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies

NGS in Pathology Webinar

Illumina Genome Analyzer. Progenika Experience. - Susana Catarino -

Introducing combined CGH and SNP arrays for cancer characterisation and a unique next-generation sequencing service. Dr. Ruth Burton Product Manager

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4

Haploid Assembly of Diploid Genomes

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

Agilent NGS Solutions : Addressing Today s Challenges

Introduction to Bioinformatics

University of Athens - Medical School. pmedgr. The Greek Research Infrastructure for Personalized Medicine

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

Bioinformatics: Sequence Analysis. COMP 571 Luay Nakhleh, Rice University

Form for publishing your article on BiotechArticles.com this document to

FFPE in your NGS Study

Performance comparison of five RNA-seq alignment tools

Pioneering Clinical Omics

DNA METHYLATION RESEARCH TOOLS

Applications of short-read

The Expanded Illumina Sequencing Portfolio New Sample Prep Solutions and Workflow

Top 5 Lessons Learned From MAQC III/SEQC

Unit 1 Human cells. 1. Division and differentiation in human cells

Design a super panel for comprehensive genetic testing

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

Target Enrichment Strategies for Next Generation Sequencing

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Introduction to BIOINFORMATICS

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

Supplemental Methods. Exome Enrichment and Sequencing

Next Generation Sequencing: An Overview

RNA Sequencing. Next gen insight into transcriptomes , Elio Schijlen

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq

Transcription:

Bioinformatics for High Throughput Sequencing Eric Rivals LIRMM & IBC, Montpellier http://www.lirmm.fr/~rivals http://www.lirmm.fr/~rivals 1 /

High Throughput Sequencing or Next Generation Sequencing High Throughput Sequencing or Next Generation Sequencing http://www.lirmm.fr/~rivals 2 /

High Throughput Sequencing or Next Generation Sequencing http://www.lirmm.fr/~rivals 3 /

High Throughput Sequencing or Next Generation Sequencing Overview of techniques Name Read Lg Time Gb/run pros / cons 454 GS Flex 700 23 h 0.7 long Illumina HiSeq 2*100 48 h 120 short/cost SOLID (LifeSc) 85 8 d 150 long time Ion Proton 200 2 h 100 new PacBio Sciences 3-15000 0.3 3 high error rate http://www.lirmm.fr/~rivals 4 /

High Throughput Sequencing or Next Generation Sequencing HTS output: an example one Human RNA library 75 million reads of 100 bp each Analysis reveals that it represents > 140, 000 splice events on 16, 000 expressed genes http://www.lirmm.fr/~rivals 5 /

High Throughput Sequencing or Next Generation Sequencing HTS output: an example one Human RNA library 75 million reads of 100 bp each Analysis reveals that it represents > 140, 000 splice events on 16, 000 expressed genes Bottleneck: Bioinformatics read analysis http://www.lirmm.fr/~rivals 5 /

What can life scientists do with NGS assays? What can life scientists do with NGS assays? http://www.lirmm.fr/~rivals 6 /

What can life scientists do with NGS assays? Domains of applications bio-molecular research biotechnology (e.g. bio-fuels) biodiversity monitoring personalised medicine epidemiology surveillance pharmacogenomics personal genomics forensic agronomy (animal and plant research) http://www.lirmm.fr/~rivals 7 /

What can life scientists do with NGS assays? Biological questions sample genomic variations in a population of individuals detect genotype differences related to a disease measure variations in gene expression & identify RNA variants study replication, transcription or translation processes interrogate protein binding sites on the whole genome or RNAs assess epigenetic modifications on the genome (3D structure) estimate the fitness contribution of each gene in bacteria identify genes involved in pathogenicity or adaptation study gene interactions and their role in regulatory pathways or in metabolic pathways survey the species or assess the biodiversity of an environment list the bio-molecular functions or processes active in an environmental sample http://www.lirmm.fr/~rivals 8 /

What can life scientists do with NGS assays? Remarks on seq based assays This type of questions and assays pre-existed to NGS but NGS made them cheaper, high-throughput, and genome-wide Genome wide is the major qualitative change: no predefined target, no knowledge required, potentially all sites are scrutinized http://www.lirmm.fr/~rivals 9 /

What can life scientists do with NGS assays? Two situations in genomics 1 a reference genome is available map reads on the genome http://www.lirmm.fr/~rivals 10 /

What can life scientists do with NGS assays? Two situations in genomics 1 a reference genome is available map reads on the genome 2 without a reference genome assemble the reads to get the genome or comparative analysis of several read sets http://www.lirmm.fr/~rivals 10 /

A pattern matching primer A pattern matching primer http://www.lirmm.fr/~rivals 11 /

A pattern matching primer Outline 1 The problem 2 Text indexing approach 3 Filtration approach http://www.lirmm.fr/~rivals 12 /

A pattern matching primer Pattern Matching 1 a text T of length n 2 a pattern M of length m 3 generally m << n. Example: M := tgtg 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 T: c t g t g t g t a c a t g t g t g t g t g t g t g t g Solution: {2, 4, 12} http://www.lirmm.fr/~rivals 13 /

A pattern matching primer Pattern Matching 1 a text T of length n 2 a pattern M of length m 3 generally m << n. For one read: window 1 2 m How to do it for millions of reads? http://www.lirmm.fr/~rivals 13 /

A pattern matching primer Naive and involved algorithms Naive algorithm: for each window m pairwise symbol comparisons about n windows Total time proportional to n m (complexity) Linear time solutions: Idea: exploit results on a window to ease that of overlapping windows Boyer-Moore or Knuth Morris Pratt algorithms in the 70 s Total time proportional to n + m http://www.lirmm.fr/~rivals 14 /

A pattern matching primer Naive and involved algorithms Naive algorithm: for each window m pairwise symbol comparisons about n windows Total time proportional to n m (complexity) Linear time solutions: Idea: exploit results on a window to ease that of overlapping windows Boyer-Moore or Knuth Morris Pratt algorithms in the 70 s Total time proportional to n + m Limitations: single query and exact match http://www.lirmm.fr/~rivals 14 /

A pattern matching primer Naive and involved algorithms Naive algorithm: for each window m pairwise symbol comparisons about n windows Total time proportional to n m (complexity) Linear time solutions: Idea: exploit results on a window to ease that of overlapping windows Boyer-Moore or Knuth Morris Pratt algorithms in the 70 s Total time proportional to n + m Limitations: single query and exact match Answers: indexing text and filtration approaches http://www.lirmm.fr/~rivals 14 /

A pattern matching primer Multiple PM with a text index Matching in two steps: 1 preprocessing the text T in time O(n) build and store a data structure: an index enables exact search query 2 search for each pattern in the index in O(m) time (optimal) http://www.lirmm.fr/~rivals 15 /

A pattern matching primer Text indexing data structures For a text of length n, a good index: 1 occupancy memory in O(n) 2 construction time in O(n) units 3 enables exact motif search in O(m) time for a motif of length m Three historical structures: 1 compact suffix tree [Wiener 73, McCreight 76, Ukkonen 92] 2 suffix array: construction in O(n) [Kärkkäinen & Sanders 03] 3 DAWG (Directed Acyclic Word Graph) [Blumer et al. 85] http://www.lirmm.fr/~rivals 16 /

A pattern matching primer Breakthrough in text indexing With historical index structures, 1 you need the text and the index 2 both in main memory to keep it fast Around 2000, the advent of compressible self indexing structures : 1 a self-index replaces the text and the classical index 2 its size can be modulated in function of available memory. Example 1 Burrows-Wheeler Transform or FM-index [Ferragina Manzini 00] 2 Compressed k-mer indexes [Philippe et al. 11] 3 Minimum information de Bruijn Graphs for assembly [Li 09, Chikhi & Rizk 13] http://www.lirmm.fr/~rivals 17 /

Personalized Medicine Personalized Medicine http://www.lirmm.fr/~rivals 18 /

Personalized Medicine Personalised Medicine Wikipedia emphasizes the systematic use of information about an individual patient to select or optimize that patient s preventative and therapeutic care. US Congress definition the application of genomic and molecular data to better target the delivery of health care, facilitate the discovery and clinical testing of new products, and help determine a person s predisposition to a particular disease or condition. http://www.lirmm.fr/~rivals 19 /

Personalized Medicine Abnormal chromosome pool in cancer Blood cancer karyotype (leukemia) Normal human karyotype http://www.lirmm.fr/~rivals 20 /

Personalized Medicine Abnormal chromosome pool in cancer Normal human karyotype Blood cancer karyotype (leukemia) http://www.lirmm.fr/~rivals 20 /

Personalized Medicine Abnormal chromosome pool in cancer Diagnosis of chronic myelogenous leukemia (CML) Prognosis in myelodysplastic syndrome Blood cancer karyotype (leukemia) http://www.lirmm.fr/~rivals 21 /

Personalized Medicine Leukemia with gene fusion http://www.lirmm.fr/~rivals 22 /

Personalized Medicine Leukemia with gene fusion translocation http://www.lirmm.fr/~rivals 22 /

Personalized Medicine Translocated gene to fusion RNA http://www.lirmm.fr/~rivals 23 /

Personalized Medicine Personalised Medicine for Chronic Myelogenous Leukemia Test in the bone marrow: presence of BCR-ABL t(9;22) fusion RNA? 1 diagnosis 2 monitoring disease recurrence 3 treatment follow up http://www.lirmm.fr/~rivals 24 /

Personalized Medicine Personalised Medicine for Chronic Myelogenous Leukemia Test in the bone marrow: presence of BCR-ABL t(9;22) fusion RNA? 1 diagnosis 2 monitoring disease recurrence 3 treatment follow up What if test goes wrong because of human genetic variability? another form of BCR-ABL fusion is produced? other, still unknow, aberrant RNA are involved in this cancer? http://www.lirmm.fr/~rivals 24 /

Personalized Medicine What we need... Monitoring all active genes, i.e. RNAs, in a cell very fast at low cost, and limited cell material. High-Throughput Transcriptomics RNA-seq determining which genomic regions are transcribed and activated in a cell, at which activation/expression level http://www.lirmm.fr/~rivals 25 /

Personalized Medicine Detection needs sensitivity and specificity http://www.lirmm.fr/~rivals 26 /

Locating read on a reference sequence Mapping Locating read on a reference sequence Mapping http://www.lirmm.fr/~rivals 27 /

Locating read on a reference sequence Mapping A definition of mapping Locating or mapping reads for each read, find its location of origin on the reference genome http://www.lirmm.fr/~rivals 28 /

Locating read on a reference sequence Mapping A definition of mapping Locating or mapping reads for each read, find its location of origin on the reference genome How? use the sequence similarity between the read and the reference approximate pattern matching or alignment http://www.lirmm.fr/~rivals 28 /

Locating read on a reference sequence Mapping A definition of mapping Locating or mapping reads for each read, find its location of origin on the reference genome How? use the sequence similarity between the read and the reference approximate pattern matching or alignment Differences in sequence come from 1 sequencing errors 2 genetic variability at intra- and inter-individual 3 splicing of RNA compared to DNA sequence http://www.lirmm.fr/~rivals 28 /

Locating read on a reference sequence Mapping Mapping for genomics, transcriptomics, or epigenomics Find for each read all genomic positions at which the read match either exactly or approximately on the genome (+/ strands) Results: is a read located? once or more than once? unmapped : not found uniquely mapped : mapped at a single genomic location mutiply mapped : mapped at several genomic locations http://www.lirmm.fr/~rivals 29 /

Locating read on a reference sequence Mapping Bottleneck of mapping Data volume, typically: 3 Giga bp of the Human genome sequence 50 million reads, each 100 bases long par read http://www.lirmm.fr/~rivals 30 /

Locating read on a reference sequence Mapping Bottleneck of mapping Data volume, typically: 3 Giga bp of the Human genome sequence 50 million reads, each 100 bases long par read Main issue: Scalability in terms of memory and time data flow especially in sequencing centers How? indexing the genome sequence for answering pattern matching queries filtration algorithms for fast alignment http://www.lirmm.fr/~rivals 30 /

Locating read on a reference sequence Mapping Mapping programs http://wwwdev.ebi.ac.uk/fg/hts_mappers http://www.lirmm.fr/~rivals 31 /

Locating read on a reference sequence Mapping Mapping comparison Data Human K562 cancer cell line RNA-Seq library 12 millions reads, 75 bp long Percentage of mapped reads 100 80 60 40 20 0 Unique Multiple Bowtie BWA SOAP2 Exact http://www.lirmm.fr/~rivals 32 /

High Throughput Sequencing & transcriptomics High Throughput Sequencing & transcriptomics http://www.lirmm.fr/~rivals 33 /

High Throughput Sequencing & transcriptomics HTS Transcriptomics: RNA-Seq RNA-Seq: monitoring gene activation in cells cataloguing and discovery of RNAs Bottlenecks: Bioinformatics processing Big Data and scalability issues http://www.lirmm.fr/~rivals 34 /

High Throughput Sequencing & transcriptomics Mapping of RNA-seq reads Goal: Find the alignments of the read with regions of the genome. spliced read exon exon http://www.lirmm.fr/~rivals 35 /

High Throughput Sequencing & transcriptomics Typical analysis information flow Multi-step analysis pipeline 1 Mapping (only genomic locations) 2 Coverage (distinguishing errors from biological events) 3 Prediction of candidate (mutations, splicing, etc) Limitations Late distinction between sequencing errors and mutations No control on false negatives & positives in mapping Almost no backtracking - error propagation http://www.lirmm.fr/~rivals 36 /

High Throughput Sequencing & transcriptomics Remedy / Solution? integrate all information at once in a single program http://www.lirmm.fr/~rivals 37 /

High Throughput Sequencing & transcriptomics CRAC CRAC http://www.lirmm.fr/~rivals 38 /

High Throughput Sequencing & transcriptomics CRAC CRAC: a tool for analyzing genomic or transcriptomic reads Case where a reference genome is available Inputs: the indexed genome sequence the set of reads (no FASTQ quality) a integer parameter k: the length of k-mers http://www.lirmm.fr/~rivals 39 /

High Throughput Sequencing & transcriptomics CRAC CRAC: a tool for analyzing genomic or transcriptomic reads Case where a reference genome is available Inputs: the indexed genome sequence the set of reads (no FASTQ quality) a integer parameter k: the length of k-mers Questions: detect genome localization: single, a few, many erroneous position and error mutation position and mutation (substitution & indels) exon-exon junctions rearrangements or chimeric RNAs repeats borders http://www.lirmm.fr/~rivals 39 /

High Throughput Sequencing & transcriptomics CRAC CRAC principle I: k-mer profiling 10 20 30 40 50 60 70 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 C T A G T T T T A T A C T T T A G G G G T A A G C A G T G G A A A G T T A G A G T T C G G A G C T G T T T A T T G A G G G C A G G G G A A G A A T G T http://www.lirmm.fr/~rivals 40 /

High Throughput Sequencing & transcriptomics CRAC CRAC principle I: k-mer profiling 10 20 30 40 50 60 70 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 C T A G T T T T A T A C T T T A G G G G T A A G C A G T G G A A A G T T A G A G T T C G G A G C T G T T T A T T G A G G G C A G G G G A A G A A T G T http://www.lirmm.fr/~rivals 40 /

High Throughput Sequencing & transcriptomics CRAC CRAC principle I: k-mer profiling 10 20 30 40 50 60 70 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 C T A G T T T T A T A C T T T A G G G G T A A G C A G T G G A A A G T T A G A G T T C G G A G C T G T T T A T T G A G G G C A G G G G A A G A A T G T http://www.lirmm.fr/~rivals 40 /

High Throughput Sequencing & transcriptomics CRAC CRAC principle I: k-mer profiling 10 20 30 40 50 60 70 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 C T A G T T T T A T A C T T T A G G G G T A A G C A G T G G A A A G T T A G A G T T C G G A G C T G T T T A T T G A G G G C A G G G G A A G A A T G T http://www.lirmm.fr/~rivals 40 /

High Throughput Sequencing & transcriptomics CRAC CRAC principle I: k-mer profiling 10 20 30 40 50 60 70 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 C T A G T T T T A T A C T T T A G G G G T A A G C A G T G G A A A G T T A G A G T T C G G A G C T G T T T A T T G A G G G C A G G G G A A G A A T G T 16 located k-mers 22 k-mers not located 16 located k-mers http://www.lirmm.fr/~rivals 40 /

High Throughput Sequencing & transcriptomics CRAC CRAC principle I: k-mer profiling 10 20 30 40 50 60 70 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 C T A G T T T T A T A C T T T A G G G G T A A G C A G T G G A A A G T T A G A G T T C G G A G C T G T T T A T T G A G G G C A G G G G A A G A A T G T 16 located k-mers 22 k-mers not located 16 located k-mers error or mutation? http://www.lirmm.fr/~rivals 40 /

High Throughput Sequencing & transcriptomics CRAC Principle I: genomic location With the k-mers genomic location, you get: the read location the difference with the genome However, with mapping only no distinction between genetic variations and errors http://www.lirmm.fr/~rivals 41 /

High Throughput Sequencing & transcriptomics CRAC Principle II: genetic variation while a sequence error occurs in a read, affects only that read Error or mutation? An Integrated approach Principle II A genetic variation affect all reads covering its position mutation? Error or mutation? Polymorphism Reads Polymorphism An Integrate gen All reads incorporate the mutation Error Reads All reads i Error http://www.lirmm.fr/~rivals 42 /

High Throughput Sequencing & transcriptomics CRAC Support: a proxy for local coverage Definition: support of a k-mer Number of reads containing that k-mer (at least once) Support: approximation of the local coverage by the read set http://www.lirmm.fr/~rivals 43 /

High Throughput Sequencing & transcriptomics CRAC CRAC: idea For each read, it analyzes jointly two signals for each k-mer the location of the k-mer on the genome i.e. its matching locations and their number, the support: the number of reads sharing this k-mer How? on the fly using two indexes: a compressed Burrows-Wheeler Transform of the genome a generalized k-factor table built on all reads [Philippe et al., 2011] http://www.lirmm.fr/~rivals 44 /

High Throughput Sequencing & transcriptomics CRAC profiles Sequence error vs mutation profile (m = 50, k = 20) 10 2 10 1 10 0 CGGCTGTGTATTACTGTGCGAGAGTCGGGGGAGATTACTATGATAGTAGT (blue dots): support (left scale) x (red cross): nb of genome locations (right scale) 100 80 60 40 20 http://www.lirmm.fr/~rivals 45 /

High Throughput Sequencing & transcriptomics CRAC profiles Sequence error vs mutation profile (m = 50, k = 20) 10 2 10 1 10 0 CGGCTGTGTATTACTGTGCGAGAGTCGGGGGAGATTACTATGATAGTAGT 100 80 60 40 20 10 0.5 10 0 CTGGACCCCCTGGACATGCCCTGCACAACCATCCCCTCCGCGCCCCAGGC 100 80 60 40 20 http://www.lirmm.fr/~rivals 45 /

High Throughput Sequencing & transcriptomics CRAC profiles Profile Analysis: Rules for Single Cause Length of location break: Substitution: k Deletion or splice junction: k 1 Insertion: k + p with p length of the insertion Issues: suppress isolated random location compare left & right vs inner support levels of a break http://www.lirmm.fr/~rivals 46 /

High Throughput Sequencing & transcriptomics CRAC profiles Support variation SNV error Read k-mers break Analysis of the support profile location profile 30 reads share the k- mer starting here Stable Variable 30 30 1 1 There is only one read E. Rivals (LIRMM) High Throughput Sequencing with this erroneous & bioinformatics k-mer http://www.lirmm.fr/~rivals 47 /

High Throughput Sequencing & transcriptomics CRAC profiles Random locations expected break mirage breaks Read Genome False locations http://www.lirmm.fr/~rivals 48 /

High Throughput Sequencing & transcriptomics Classification Classification process of a read CRAC reads analysis according to P-loc FM-index mapping no break location break(s) no mutation Gk arrays fall support ambiguous no fall ambiguous unique or duplicated seq error undetermined SNV or insertion or bio undetermined or deletion multiple or no loc or splice or chimera http://www.lirmm.fr/~rivals 49 /

High Throughput Sequencing & transcriptomics CRAC results CRAC results http://www.lirmm.fr/~rivals 50 /

High Throughput Sequencing & transcriptomics Results on simulated data Results mapping: simulated data 100 Human 42M length 75 bp Percent of single mapped reads 80 60 40 20 0 Bowtie BWA CRAC GASSST GSNAP SOAP2 http://www.lirmm.fr/~rivals 51 /

High Throughput Sequencing & transcriptomics Results on simulated data Results mapping: simulated data 100 Human 48M length 200 bp Percent of single mapped reads 80 60 40 20 0 Bowtie BWASW CRAC GASSST GSNAP SOAP2 http://www.lirmm.fr/~rivals 51 /

High Throughput Sequencing & transcriptomics Results on simulated data Splice junction prediction: simulated data 75bp 200bp Tool Sensitivity Precision Sensitivity Precision CRAC 79.43 99.5 86.02 99.18 GSNAP 84.17 97.03 72.94 97.09 MapSplice 79.89 97.68 84.72 98.82 TopHat 84.96 89.59 54.07 94.69 TopHat2 82.25 92.71 88. 91.35 http://www.lirmm.fr/~rivals 52 /

High Throughput Sequencing & transcriptomics Results on simulated data Memory & time Data: 42 M reads, 75 bp Same nb of processors Computing time in (m,h,d) and memory in GB Prog. Bowtie BWA GASSST SOAP2 CRAC GSNAP MapSplice TopHat Time 7h 6h 5h 40m 9h 2d 4h 12h Memory 3 2 43 5 38 5 3 2 http://www.lirmm.fr/~rivals 53 /

High Throughput Sequencing & transcriptomics Results on simulated data Splice junction detection on real data (Human) Agreement between tools on known RefSeq splice junctions http://www.lirmm.fr/~rivals 54 /

High Throughput Sequencing & transcriptomics Results on simulated data Reads spanning several exons and junctions a read overlapping exons 2 to 5 of TIMM50 gene (Human) CRAC can detect several successive splice junctions in a single read http://www.lirmm.fr/~rivals 55 /

High Throughput Sequencing & transcriptomics Results on simulated data Candidate fusion RNAs in four Breast cancer libraries [Edgren et al. 2011]: 4 cancer cell lines, RNA-seq, 50 millions reads of 50 nt CRAC & TopHat-fusion find 20, resp. 21 out of 28 validated fusion RNAs http://www.lirmm.fr/~rivals 56 /

High Throughput Sequencing & transcriptomics Results on simulated data Candidate fusion RNAs in four Breast cancer libraries [Edgren et al. 2011]: 4 cancer cell lines, RNA-seq, 50 millions reads of 50 nt CRAC & TopHat-fusion find 20, resp. 21 out of 28 validated fusion RNAs Nb of reported fusion RNA candidates Cancer libraries CRAC TopHat-fusion BT-474 153 81 327 KPL-4 60 23 075 MCF-7 90 27 267 SK-BR-3 152 61 494 http://www.lirmm.fr/~rivals 56 /

High Throughput Sequencing & transcriptomics Results on simulated data Candidate fusion RNAs in four Breast cancer libraries [Edgren et al. 2011]: 4 cancer cell lines, RNA-seq, 50 millions reads of 50 nt CRAC & TopHat-fusion find 20, resp. 21 out of 28 validated fusion RNAs Nb of reported fusion RNA candidates Cancer libraries CRAC TopHat-fusion BT-474 153 81 327 KPL-4 60 23 075 MCF-7 90 27 267 SK-BR-3 152 61 494 CRAC reports 36 fusion candidates that recur 2 libraries 35/36 with the same junction point No recurrent fusion RNAs were found in the original study http://www.lirmm.fr/~rivals 56 /

Conclusion Conclusion http://www.lirmm.fr/~rivals 57 /

Conclusion Take home Integration of location and support informations Multiple event predictions k-mer profiling better mapping, especially for spliced reads CRAC sensitivity improves with read length Detailed information on each read http://www.lirmm.fr/~rivals 58 /

Conclusion Conclusions NGS assays pervade many domains of biology and are exploited for numerous and divers studies Bioinformatics analysis is the current bottleneck The scalability challenge is solved up to now... thanks to text indexing algorithms Data integration for prioritizing candidates http://www.lirmm.fr/~rivals 59 /

Conclusion CRAC publication & views software available on the ATGC platform: http://www.atgc-montpellier.fr/crac http://www.lirmm.fr/~rivals 60 /

Conclusion Funding and acknowledgments MAB team and in particular B. Cazaux, M. Hébrard, V. Maillol, V. Lefort MASTODONS SePhHaDe project Thanks for your attention Questions? http://www.lirmm.fr/~rivals 61 /

Conclusion A few references CRAC: an integrated approach to the analysis of RNA-seq reads: N. Philippe, M. Salson, T. Commes, E. Rivals. Genome Biology 14:R30, 2013. Filtration and indexing for similarity searches: S. Burkhardt, A. Crauser, P. Ferragina, H.-P. Lenhof, E. Rivals, M. Vingron, q-gram Based Database Searching Using a Suffix Array (QUASAR), Proc. of the 3rd International Conference on Computational Molecular Biology (RECOMB99), ACM Press. Index data structures: D. Gusfield s book, OUP, 1997. V. Mäkinen, G. Navarro: Compressed Text Indexing. Encyclopedia of Algorithms. Springer-Verlag, 2008. N. Välimäki, E. Rivals, Scalable and Versatile k-mer Indexing for High-Throughput Sequencing Data, ISBRA, LNBI 7875, 2013. http://www.lirmm.fr/~rivals 62 /

Supplements Supplements http://www.lirmm.fr/~rivals 63 /

Supplements Tools for RNA-seq To detect splice junctions TopHat (v1 & 2) [Trapnell et al., 2009] MapSplice [Wang et al., 2010] GSNAP [Wu et Nacu, 2010] CRAC [Philippe et al. 2013] To detect fusion RNAs splice junctions MapSplice [Wang et al., 2010] single reads TopHat fusion [McPherson et al., 2011] single reads FusionSeq [Sboner et al., 2010] paired reads FusionHunter [Li et al., 2011] paired reads CRAC [Philippe et al. 2013] single & paired http://www.lirmm.fr/~rivals 64 /

Supplements Simulated data: CRAC predictions by category (B) (C) 100 100 80 80 Percent of cause found 60 40 60 40 20 20 0 0 SNV Insertions Deletions Splices Chimeras Errors SNV Insertions Deletions Splices Chimeras Errors http://www.lirmm.fr/~rivals /