Genomic DNA ASSEMBLY BY REMAPPING. Course overview

Similar documents
NEXT GENERATION SEQUENCING. Farhat Habib

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010

DATA FORMATS AND QUALITY CONTROL

Read Quality Assessment & Improvement. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

NGS in Pathology Webinar

SNP calling and VCF format

Sanger vs Next-Gen Sequencing

Bioinformatics in next generation sequencing projects

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

Francisco García Quality Control for NGS Raw Data

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

Alignment methods. Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Read Mapping and Variant Calling. Johannes Starlinger

Alignment & Variant Discovery. J Fass UCD Genome Center Bioinformatics Core Tuesday June 17, 2014

Reference genomes and common file formats

Reference genomes and common file formats

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

ChIP-seq analysis. adapted from J. van Helden, M. Defrance, C. Herrmann, D. Puthier, N. Servant

Introduction to Short Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

NGS sequence preprocessing. José Carbonell Caballero

RNA-seq Data Analysis

Introduction to Next Generation Sequencing

Short Read Alignment to a Reference Genome

Bioinformatics Support of Genome Sequencing Projects. Seminar in biology

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Introduction to RNA sequencing

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

CNV and variant detection for human genome resequencing data - for biomedical researchers (II)

Quantifying gene expression

L3: Short Read Alignment to a Reference Genome

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012

Next Generation Sequencing. Tobias Österlund

Lecture 7. Next-generation sequencing technologies

ISO/IEC JTC 1/SC 29/WG 11 N15527 Warsaw, CH June Introduction

Normal-Tumor Comparison using Next-Generation Sequencing Data

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Mapping of Next Generation Sequencing Data

Basic Bioinformatics: Homology, Sequence Alignment,

De Novo Assembly of High-throughput Short Read Sequences

14 March, 2016: Introduction to Genomics

Challenging algorithms in bioinformatics

Reads to Discovery. Visualize Annotate Discover. Small DNA-Seq ChIP-Seq Methyl-Seq. MeDIP-Seq. RNA-Seq. RNA-Seq.

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:

Illumina Sequencing Error Profiles and Quality Control

Quality assessment and control of sequence data

Introduction to NGS analyses

Next Generation Sequencing: An Overview

Course Presentation. Ignacio Medina Presentation

Introduc)on to Bioinforma)cs of next- genera)on sequencing. Sequence acquisi)on and processing; genome mapping and alignment manipula)on

Galaxy for Next Generation Sequencing 初探次世代序列分析平台 蘇聖堯 2013/9/12

Genomic Dark Matter: The limitations of short read mapping illustrated by the Genome Mappability Score (GMS)

RNA-Seq de novo assembly training

Introduction to bioinformatics (NGS data analysis)

DNASeq: Analysis pipeline and file formats Sumir Panji, Gerrit Boha and Amel Ghouila

Quality assessment and control of sequence data. Naiara Rodríguez-Ezpeleta

Gene Expression analysis with RNA-Seq data

About Strand NGS. Strand Genomics, Inc All rights reserved.

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

RNA-Seq Module 2 From QC to differential gene expression.

RNAseq and Variant discovery

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

From reads to results. Dr Torsten Seemann

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Next-Generation Sequencing in practice

NGS, Cancer and Bioinformatics. 5/3/2015 Yannick Boursin

10/06/2014. RNA-Seq analysis. With reference assembly. Cormier Alexandre, PhD student UMR8227, Algal Genetics Group

Genomics AGRY Michael Gribskov Hock 331

Introduction. CS482/682 Computational Techniques in Biological Sequence Analysis

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

BIOINFORMATICS ORIGINAL PAPER

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Disclosing the nature of computational tools for the analysis of Next Generation Sequencing data.

Introduction to DNA-Sequencing

Prioritization: from vcf to finding the causative gene

Bioinformatics Core Facility IDENTIFYING A DISEASE CAUSING MUTATION

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Read Complexity

Data Retrieval from GenBank

Introduction to RNA-Seq in GeneSpring NGS Software

Processing Ion AmpliSeq Data using NextGENe Software v2.3.0

Rapid Parallel Genome Indexing using MapReduce

European Union Reference Laboratory for Genetically Modified Food and Feed (EURL GMFF)

IDENTIFYING A DISEASE CAUSING MUTATION

Resolution of fine scale ribosomal DNA variation in Saccharomyces yeast

Next-Generation Sequencing. Technologies

Eucalyptus gene assembly

Introduction to human genomics and genome informatics

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Match the Hash Scores

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

Data Analysis with CASAVA v1.8 and the MiSeq Reporter

Bioinformatics for NGS projects. Guidelines. genomescan.nl

Transcription:

ASSEMBLY BY REMAPPING Laurent Falquet, The Bioinformatics Unravelling Group, UNIFR & SIB MA/MER @ UniFr Group Leader @ SIB Course overview Genomic DNA PacBio Illumina methylation de novo remapping Annotation Indels calling SNP calling Virulence/ Resistance genes VCF annotation Comparative genomics roary Comparative genomics SNP diff

What is remapping? Originally "mapping" is the process of finding the location of genes on each chromosome, but in NGS context, "remapping" means identify (by aligning) all possible locations of a read on a reference sequence (genome). AGCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG Reference sequence! CTGATGTGCCGCCTCACTTCGGTGGT Short read 1! TGATGTGCCGCCTCACTACGGTGGTG Short read 2! GATGTGCCGCCTCACTTCGGTGGTGA Short read 3! GCTGATGTGCCGCCTCACTACGGTG Short read 4! GCTGATGTGCCGCCTCACTACGGTG Short read 5 Next Generation Sequencing and remapping: an easy task? Remapping reads onto an existing genome: Current tools are fast by using the Burrows-Wheeler Transform Success depends on the degree of similarity of the reference Detectable variations: SNPs and small insertions or deletions Variations difficult to identify: large insertions/deletions, inversions and translocations reference target

Quality Control of the data First step after receiving the data Sometimes already done by the sequencing center (e.g., chastity) Objective: Remove bad quality reads Remove contaminants Trim ends of reads Remove orphans (if possible or desirable) FastQC (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/) FastX toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) PrinSeq (http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi) 5 Phred quality score, a measure of base call quality Q sanger = -10 log 10 p Phred quality scores are logarithmically linked to error probabilities" Phred Quality Score!Probability of incorrect call!base call accuracy" 10 "1 in 10 "90%" 20 "1 in 100 "99%" 30 "1 in 1000 "99.9%" 40 "1 in 10000 "99.99%" 50 "1 in 100000 "99.999%" The quality score is ASCII encoded in the FASTQ format" FASTQ is a FASTA with score

Example of FASTA >C3PO_0001:2:1:17:1499#0/1! TGAATTCATTGACCATAACAATCATATGCATGATGCAAATTATAATATCATT TTTGTTTGAGCAAATGATTCATAATAATGTATTTCAATATTTTTAGGAATAT CTCCCAATATTGCGCGTGCTGAATTCCATCCGGAATTTTTGACGTCCCCCCC CGAANGGANGNGANNNNGNNGNNNTNTNNAAANGNNNNN!! Example of FASTQ Illumina 1.8+ @M01867:115:000000000-ABF5V:1:1101:9268:1666 1:N:0:51! AACAGGATTAGATACCCTGGTAGTCCACGCCCTAAACGATGCGAACTGGTTGTTGGGTGCTTTTTG! +! --A-6@8CE,@<CEFGGFAFF9CEFF,C@CE@B<8@C:CC,,+,7@C<6,668C,,+8,6,,<9,+! @M01867:115:000000000-ABF5V:1:1101:9214:1685 1:N:0:51! AACCGGATTAGATACCCTGGTAGTCCACGCCCTAAACGATGTCTACTAGTTGTTGGTGGAGTAAAA! +! --AA@7:FF9C9C@FEFE<CF9FEFF,C@FE:B8,6C:+C6CFD9CE,<C6<C@,,8,,,,;,,,-! @M01867:115:000000000-ABF5V:1:1101:18344:1708 1:N:0:51! AACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCAACTAGCCGTTGGGAGCCTTGAG! +! --A99E8CE<C9CFFGGG8FF9@CFF9ECFF@F;,CFC7C,CF,,CF@@EE@@@,,+,,6BE@,,-!! read 1 read 2 read 3

Warning: various FASTQ formats SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS...!...XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...!...IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...!...JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ...!..LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL...!!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{ }~!! 33 59 64 73 104 126! 0...26...31...40! -5...0...9...40! 0...9...40! 3...9...40! 0.2...26...31...41! S - Sanger Phred+33, raw reads typically (0, 40)! X - Solexa Solexa+64, raw reads typically (-5, 40)! I - Illumina 1.3+ Phred+64, raw reads typically (0, 40)! J - Illumina 1.5+ Phred+64, raw reads typically (3, 40)! with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold)! (Note: See discussion above).! L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)!!!!! http://en.wikipedia.org/wiki/fastq_format! Quality control examples Forward Reverse Forward

Quality Control example 11 Quality Control example 12

Read trimming or filtering Trimming remove 5' and/or 3' ends of reads (bad quality or adapter) Filtering remove full reads (e.g., contaminants) Tools: FastX toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) PrinSeq (http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi) Sickle (https://github.com/najoshi/sickle) ea-utils (https://code.google.com/p/ea-utils/) Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic) cutadapt (https://cutadapt.readthedocs.org/)... Error correction For substitutions (mainly Illumina) Quake Reptile ECHO HiTEC For insertions and deletions (454, IonTorrent, PacBio, ONP) Coral HSHREC Quiver Arrow 14

Remapping methods By sequence comparison with Smith-Waterman much too slow By sequence indexing (e.g., BLAST or BLAT) Conventional tools like Blast or Blat do not work well with short sequence reads. -> Modification of existing alignment algorithms to handle short reads. Indexing methods Suffix tree Suffix array Seed hash tables BWT (Burrows-Wheeler Transform) Suffix tree The suffix tree for a string S is a tree whose edges are labelled with strings. Suffix trees also provided one of the first linear-time solutions for the longest common substring problem. These speedups come at a cost: storing a string's suffix tree typically requires significantly more space than storing the string itself. 35Gb for the human genome

Suffix array: a sorted array of all suffixes of a string Consider the string BANANA$ of length 7. It has 7 suffixes: index suffix 0 BANANA$ 1 ANANA$ 2 NANA$ 3 ANA$ 4 NA$ 5 A$ 6 $ sort à index suffix 6 $ 5 A$ 3 ANA$ 1 ANANA$ 0 BANANA$ 4 NA$ 2 NANA$ The suffix array is the array of indices: {6,5,3,1,0,4,2} 12Gb for the human genome Seed hash table Given the string ACGTACGTAAG of length 10, extract all substrings length 4 (seeds) and store their starting positions. index seed 0,4 ACGT 1,5 CGTA 2 GTAC 3 TACG 6 GTAA 7 TAAG sort à index seed 0,4 ACGT 1,5 CGTA 6 GTAA 2 GTAC 7 TAAG 3 TACG The size of the hash table depends on the length of the seed and the complexity of the input string 12Gb for the human genome

Spaced seed hash table indexing (MAQ) (original algorithm for remapping short reads with 2 mismatches) MAQ builds 6 hash tables, each indexing 14 of the first 28 bases 1 14 28 Hence, Maq finds all alignments with at most 2 mismatches in the first 28 bases. Why Burrows-Wheeler? BWT very compact Approximately ½ byte per base As large as the original text(sequence), plus a few extras Can fit onto a standard computer with 2GB of memory Linear-time search algorithm proportional to length of query for exact matches

Burrows-Wheeler Transform (BWT) acaacg$ all rotations $acaacg g$acaac cg$acaa acg$aca aacg$ac caacg$a acaacg$ sort $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac BW Matrix gc$aaac Langmead et al. 2009 Genome Biology Burrows-Wheeler Matrix $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac See the hidden suffix array?

Burrows-Wheeler Transform LF mapping property: The i th occurrence of character X in the Last column corresponds to the same text character as the i th occurrence of X in the First column acaacg$ 2 nd $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac 2 nd Burrows-Wheeler Transform LF mapping property: Using LF the UNPERMUTE algorithm can recreate the original string

Burrows-Wheeler Transform LF mapping property Using LF the EXACTMATCH algorithm from Ferragina and Manzini can find occurrence of a substring from right to left (! greedy) Mapping tools history http://www.ebi.ac.uk/~nf/hts_mappers/ DNA mappers in blue RNA mappers in red mirna mappers in green bisulfite mappers in purple

Example of output formats a) alignment b) SAM c) pileup Li H et al. Bioinformatics 2009;25:2078-2079 MAQ Pileup example BA000018.3 36129 A 102 @.,,,,.,...,.,,,,...,...,,,,,.,,,.. BA000018.3 36130 A 103 @,,.,...,.,,,,...,...,,,,,.,,,... BA000018.3 36131 T 100 @...,.,,,,...,...g.,,,,,.,,,...,. BA000018.3 36132 T 93 @,...,...,,,,,.,,,...,.. BA000018.3 36133 A 95 @...,...,,,,,.,,,...,..,,,, BA000018.3 36134 G 98 @...,...,,,,,.,,,...,..,,,, BA000018.3 36135 T 99 @...,...G,G,,.,,,...,..,,,,..., BA000018.3 36136 C 97 @...,...,,,,,.,,,...,..,,,,...,,. BA000018.3 36137 T 96 @.,...,,,,,.,,,...,..,,,,...,,.,, BA000018.3 36138 A 96 @..,,,,,.,,,...,..,,,,...,,.,,,, BA000018.3 36139 T 93 @,,,.,,,...,..,,,,...,,.,,,,... BA000018.3 36140 C 94 @,.,,,...,..,,,,...,,.,,,,...,.. BA000018.3 36141 A 97 @,.,,,...,..,,,,...,,.,,,,...,..,,. BA000018.3 36142 A 100 @,,...,..,,,,...,,.,,,,...,..,,.,,... BA000018.3 36143 A 102 @,...,..,,,,...,,.,,,,...,..,,.,,...,.. BA000018.3 36144 A 102 @...,..,,,,...,,.,,,,...,..,,.,,...,..,,. BA000018.3 36145 G 102 @ttttttttttttttttttttttttttttttttttttttttttt BA000018.3 36146 A 103 @,..,,,,...,,.,,,,...,..,,.,,...,..,,.,,,,, BA000018.3 36147 A 105 @,..,,,,...,,.,,,,...,..,,.,,..g.,..,,.,,,,,,, BA000018.3 36148 A 108 @..,,,,...,,.,,,,...,t.,,.,,...,..,,.,,,,,,,,,,. BA000018.3 36149 G 110 @.,,,,...,,.,,,,...,..,,.,,...,..,,.,,,,,,,,,,.,.. BA000018.3 36150 G 113 @,,,...,,.,,,,...,..,,.,,...,..,,.,,,,,,,,,,.,..,, BA000018.3 36151 G 109 @,.,,,,...,..,,.,,...,..,,.,,,,,,,,,,.,..,,...,.. BA000018.3 36152 G 110 @,,,,...,..,,.,,...,..,,.,,,,,,,,,,.,..,,...,..,,. BA000018.3 36153 T 111 @,...,..,,.,,...,..,,.,,,,,,,,,,.,..,,...,..,,., BA000018.3 36154 T 110 @...,..,,.,,...,..,,.,,,,,,,,,,.,..,,...,..,,.,..., BA000018.3 36155 G 111 @.,,.,,...,..,,.,,,,,,,,,,.,..,,...,..,,.,...,,,,,.. BA000018.3 36156 G 116 @,.,,...,..,,.,,,,,,,,,,.,..,,...,..,,.,...,,,,,..,... BA000018.3 36157 G 112 @.,t.,,.,,,,,,,,,,.,..,,...,..,,.,...,,,,,..,...,,, BA000018.3 36158 A 108 @.,,,,,,,,,,.,..,,...,..,,.,...,,,,,..,...,,,,. BA000018.3 36159 C 111 @,,,,,,,,,,.,..,,...,..,,.,...,,,,,..,...,,,,.,,.. BA000018.3 36160 T 113 @,,,,,,,,.,..,,...,..,,.,...,,,,,..N...,,,,.,,..,,,. BA000018.3 36161 G 114 @,,,,,.,..,,...,..,,.,...,,,,,..,...,,,,.,,..,,,.,,.. BA000018.3 36162 T 116 @,,,,.,..,,...,..,,.,...,,,,,..,...,,,,.,,..,,,.,,... BA000018.3 36163 T 120 @,..,,...,..,,.,...,,,,,..,...,,,,.,,..,,,.,,...,,,,,,,..

SAM/BAM formats Here is an example of an SAM file: @HD VN:1.0! @SQ SN:chr20 LN:62435964! @RG ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891! @RG ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891! read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195 AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<< NM:i:1 RG:Z:L1! read_28701_28881_323b 147 chr20 28834 30 35M = 28701-168 ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA <<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<< MF:i:18 RG:Z:L2!!! BAM is the binary compressed version of the same data More details: https://samtools.github.io/hts-specs/samv1.pdf http://genome.sph.umich.edu/wiki/sam http://samtools.sourceforge.net/sam1.pdf Visualization tools for mapping (non-exhaustive list) Tool Windows Linux Mac Input format BAMview Y Y Y BAM Consed/Gap5 N Y (X11) Y (X11) ACE, MAQ, BAM Eagleview Y Y Y ACE Gambit Y Y Y BAM Hawkeye Y (cygwin) Y (Y) afg (AMOS) IGViewer Y Y Y BAM, SAM, GFF, BED, VCF Tablet Y Y Y ACE, MAQ, BAM, afg, SAM, IGBrowser Y Y Y BAM, SAM, GFF, BED... https://en.wikipedia.org/wiki/genome_browser

Text based with Samtools 34 Tablet visualization of the mapping and the SNPs Mapping of the reads of a Staphylococcus aureus sequencing, showing 2 SNPs vs the reference genome.

IGV Integrative Genome Viewer Summary Lessons from the remapping Easy to map reads onto a closely related reference (always better than de novo) Less easy to find non-matching reads and what they are (plasmids, insertion sequences, phages, virus, other) Repeats are a nightmare in any case Paired-ends help SNPs, CNVs, and phasing Next courses