Genomic DNA ASSEMBLY BY REMAPPING. Course overview

ASSEMBLY BY REMAPPING Laurent Falquet, The Bioinformatics Unravelling Group, UNIFR & SIB MA/MER @ UniFr Group Leader @ SIB Course overview Genomic DNA PacBio Illumina methylation de novo remapping Annotation Indels calling SNP calling Virulence/ Resistance genes VCF annotation Comparative genomics roary Comparative genomics SNP diff

What is remapping? Originally "mapping" is the process of finding the location of genes on each chromosome, but in NGS context, "remapping" means identify (by aligning) all possible locations of a read on a reference sequence (genome). AGCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG Reference sequence! CTGATGTGCCGCCTCACTTCGGTGGT Short read 1! TGATGTGCCGCCTCACTACGGTGGTG Short read 2! GATGTGCCGCCTCACTTCGGTGGTGA Short read 3! GCTGATGTGCCGCCTCACTACGGTG Short read 4! GCTGATGTGCCGCCTCACTACGGTG Short read 5 Next Generation Sequencing and remapping: an easy task? Remapping reads onto an existing genome: Current tools are fast by using the Burrows-Wheeler Transform Success depends on the degree of similarity of the reference Detectable variations: SNPs and small insertions or deletions Variations difficult to identify: large insertions/deletions, inversions and translocations reference target

Quality Control of the data First step after receiving the data Sometimes already done by the sequencing center (e.g., chastity) Objective: Remove bad quality reads Remove contaminants Trim ends of reads Remove orphans (if possible or desirable) FastQC (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/) FastX toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) PrinSeq (http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi) 5 Phred quality score, a measure of base call quality Q sanger = -10 log 10 p Phred quality scores are logarithmically linked to error probabilities" Phred Quality Score!Probability of incorrect call!base call accuracy" 10 "1 in 10 "90%" 20 "1 in 100 "99%" 30 "1 in 1000 "99.9%" 40 "1 in 10000 "99.99%" 50 "1 in 100000 "99.999%" The quality score is ASCII encoded in the FASTQ format" FASTQ is a FASTA with score

Example of FASTA >C3PO_0001:2:1:17:1499#0/1! TGAATTCATTGACCATAACAATCATATGCATGATGCAAATTATAATATCATT TTTGTTTGAGCAAATGATTCATAATAATGTATTTCAATATTTTTAGGAATAT CTCCCAATATTGCGCGTGCTGAATTCCATCCGGAATTTTTGACGTCCCCCCC CGAANGGANGNGANNNNGNNGNNNTNTNNAAANGNNNNN!! Example of FASTQ Illumina 1.8+ @M01867:115:000000000-ABF5V:1:1101:9268:1666 1:N:0:51! AACAGGATTAGATACCCTGGTAGTCCACGCCCTAAACGATGCGAACTGGTTGTTGGGTGCTTTTTG! +! --A-6@8CE,@<CEFGGFAFF9CEFF,C@CE@B<8@C:CC,,+,7@C<6,668C,,+8,6,,<9,+! @M01867:115:000000000-ABF5V:1:1101:9214:1685 1:N:0:51! AACCGGATTAGATACCCTGGTAGTCCACGCCCTAAACGATGTCTACTAGTTGTTGGTGGAGTAAAA! +! --AA@7:FF9C9C@FEFE<CF9FEFF,C@FE:B8,6C:+C6CFD9CE,<C6<C@,,8,,,,;,,,-! @M01867:115:000000000-ABF5V:1:1101:18344:1708 1:N:0:51! AACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCAACTAGCCGTTGGGAGCCTTGAG! +! --A99E8CE<C9CFFGGG8FF9@CFF9ECFF@F;,CFC7C,CF,,CF@@EE@@@,,+,,6BE@,,-!! read 1 read 2 read 3

Warning: various FASTQ formats SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS...!...XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...!...IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...!...JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ...!..LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL...!!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{ }~!! 33 59 64 73 104 126! 0...26...31...40! -5...0...9...40! 0...9...40! 3...9...40! 0.2...26...31...41! S - Sanger Phred+33, raw reads typically (0, 40)! X - Solexa Solexa+64, raw reads typically (-5, 40)! I - Illumina 1.3+ Phred+64, raw reads typically (0, 40)! J - Illumina 1.5+ Phred+64, raw reads typically (3, 40)! with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold)! (Note: See discussion above).! L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)!!!!! http://en.wikipedia.org/wiki/fastq_format! Quality control examples Forward Reverse Forward

Quality Control example 11 Quality Control example 12

Read trimming or filtering Trimming remove 5' and/or 3' ends of reads (bad quality or adapter) Filtering remove full reads (e.g., contaminants) Tools: FastX toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) PrinSeq (http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi) Sickle (https://github.com/najoshi/sickle) ea-utils (https://code.google.com/p/ea-utils/) Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic) cutadapt (https://cutadapt.readthedocs.org/)... Error correction For substitutions (mainly Illumina) Quake Reptile ECHO HiTEC For insertions and deletions (454, IonTorrent, PacBio, ONP) Coral HSHREC Quiver Arrow 14

Remapping methods By sequence comparison with Smith-Waterman much too slow By sequence indexing (e.g., BLAST or BLAT) Conventional tools like Blast or Blat do not work well with short sequence reads. -> Modification of existing alignment algorithms to handle short reads. Indexing methods Suffix tree Suffix array Seed hash tables BWT (Burrows-Wheeler Transform) Suffix tree The suffix tree for a string S is a tree whose edges are labelled with strings. Suffix trees also provided one of the first linear-time solutions for the longest common substring problem. These speedups come at a cost: storing a string's suffix tree typically requires significantly more space than storing the string itself. 35Gb for the human genome

Suffix array: a sorted array of all suffixes of a string Consider the string BANANA$ of length 7. It has 7 suffixes: index suffix 0 BANANA$ 1 ANANA$ 2 NANA$ 3 ANA$ 4 NA$ 5 A$ 6 $ sort à index suffix 6 $ 5 A$ 3 ANA$ 1 ANANA$ 0 BANANA$ 4 NA$ 2 NANA$ The suffix array is the array of indices: {6,5,3,1,0,4,2} 12Gb for the human genome Seed hash table Given the string ACGTACGTAAG of length 10, extract all substrings length 4 (seeds) and store their starting positions. index seed 0,4 ACGT 1,5 CGTA 2 GTAC 3 TACG 6 GTAA 7 TAAG sort à index seed 0,4 ACGT 1,5 CGTA 6 GTAA 2 GTAC 7 TAAG 3 TACG The size of the hash table depends on the length of the seed and the complexity of the input string 12Gb for the human genome

Spaced seed hash table indexing (MAQ) (original algorithm for remapping short reads with 2 mismatches) MAQ builds 6 hash tables, each indexing 14 of the first 28 bases 1 14 28 Hence, Maq finds all alignments with at most 2 mismatches in the first 28 bases. Why Burrows-Wheeler? BWT very compact Approximately ½ byte per base As large as the original text(sequence), plus a few extras Can fit onto a standard computer with 2GB of memory Linear-time search algorithm proportional to length of query for exact matches

Burrows-Wheeler Transform (BWT) acaacg$ all rotations $acaacg g$acaac cg$acaa acg$aca aacg$ac caacg$a acaacg$ sort $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac BW Matrix gc$aaac Langmead et al. 2009 Genome Biology Burrows-Wheeler Matrix $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac See the hidden suffix array?

Burrows-Wheeler Transform LF mapping property: The i th occurrence of character X in the Last column corresponds to the same text character as the i th occurrence of X in the First column acaacg$ 2 nd $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac 2 nd Burrows-Wheeler Transform LF mapping property: Using LF the UNPERMUTE algorithm can recreate the original string

Burrows-Wheeler Transform LF mapping property Using LF the EXACTMATCH algorithm from Ferragina and Manzini can find occurrence of a substring from right to left (! greedy) Mapping tools history http://www.ebi.ac.uk/~nf/hts_mappers/ DNA mappers in blue RNA mappers in red mirna mappers in green bisulfite mappers in purple

Example of output formats a) alignment b) SAM c) pileup Li H et al. Bioinformatics 2009;25:2078-2079 MAQ Pileup example BA000018.3 36129 A 102 @.,,,,.,...,.,,,,...,...,,,,,.,,,.. BA000018.3 36130 A 103 @,,.,...,.,,,,...,...,,,,,.,,,... BA000018.3 36131 T 100 @...,.,,,,...,...g.,,,,,.,,,...,. BA000018.3 36132 T 93 @,...,...,,,,,.,,,...,.. BA000018.3 36133 A 95 @...,...,,,,,.,,,...,..,,,, BA000018.3 36134 G 98 @...,...,,,,,.,,,...,..,,,, BA000018.3 36135 T 99 @...,...G,G,,.,,,...,..,,,,..., BA000018.3 36136 C 97 @...,...,,,,,.,,,...,..,,,,...,,. BA000018.3 36137 T 96 @.,...,,,,,.,,,...,..,,,,...,,.,, BA000018.3 36138 A 96 @..,,,,,.,,,...,..,,,,...,,.,,,, BA000018.3 36139 T 93 @,,,.,,,...,..,,,,...,,.,,,,... BA000018.3 36140 C 94 @,.,,,...,..,,,,...,,.,,,,...,.. BA000018.3 36141 A 97 @,.,,,...,..,,,,...,,.,,,,...,..,,. BA000018.3 36142 A 100 @,,...,..,,,,...,,.,,,,...,..,,.,,... BA000018.3 36143 A 102 @,...,..,,,,...,,.,,,,...,..,,.,,...,.. BA000018.3 36144 A 102 @...,..,,,,...,,.,,,,...,..,,.,,...,..,,. BA000018.3 36145 G 102 @ttttttttttttttttttttttttttttttttttttttttttt BA000018.3 36146 A 103 @,..,,,,...,,.,,,,...,..,,.,,...,..,,.,,,,, BA000018.3 36147 A 105 @,..,,,,...,,.,,,,...,..,,.,,..g.,..,,.,,,,,,, BA000018.3 36148 A 108 @..,,,,...,,.,,,,...,t.,,.,,...,..,,.,,,,,,,,,,. BA000018.3 36149 G 110 @.,,,,...,,.,,,,...,..,,.,,...,..,,.,,,,,,,,,,.,.. BA000018.3 36150 G 113 @,,,...,,.,,,,...,..,,.,,...,..,,.,,,,,,,,,,.,..,, BA000018.3 36151 G 109 @,.,,,,...,..,,.,,...,..,,.,,,,,,,,,,.,..,,...,.. BA000018.3 36152 G 110 @,,,,...,..,,.,,...,..,,.,,,,,,,,,,.,..,,...,..,,. BA000018.3 36153 T 111 @,...,..,,.,,...,..,,.,,,,,,,,,,.,..,,...,..,,., BA000018.3 36154 T 110 @...,..,,.,,...,..,,.,,,,,,,,,,.,..,,...,..,,.,..., BA000018.3 36155 G 111 @.,,.,,...,..,,.,,,,,,,,,,.,..,,...,..,,.,...,,,,,.. BA000018.3 36156 G 116 @,.,,...,..,,.,,,,,,,,,,.,..,,...,..,,.,...,,,,,..,... BA000018.3 36157 G 112 @.,t.,,.,,,,,,,,,,.,..,,...,..,,.,...,,,,,..,...,,, BA000018.3 36158 A 108 @.,,,,,,,,,,.,..,,...,..,,.,...,,,,,..,...,,,,. BA000018.3 36159 C 111 @,,,,,,,,,,.,..,,...,..,,.,...,,,,,..,...,,,,.,,.. BA000018.3 36160 T 113 @,,,,,,,,.,..,,...,..,,.,...,,,,,..N...,,,,.,,..,,,. BA000018.3 36161 G 114 @,,,,,.,..,,...,..,,.,...,,,,,..,...,,,,.,,..,,,.,,.. BA000018.3 36162 T 116 @,,,,.,..,,...,..,,.,...,,,,,..,...,,,,.,,..,,,.,,... BA000018.3 36163 T 120 @,..,,...,..,,.,...,,,,,..,...,,,,.,,..,,,.,,...,,,,,,,..

SAM/BAM formats Here is an example of an SAM file: @HD VN:1.0! @SQ SN:chr20 LN:62435964! @RG ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891! @RG ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891! read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195 AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<< NM:i:1 RG:Z:L1! read_28701_28881_323b 147 chr20 28834 30 35M = 28701-168 ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA <<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<< MF:i:18 RG:Z:L2!!! BAM is the binary compressed version of the same data More details: https://samtools.github.io/hts-specs/samv1.pdf http://genome.sph.umich.edu/wiki/sam http://samtools.sourceforge.net/sam1.pdf Visualization tools for mapping (non-exhaustive list) Tool Windows Linux Mac Input format BAMview Y Y Y BAM Consed/Gap5 N Y (X11) Y (X11) ACE, MAQ, BAM Eagleview Y Y Y ACE Gambit Y Y Y BAM Hawkeye Y (cygwin) Y (Y) afg (AMOS) IGViewer Y Y Y BAM, SAM, GFF, BED, VCF Tablet Y Y Y ACE, MAQ, BAM, afg, SAM, IGBrowser Y Y Y BAM, SAM, GFF, BED... https://en.wikipedia.org/wiki/genome_browser

Text based with Samtools 34 Tablet visualization of the mapping and the SNPs Mapping of the reads of a Staphylococcus aureus sequencing, showing 2 SNPs vs the reference genome.

IGV Integrative Genome Viewer Summary Lessons from the remapping Easy to map reads onto a closely related reference (always better than de novo) Less easy to find non-matching reads and what they are (plasmids, insertion sequences, phages, virus, other) Repeats are a nightmare in any case Paired-ends help SNPs, CNVs, and phasing Next courses