Myzus persicae Clone G006 Assembly

Similar documents
LINKS: Scaffolding genome assemblies with kilobase-long nanopore reads

LINKS: Scaffolding genome assemblies with kilobase-long nanopore reads

Assembly of Ariolimax dolichophallus using SOAPdenovo2

De novo genome assembly with next generation sequencing data!! "

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler

SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler

BIOINFORMATICS ORIGINAL PAPER

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

De Novo Assembly of High-throughput Short Read Sequences

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

N50 must die!? Genome assembly workshop, Santa Cruz, 3/15/11

de novo paired-end short reads assembly

TIGER: tiled iterative genome assembler

Next Generation Sequencing Technologies

Assemblathon Summary Report

Reference-free detection of isolated SNPs Additional File 1

Mapping. Main Topics Sept 11. Saving results on RCAC Scaffolding and gap closing Assembly quality

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

GENOME ASSEMBLY FINAL PIPELINE AND RESULTS

State of the art de novo assembly of human genomes from massively parallel sequencing data

SUPPLEMENTARY INFORMATION

Mate-pair library data improves genome assembly

A Roadmap to the De-novo Assembly of the Banana Slug Genome

Assembly and Validation of Large Genomes from Short Reads Michael Schatz. March 16, 2011 Genome Assembly Workshop / Genome 10k

White paper on de novo assembly in CLC Assembly Cell 4.0

Direct determination of diploid genome sequences. Supplemental material: contents

Title: High-quality genome assembly of channel catfish, Ictalurus punctatus

Faction 2: Genome Assembly Lab and Preliminary Data

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Genome Assembly and Annotation of Isochrysis Galbana

De novo assembly in RNA-seq analysis.

Parts of a standard FastQC report

Workflow of de novo assembly

IDBA-UD: A de Novo Assembler for Single-Cell and Metagenomic Sequencing Data with Highly Uneven Depth

Gap Filling for a Human MHC Haplotype Sequence

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly

Analysis Datasheet Exosome RNA-seq Analysis

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Supplemental Materials

A Computer Simulator for Assessing Different Challenges and Strategies of de Novo Sequence Assembly

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Evaluation of genome scaffolding tools using pooled clone sequencing. Elif DAL 1, Can ALKAN 1, *Correspondence:

Computational Genomics [2017] Faction 2: Genome Assembly Results, Protocol & Demo

Current'Advances'in'Sequencing' Technology' James'Gurtowski' Schatz'Lab'

De novo whole genome assembly

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

GenScale Scalable, Optimized and Parallel Algorithms for Genomics. Dominique LAVENIER

Supplement to: The Genomic Sequence of the Chinese Hamster Ovary (CHO)-K1 cell line

Optimizing k-mer size using a variant grid search to enhance de novo genome assembly

Genome Assembly. Background and Approach 28 Jan Jillian Walker Diana Williams

Genome Assembly Software for Different Technology Platforms. PacBio Canu Falcon. Illumina Soap Denovo Discovar Platinus MaSuRCA.

arxiv: v1 [q-bio.gn] 20 Apr 2013

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Genome Assembly: Background and Strategy

Assemblathon 1: A competitive assessment of de novo short read assembly methods

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

Supplementary Figure 1. Design of the control microarray. a, Genomic DNA from the

Biol 478/595 Intro to Bioinformatics

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010

The MaSuRCA genome Assembler Aleksey Zimin 1,*, Guillaume Marçais 1, Daniela Puiu 2, Michael Roberts 1, Steven L. Salzberg 2, and James A.

Haploid Assembly of Diploid Genomes

Bioinformatics in next generation sequencing projects

Data Retrieval from GenBank

De novo whole genome assembly

The New Genome Analyzer IIx Delivering more data, faster, and easier than ever before. Jeremy Preston, PhD Marketing Manager, Sequencing

COPE: An accurate k-mer based pair-end reads connection tool to facilitate genome assembly

Virus-Clip: a fast and memory-efficient viral integration site detection tool at single-base resolution with annotation capability

Assemblathon 1: A competitive assessment of de novo short read assembly methods

A Short Sequence Splicing Method for Genome Assembly Using a Three- Dimensional Mixing-Pool of BAC Clones and High-throughput Technology

Transcriptome analysis

Introduction: Methods:

Genome Assembly Background and Strategy

Genomics and Transcriptomics of Spirodela polyrhiza

Identifying wrong assemblies in de novo short read primary sequence assembly contigs

ACCEPTED. Korean patient isolate in an effort to understand the prevalence, antibiotic resistance, and

Genome Assembly Workshop Titles and Abstracts

Oxford Nanopore Sequencing and de novo Assembly of a Eukaryotic Genome Supplemental Notes and Figures

Supplementary Data for Hybrid error correction and de novo assembly of single-molecule sequencing reads

TR-IIS Li-An Yang, Wei-Chun Chung, Yu-Jung Chang, Shu-Hwa Chen, Chung-Yen Lin and Jan-Ming Ho

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads

Complete genome sequence of Clostridium acetobutylicum. DSM 1731, a solvent producing strain with multi-replicon

Gene Prediction Group

Finishing of Fosmid 1042D14. Project 1042D14 is a roughly 40 kb segment of Drosophila ananassae

Infectious Disease Omics

Next-Generation Sequencing: Quality Control

RNA-Seq de novo assembly training

Assembly. Ian Misner, Ph.D. Bioinformatics Crash Course. Bioinformatics Core

DE NOVO GENOME ASSEMBLY OF THE AFRICAN CATFISH (CLARIAS GARIEPINUS)

Integrated NGS Sample Preparation Solutions for Limiting Amounts of RNA and DNA. March 2, Steven R. Kain, Ph.D. ABRF 2013

Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro

Eucalyptus gene assembly

arxiv: v2 [q-bio.gn] 21 May 2012

Bioinformatics for Microbial Biology

Alignment and Assembly

Determining Error Biases in Second Generation DNA Sequencing Data

From Infection to Genbank

Strain/species identification in metagenomes using genome-specific markers. Tu, He and Zhou Nucleic Acids Research

AGOUTI: improving genome assembly and annotation using transcriptome data

Transcription:

Myzus persicae Clone G006 Assembly R. Chikhi, T. Derrien, F. Legeai October 8, 2013 1 Reads correction Although sequence qualities from Illumina technologies are known to be accurate, typical errors are substitutions and arise at a frequency 0.5-2.5%, especially at the 3 end of the reads [7]. Assembling reads that contains errors may lead to false positive overlaps of k-mers ( reads) or trigger false positive gaps in the final assembly. We therefore used a state-ofthe-art program, Quake [7], that is dedicated to identify and correct reads. Briefly, Quake uses a specific method to choose an appropriate coverage cutoff between trusted k-mers (those that are truly part of the genome) and erroneous k-mers based on weighting k-mer counts in the reads using the quality values assigned to each base. The main statistics of this correcting steps are summarized in table 1. 2 Assembly 2.1 Minia assembly The Minia assembler (v 1.5215) [4] was used to assemble the paired-end reads into contigs. It was executed in de Bruijn graph assembly mode, i.e. no scaffolds were created. The commandline parameters are k=71, this value was recommended following a run of kmergenie [3] and min abundance=43. The contigs were (s) scaffolded and (g) gap-filled by executing each operation two times as follows: s+g+s+g. The scaffolding software used is SuperScaffolder Number Validated Corrected Trimmed Total Library of fragmentmentsmentsments* fragments frag- frag- frag- cleaned Name MPA 140 66-63 9-10 38-37 81 MPB 152 45-53 21-10 48-50 72 S(PE) 182 147-127 16-23 21-19 162 Table 1: Summary table of Quake corrections on Myzus libraries. (* each number correspond to the 2 mate files per library), reads number are in millions 1

[5], a modified version of SSPACE [2] that is still in development. SuperScaffolder version 0.5304 was executed with no other parameters than the read files and the insert size of each library. The gap-filling software used is GapCloser v1.12 from SOAPdenovo [11]. GapCloser was executed with default parameters. Only the two mate-pairs libraries were used in scaffolding and gap-filling. Minia and SuperScaffolder have been used on non-corrected reads. 2.2 Allpaths-LG2 assembly The AllPaths-LG (r40324) [13] assembler was runned using the default parameters as recommended by the authors. We provided the followings descriptor files : in groups.csv : group_name, library_name, file_name MPA, MPA, awilson_mpersicaeg006_201119392-01mpa_s_7_?.fastq MPB, MPB, awilson_mpersicaeg006_201119392-01mpb_s_8_?.fastq PE, PE, awilson_mpersicaeg006_201119392-01_s_6_?.fastq in libs.csv : library_name, project_name, organism_name, type, paired, frag_size, frag_stddev,insert_size, insert_stddev, read_orientation, genomic_start, genomic_end MPA, Myzus, Myzus persicae, jumping, 1,,, 5000,500, outward,, MPB, Myzus, Myzus persicae, jumping, 1,,, 2000,200, outward,, PE, Myzus, Myzus persicae, fragment, 1, 200, 20,,, inward,, 2.3 Abyss assembly The Abyss assembler v1.3.2 [15] has been used on the corrected reads using different kmer size (k=41, 64 and 91). 3 Metrics We computed standard statistics on the AllPaths-LG2, Minia and Abyss assemblies, scaffolds (table 2) and contigs (table 3). These metrics have been calculated using the script assemblathon stats2.pl, used during the Assemblathon contest [6]. The expected size of the genome has been fixed at 350MBP, the number of consecutive N used to split scaffolds in contigs is 3. All these metrics have been calculated on scaffolds larger than 1000 bp. 3.1 Conclusion Minia gives better metrics, especially a better scaffold NG50, and less N in scaffolds. This better performance is not essentially due to a Gap closing step or a better scaffolding because the contigs show also a better NG50. In comparison with the other assemblers, the metrics given by Abyss are not good (number of scaffolds, NG50,... ). Thus, we decided to discard Abyss from further analyses and compares the AllPaths-LG and Minia assemblies, only. 2

Metric Allpaths Minia Abyss (k=41) Abyss (k=64) Abyss (k=91) Number of scaffolds 4 297 3 686 12 043 12 511 27 753 Total size of scaffolds 348 159 819 347 538 302 336 430 733 363 405 053 349 109 988 Total scaffold length as percentage of known 99.5% 99.3% 96.1% 103.8% 99.7% genome size Longest scaffold 2 198 345 3 415 201 650 496 806 817 643 518 Shortest scaffold 1001 1002 1000 1000 1000 Number of scaffolds > 10K nt 1858 (43.2%) 1 396 (37.9%) 4 773 (39.6%) 5128 (41.0%) 8916 (32.1%) Number of scaffolds > 100K nt 788 (18.3%) 673 (18.3%) 976 (8.1%) 1031 (8.2%) 246 (0.9%) Number of scaffolds > 1M nt 38 (0.9%) 67 (1.8%) 0 (0.0%) 0 (0.0%) 0 (0.0%) Mean scaffold size 81 024 94 286 27 936 29 047 12 579 Median scaffold size 5 828 4 552 5 846 6 379 4 842 N50 scaffold length 434 305 594 200 99 116 109 344 32 150 L50 scaffold count 225 161 994 927 2 972 NG50 scaffold length 433 380 591 021 94 204 114 469 32 048 LG50 scaffold count 227 164 1 064 867 2 985 N50 scaffold - NG50 scaffold length difference 925 3 179 4 912 5125 102 scaffold %A 34.09 34.51 34.15 34.42 33.45 scaffold %C 14.65 14.82 14.61 14.77 14.40 scaffold %G 14.64 14.82 14.61 14.79 14.39 scaffold %T 34.12 34.52 34.14 34.44 33.46 scaffold %N 2.41 1.32 2.49 1.57 4.31 scaffold N nt 8 400 913 4 603 220 8 360 599 5 715 892 15 045 988 Table 2: Scaffolds metrics Metric Allpaths Minia Abyss (k=41) Abyss (k=64) Abyss (k=91) Percentage of assembly in scaffolded contigs 95.4% 96.2% 93.5% 88.8% 90.8% Percentage of assembly in unscaffolded contigs 4.6% 3.8% 6.5% 11.2% 9.2% Average number of contigs per scaffold 3.0 2.9 3.8 2.8 3.9 Average length of breaks (3 or more Ns) between 987 649 243 260 186 contigs Number of contigs 12 799 10 758 46 293 34 471 107 826 Number of contigs in scaffolds 10 290 9 271 41 237 28 037 96 021 Number of contigs not in scaffolds 2 509 1 487 5 056 6 434 11 805 Total size of contigs 339 760 004 342 946 467 328 075 422 357 694 813 334 140 822 Longest contig 560 224 807 357 134 594 169 705 643 518 Shortest contig 28 44 41 64 99 Number of contigs > 500 nt 12 647 (98.8%) 9 937 (92.4%) 40 259 (87.0%) 32420 (94.1%) 98964 (91.8%) Number of contigs > 1K nt 12302 (96.1%) 8655 (80.5%) 36 655 (79.2%) 29900 (86.7%) 85497 (79.3%) Number of contigs > 10K nt 5915 (46.2%) 3983 (37.0%) 10 568 (22.8%) 10926 (31.7%) 4173 (3.9%) Number of contigs > 100K nt 870 (6.8%) 1072 (10.0%) 9 (0.0%) 65 (0.2%) 1 (0.0%) Number of contigs > 1M nt 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) Mean contig size 26 546 31 878 7 087 10 377 3 099 Median contig size 8 130 4 029 3502 4 703 2 091 N50 contig length 77 851 140 790 15 512 23 478 4 837 L50 contig count 1 283 691 6 122 4 406 20 504 NG50 contig length 75 354 137 895 14 334 24 016 4 606 LG50 contig count 1350 716 6 859 4 244 22 184 N50 contig - NG50 contig length difference 2 497 2 895 1 178 538 231 contig %A 34.93 34.97 35.02 34.97 34.95 contig %C 15.01 15.02 14.98 15.01 15.04 contig %G 15.01 15.02 14.98 15.02 15.03 contig %T 34.97 34.98 35.01 34.99 34.96 contig %N 0.00 0.00 0.00 0.00 0.02 contig N nt 1 098 11 385 5 288 5 652 76 822 Table 3: Contigs metrics 3

Metric Allpaths Minia number of reads 325 190 012 325 190 012 number of mapped reads 281 829 650 (86.67%) 285 253 033 (87.72%) number of aligned reads >1 times 12 683 526 (7.80%) 48 795 812 (15.01%) number of properly paired reads 270 271 882 (83.11%) 259 464 196 (79.79%) number of reads with itself and mate mapped 275 699 376 () 278 271 490 (85.57%) singletons 6 130 274 (1.89%) 6 981 543 (2.15%) with mate mapped to a different chr 4 394 368 (1.35%) 6 784 560 (2.09%) number of improperly paired reads on the same chr 1 033 126 (0.03%) 12 022 734 (3.70%) Table 4: PE mapping statistics Metric Allpaths Minia number of reads 163 566 910 163 566 910 number of mapped reads 104 622 370 (63.96%) 105 492 323 (64.49%) number of aligned reads >1 times 2 314 520(1.42%) 2 732 018 (1.67%) number of properly paired reads 77 187 462 (47.19%) 74 737 904 (45.69%) number of reads with itself and mate mapped 93 248 938 (57.00%) 93 324 438 (57.06%) singletons 11 373 432 (6.95%) 12 167 885 (7.44%) with mate mapped to a different chr 10 095 410 (6.17%) 11 345 488 (6.94%) number of improperly paired reads on the same chr 5 966 066 (3.64%) 7 241 046 (4.43%) 4 Remapping Reads 4.1 Protocol Table 5: MPA mapping statistics The corrected reads have been remapped on the genome using bowtie2 [9]. We used the option -X 500 for the pair ends mapping and the options rf for the mate pairs reads, we put the option -X to 8000 and 5000 for the MPA and MPB libraries, respectively. For the mapping statistics, we used the option flagstat of samtools [10]. For the coverage analysis, we used the tool genomecov of bedtools [14] and R scripts. 4.2 Mapping statistics An overview of the tables 4, 5, 6 below shows that both assemblers give very similar results, but it appears that Allpaths-LG shows better statistics on properly paired mapping for each library. It appears also that there is 3 times more reads with multiple putative locations found in the Minia assembly, although there is just few more mapping reads. 4

Metric Allpaths Minia number of reads 144 357 844 144 357 844 number of mapped reads 108 597 964 (75.23%) 109 405 790 (75.79%) number of aligned reads >1 times 3 630 280 (2.51%) 3 334 252 (2.31%) number of properly paired reads 91 978 062 (63.72%) 89 007 978 (61.66%) number of reads with itself and mate mapped 103 764 046 (71.88%) 103 702 250 (71.84%) singletons 4 833 918 (3.35%) 5 703 540 (3.95%) with mate mapped to a different chr 9 877 238 (6.84%) 10 902 830 (7.55%) number of improperly paired reads on the same chr 1908746(1.32%) 3 791 442 (2.63%) Table 6: MPB mapping statistics Min. 1st Qu. Median Mean 3rd Qu. Max. 2.25 107.90 125.50 285.70 220.20 59110.00 Table 7: Allpaths depth coverage by scaffolds 4.3 Analysis of the coverage 4.3.1 Allpaths Following the mapping reads statistics we are expecting a depth of coverage of 140x. But the median coverage by scaffold given by genomecov is slightly lower as described (cf table 7). This might be explained by a contamination of bacterial population or recent transposable elements regions in the nuclear genome. Thus, we plotted the GC percent and coverage by scaffolds (see figure 1), and observe a distinct cloud (in the orange circle) with a coverage higher than 1000x, suspected as being putatively bacterial contamination, and discard these scaffolds for further analyses. This set includes 222 scaffolds covering 537 269 bp. Among these 222 scaffolds, 173 return at least one hit when compared to Genbank Bacterial division, 171 matchs especially buchnera aphidicola. (see table 8) 4.3.2 Minia As previously, we obtained a median coverage lower as described in table 9. The figure 2, we can also point out a distinct cloud (in orange) with a coverage higher than 1000x, we also discard them from the set for further analyses. This set includes 35 scaffolds covering 889 264 bp. 23 of these 35 scaffolds return at least one hit when compared to Genbank Bacterial division, 22 matchs especially Buchnera aphidicola. (see table 10) 5

Figure 1: Plot of the coverage x GC percent of the allpaths scaffolds, the orange circle highlights the scaffolds with high coverage Bacterial hit # scaffolds Buchnera aphidicola strȧk (Acyrthosiphon kondoi) 112 Buchnera aphidicola str. JF98 (Acyrthosiphon pisum) 24 Buchnera aphidicola str. Ua (Uroleucon ambrosiae) 11 Buchnera aphidicola str. JF99 (Acyrthosiphon pisum) 7 Buchnera aphidicola str. TLW03 (Acyrthosiphon pisum) 4 Buchnera aphidicola str. 5A (Acyrthosiphon pisum) 3 Buchnera aphidicola str. Sg (Schizaphis graminum) 2 Buchnera aphidicola str. LL01 (Acyrthosiphon pisum) 2 Buchnera aphidicola str. APS (Acyrthosiphon pisum) 1 Buchnera aphidicola (Myzus persicae) hupa-rpoc intergenic spacer 1 Buchnera aphidicola DNA polymerase III beta subunit (dnan) gene 1 Buchnera aphidicola (Diuraphis noxia) plasmid pleu-dn(usa8) 1 Buchnera aphidicola (Acyrthosiphon kondoi) 1-deoxy-D-xylulose 5-phosphate reductoisomerase (dxr) gene, 1 Bacterium IS422 gene for 16S rrna 1 Bacillus subtilis BSn5 1 Azotobacter vinelandii isolate DNA101014 18S ribosomal RNA gene 1 No hit 49 Table 8: Hits of the highly covered AllPaths scaffolds against the Genbank Bacterial division Min. 1st Qu. Median Mean 3rd Qu. Max. 1.265 36.680 80.910 112.100 117.400 31550.000 Table 9: Minia depth coverage by scaffolds 6

Figure 2: Plot of the coverage x GC percent of the Minia scaffolds longer than 1000bp, the orange circle highlights the scaffolds with high coverage Bacterial hit # scaffolds Buchnera aphidicola str. Ak (Acyrthosiphon kondoi) 16 Buchnera aphidicola str. Ua (Uroleucon ambrosiae) 2 Buchnera aphidicola str. JF98 (Acyrthosiphon pisum) 2 Buchnera aphidicola str. JF99 (Acyrthosiphon pisum) 1 Buchnera aphidicola anthanilate syntyhase component I (trpe) and anthranilate synthase component II (trpg) genes 1 Bacillus subtilis BSn5 1 No hit 12 Table 10: Hits of the highly covered Minia scaffolds against the Genbank Bacterial division 7

evalue no hit unique hit multiple hits Median Mean Max. 1e-50 2122 1039 208 0.0000 0.4788 12.0000 1e-40 1782 1269 318 0.0000 0.6575 15.0000 1e-30 1407 1487 475 1.000 0.927 17.000 1e-20 937 1757 675 1.000 1.369 51.000 Table 11: Blast hits at different e-value threshold while comparing the 3369 BUSCO drosophila melanogaster set to the Minia assembly evalue no hit unique hit multiple hits Median Mean Max. 1e-50 2120 1041 208 0.0000 0.4829 16.0000 1e-40 1780 1275 314 0.0000 0.6575 18.0000 1e-30 1398 1505 466 1.000 0.927 18.000 1e-20 929 1767 673 1.000 1.375 49.000 Table 12: Blast hits at different e-value threshold while comparing the 3369 BUSCO drosophila melanogaster set to the Allpaths assembly 5 Mapping BUSCO proteins 5.1 Protocol We extracted the 3 369 proteins from drosophila melanogaster from the OrthoDB BUSCO Arthopoda set (ftp://cegg.unige.ch/orthodb6/busco/). Firstly, we aligned this protein set by blast on the genome and retrieve the hits with an e-value lesser than 1e-20. 5.2 Blast hits We used different threshold for filtering blast results, and observe at each threshold, the number of missing proteins, or proteins with unique or multiple matchs. The results are reported in the tables 11 and 12. 5.3 Completion of the proteins For each of the 2440 and 2432 proteins having a hit with respectively Allpaths and Minia assemblies, we aligned the BUSCO protein to the corresponding scaffold using GeneWise [1]. For analyzing the completion of the predicted peptides in the Myzus genome and the BUSCO set, we simply compared the size of the proteins, and plotted the empirical cumulative distribution frequencies (ecdf), as presented in figures 3 and 4. We observe that the plots are very similar. For both assembly, 50% of the proteins are at least complete at 78% (red lines), and 30% of the proteins are 90% complete (orange lines). 8

Figure 3: Cumulative distribution of the BUSCO drosophila melanogaster set completion on the Allpaths-LG genome 5.4 Conclusion Interestingly only almost 2 thirds of the BUSCO proteins have been successfully anchored on the Myzus persicae genome. None assembly give better results than the other. 6 Mapping ACYPI proteins 6.1 Protocol We used exactly the same protocol using the 36 275 pea aphid proteins from AphidBase. 6.2 Blast hits We used different threshold for filtering blast results, and observe at each threshold, the number of missing proteins, or proteins with unique or multiple matchs. The results are reported in the tables 13 and 14. 9

Figure 4: Cumulative distribution of the BUSCO drosophila melanogaster set completion on the Minia genome evalue no hit unique hit multiple hits Median Mean 3rd qu. Max. 1e-50 16 008 9 457 10 810 1.00 8.78 2.00 324.00 1e-40 13 336 10 134 12 805 1.00 12.57 4.00 374.00 1e-30 10 440 10 515 15 320 1.00 18.48 8.00 452.00 1e-20 7 621 10 446 18 208 2.00 28.14 17.00 500.00 Table 13: Blast hits at different e-value threshold while comparing the 36 275 pea aphid proteins set to the Allpaths-LG assembly evalue no hit unique hit multiple hits Median Mean 3rd qu. Max. 1e-50 16 074 9 526 10 675 1.00 7.813 2.000 259.000 1e-40 13 412 10 120 12 743 1.00 11.23 4.00 300.00 1e-30 10 489 10 490 15 296 1.00 16.58 8.00 401.00 1e-20 7 742 10 300 18 233 2.00 25.43 17.00 500.00 Table 14: Blast hits at different e-value threshold while comparing the 36 275 pea aphid proteins to the Minia assembly 10

Figure 5: Cumulative distribution of the pea aphid protein set completion on the Allpaths-LG genome 6.3 Completion of the proteins The 28 654 and 28 533 proteins having a hit with respectively Allpaths and Minia assemblies, we aligned the pea aphid proteins to the corresponding scaffold using GeneWise [1]. As previously, we plotted the cumulative distribution of the ratio protein size in figures 5 and 6. plots are very similar. For both assembly, 50% of the proteins are at least complete at 96.5% (red lines), and 64% of the proteins are 90% complete (orange lines). 6.4 Conclusion On both genomes, lot of pea aphid proteins were unmapped on the Myzus genome, especially using a high threshold. Also, lot of proteins were not uniquely mapped. For most of the proteins, anchoring their best hits gives a close to complete prediction. 11

Figure 6: Cumulative distribution of the pea aphid set completion on the Minia genome Min. 1st Qu. Median Mean 3rd Qu. Max. Sum 1000 1184 1486 1742 1998 5301 292 637 Table 15: Statistics on the Allpaths-LG scaffolds including buchnera genome parts 7 Buchnera genome 7.1 Analysis of the buchnera sequences in the genomes We compared the genome sequences to the Buchnera aphidicola str. LSR1 (Acyrthosiphon pisum), whole genome shotgun sequence (Accession number : NZ ACFK01000001) using blastn (e-value threshold : 1e-20). In the Allpaths- LG genome, we found 168 scaffolds that are including a buchnera genome part, all of them are very short (see table 15), but the sum of all these scaffolds is lower than the expected buchnera size (642 011bp for the pea aphid LSRA strain). Contrarily, when analyzing the 34 minia scaffolds that include part of buchnera we found that their size distribution (see table 16) is much larger. Surprisingly, some scaffolds larger than the buchnera genome include the buchnera genome (see table 16). 12

Min. 1st Qu. Median Mean 3rd Qu. Max. Sum 258 1705 6286 63940 15910 1563000 2 173 835 Table 16: Statistics on the Allpaths-LG scaffolds includin buchnera genome parts Figure 7: Dot plot of the buchnera scaffolds 1 aligned to the buchnera pea aphid strain LSR1 7.2 New buchnera genome assembly As a result, none of the assemblies were able to predict the complete buchnera genome. The reason is that this genome have a very high coverage compared to the nuclear genome, and because there is a lot of reads, there is also lot of errors creating very complex regions in the De Bruijn graph and prevent the assemblers to achieve their traversal. Thus, we proceed to a specific assembly of the buchnera genome using Minia and increasing the min abundance to 20 and kmer size threshold to 91 in order to remove a large part of the errors. This stringency is too high to build the nuclear genome because its coverage is too low, but is is efficient to assemble the buchnera genome. Indeed using blast we retreve the scaffolds that were similar to the buchnera genome, and found only 2 scaffolds with respective size of 378 928 and 264 603. These 2 scaffolds correspond to 2 distinct parts of the buchnera genome and cover them almost completely (see figures 7 and 8). 8 Duplications 8.1 Protocol We used Nucmer algorithm from the Mummer 3.22 package [8] to compare all the scaffolds against each other in order to find similar regions which might be due to assembly artefacts. We were considering as putatively artefactually duplicated regions (PADR), if they were larger than 1000bp with a percentage 13

Figure 8: Dot plot of the buchnera scaffolds 1 aligned to the buchnera pea aphid strain LSR1 of similarity higher than 90. On the Minia genome we found 4415 PADR, corresponding to 7 368 408 bp, and 4662 PADR corresponding to 10 465 023 bp in the Allpaths-LG genome. Thus, we removed scaffolds covered by 70% by a PADR. As a result we discarded 184 scaffolds from the AllPaths-LG assembly and 115 from the Minia assembly. 9 Gap Closing and final assembly 9.1 Protocol As a conclusion, we propose to use the AllPaths assembly without the scaffolds with a coverage higher than 1000x, without scaffolds similar to the buchnera aphidicola genome and without scaffolds covered by a putative duplicated regions. We perform the gap closing (i.e. covering the N between the contigs in scaffolds using the reads), using Gap Closer v1.12 from soapdenovo2 [12] using the 3 libraries. 9.2 Results As a result we obtained clone G006 assembly with the metrics summarized in table 17. 10 Conclusion We used various metrics to select the best assembly among the 5 assemblies computed using the Myzus G006 clone reads. The Abyss assemblies have bad general metrics, and we compared AllPaths-LG and Minia using other different metrics. Although they gave very similar results, in particular about the protein mapping statistics, it appears that the AllPaths assembly has a lower level of 14

Metric G006 assembly Number of scaffolds 4022 Total size of scaffolds 347304760 Longest scaffold 2199663 Shortest scaffold 959 Number of scaffolds > 500 nt 4022 100.0% Number of scaffolds > 1K nt 4018 99.9% Number of scaffolds > 10K nt 1844 45.8% Number of scaffolds > 100K nt 788 19.6% Number of scaffolds > 1M nt 38 0.9% Mean scaffold size 86351 Median scaffold size 7170 N50 scaffold length 435781 L50 scaffold count 224 scaffold %A 34.82 scaffold %C 14.94 scaffold %G 14.93 scaffold %T 34.78 scaffold %N 0.53 scaffold N nt 1836185 Table 17: Scaffolds metrics redundancy (reflected by the number of PE remapping statistics) but a larger region of putative artefactually duplicated regions. Moreover, Minia shows up some large nuclear scaffolds including parts of buchnera genome, while Allpaths- LG created separated scaffolds for buchenra genome. Both assemblies were not able to assemble completely the buchnera genome, and we did a new specific assembly to achieve this goal. As a result, we retrieve almost the complete buchnera sequence in 2 large scaffolds. References [1] E. Birney, M. Clamp, and R. Durbin. GeneWise and Genomewise. Genome Res., 14(5):988 995, May 2004. [2] M. Boetzer, C. V. Henkel, H. J. Jansen, D. Butler, and W. Pirovano. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics, 27(4):578 579, Feb 2011. [3] R. Chikhi and P. Medvedev. Informed and automated k-mer size selection for genome assembly. Bioinformatics, Jun 2013. [4] Rayan Chikhi and Dominique Lavenier. Localized genome assembly from reads to scaffolds: practical traversal of the paired string graph. In Springer, editor, WABI 2011, Sarrebruck, Germany, August 2011. [5] Rayan Chikhi and Delphine Naquin. Graph-based scaffolding for nextgeneration sequencing. In JOBIM, 2012. [6] D. Earl, K. Bradnam, J. St John, A. Darling, D. Lin, J. Fass, H. O. Yu, V. Buffalo, D. R. Zerbino, M. Diekhans, N. Nguyen, P. N. Ariyaratne, W. K. Sung, Z. Ning, M. Haimel, J. T. Simpson, N. A. Fonseca,?. Birol, T. R. Docking, I. Y. Ho, D. S. Rokhsar, R. Chikhi, D. Lavenier, G. Chapuis, D. Naquin, N. Maillet, M. C. Schatz, D. R. Kelley, A. M. Phillippy, S. Koren, S. P. Yang, W. Wu, W. C. Chou, A. Srivastava, T. I. Shaw, J. G. 15

Ruby, P. Skewes-Cox, M. Betegon, M. T. Dimon, V. Solovyev, I. Seledtsov, P. Kosarev, D. Vorobyev, R. Ramirez-Gonzalez, R. Leggett, D. MacLean, F. Xia, R. Luo, Z. Li, Y. Xie, B. Liu, S. Gnerre, I. MacCallum, D. Przybylski, F. J. Ribeiro, S. Yin, T. Sharpe, G. Hall, P. J. Kersey, R. Durbin, S. D. Jackman, J. A. Chapman, X. Huang, J. L. DeRisi, M. Caccamo, Y. Li, D. B. Jaffe, R. E. Green, D. Haussler, I. Korf, and B. Paten. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res., 21(12):2224 2241, Dec 2011. [7] D. R. Kelley, M. C. Schatz, and S. L. Salzberg. Quake: quality-aware detection and correction of sequencing errors. Genome Biol., 11(11):R116, 2010. [8] S. Kurtz, A. Phillippy, A. L. Delcher, M. Smoot, M. Shumway, C. Antonescu, and S. L. Salzberg. Versatile and open software for comparing large genomes. Genome Biol., 5(2):R12, 2004. [9] B. Langmead and S. L. Salzberg. Fast gapped-read alignment with Bowtie 2. Nat. Methods, 9(4):357 359, Apr 2012. [10] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, and R. Durbin. The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16):2078 2079, Aug 2009. [11] R. Li, H. Zhu, J. Ruan, W. Qian, X. Fang, Z. Shi, Y. Li, S. Li, G. Shan, K. Kristiansen, S. Li, H. Yang, J. Wang, and J. Wang. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res., 20(2):265 272, Feb 2010. [12] R. Luo, B. Liu, Y. Xie, Z. Li, W. Huang, J. Yuan, G. He, Y. Chen, Q. Pan, Y. Liu, J. Tang, G. Wu, H. Zhang, Y. Shi, Y. Liu, C. Yu, B. Wang, Y. Lu, C. Han, D. W. Cheung, S. M. Yiu, S. Peng, Z. Xiaoqian, G. Liu, X. Liao, Y. Li, H. Yang, J. Wang, T. W. Lam, and J. Wang. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience, 1(1):18, 2012. [13] I. Maccallum, D. Przybylski, S. Gnerre, J. Burton, I. Shlyakhter, A. Gnirke, J. Malek, K. McKernan, S. Ranade, T. P. Shea, L. Williams, S. Young, C. Nusbaum, and D. B. Jaffe. ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads. Genome Biol., 10(10):R103, 2009. [14] A. R. Quinlan and I. M. Hall. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6):841 842, Mar 2010. [15] J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. Jones, and I. Birol. ABySS: a parallel assembler for short read sequence data. Genome Res., 19(6):1117 1123, Jun 2009. 16