Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro Philip Morris International R&D, Philip Morris Products S.A., Neuchatel, Switzerland
Introduction Nicotiana sylvestris and Nicotiana tomentosiformis are diploids (2n=12) originating from overlapping regions of South America They probably diverged early in the evolution of genus Nicotiana which split from Symonanthus around 15 Myr ago 1 Their 1C genome size is estimated at 2.65 Gb 1, about 3 times the size of the tomato and the potato genomes 1. Renny-Byfield, S. et al. 2011. Next Generation Sequencing Reveals Genome Downsizing in Allotetraploid Nicotiana tabacum, Predominantly through the Elimination of Paternally Derived Repetitive DNAs. Mol. Biol. Evol. 28:2843. Page: 2
Introduction Modern descendants of the maternal and paternal donors that formed tobacco 1 N. sylvestris ancestor 2n=24 N. tomentosiformis ancestor 2n=24 The determination of their genome and transcriptome will contribute to the assembly and annotation of the tobacco genome and transcriptome N. tabacum 2n=4x=48 1. Leitch, I.J. et al. 2008. The ups and downs of genome size evolution in polyploid species of Nicotiana (Solanaceae). Ann Bot. 2008 Apr;101(6):805-14 Page: 3
Genomes sequencing and assembly strategy DNA isolation - Leaves Library preparation - Paired ends - Mate pairs Sequencing Illumina 2x100 bp Quality filtering and trimming Superscaffolding Scaffolding - SOAPdenovo Contig creation - SOAPdenovo related species Tobacco WGP physical map Page: 4
Genome libraries Nicotiana sylvestris Total coverage of 94x Library type Read size Insert size Cleaned reads Expected coverage Paired ends 2x100 180 b 1 249 808 412 47.5x Paired ends 2x100 300 b 1 057 102 557 38.6x Paired ends 2x100 1 kb 42 216 128 1.6x Mate pairs 2x100 3 kb 98 524 837 2.8x Mate pairs 2x100 4 kb 63 727 279 1.8x Mate pairs 2x100 4 kb 51 368 983 1.5x Using the 31-nucleotide depth distribution, the genome size is estimated at 2.58 Gb. Page: 5
Genome libraries Nicotiana tomentosiformis Total coverage of 146x Library type Read size Insert size Cleaned reads Expected coverage Paired ends 2x100 140 b 1 730 522 445 65.7x Paired ends 2x100 175 b 823 913 833 31.0x Paired ends 2x100 350 b 804 501 117 30.2x Paired ends 2x100 385 b 462 732 217 17.6x Paired ends 2x100 1 kb 34 860 106 1.3x Mate pairs 2x100 3 kb 8 065 420 0.25x Mate pairs 2x100 5 kb 7 750 383 0.25x Using the 31-nucleotide depth distribution, the genome size is estimated at 2.14 Gb. Page: 6
Genome assemblies N. sylvestris N. tomentosiformis Sequences 253 984 159 649 Average length (bp) 8 748.83 10 576.84 Maximum length (bp) 698 072 789 565 N50 length (bp) 79 724 82 598 Total length (bp) 2 222 062 302 1 688 581 715 Undefined bases 174 351 674 (7.8%) 45 955 292 (2.7%) Genome coverage 82.9% 71.6% Using the S/T regions of the tobacco WGP physical map N. sylvestris N. tomentosiformis Superscaffolds 2 637 1 989 Components 10 261 7 463 N50 length (bp) 194 000 166 000 Page: 7
Repeat content Species specific repeat library created using RepeatScout on sequences of at least 200kb. Repeat classification using blast against known repeat elements. Repeat content estimation using RepeatMasker with the RepeatScout, TIGR Solanaceae and SOL eudicot repeat libraries. Page: 8
Repeat contents 72-75% of the sequenced genome consists of repeats. 625 and 425 Mb of unmasked DNA for N. sylvestris and N. tomentosiformis. Repeat element N. sylvestris N. tomentosiformis LINE 5 828 979 (0.3%) 2 834 174 (0.2%) SINE 4 040 138 (0.2%) 5 244 169 (0.3%) LTR/Copia 203 592 581 (9%) 227 491 087 (13%) LTR/Gypsy 463 070 166 (21%) 343 784 620 (20%) LTR/Others 184 881 207 (8%) 90 166 206 (5%) Transposons 33 621 895 (1.5%) 22 593 004 (1%) Retrotransposons 230 653 066 (10%) 220 727 245 (13%) Simple repeats 4 954 900 (0.2%) 4 809 855 (0.3%) Low complexity 10 145 060 (0.5%) 9 723 109 (0.6%) Others 293 036 384 (13%) 246 313 534 (15%) Total 1 605 541 978 (72%) 1 266 206 541 (75%) Page: 9
Transcriptome sequencing and assembly strategy RNA isolation - Leaves - Roots -Flowers Library preparation - Paired ends Sequencing Illumina 2x100 bp 3 biological replicates Quality filtering and trimming ORF finding Isoform prediction - cufflinks/cuffmerge Annotation Read mapping - bowtie/tophat - BLAST - InterPro Scan (GO terms) - EFICAz (EC number) Page: 10
Transcriptome assemblies Nicotiana sylvestris Tissue Transcripts Shortest Longest Median Roots 46 313 72 20 215 1 364 Leaves 46 114 72 23 553 1 372 Flowers 53 247 63 24 850 1 327 Nicotiana tomentosiformis Tissue Transcripts Shortest Longest Median Roots 44 169 69 16 753 1 410 Leaves 43 743 89 19 133 1 415 Flowers 48 043 75 15 607 1 388 Page: 11
Mutual best BLAST hits against UniProt Proteins predicted by Trinity ORF finding program Minimum length of 100 amino acids Mutual blast against UniProt plants collection Filter pairs by e-value of less than 1E-10 in either direction Select proteins with mutual best hits Best blast hit Best blast hit Predicted protein UniProt protein Predicted protein Best blast hit Page: 12
Mutual best blast hits against UniProt N. sylvestris N. tomentosiformis Coverage of reference Coverage of query 82% of the transcripts have homologous UniProt sequences, but some of them are only partially covering the reference sequence. Page: 13
GO term enrichment GO term enrichment for each species against the pooled set of GO terms using GOStats. Only small and not highly significant changes in gene composition. N. sylvestris: defense response function N. tomentosiformis: core metabolic functions, protein phosphorylation Phenotypic difference more likely to be regulatory than due to loss or gain of genes. Page: 14
Transcriptome overlap OrthoMCL was used to define clusters of orthologous and paralogous genes between species: N. sylvestris N. tomentosifomis Tomato Arabidopsis And between the root, leaf and flower transcriptomes of N. sylvestris and N. tomentosifomis. Page: 15
Transcriptome overlap between species ~7 000 clusters are shared between all species. ~3 600 clusters are specific to Nicotiana. ~2 800 clusters are specific to Solanaceae. Page: 16
Transcriptome overlap in N. sylvestris ~15 000 clusters are shared. ~3 500 clusters are specific to flower. ~2 000 clusters are specific to root. ~1 800 clusters are specific to leaf. Page: 17
Transcriptome overlap in N. tomentosiformis ~14 500 clusters are shared. ~3 400 clusters are specific to flower. ~2 600 clusters are specific to root. ~1 900 clusters are specific to leaf. Page: 18
Conclusions Nicotiana sylvestris and Nicotiana tomentosiformis have been sequenced at a coverage of about 100x and 150x respectively. 83% of the N. sylvestris genome covered. 72% of the N. tomentosiformis genome covered. the tobacco WGP physical map can be used to superscaffold the assembly. Between 45 000 and 50 000 transcripts are identified by mapping of RNA-seq data. More than 80% have homologs in UniProt. Page: 19
Conclusions About 15 000 clusters of orthologous and paralogous genes genes are shared between root, leaf and flower. 3500 clusters specific to flowers About 7 000 clusters of orthologous and paralogous genes are shared between N. sylvestris, N. tomentosiformis, tomato and Arabidopsis. About 3 600 clusters of orthologous and paralogous genes specific to Nicotiana species. The obtained genomes and transcriptomes will contribute to the assembly and annotation of the tobacco genome. Page: 20
Acknowledgments James Battey Sonia Ouadi Lucien Bovet Simon Goepfert Nicolas Bakaher Manuel C. Peitsch Nikolai V. Ivanov Page: 21