Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro

Similar documents
HIGH-QUALITY ASSEMBLY OF THE DURUM WHEAT GENOME CV. SVEVO

Supplementary Table 1. Summary of whole genome shotgun sequence used for genome assembly

Genomics and Transcriptomics of Spirodela polyrhiza

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

RNASEQ WITHOUT A REFERENCE

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

Genome sequencing in Senecio squalidus

Biol 478/595 Intro to Bioinformatics

De novo assembly in RNA-seq analysis.

Genome annotation & EST

Introduction to Plant Genomics and Online Resources. Manish Raizada University of Guelph

RNA-Sequencing analysis

The genome of Fraxinus excelsior (European Ash)

Wheat Genome Structural Annotation Using a Modular and Evidence-combined Annotation Pipeline

Assessing De-Novo Transcriptome Assemblies

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

BIOINFORMATICS FOR DUMMIES MB&C2017 WORKSHOP

Exploiting novel rice baseline datasets: WGS, BAC-based platinum genome sequencing and full-length transcriptomics

Genomic resources. for non-model systems

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.

Mate-pair library data improves genome assembly

High quality reference genome of the domestic sheep (Ovis aries) Yu Jiang and Brian P. Dalrymple

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

Supplement to: The Genomic Sequence of the Chinese Hamster Ovary (CHO)-K1 cell line

Post-assembly Data Analysis

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

Steps in Genetic Analysis

RNA standards v May

Sequencing and assembly of the sheep genome reference sequence

Gene Annotation Project. Group 1. Tyler Tiede Yanzhu Ji Jenae Skelton

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Mapping strategies for sequence reads

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

De novo whole genome assembly

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Yellow-bellied marmot genome. Gabriela Pinho Graduate Student Blumstein & Wayne Labs EEB - UCLA

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

Using the Potato Genome Sequence! Robin Buell! Michigan State University! Department of Plant Biology! August 15, 2010!

What the Genome of Raffaelea lauricola Can Tell Us About Laurel Wilt

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Chimp Sequence Annotation: Region 2_3

Identifying the functional bases of trait variation in Brassica napus using Associative Transcriptomics

Genome evolution on the allotetraploid Xenopus laevis

BIO 4342 Lecture on Repeats

COMPUTATIONAL PREDICTION AND CHARACTERIZATION OF A TRANSCRIPTOME USING CASSAVA (MANIHOT ESCULENTA) RNA-SEQ DATA

CBC Data Therapy. Metatranscriptomics Discussion

Applied Bioinformatics - Lecture 16: Transcriptomics

9/19/13. cdna libraries, EST clusters, gene prediction and functional annotation. Biosciences 741: Genomics Fall, 2013 Week 3

Assemblathon Summary Report

Progress in genomics applications in investigating abiotic stresses influencing perennial forage and biomass grasses

DE NOVO WHOLE GENOME ASSEMBLY AND SEQUENCING OF THE SUPERB FAIRYWREN. (Malurus cyaneus) JOSHUA PEÑALBA LEO JOSEPH CRAIG MORITZ ANDREW COCKBURN

Transcriptome analysis

The international effort to sequence the 17Gb wheat genome: Yes, Wheat can!

I AM NOT A METAGENOMIC EXPERT. I am merely the MESSENGER. Blaise T.F. Alako, PhD EBI Ambassador

The Diploid Genome Sequence of an Individual Human

The New Genome Analyzer IIx Delivering more data, faster, and easier than ever before. Jeremy Preston, PhD Marketing Manager, Sequencing

Anker P Sørensen Crop innovation through novel NGS applications

NGS developments in tomato genome sequencing

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

The use of bioinformatic analysis in support of HGT from plants to microorganisms. Meeting with applicants Parma, 26 November 2015

Matthew Tinning Australian Genome Research Facility. July 2012

Whole Genome Profiling Physical Map and. Ancestral Annotation of Tobacco Hicks Broadleaf

Whole Genome Profiling Physical Map and. Ancestral Annotation of Tobacco Hicks Broadleaf

Shuji Shigenobu. April 3, 2013 Illumina Webinar Series

The tomato genome re-seq project

Two Mark question and Answers

How much sequencing do I need? Emily Crisovan Genomics Core

Finding Genes with Genomics Technologies

How much sequencing do I need? Emily Crisovan Genomics Core September 26, 2018

Transcription Start Sites Project Report

Rapid Transcriptome Characterization for a nonmodel organism using 454 pyrosequencing

De novo genome assembly with next generation sequencing data!! "

Analysis of RNA-seq Data

Technologies, resources and tools for the exploitation of the sheep and goat genomes.

Figure S1. Data flow of de novo genome assembly using next generation sequencing data from multiple platforms.

NGS the subterranean realm: from RNA-seq to bait design for a groundwater isopod Danielle Stringer

De Novo Assembly (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

A tutorial introduction into the MIPS PlantsDB barley&wheat databases. Manuel Spannagl&Kai Bader transplant user training Poznan June 2013

DE NOVO GENOME ASSEMBLY OF THE AFRICAN CATFISH (CLARIAS GARIEPINUS)

Genome Annotation. Stefan Prost 1. May 27th, States of America. Genome Annotation

A draft sequence of bread wheat chromosome 7B based on individual MTP BAC sequencing using pair end and mate pair libraries.

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

Lecture 2: Biology Basics Continued. Fall 2018 August 23, 2018

An Extreme Metabolism: Iso-Seq analysis of the Ruby-Throated Hummingbird

RNA-SEQUENCING ANALYSIS

CHAPTER 21 LECTURE SLIDES

TECH NOTE Stranded NGS libraries from FFPE samples

Applications of short-read

A Naturally Occurring Epiallele associates with Leaf Senescence and Local Climate Adaptation in Arabidopsis accessions He et al.

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

Perfect~ Arthropod Genes Constructed from Gigabases of RNA

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Analysis of structural variation. Alistair Ward - Boston College

Alignment to a database. November 3, 2016

Draft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009

High throughput sequencing technologies

Transcription:

Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro Philip Morris International R&D, Philip Morris Products S.A., Neuchatel, Switzerland

Introduction Nicotiana sylvestris and Nicotiana tomentosiformis are diploids (2n=12) originating from overlapping regions of South America They probably diverged early in the evolution of genus Nicotiana which split from Symonanthus around 15 Myr ago 1 Their 1C genome size is estimated at 2.65 Gb 1, about 3 times the size of the tomato and the potato genomes 1. Renny-Byfield, S. et al. 2011. Next Generation Sequencing Reveals Genome Downsizing in Allotetraploid Nicotiana tabacum, Predominantly through the Elimination of Paternally Derived Repetitive DNAs. Mol. Biol. Evol. 28:2843. Page: 2

Introduction Modern descendants of the maternal and paternal donors that formed tobacco 1 N. sylvestris ancestor 2n=24 N. tomentosiformis ancestor 2n=24 The determination of their genome and transcriptome will contribute to the assembly and annotation of the tobacco genome and transcriptome N. tabacum 2n=4x=48 1. Leitch, I.J. et al. 2008. The ups and downs of genome size evolution in polyploid species of Nicotiana (Solanaceae). Ann Bot. 2008 Apr;101(6):805-14 Page: 3

Genomes sequencing and assembly strategy DNA isolation - Leaves Library preparation - Paired ends - Mate pairs Sequencing Illumina 2x100 bp Quality filtering and trimming Superscaffolding Scaffolding - SOAPdenovo Contig creation - SOAPdenovo related species Tobacco WGP physical map Page: 4

Genome libraries Nicotiana sylvestris Total coverage of 94x Library type Read size Insert size Cleaned reads Expected coverage Paired ends 2x100 180 b 1 249 808 412 47.5x Paired ends 2x100 300 b 1 057 102 557 38.6x Paired ends 2x100 1 kb 42 216 128 1.6x Mate pairs 2x100 3 kb 98 524 837 2.8x Mate pairs 2x100 4 kb 63 727 279 1.8x Mate pairs 2x100 4 kb 51 368 983 1.5x Using the 31-nucleotide depth distribution, the genome size is estimated at 2.58 Gb. Page: 5

Genome libraries Nicotiana tomentosiformis Total coverage of 146x Library type Read size Insert size Cleaned reads Expected coverage Paired ends 2x100 140 b 1 730 522 445 65.7x Paired ends 2x100 175 b 823 913 833 31.0x Paired ends 2x100 350 b 804 501 117 30.2x Paired ends 2x100 385 b 462 732 217 17.6x Paired ends 2x100 1 kb 34 860 106 1.3x Mate pairs 2x100 3 kb 8 065 420 0.25x Mate pairs 2x100 5 kb 7 750 383 0.25x Using the 31-nucleotide depth distribution, the genome size is estimated at 2.14 Gb. Page: 6

Genome assemblies N. sylvestris N. tomentosiformis Sequences 253 984 159 649 Average length (bp) 8 748.83 10 576.84 Maximum length (bp) 698 072 789 565 N50 length (bp) 79 724 82 598 Total length (bp) 2 222 062 302 1 688 581 715 Undefined bases 174 351 674 (7.8%) 45 955 292 (2.7%) Genome coverage 82.9% 71.6% Using the S/T regions of the tobacco WGP physical map N. sylvestris N. tomentosiformis Superscaffolds 2 637 1 989 Components 10 261 7 463 N50 length (bp) 194 000 166 000 Page: 7

Repeat content Species specific repeat library created using RepeatScout on sequences of at least 200kb. Repeat classification using blast against known repeat elements. Repeat content estimation using RepeatMasker with the RepeatScout, TIGR Solanaceae and SOL eudicot repeat libraries. Page: 8

Repeat contents 72-75% of the sequenced genome consists of repeats. 625 and 425 Mb of unmasked DNA for N. sylvestris and N. tomentosiformis. Repeat element N. sylvestris N. tomentosiformis LINE 5 828 979 (0.3%) 2 834 174 (0.2%) SINE 4 040 138 (0.2%) 5 244 169 (0.3%) LTR/Copia 203 592 581 (9%) 227 491 087 (13%) LTR/Gypsy 463 070 166 (21%) 343 784 620 (20%) LTR/Others 184 881 207 (8%) 90 166 206 (5%) Transposons 33 621 895 (1.5%) 22 593 004 (1%) Retrotransposons 230 653 066 (10%) 220 727 245 (13%) Simple repeats 4 954 900 (0.2%) 4 809 855 (0.3%) Low complexity 10 145 060 (0.5%) 9 723 109 (0.6%) Others 293 036 384 (13%) 246 313 534 (15%) Total 1 605 541 978 (72%) 1 266 206 541 (75%) Page: 9

Transcriptome sequencing and assembly strategy RNA isolation - Leaves - Roots -Flowers Library preparation - Paired ends Sequencing Illumina 2x100 bp 3 biological replicates Quality filtering and trimming ORF finding Isoform prediction - cufflinks/cuffmerge Annotation Read mapping - bowtie/tophat - BLAST - InterPro Scan (GO terms) - EFICAz (EC number) Page: 10

Transcriptome assemblies Nicotiana sylvestris Tissue Transcripts Shortest Longest Median Roots 46 313 72 20 215 1 364 Leaves 46 114 72 23 553 1 372 Flowers 53 247 63 24 850 1 327 Nicotiana tomentosiformis Tissue Transcripts Shortest Longest Median Roots 44 169 69 16 753 1 410 Leaves 43 743 89 19 133 1 415 Flowers 48 043 75 15 607 1 388 Page: 11

Mutual best BLAST hits against UniProt Proteins predicted by Trinity ORF finding program Minimum length of 100 amino acids Mutual blast against UniProt plants collection Filter pairs by e-value of less than 1E-10 in either direction Select proteins with mutual best hits Best blast hit Best blast hit Predicted protein UniProt protein Predicted protein Best blast hit Page: 12

Mutual best blast hits against UniProt N. sylvestris N. tomentosiformis Coverage of reference Coverage of query 82% of the transcripts have homologous UniProt sequences, but some of them are only partially covering the reference sequence. Page: 13

GO term enrichment GO term enrichment for each species against the pooled set of GO terms using GOStats. Only small and not highly significant changes in gene composition. N. sylvestris: defense response function N. tomentosiformis: core metabolic functions, protein phosphorylation Phenotypic difference more likely to be regulatory than due to loss or gain of genes. Page: 14

Transcriptome overlap OrthoMCL was used to define clusters of orthologous and paralogous genes between species: N. sylvestris N. tomentosifomis Tomato Arabidopsis And between the root, leaf and flower transcriptomes of N. sylvestris and N. tomentosifomis. Page: 15

Transcriptome overlap between species ~7 000 clusters are shared between all species. ~3 600 clusters are specific to Nicotiana. ~2 800 clusters are specific to Solanaceae. Page: 16

Transcriptome overlap in N. sylvestris ~15 000 clusters are shared. ~3 500 clusters are specific to flower. ~2 000 clusters are specific to root. ~1 800 clusters are specific to leaf. Page: 17

Transcriptome overlap in N. tomentosiformis ~14 500 clusters are shared. ~3 400 clusters are specific to flower. ~2 600 clusters are specific to root. ~1 900 clusters are specific to leaf. Page: 18

Conclusions Nicotiana sylvestris and Nicotiana tomentosiformis have been sequenced at a coverage of about 100x and 150x respectively. 83% of the N. sylvestris genome covered. 72% of the N. tomentosiformis genome covered. the tobacco WGP physical map can be used to superscaffold the assembly. Between 45 000 and 50 000 transcripts are identified by mapping of RNA-seq data. More than 80% have homologs in UniProt. Page: 19

Conclusions About 15 000 clusters of orthologous and paralogous genes genes are shared between root, leaf and flower. 3500 clusters specific to flowers About 7 000 clusters of orthologous and paralogous genes are shared between N. sylvestris, N. tomentosiformis, tomato and Arabidopsis. About 3 600 clusters of orthologous and paralogous genes specific to Nicotiana species. The obtained genomes and transcriptomes will contribute to the assembly and annotation of the tobacco genome. Page: 20

Acknowledgments James Battey Sonia Ouadi Lucien Bovet Simon Goepfert Nicolas Bakaher Manuel C. Peitsch Nikolai V. Ivanov Page: 21