The tomato genome re-seq project http://www.tomatogenome.net 5 February 2013, Richard Finkers & Sjaak van Heusden
Rationale Genetic diversity in commercial tomato germplasm relatively narrow Unexploited genetic diversity available in land races and old varieties? Cultivated tomato has lost valuable traits during domestication Wild species - source of genetic diversity Diverse habitat Variation in flowers and fruits Variation in mating systems Most wild species can be crossed with cultivated tomato (introgression breeding)
Rationale Tomato Genome (Re-) Sequencing Project Identify alleles underpinning phenotypic diversity across the entire genome and entire tomato clade
Acknowledgement: Sjaak van Heuden, Paris market
Tomato fruit shape variation Rodríguez et al (2011) Plant physiology 156: 275-85
EU-SOL core collection 1000 landraces > 7000 landraces 200 landraces Selected landraces for (re-)sequencing https://www.eu-sol.wur.nl Information: Marker data Phenotype data Passport data Markers 20 (7000 -> 1000) 384 (1000 -> 200) 7500 ( 200 -> 34) Acknowledgement: Dani Zamir et al. & Keygene N.V.
Landraces & old cultivar collection
Fruit phenotypes EU-SOL collection
Improving with exotic genetic libraries Wild tomato species are valuable candidate for novel alleles Dani Zamir, Nature Reviews Genetics 2, 983-989 (December 2001)
Improving with exotic genetic libraries Phylogenetic relationships in the Solanum clade Moyle 2008
(re-)sequencing collection 51 4 6 2 3 2 2 1 3 2 7 2 Lycopersicon group Arcanum group Eriopersicon group Neolycopersicon group Tree according to Anderson et al. (2010), redrawn from Moyle 2008
Genome Alignment Read mapping to cv. Heinz Genome structure wild tomato relatives?
Reference genomes: De novo assembly selection Heinz1706 Lycopersicon group LA 2157 Arcanum group LYC 4 Eriopersicon group LA 716 Neolycopersicon group
Data production 84 Resequenced genomes 500 bp, 2x100 bp Paired-end Illumina Average coverage 41x 3 de novo genomes (S. arcanum, S. habrochaites, S. pennellii) 170 bp, 2x 100 bp Paired end Illumina 2 kb, 2 x 100 bp Mate-paired end Illumina 8 kb matepair (454) 20 kb matepair (454) Average coverage 205x
Genomic sequencing libraries
K-mer graph 31-mer volume Millions 1000 900 800 700 600 500 400 300 200 100 31-mer histogram '001' FIT '045' FIT '046' FIT '053' FIT '054' FIT '058' FIT '072' FIT '074' FIT 0 0 10 20 30 40 31-mer 50 frequency 60 70 80 90 100 Data: 500 bp, 2x100 bp Paired-end Illumina Acknowledgement: Theo Borm
K-mer exploration Fitted modi Homozygous Heterozygous Duplicated (2x) Conclusions % heterozygosity is neglectable Duplicated portion is not neglectable Millions 31-mer volume 300 250 200 150 31-mer histogram 100 50 0 30 50 31-mer frequency 70 90 '001' FIT '045' FIT '046' FIT '053' FIT '054' FIT '058' FIT '072' FIT '074' FIT
Genome size estimates Genomic K-mer based estimate Ignores differences GC-AT ratio Underestimation Nr Specie s Est. Size (Mb) Draft Size (Mb) %CP 01 SL 723 1.9 Heinz 760 45 SP 749 1.9 46 SP 775 6.3 LA1589 739 53 SG 728 4.4 54 SC 760 6.2 58 SA 830 3.0 72 SH 779 7.1 74 SP 962 8.6 Acknowledgement: Theo Borm The Tomato Genome Consortium Nature 485, 635 641 (2012)
Optimizing assembly strategy
Checking assebly integrity Average completeness per 10 contigs: ALL-PATHS (96.62%) CLC-BIO (74.62%) Heinz dot plot SL2.40 ch11 region (1 Mbp)
Status de novo assembly genomes
Status de novo assembly genomes N50 N90 Longest Shortest Mean Median N Contigs Total length Heinz 1706 reference 16,467,796 3,041,128 42,121,211 2000 242,428 2,847 3,223 781,345,411 S. habrochaites_allpaths 90,424 12,290 990,035 902 43,409 20,461 16,935 735,128,396 S. habrochaites_scaf 515,730 104,925 3,252,897 902 130,475 9,758 5,873 766,277,628 S. pennellii_allpaths 64,671 7,460 627,722 887 27,680 11,008 26,589 735,990,792 S. pennellii_scaf 206,135 38,969 1,269,801 887 49,209 5,932 15,886 781,730,072 S. arcanum_clc 18,651 2,524 241,690 200 2,869 428 290,145 832,461,203
Conclusions Sequencing completed Quality and coverage threshold satisfied Cleaning resequencing data completed De novo assembly of S. habrochaites and S. pennelli comparable with tomato reference De novo assembly of S. arcanum in progress Read mapping and SNP analysis finished
And now the fun begins...
Average SNP rate/kb (vs. SL2.40)
Homozygous vs Heterozygous feature rate
Exploring the FW9-2-5 locus (Lin5) Sucrose synthase gene Cloned from S. pennellii amino acid substitutions: 2878 (Asp in LP to Glu in LE) 2932 (Asp to Asn) 2953 (Val to Leu) Fridman et al. Proc Natl Acad Sci U S A. 2000 Apr 25;97(9):4718-23.
FW9-2-5 variation (Lin5) S. galapagense
Needs Whole genome variant catalogue Annotation for the three wild species genomes Pan genome reconstruction How good is our sampling?
Perspectives Direct application for Reverse genetics studies Use identified allelic variation Calculate distance based on all genes? Better understanding of genome organization Improve introgression breeding Homozygous vs. hetrerozygous features Scan for inversions Diamond jewelry?
150 tomato genome consortium
Questions Project site: http://www.tomatogenome.net Phenotype data & Images: https://www.eu-sol.wur.nl SOL100: http://solgenomics.net or http://solgenomics.wur.nl
Acknowledgments Data production Elio Schijlen Bas te Lintel Hekkert Quality control Saulo Aflitos Huanwen Zhu Minling Xiao Tao Ma Xiaoli Wang Data management and assembly Sandra Smit Jan van Haarst Henri van de Geest Lars Smits Jiumeng Min Jie Chen Xiaoli Wang Project management Sander Peters Richard Finkers Andries Koops Jianbo Jian Yadan Luo Li Liao Tina(Na) Xu