Exploiting novel rice baseline datasets: WGS, BAC-based platinum genome sequencing and full-length transcriptomics

Size: px
Start display at page:

Download "Exploiting novel rice baseline datasets: WGS, BAC-based platinum genome sequencing and full-length transcriptomics"

Transcription

1 Exploiting novel rice baseline datasets: WGS, BAC-based platinum genome sequencing and full-length transcriptomics Dario Copetti, PhD Arizona Genomics Institute The University of Arizona International Rice Research Institute PacBio Users Group Meeting Dec

2 The 9-Billion People Question: How do we grow enough food to feed the world in < 40 years? Rice will play a significant role Rice feeds half of the world It is the rice dependent population that will double by 2050 New rice varieties are needed with 2-3X yield BUT Require less water, fertilizer, pesticides, and land Green Super Crops

3 Expansion of rice genomics resources - Resequencing of O. sativa accessions (3K RGP 2014) - Development of new RefSeq assemblies - Characterizing wild rice relatives (genomes/populations) - Assessment of the rice pan-genome - Extensive characterization of (new) coding regions - (Develop a platform for genus-wide evolutionary analyses) The ultimate goals are to: - access and catalogue all existing natural rice variation - characterize function at the molecular level

4 Expansion of rice genomics resources - Resequencing of O. sativa accessions (3K RGP 2014) - Development of new RefSeq assemblies - Characterizing wild rice relatives (genomes/populations) - Assessment of the rice pan-genome - Extensive characterization of (new) coding regions - (Develop a platform for genus-wide evolutionary analyses) AGI PacBio datasets developed at AGI: - One WGS rice assembly - Two BAC-based rice assemblies - mrna sequencing (Iso-Seq) of 4 genotypes

5 Rice genome assemblies O. s. japonica Nipponbare RefSeq O. s. indica scf N kb O. s. indica IR64 scf N kb O. sativa DJ123 (aus) scf N kb O. sativa Kasalath (aus) ctg N kb 3K RGP, GigaScience 2014

6 Rice genome assemblies O. s. japonica Nipponbare RefSeq O. s. indica scf N kb O. s. indica IR64 scf N kb O. sativa DJ123 (aus) scf N kb O. sativa Kasalath (aus) ctg N kb New rice RefSeqs 3K RGP, GigaScience 2014 O. s. indica Nagina 22 (upland) O. s. indica Var. A O. s. indica Var. B Develop several RefSeq assemblies spanning all rice diversity

7 PacBio datasets developed at AGI: - One WGS rice assembly - Two BAC-based rice assemblies - mrna sequencing (Iso-Seq) of 4 genotypes

8 WGS of the aus variety Nagina 22 - Upland indica rice - Deep rooted - Drought and heat tolerant - Functional analyses - Mutant development - Est. genome size: 380 Mb Mohapatra et al. NRCPB, New Delhi

9 WGS of the aus variety Nagina 22 WGS PacBio only approach, targeting 60x raw coverage 24,7 Gb (~65x) 57x >10 kb 42x >18 kb 15 kb BluePippin selection, P6-C5, 42 SMRT cells, 4 and 6 h movies

10 WGS of the aus variety Nagina 22 Assembled with FALCON at HudsonAlpha: Mb (>98% of estimated genome size) contigs, N kb, L Longest contig 3.5 Mb, 613 contigs >100 kb Thanks to J. Schmutz, J. Jenkins, D. HA Downstream steps: - Quiver polishing (99.996% cns concordance) - Pseudomolecules with Genome Puzzle Master - Genome annotation - Rice 3k dataset variant call Ken McNally, Nick Alexandrov, Ramil Mauleon

11 PacBio datasets developed at AGI: - One WGS rice assembly - Two BAC-based rice assemblies - mrna sequencing (Iso-Seq) of 4 genotypes

12 BAC TO THE FUTURE! Deliver high-quality whole genome assemblies by: BAC libraries and physical maps PacBio SMRT long read technology multi-platform information management system Qifa Zhang s lab at Huazhong Agricultural Univ.

13 Pilot PacBio BAC pool sequencing experiment 10 Rice BACs 1 pool 1 SMRT cell 10 single contigs 99.99% accuracy Contig Coverage (x) QC Observations Final Result Circular BAC Circular BAC Circular BAC Circular BAC Required manual circularization Circular BAC Circular BAC Circular BAC Required manual circularization Circular BAC Circular BAC Circular BAC Produce a map-based assembly, with no gaps and high accuracy

14 BAC Libraries & Whole Genome Profiling Physical Maps Var. A Var. B # of clones Original 36,864 36,864 On-map 30,749 32,829 # of phys. contigs MTP selected 4,082 4,007 Physical map visualization MTP BACs in yellow

15 BAC pool prep and SMRT sequencing Template DNA prep: ~15 ug of DNA were sheared with Covaris g-tube Libraries were constructed with PB 20kb protocol Size-selected with Sage Blue Pippin at 4-8 kb Sequenced in 1 SMRT cell with 4 h movies, P5-C3

16 BAC Pool Sequencing Runs Var. A 12-BAC pool: 8 24-BAC pool: BAC pool: BAC pool: 2 96-BAC pool: 1 Total: 209 pools Var. B 24-BAC pool: 9 32-BAC pool: BAC pool: 2 Total: 167 pools ~9 months

17 BAC Assembly and Address Assignment Raw data of each pool was assembled independently with HGAP3 >90% of the contigs in each pool were >100 kb QV >40 The average coverage was 50x (30-100x)

18 Sequence Quality Based Upon Physical Map BAC Overlapping Sequencing overlap base Chr overlapping clones seq bp subst bp in/del bp 1 OSIZBa015B14 OSIZBa086C OSIZBa074N20 OSIZBa002D OSIZBa0094M08 OSIZBa0026H OSIZBa0026H12 OSIZBa057c OSIZBa057c20 OSIZBa051P OSIZBa018B09 OSIZBa087H OSIZBa015N11 OSIZBa072N OSIZBa094C21 OSIZBa008D OSIZBa015D24 OSIZBa059o OSIZBa086E16 OSIZBa026A OSIZBa025B15 OSIZBa049C OSIZBa014J18 OSIZBa055D Total Accuracy based on base pair discrepancies (%) Accuracy based on both substitutions and in/dels (%)

19 Information Management Platform

20 PacBio BAC-based Assembly Post BAC assembly steps: BAC circularization and vector trimming address assignment with WGP sequence tags BAC completion phase assignment Var. A 216 contigs Sequence Length: Mb N50: 2.88 Mb Longest: 7.41 Mb Gaps: 203 Var. B 318 contigs Sequence Length: Mb N50: 1.68 Mb Longest: 5.28 Mb Gaps: 305

21 > 90% BACs Were Fully Assembled and Addressed # MTP Plates # of Clones picked BACs with ID Circularized BACs with ID Var. A (4751 unique) 4488 (94.5%) 4320 (90.9%) Var. B (4714 unique) 4571 (96.9%) 4415 (93.6%) From BAC assembly to chromosome pseudomolecules: PacBio BAC assembly BAC end sequence data and WGP sequence tags FPC physical maps and unanchored BACs Contigs from NGS assembly allowed to close gaps that WGP could not address

22 PacBio-NGS Final Integrated Assembly PacBio-derived contigs ordered and oriented on PM NGS contigs were used to close gaps Var. A PB only Var. A PB+NGS Var. B PB only Var. B PB+NGS Total length (Mb) Contig N50 (Mb) Longest contig (Mb) # of contigs # of gaps Sequence in ctgs > 1 Mb 66.3% 51.9% Note: O. sativa ssp. japonica Nipponbare RefSeq contains 229 gaps

23 PacBio datasets developed at AGI: - One WGS rice assembly - Two BAC-based rice assemblies - mrna sequencing (Iso-Seq) of 4 genotypes

24 Iso-Seq of 4 rice varieties Leaf transcriptome of 4 cultivars (var. A-D) 3 size fractions (1-2, 2-4, >4 kb) ~12 SMRT cells for each variety

25 Isoform length distribution Var. A FL cdnas benefit genome annotation, especially for long genes transcripts are longer than 10 kb (2.6%) are longer than 7 kb Isoforms have been aligned to the available genome assemblies

26 Alignment of transcripts to the genome Total 144,703 isoforms: 95% align to Var. A genome and 97.8% align to IRGSP Of the 7229 not aligned isoforms, align to Var. A A-specific genes do not align to either genome O. sativa new genes?

27 Iso-Seq of 4 rice varieties More organs and conditions will be sequenced to: - Assist and improve gene annotation - Discover alternative isoforms - Discover genotype-specific genes

28 Conclusion - The rice community is actively developing new datasets - AGI is contributing in expanding their breadth and depth New RefSeqs Transcriptome collection Developing new BAC-based protocols - More rice assemblies are being delivered and improved - Accelerating plant breeding and model genome evolution

29 HZAU Qifa Zhang Lizhong Xiong Lingling Chen Weibo Xie Yongzhong Xing Changyin Wu Sibin Yu Daoxiu Zhou Yu Zhao Gongwei Wang Meizhong Luo Ting Mu Weiming Li HudsonAlpha Arizona Rod Wing Jianwei Zhang Dave Kudrna Dario Copetti Seunghee Lee Jason Talag Beatriz Padilla H. Ann Danowitz Yeisoo Yu (PHYZEN, Seoul, S. Korea) IRRI Ken McNally Nick Alexandrov Ramil Mauleon Jeremy Schmutz Jerry Jenkins Dave Flowers