Genomics and Transcriptomics of Spirodela polyrhiza

Genomics and Transcriptomics of Spirodela polyrhiza Doug Bryant Bioinformatics Core Facility & Todd Mockler Group, Donald Danforth Plant Science Center

Desired Outcomes High-quality genomic reference sequence Transcriptome definition, functional annotation Comparison of several additional accessions

Genomic, RNA-Seq Data Spirodela accession 9509 deeply sequenced Additional 8 accessions, low coverage RNA-seq obtained from 9509 and two other accessions under Control and ABA conditions Kuehdorf, Jetschke, Ballani, and Appenroth 2013

Analysis Strategy Genome data acquisition Transcriptome data acquisition Data quality control Genome, transcriptome assembly Genome structural annotation Transcriptome functional annotation Differential expression analysis*

Genome

Data Acquisition Genomic Illumina HiSeq Diverse library set Overlap 300-500 bp Several mate-pair Illumina HiSeq 2000

Quality Control Raw Data Visualize Adaptors Verify Insert Sizes Retain Pairs Only Trim 3 Low Quality Insert Size Stdev Read Length Trimmed Avg Read Length Read Pairs Passed QC Coverage @ 329 Mbp 388 43 101 100.79 28,630,414 17.54 490 168 101 100.72 33,683,677 20.62 228 31 101 100.44 41,684,344 25.45 166 17 101 99.54 91,328,045 55.26 166 17 101 99.62 100,172,620 60.66 217 17 101 83.45 44,051,527 22.35 4660 1110 101 100.12 177,983,251 108.33 4500 151 150.74 6,782,583 6.22 524,316,461 316.43

Genomic Data Insert Sizes Distribution, 9509 20,000 bp, 26, 23% 180 bp, 31, 28% 5,000 bp, 11, 10% 2,000 bp, 20, 17% 500 bp, 25, 22% Insert size, estimated coverage, fraction of total data.

Genome Assembly Several iterations Preliminary assemblies with Velvet, SOAPdenovo Final assembly with AllPathsLG Polished with SSPACE

Genome Assembly Statistics Assembly 9509 (Mbp) (152 exp.) 146 (96% of exp.) Scaffolds (#) 774 Scaffolds >= 1 Mbp (#) 32 (4.13%) N50 scaffold length (bp) 4,305,909 L50 scaffold (#) 11 N90 scaffold length (bp) 1,428,181 L90 scaffold (#) 31 Ns (%) 7.7

Genomic Physical Coverage Physical Coverage by Library (Total: 370x) Coverage (x) 200 180 160 140 120 100 80 60 40 20 0 180bp 500bp 2,000bp 5,000bp 20,000bp

Genome Assembly Quality Assessment Reads used in assembly? Reads align to assembly? Core eukaryotic genes present?

Genomic Reads Used, Aligned 180bp 500bp 2,000bp 5,000bp 20,000bp 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Reads Used (%) Reads Align (%)

Core Eukaryotic Genes Core Eukaryotic Genes Mapping Approach (CEGMA), Korf lab (korflab.ucdavis.edu/) Search genome for 248 low copy, highly conserved genes Assess completeness of genome

Core Eukaryotic Genes Number of Core Genes Identified 250 200 150 100 50 0 246 A.thaliana (99.60%) Complete 247 246 237 241 241 226 215 B.distachyon (99.19%) Partial Z.mays (97.18%) Species (core genes at least partial %) S.polyrhiza (97.18%)

Resequencing

Resequencing Kuehdorf, Jetschke, Ballani, and Appenroth 2013

Resequencing Kuehdorf, Jetschke, Ballani, and Appenroth, 2013

Resequencing Data 120 Sequencing Depth of Coverage Coverage Depth 100 80 60 40 20 0 9509 9504 9506 9316 9242 9502 9511 9512 9501 Strain

Resequencing Variation SNP/INDEL Rate Per Accession 600,000 SNP Positions INDEL Positions 500,000 Num. Positions 400,000 300,000 200,000 100,000 0 9509 (0.12%) 9504 (0.39%) 9506 (0.35%) 9316 (0.37%) 9242 (0.37%) 9502 (0.20%) 9511 (0.21%) 9512 (0.29%) 9501 (0.20%) Accession (Total Variant Positions %)

Resequencing Assemblies Per accession: ~30x coverage, single library Assembled each using Velvet Mean assembled size: 128 Mbp (~84%) (stdev: 6 Mbp) Mean N50: 15kb (stdev: 1.5kb) Nearly all contigs (>98%) align to 9509 genome assembly Defining structural differences in progress

Transcriptome

RNA-Seq Data Kuehdorf, Jetschke, Ballani, and Appenroth, 2013

RNA-Seq Data 250 RNA-Seq Reads per Accession and Treatment No. 101 bp Reads (M) 200 150 100 50 0 9509 Control 9509 ABA 9316 Control 9316 ABA 9501 Control 9501 ABA

Transcriptome Discovery 1. Reference-guided assembly Tophat2 Cufflinks2 2. De novo predictions Maker, informed by assembly SNAP, Augustus, GeneMarkHMM Iteratively trained SNAP

(1) Reference-Guided Transcriptome Assembly Align each RNA-seq library (6) to genome For each, define transcripts based on alignments Merge resulting assemblies to discover gene models, alternative splicing Output: Gene, transcripts annotation (GFF3) Transcripts (FASTA)

(2) De novo Transcriptome Discovery Discover genes not expressed in RNA-seq experiments Train algorithms on reference-guided assembly 1. Call high-confidence open reading frames in transcript sequences 2. Use transcripts and translated proteins to inform and train de novo gene callers 3. Iteratively train SNAP on resulting output

(2) De novo Transcriptome Assembled: 25,090 loci 41,884 transcripts Discovery Of 41,884 transcripts, complete ORF and at least 33 amino acids: 39,076 Initial training using these transcripts and proteins

Transcriptome Discovery: Results Preliminary maker output: 28,600 genes Prune: Must have RNA-seq evidence across >= 50% or, >= 100 amino acids with complete ORF Prune bacterial scaffolds Final gene set: 23,495 genes Transcriptome size (nucleic acids): 33 Mbp Mean protein length: 358 amino acids 19,380 (82%) have functional prediction from BLASTP and/or InterProScan

Transcriptome Functional Annotation BlastP (77%) 1,270 74 InterProScan (66%) 453 14,066 2,605 912 3,238 (89%) RNA-Seq Evidence 877 (3.7%)

Transcriptome Annotation Brachypodium distachyon Sorghum bicolor Cicer arietinum Setaria italica Solanum lycopersicum Fragaria vesca Zea mays Cucumis sativus Ricinus communis Glycine max Prunus persica Populus trichocarpa Oryza sativa Theobroma cacao Vitis vinifera Annotations by Species

Alternative Splicing Genes With Num. Isoforms Num. Genes with Num. Isoforms (log 10) 10000 1000 100 10 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Num. Isoforms

Identify Differentially Expressed Genes 250 RNA-Seq Reads per Accession and Treatment No. 101 bp Reads (M) 200 150 100 50 0 9509 Control 9509 ABA 9316 Control 9316 ABA 9501 Control 9501 ABA

Differentially Expressed Genes 1,727 genes identified as significantly differentially expressed 1,105 isoforms identified as significant Molecular verification in progress

Ongoing Assemble repetitive elements Assemble, annotate mitochondria, chloroplast Accessions, structural differences Molecular investigation of differentially expressed genes of interest

Thank you